Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

SuperSQL

SuperSQL is a Pipe SQL adapted for super-structured data. The language is a superset of SQL query syntax and includes a modern type system with sum types to represent heterogeneous data.

Similar to a Unix pipeline, a SuperSQL query is expressed as a data source followed by a number of operators that manipulate the data:

from source | operator | operator | ...

As with SQL, SuperSQL is declarative and the SuperSQL compiler often optimizes a query into an implementation different from the dataflow implied by the pipeline to achieve the same semantics with better performance.

While SuperSQL at its core is a pipe-oriented language, it is also backward compatible with relational SQL in that any arbitrarily complex SQL query may appear as a single pipe operator anywhere in a SuperSQL pipe query.

In other words, a single pipe operator that happens to be a standalone SQL query is also a SuperSQL pipe query. For example, these are all valid SuperSQL queries:

SELECT 'hello, world'
SELECT * FROM table
SELECT * FROM f1.json JOIN f2.json ON f1.id=f2.id
SELECT watchers FROM https://api.github.com/repos/brimdata/super

Interactive UX

To support an interactive pattern of usage, SuperSQL includes search syntax reminiscent of Web or email keyword search along with operator shortcuts.

With shortcuts, verbose queries can be typed in a shorthand facilitating rapid data exploration. For example, the query

SELECT count(), key
FROM source
GROUP BY key

can be simplified as from source | count() by key.

With search, all of the string fields in a value can easily be searched for patterns, e.g., this query

from source
| ? example.com urgent message_length > 100

searches for the strings “example.com” and “urgent” in all of the string values in the input and also includes a numeric comparison regarding the field message_length.

Pipe Queries

The entities that transform data within a SuperSQL pipeline are called pipe operators and take super-structured input from the upstream operator or data source, operate upon the input, and produce zero or more super-structured values as output.

Unlike relational SQL, SuperSQL pipe queries define their computation in terms of dataflow through the directed graph of operators. But instead of relational tables propagating from one pipe operator to another (e.g., as in ZetaSQL pipe syntax), any sequence of potentially heterogeneously typed data may flow between SuperSQL pipe operators.

When a super-structured sequence conforms to a single, homogeneous record type, then the data is equivalent to a SQL relation. And because any SQL query is also a valid pipe operator, SuperSQL is thus a superset of SQL. In particular, a single operator defined as pure SQL is an acceptable SuperSQL query so all SQL query texts are also SuperSQL queries.

Unlike a Unix pipeline, a SuperSQL query can be forked and joined, e.g.,

from source
| operator
| fork
  ( operator | ... )
  ( operator | ... )
| join on condition
| ...
| switch expr
  case value ( operator | ... )
  case value ( operator | ... )
  default ( operator | ... )
| ...

A query can also include multiple data sources, e.g.,

fork
  ( from source1 | ... )
  ( from source2 | ... )
| ...

Pipe Sources

Like SQL, input data for a query is typically sourced with the from operator.

When from is not present, the file arguments to the super command are used as input to the query as if there is an implied from operator, e.g.,

super -c "op1 | op2 | ..." input.json

is equivalent to

super -c "from input.json | op1 | op2 | ..."

When neither from nor file arguments are specified, a single null value is provided as input to the query.

super -c "pass"

results in

null

This pattern provides a simple means to produce a constant input within a query using the values operator, wherein values takes as input a single null and produces each constant expression in turn, e.g.,

super -c "values 1,2,3"

results in

1
2
3

When running on the local file system, from may refer to a file or an HTTP URL indicating an API endpoint. When connected to SuperDB database, from typically refers to a collection of super-structured data called a pool and is referenced using the pool’s name similar to SQL referencing a relational table by name.

For more detail, see the reference page of the from operator, but as an example, you might use its from to fetch data from an HTTP endpoint and process it with super, in this case, to extract the description and license of a GitHub repository:

super -f line -c "
from https://api.github.com/repos/brimdata/super
| values description,license.name
"

Relational Scoping

In SQL queries, data from tables is generally referenced with expressions that specify a table name and a column name within that table, e.g., referencing a column x in a table T as

SELECT T.x FROM (VALUES (1),(2),(3)) AS T(x)

More commonly, when the column name is unambiguous, the table name can be omitted as in

SELECT x FROM (VALUES (1),(2),(3)) AS T(x)

When SQL queries are nested, joined, or invoked as subqueries, scoping rules define how identifiers and dotted expressions resolve to the different available table names and columns reachable via containing scopes. To support such semantics, SuperSQL implements SQL scoping rules inside of any SQL pipe operator but not between pipe operators.

In other words, table aliases and column references all work within a SQL query written as a single pipe operator but scoping of tables and columns does not reach across pipe operators. Likewise, a pipe query embedded inside of a nested SQL query cannot access tables and columns in the containing SQL scope.

Pipe Scoping

For pipe queries, SuperSQL takes a different approach to scoping called pipe scoping.

Here, a pipe operator takes any sequence of input values and produces any computed sequence of output values and all data references are limited to these inputs and outputs. Since there is just one sequence of values, it may be referenced as special value with a special name, which for SuperSQL is the value this.

This scoping model can be summarized as follows:

  • all input is referenced as a single value called this, and
  • all output is emitted into a single value called this.

As mentioned above, when processing a set of homogeneously-typed records, the data resembles a relational table where the record type resembles a relational schema and each field in the record models the table’s column. In other words, the record fields of this can be accessed with the dot operator reminiscent of a table.column reference in SQL.

For example, the SQL query from above can thus be written in pipe form using the values operator as:

values {x:1}, {x:2}, {x:3} | select this.x

which results in:

{x:1}
{x:2}
{x:3}

As with SQL table names, where this is implied, it is optional can be omitted, i.e.,

values {x:1}, {x:2}, {x:3} | select x

produces the same result.

Referencing this is often convenient, however, as in this query

values {x:1}, {x:2}, {x:3} | aggregate collect(this)

which collects each input value into an array and emits the array resulting in

[{x:1},{x:2},{x:3}]

Combining Piped Data

If all data for all operators were always presented as a single input sequence called this, then there would be no way to combine data from different entities, which is otherwise a hallmark of SQL and the relational model.

To remedy this, SuperSQL extends pipe scoping to joins and subqueries where multiple entities can be combined into the common value this.

Join Scoping

To combine joined entities into this via pipe scoping, the join operator includes an as clause that names the two sides of the join, e.g.,

... | join ( from ... ) as {left,right} | ...

Here, the joined values are formed into a new two-field record whose first field is left and whose second field is right where the left values come from the parent operator and the right values come from the parenthesized join query argument.

For example, suppose the contents of a file f1.json is

{"x":1}
{"x":2}
{"x":3}

and f2.json is

{"y":4}
{"y":5}

then a join can bring these two entities together into a common record which can then be subsequently operated upon, e.g.,

from f1.json
| cross join (from f2.json) as {f1,f2}

computes a cross-product over all the two sides of the join and produces the following output

{f1:{x:1},f2:{y:4}}
{f1:{x:2},f2:{y:4}}
{f1:{x:3},f2:{y:4}}
{f1:{x:1},f2:{y:5}}
{f1:{x:2},f2:{y:5}}
{f1:{x:3},f2:{y:5}}

A downstream operator can then operate on these records, for example, merging the two sides of the join using spread operators (...), i.e.,

from f1.json
| cross join (from f2.json) as {f1,f2}
| values {...f1,...f2}

produces

{x:1,y:4}
{x:2,y:4}
{x:3,y:4}
{x:1,y:5}
{x:2,y:5}
{x:3,y:5}

In contrast, relational scoping using identifier scoping in a SELECT clause with the table source identified in FROM and JOIN clauses, e.g., this query produces the same result:

SELECT f1.x, f2.y FROM f1.json as f1 CROSS JOIN f2.json as f2

Subquery Scoping

A subquery embedded in an expression can also combine data entities via pipe scoping as in

from f1.json | values {outer:this,inner:[from f2.json | ...]}

Here data from the outer query can be mixed in with data from the inner array subquery embedded in the expression inside of the values operator.

The array subquery produces an array value so it is often desirable to unnest this array with respect to the outer values as in

from f1.json | unnest {outer:this,inner:[from f2.json | ...]} into ( <scope> )

where <scope> can be an arbitrary pipe query that processes each collection of unnested values separately as a unit for each outer value. The into ( <scope> ) body is an optional component of unnest, and if absent, the unnested collection boundaries are ignored and all of the unnested data is output.

With the unnest operator, we can now consider how a correlated subquery from SQL can be implemented purely as a pipe query with pipe scoping. For example,

SELECT (SELECT sum(f1.x+f2.y) FROM f1.json) AS s FROM f2.json

results in

{s:18}
{s:21}

To implement this with pipe scoping, the correlated subquery is carried out by unnesting the data from the subquery with the values coming from the outer scope as in

from f2.json
| unnest {f2:this,f1:[from f1.json]} into ( s:=sum(f1.x+f2.y) )

giving the same result

{s:18}
{s:21}

Strong Typing

Data in SuperSQL is always strongly typed.

Like relational SQL, SuperSQL data sequences can conform to a static schema that is type-checked at compile time. And like document databases and SQL++, data sequences may also be dynamically typed, but unlike these systems, SuperSQL data is always strongly typed.

To perform type checking of dynamic data, SuperSQL utilizes a novel approach based on fused types. Here, the compiler interrogates the data sources for their fused type and uses these types (instead of relational schemas) to perform type checking. This is called fused type checking.

Because a relational schema is a special case of a fused type, fused type checking works for both traditional SQL as well as for super-structured pipe queries. These fused types are maintained in the super-structured database and the binary forms of super-structured file formats provide efficient ways to retrieve their fused type.

For traditional formats like JSON or CSV, the file is read by the compiler and the fused type computed on the fly. When such files are sufficiently large creating too much overhead for the compilation stage, this step may be skipped using a configurable limit and the compilation completed with more limited type checking, instead creating runtime errors when type errors are encountered.

In other words, dynamic data is statically type checked when possible. This works by computing fused types of each operator’s output and propagating these types in a dataflow analysis. When types are unknown, the analysis flexibly models them as having any possible type.

For example, this query produces the expected output

$ super -c "select b from (values (1,2),(3,4)) as T(b,c)"
{b:1}
{b:2}

But this query produced a compile-time error:

$ super -c "select a from (values (1,2),(3,4)) as T(b,c)"
column "a": does not exist at line 1, column 8:
select a from (values (1,2),(3,4)) as T(b,c)
       ~

Now supposing this data is in the file input.json:

{"b":1,"c":2}
{"b":3,"c":4}

If we run this the query from above without type checking data from the source (enabled with the -dynamic flag), then the query runs even though there are type errors. In this case, “missing” values are produced:

$ super -dynamic -c "select a from input.json"
{a:error("missing")}
{a:error("missing")}

Even though the reference to column “a” is dynamically evaluated, all the data is still strongly typed, i.e.,

$ super -c "from input.json | values typeof(this)"
<{b:int64,c:int64}>
<{b:int64,c:int64}>

Data Order

Data sequences from sources may have a natural order. For example, the values in a file being read are presumed to have the order they appear in the file. Likewise, data stored in a database organized by a sort constraint is presumed to have the sorted order.

For order-preserving pipe operators, this order is preserved. For order-creating operators like sort an output order is created independent of the input order. For other operators, the output order is undefined.

Each operator defines whether or not it is order is preserved, created, or discarded.

For example, the where drops values that do not meet the operator’s condition but otherwise preserves data order, whereas the sort creates an output order defined by the sort expressions. The aggregate creates an undefined order at output.

When a pipe query branches as in join, fork, or switch, the order at the merged branches is undefined.