SuperSQL
SuperSQL is a Pipe SQL adapted for super-structured data. The language is a superset of SQL query syntax and includes a modern type system with sum types to represent heterogeneous data.
Similar to a Unix pipeline, a SuperSQL query is expressed as a data source followed by a number of operators that manipulate the data:
from source | operator | operator | ...
As with SQL, SuperSQL is declarative and the SuperSQL compiler often optimizes a query into an implementation different from the dataflow implied by the pipeline to achieve the same semantics with better performance.
While SuperSQL at its core is a pipe-oriented language, it is also backward compatible with relational SQL in that any arbitrarily complex SQL query may appear as a single pipe operator anywhere in a SuperSQL pipe query.
In other words, a single pipe operator that happens to be a standalone SQL query is also a SuperSQL pipe query. For example, these are all valid SuperSQL queries:
SELECT 'hello, world'
SELECT * FROM table
SELECT * FROM f1.json JOIN f2.json ON f1.id=f2.id
SELECT watchers FROM https://api.github.com/repos/brimdata/super
Interactive UX
To support an interactive pattern of usage, SuperSQL includes search syntax reminiscent of Web or email keyword search along with operator shortcuts.
With shortcuts, verbose queries can be typed in a shorthand facilitating rapid data exploration. For example, the query
SELECT count(), key
FROM source
GROUP BY key
can be simplified as from source | count() by key.
With search, all of the string fields in a value can easily be searched for patterns, e.g., this query
from source
| ? example.com urgent message_length > 100
searches for the strings “example.com” and “urgent” in all of the string values in
the input and also includes a numeric comparison regarding the field message_length.
Pipe Queries
The entities that transform data within a SuperSQL pipeline are called pipe operators and take super-structured input from the upstream operator or data source, operate upon the input, and produce zero or more super-structured values as output.
Unlike relational SQL, SuperSQL pipe queries define their computation in terms of dataflow through the directed graph of operators. But instead of relational tables propagating from one pipe operator to another (e.g., as in ZetaSQL pipe syntax), any sequence of potentially heterogeneously typed data may flow between SuperSQL pipe operators.
When a super-structured sequence conforms to a single, homogeneous record type, then the data is equivalent to a SQL relation. And because any SQL query is also a valid pipe operator, SuperSQL is thus a superset of SQL. In particular, a single operator defined as pure SQL is an acceptable SuperSQL query so all SQL query texts are also SuperSQL queries.
Unlike a Unix pipeline, a SuperSQL query can be forked and joined, e.g.,
from source
| operator
| fork
( operator | ... )
( operator | ... )
| join on condition
| ...
| switch expr
case value ( operator | ... )
case value ( operator | ... )
default ( operator | ... )
| ...
A query can also include multiple data sources, e.g.,
fork
( from source1 | ... )
( from source2 | ... )
| ...
Pipe Sources
Like SQL, input data for a query is typically sourced with the
from operator.
When from is not present, the file arguments to the
super command are used as input to the query
as if there is an implied
from operator, e.g.,
super -c "op1 | op2 | ..." input.json
is equivalent to
super -c "from input.json | op1 | op2 | ..."
When neither
from nor file arguments are specified, a single null value
is provided as input to the query.
super -c "pass"
results in
null
This pattern provides a simple means to produce a constant input within a
query using the values operator, wherein
values takes as input a single null and produces each constant
expression in turn, e.g.,
super -c "values 1,2,3"
results in
1
2
3
When running on the local file system,
from may refer to a file or an HTTP URL
indicating an API endpoint.
When connected to SuperDB database,
from typically
refers to a collection of super-structured data called a pool and
is referenced using the pool’s name similar to SQL referencing
a relational table by name.
For more detail, see the reference page of the from operator,
but as an example, you might use its
from to fetch data from an
HTTP endpoint and process it with super, in this case, to extract the description
and license of a GitHub repository:
super -f line -c "
from https://api.github.com/repos/brimdata/super
| values description,license.name
"
Relational Scoping
In SQL queries, data from tables is generally referenced with expressions that
specify a table name and a column name within that table,
e.g., referencing a column x in a table T as
SELECT T.x FROM (VALUES (1),(2),(3)) AS T(x)
More commonly, when the column name is unambiguous, the table name can be omitted as in
SELECT x FROM (VALUES (1),(2),(3)) AS T(x)
When SQL queries are nested, joined, or invoked as subqueries, scoping rules define how identifiers and dotted expressions resolve to the different available table names and columns reachable via containing scopes. To support such semantics, SuperSQL implements SQL scoping rules inside of any SQL pipe operator but not between pipe operators.
In other words, table aliases and column references all work within a SQL query written as a single pipe operator but scoping of tables and columns does not reach across pipe operators. Likewise, a pipe query embedded inside of a nested SQL query cannot access tables and columns in the containing SQL scope.
Pipe Scoping
For pipe queries, SuperSQL takes a different approach to scoping called pipe scoping.
Here, a pipe operator takes any sequence of input values
and produces any computed sequence of output values and all
data references are limited to these inputs and outputs.
Since there is just one sequence of values, it may be
referenced as special value with a special name, which for
SuperSQL is the value this.
This scoping model can be summarized as follows:
- all input is referenced as a single value called
this, and - all output is emitted into a single value called
this.
As mentioned above,
when processing a set of homogeneously-typed records,
the data resembles a relational table where the record type resembles a
relational schema and each field in the record models the table’s column.
In other words, the record fields of this can be accessed with the dot operator
reminiscent of a table.column reference in SQL.
For example, the SQL query from above can thus be written in pipe form
using the values operator as:
values {x:1}, {x:2}, {x:3} | select this.x
which results in:
{x:1}
{x:2}
{x:3}
As with SQL table names, where this is implied, it is optional can be omitted, i.e.,
values {x:1}, {x:2}, {x:3} | select x
produces the same result.
Referencing this is often convenient, however, as in this query
values {x:1}, {x:2}, {x:3} | aggregate collect(this)
which collects each input value into an array and emits the array resulting in
[{x:1},{x:2},{x:3}]
Combining Piped Data
If all data for all operators were always presented as a single input sequence
called this, then there would be no way to combine data from different entities,
which is otherwise a hallmark of SQL and the relational model.
To remedy this, SuperSQL extends pipe scoping to
joins and
subqueries
where multiple entities can be combined into the common value this.
Join Scoping
To combine joined entities into this via pipe scoping, the
join operator
includes an as clause that names the two sides of the join, e.g.,
... | join ( from ... ) as {left,right} | ...
Here, the joined values are formed into a new two-field record
whose first field is left and whose second field is right where the
left values come from the parent operator and the right values come
from the parenthesized join query argument.
For example, suppose the contents of a file f1.json is
{"x":1}
{"x":2}
{"x":3}
and f2.json is
{"y":4}
{"y":5}
then a join can bring these two entities together into a common record
which can then be subsequently operated upon, e.g.,
from f1.json
| cross join (from f2.json) as {f1,f2}
computes a cross-product over all the two sides of the join and produces the following output
{f1:{x:1},f2:{y:4}}
{f1:{x:2},f2:{y:4}}
{f1:{x:3},f2:{y:4}}
{f1:{x:1},f2:{y:5}}
{f1:{x:2},f2:{y:5}}
{f1:{x:3},f2:{y:5}}
A downstream operator can then operate on these records,
for example, merging the two sides of the join using
spread operators (...), i.e.,
from f1.json
| cross join (from f2.json) as {f1,f2}
| values {...f1,...f2}
produces
{x:1,y:4}
{x:2,y:4}
{x:3,y:4}
{x:1,y:5}
{x:2,y:5}
{x:3,y:5}
In contrast, relational scoping using identifier scoping in a SELECT clause
with the table source identified in FROM and JOIN clauses, e.g., this query
produces the same result:
SELECT f1.x, f2.y FROM f1.json as f1 CROSS JOIN f2.json as f2
Subquery Scoping
A subquery embedded in an expression can also combine data entities via pipe scoping as in
from f1.json | values {outer:this,inner:[from f2.json | ...]}
Here data from the outer query can be mixed in with data from the
inner array subquery embedded in the expression inside of the
values operator.
The array subquery produces an array value so it is often desirable to
unnest this array with respect to the outer
values as in
from f1.json | unnest {outer:this,inner:[from f2.json | ...]} into ( <scope> )
where <scope> can be an arbitrary pipe query that processes each
collection of unnested values separately as a unit for each outer value.
The into ( <scope> ) body is an optional component of unnest, and if absent,
the unnested collection boundaries are ignored and all of the unnested data is output.
With the unnest operator, we can now consider how a correlated subquery from
SQL can be implemented purely as a pipe query with pipe scoping.
For example,
SELECT (SELECT sum(f1.x+f2.y) FROM f1.json) AS s FROM f2.json
results in
{s:18}
{s:21}
To implement this with pipe scoping, the correlated subquery is carried out by unnesting the data from the subquery with the values coming from the outer scope as in
from f2.json
| unnest {f2:this,f1:[from f1.json]} into ( s:=sum(f1.x+f2.y) )
giving the same result
{s:18}
{s:21}
Strong Typing
Data in SuperSQL is always strongly typed.
Like relational SQL, SuperSQL data sequences can conform to a static schema that is type-checked at compile time. And like document databases and SQL++, data sequences may also be dynamically typed, but unlike these systems, SuperSQL data is always strongly typed.
To perform type checking of dynamic data, SuperSQL utilizes a novel approach based on fused types. Here, the compiler interrogates the data sources for their fused type and uses these types (instead of relational schemas) to perform type checking. This is called fused type checking.
Because a relational schema is a special case of a fused type, fused type checking works for both traditional SQL as well as for super-structured pipe queries. These fused types are maintained in the super-structured database and the binary forms of super-structured file formats provide efficient ways to retrieve their fused type.
For traditional formats like JSON or CSV, the file is read by the compiler and the fused type computed on the fly. When such files are sufficiently large creating too much overhead for the compilation stage, this step may be skipped using a configurable limit and the compilation completed with more limited type checking, instead creating runtime errors when type errors are encountered.
In other words, dynamic data is statically type checked when possible. This works by computing fused types of each operator’s output and propagating these types in a dataflow analysis. When types are unknown, the analysis flexibly models them as having any possible type.
For example, this query produces the expected output
$ super -c "select b from (values (1,2),(3,4)) as T(b,c)"
{b:1}
{b:2}
But this query produced a compile-time error:
$ super -c "select a from (values (1,2),(3,4)) as T(b,c)"
column "a": does not exist at line 1, column 8:
select a from (values (1,2),(3,4)) as T(b,c)
~
Now supposing this data is in the file input.json:
{"b":1,"c":2}
{"b":3,"c":4}
If we run this the query from above without type checking data from the source
(enabled with the -dynamic flag), then the query runs even though there are type
errors. In this case, “missing” values are produced:
$ super -dynamic -c "select a from input.json"
{a:error("missing")}
{a:error("missing")}
Even though the reference to column “a” is dynamically evaluated, all the data is still strongly typed, i.e.,
$ super -c "from input.json | values typeof(this)"
<{b:int64,c:int64}>
<{b:int64,c:int64}>
Data Order
Data sequences from sources may have a natural order. For example, the values in a file being read are presumed to have the order they appear in the file. Likewise, data stored in a database organized by a sort constraint is presumed to have the sorted order.
For order-preserving pipe operators, this order is preserved.
For order-creating operators like sort
an output order is created independent of the input order.
For other operators, the output order is undefined.
Each operator defines whether or not it is order is preserved, created, or discarded.
For example, the where drops values that do
not meet the operator’s condition but otherwise preserves data order,
whereas the sort creates an output order defined
by the sort expressions. The aggregate
creates an undefined order at output.
When a pipe query branches as in
join,
fork, or
switch,
the order at the merged branches is undefined.