Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

SQL

SuperSQL is backward compatible with SQL in that any SQL query is a SuperSQL pipe operator. A SQL query used as a pipe operator in SuperSQL is called a SQL operator. SQL operators can also be used recursively inside of a SQL operation as defined below.

SQL Operator

A SQL operator is a query having the form of a <sql-op> defined as

[ <with> ]
<sql-body>
[ <order-by> ]
[ <limit> ]

where

  • <with> is an optional WITH clause containing one or more comma-separated common table-expressions (CTEs),
  • <sql-body> is a recursively defined query structure as defined below,
  • <order-by> is an optional list of one or more sort expressions in an ORDER BY clause, and
  • <limit> is a LIMIT clause constraining the number of rows in the output.

A SQL operator produces relational data in the form of sets of records and may appear in several contexts including:

  • the top-level query,
  • as a parenthesized data-source element of a FROM clause,
  • as a data-source element of JOIN clause embedded in a FROM clause,
  • as a subquery in expressions, and
  • as operands in a set operation.

The optional <with> component creates named SQL queries that are available to any FROM clause contained (directly or recursively) within the <sql-body>.

The output of the <sql-body> may be optionally sorted by the <order-by> component and limited in length by the <limit> component.

Note that all of the elements of a <sql-op> are optional except the <sql-body>. Thus, any form of a simple <sql-body> may appear anywhere a <sql-op> may appear.

Note

The WINDOW clause is not yet available in SuperSQL.

SQL Body

A <sql-body> component has one of the following forms:

Thus, a <sql-body> must include either a SELECT or VALUES component as its foundation, i.e., a <sql-body> at core is either a SELECT or VALUES query. Then, this core query may be combined in optionally parenthesized set operations involving other <sql-body> or <sql-op> components.

Table Structure

Relational tables in SuperSQL are modeled as a sequence of records with a uniform type. With input in this form, standard SQL syntax may define a table alias that references an input sequence so that the fields of the record type then correspond to relational columns.

When the record type of the input data is known, the SuperSQL treats it as a relational schema thereby enabling familiar SQL concepts like static type checking and unqualified column resolution.

However, SuperSQL also allows for non-record data as well as data whose type is unknown at compile time (e.g., large JSON files that are not parsed for their type information prior to compilation). A table reference to input data who type is unknown is called a dynamic table.

Dynamic tables pose a challenge to traditional SQL semantics because the compiler cannot know the columns that comprise the input and thus cannot resolve a column reference to a dynamic table. Also, static type checking as in traditional SQL cannot be carried out.

To remedy this, SuperSQL supports dynamic tables in SQL operators but restrict how they may be used as described below.

Note

The restrictions on dynamic tables avoid a situation where the semantics of a query is dependent on whether the input type is known. If column bindings were to change when the input type goes from unknown to known, then the semantics of the query would change simply because type information happened to be known. The constraints on dynamic tables are imposed to avoid this pitfall.

When SQL operators encounter data that is not in table form, errors typically arise, e.g., compile-time errors indicating a query referencing a non-existent column or, in the case of a dynamic table, runtime errors indicating error("missing").

Note

When querying highly heterogeneous data (e.g., JSON events), it is typically preferable to use pipe operators on arbitrary data instead of SQL queries on tables.

Relational Scopes

Identifiers that appear in SQL expressions are resolved in accordance with the relational model, typically referring to tables and columns and by name.

Each SQL pipe operator defines one or more relational namespaces that are independent of other SQL pipe operators and does not span across pipe operator boundaries. A set of columns (from one or more tables) comprising such a namespace is called a relational scope.

A FROM clause creates a relational scope defined by a namespace comprising one or more table names each containing one or more column names from the top-level tables that appear in the FROM body.

A VALUES clause creates a relational scope defined by the default column names c0, c1, etc. and typically appears as a table expression in a FROM clause with a table alias to rename the columns.

Table names and column names do not need to be unique but when non-unique names cause ambiguous references, then errors are reported and the query fails to compile.

A particular column is referenced by name using the syntax

<column>

or

<table> . <column>

where <table> and <column> are identifiers. The first form is called an unqualified column reference while the second form is called a qualified column reference.

Note

The . operator here is overloaded as it may (1) indicate a column inside of a table or (2) dereference a record value.

A table referenced without a column qualifier, as in

<table>

is simply called a table reference. Table references within expressions result in values that comprise the entire row of the table as a record.

Input Scope

A relational scope defined by the optional FROM clause is called an input scope.

An input scope is comprised of the table and constituent column that FROM defines, which may in turn contain JOIN clauses and additional tables and columns. Any of the tables defined in subqueries embedded in the FROM clause are not part of the input scope and thus not visible.

For example, this query

SELECT *
FROM (VALUES (1),(2)) T(x)
CROSS JOIN (VALUES (3,4),(5,6)) U(y,z)

creates an input scope with columns T.x, U.y, and U.z.

Output Scope

A relational scope defined by a SQL operator that is not a SELECT operation — i.e., set operations or a VALUES clause — is called an output scope.

An output scope does not have a table name and is an anonymous scope for which only unqualified column references apply.

When an output scope appears as a table subquery within a FROM clause, the output scope may be named with a table alias and becomes part of the input scope for the SELECT operation in which it appears.

Relational Bindings

While identifiers in SQL expressions typically resolve to columns in table, they may also refer to lexically-scoped declarations for constants, named queries, and so forth. These bindings have a precedence higher than than relational bindings so an identifier is first resolved via lexical binding.

When an identifier does not resolve to a declaration in a lexical scope, then it is resolved as a table or column reference from the input scope or to a column alias.

The relational identifiers are bound to a table, column, or input expression as follows:

  • when alias-first resolution is in effect: if the identifier matches a column alias, then the identifier is substituted with the corresponding column’s input expression;
  • if the identifier resolves as an unqualified reference without error, then the identifier binds to that column;
  • if the identifier resolves as an unqualified reference with an ambiguous column error, then the error is reported and the query fails to compile;
  • when alias resolution is in effect (but column-first is not in effect): if the identifier matches a column alias, then the identifier is substituted with the corresponding column’s input expression;
  • when the identifier is a candidate for a qualified reference (i.e., the identifier is followed by a . and a second identifier):
    • if the identifier pair resolves as an unqualified reference without error, then the pair binds to that column;
    • if the identifier pair resolves as an unqualified reference with an ambiguous column error, then the error is reported and the query fails to compile;
  • when the identifier is not a candidate for a qualified reference:
    • if the identifier resolves as a table reference without error, then the identifier binds to that table;
    • if the identifier resolves as a table reference with an ambiguous table error, then the error is reported and the query fails to compile.

If no such matches are found, then an error is reported indicating a non-existent table or column reference and the query fails to compile.

Unqualified References

An unqualified reference of the form <column> is resolved by searching the input scope over all columns where the identifier and <column> match:

  • if there is exactly one match, then the identifier binds to that column;
  • if there is more than one match, then an error is reported indicating an ambiguous column reference and the query fails to compile;
  • if there is no match, then the resolution fails without error.

When there are multiple tables in scope and at least one of the tables is dynamic, then unqualified references are not allowed. In this case, an error is reported and the query fails to compile.

Qualified References

A qualified reference of the form <table> . <column> is resolved by searching the input scope over all tables that match <table> and where the table contains a column matching <column>:

  • if there is exactly one match, then the identifier binds to that column;
  • if there is more than one match, then an error is reported indicating an ambiguous column reference and the query fails to compile;
  • if there is no match, then the resolution fails without error.

For dynamic tables, all qualified references bind to any column name and runtime errors are generated when referencing columns that do not exist in the dynamic table.

Table References

A table reference of the form <table> is resolved by searching the input scope over all tables that match <table>:

  • if there is exactly one match, then the identifier binds to that table;
  • if there is more than one match, then an error is reported indicating an ambiguous table reference and the query fails to compile;
  • if there is no match, then the resolution fails without error.

Table references for dynamic tables are not allowed. In this case, an error is reported and the query fails to compile.

Column Aliases

Depending on the particular clause, column aliases may be referenced in expressions.

For expressions in GROUP BY and ORDER BY clauses (and when pragma pg is false):

  • if the identifier matches a column alias, then the corresponding expression is substituted in place of the identifier,
  • otherwise, the identifier is resolved to a table or column reference from the input scope as described previously.

When pragma pg is true, the column alias check is performed after the input scope but only for GROUP BY expressions, which is in line with PostgreSQL semantics.

For expressions in an ORDER BY clause, the column alias check is always performed before the input scope check independent of the pragma pg setting.

For expressions in a WHERE clause, the identifier is resolved using only the input scope, i.e., column aliases are not allowed.

this

When querying dynamic tables, the * selector for all columns is not available as the input columns are unknown. Also, the input is not guaranteed to be relational.

To remedy this, SuperSQL allows this to be referenced in expressions, which resolves to the input row for WHERE, GROUP BY and the projected expressions and resolves the output row for HAVING and ORDER BY.

For example, the SELECT statement here, places this into a first column called that thereby producing a relational output:

# spq
values 1, 'foo', {x:1}
| SELECT this
# input

# expected output
{that:1}
{that:"foo"}
{that:{x:1}}

SQL Compatibility

With so many incompatible SQL dialects in existence, SuperSQL’s goal is to be (mostly) compatible with the the PostgreSQL dialect.

In addition to SQL tests in the SuperDB GitHub repository, PostgreSQL compatibility is tested using sqllogictest queries from SQLite.

As of the first GA release of SuperDB, 3,615,296 of 3,619,718 eligible queries (99.88% success rate) from the SQLite set produce the same result as PostgreSQL when run via super. The following open issues are known to be a cause of one or more of the 0.12% remaining unsuccessful queries and will be addressed in future SuperDB releases.

IssueDescription
super#6549SQL: Correlated subqueries
super#6033SQL: INTERSECT and EXCEPT
super#6074SQL: Large cartesian product causes very long query runtime
super#5984SQL: NULL values absent from JOIN output
super#6536SQL: Signed zero
super#6517SQL: Promoting return types to common type of arguments

Additional details on how the tests are assembled and executed can be found in the sqllogic-ztests repo.