Introduction

SuperDB is a new type of analytics database that promises an easier approach to modern data because it unifies relational tables and eclectic JSON in a powerful, new data model called super-structured data.

Note

The SuperDB implementation is open source and available as a GitHub repository. Pre-built binaries may be downloaded and installed via customary mechanisms.

Super-structured data is

dynamic so that data collections can vary by type and are not handcuffed by schemas,
strongly typed ensuring that all the benefits of a comprehensive type system apply to dynamic data, and
self-describing thus obviating the need to define schemas up front.

SuperDB has taken many of the best ideas of current data systems and adapted them for super-structured data with the introduction of:

the super-structured data model,
several super-structured serialization formats,
a SQL-compatible query language adapted for super-structured data,
a super-structured query engine, and
a super-structured database format compatible with cloud object stores.

To achieve high performance for the dynamically typed data that lies at the heart of super-structured data, SuperDB has devised a novel vectorized runtime built as a clean slate around algebraic types. This contrasts with the Frankenstein approach taken by other analytics systems that shred variants into relational columns.

Putting JSON into a relational table — whether adding a JSON or variant column to a relational table or performing schema inference that does not always work — is like putting a square peg in a round hole. SuperDB turns this status quo upside down where JSON and schema-constrained relational tables are simply special cases of the more general and holistic super-structured data model.

This leads to ergonomics for SuperDB that are far better for the query language and for managing data end to end because there is not one way for handling relational data and a different way for managing dynamic data — relational tables and eclectic JSON data are treated in a uniform way from the ground up. For example, there’s no need for a set of Parquet input files to all be schema-compatible and it’s easy to mix and match Parquet with JSON across queries.

Super-structured Data

Super-structured data is strongly typed and self describing. While compatible with relational schemas, SuperDB does not require such schemas as they can be modeled as super-structured records.

More specifically, a relational table in SuperDB is simply a collection of uniformly typed records, whereas a collection of dynamic but strongly-typed data can model any sequence of JSON values, e.g., observability data, application events, system logs, and so forth.

Thus, data in SuperDB is

strongly typed like databases, but
dynamically typed like JSON.

Self-describing data makes data easier: when transmitting data from one entity to another, there is no need for the two sides to agree up front what the schemas must be in order to communicate and land the data.

Likewise, when extracting and serializing data from a query, there is never any loss of information as the super-structured formats capture all aspects of the strongly-typed data whether in human-readable form, binary row-like form, or columnar-like form.

The `super` Command

SuperDB is implemented as the standalone, dependency-free super command. super is a little like DuckDB and a little like jq but super-structured data ties these two command styles together with strong typing of dynamic data.

Because super has no dependencies, it’s easy to get going — just install the binary and you’re off and running.

SuperDB separates compute and storage and is decomposed into a runtime system that

runs directly on any data inputs like files, streams, or APIs, or
manipulates and queries data in a persistent storage layer — the SuperDB database — that rhymes in design with the emergent lakehouse pattern but is based on super-structured data.

To invoke the SuperDB runtime without a database, just run super without the db subcommand and specify an optional query with -c:

super -c "SELECT 'hello, world'"

To interact with a SuperDB database, invoke the db subcommands of super and/or program against the database API.

Note

The persistent database layer is still under development and not yet ready for turnkey production use.

Why Not Relational?

The fashionable argument against a new system like SuperDB is that SQL and the relational model (RM) are perfectly good solutions that have stood the test of time so there’s no need to replace them. In fact, a recent paper from legendary database experts argues that any attempt to supplant SQL or the RM is doomed to fail because any good ideas that arise from such efforts will simply be incorporated into SQL and the RM.

Yet, the incorporation of the JSON data model into the relational model never fails to disappoint. One must typically choose between creating columns of a JSON or variant type that layers in a parallel set of operators and behaviors that diverge from core SQL semantics, or rely upon schema inference to convert variant data into relational tables, which unfortunately does not always work.

To understand the difficulty of schema inference, consider this simple line of JSON data is in a file called example.json:

{"a":[1,"foo"]}

Note

The literal [1,"foo"] is a contrived example but it adequately represents the challenge of mixed-type JSON values, e.g., an API returning an array of JSON objects with varying shape.

Surprisingly, this simple JSON input causes unpredictable schema inference across different SQL systems. ClickHouse converts the JSON number 1 to a string:

$ clickhouse -q "SELECT * FROM 'example.json'"
['1','foo']

DuckDB does only partial schema inference and leaves the contents of the array as type JSON:

$ duckdb -c "SELECT * FROM 'example.json'"
┌──────────────┐
│      a       │
│    json[]    │
├──────────────┤
│ [1, '"foo"'] │
└──────────────┘

And DataFusion fails with an error:

$ datafusion-cli -c "SELECT * FROM 'example.json'"
DataFusion CLI v46.0.1
Error: Arrow error: Json error: whilst decoding field 'a': expected string got 1

It turns out there’s no easy way to represent this straightforward literal array value [1,'foo'] in these SQLs, e.g., simply including this value in a SQL expression results in errors:

$ clickhouse -q "SELECT [1,'foo']"
Code: 386. DB::Exception: There is no supertype for types UInt8, String because some of them are String/FixedString/Enum and some of them are not. (NO_COMMON_TYPE)

$ duckdb -c "SELECT [1,'foo']"
Conversion Error:
Could not convert string 'foo' to INT32

LINE 1: SELECT [1,'foo']
                  ^

$ datafusion-cli -c "SELECT [1,'foo']"
DataFusion CLI v46.0.1
Error: Arrow error: Cast error: Cannot cast string 'foo' to value of Int64 type

The more recent innovation of an open variant type is more general than JSON but suffers from similar problems. In both these cases, the JSON type and the variant type are not individual types but rather entire type systems that differ from the base relational type system and so are shoehorned into the relational model as a parallel type system masquerading as a specialized type to make it all work.

Maybe there is a better way?

Enter Algebraic Types

What’s missing here is an easy and native way to represent mixed-type entities. In modern programming languages, such entities are enabled with a sum type or tagged union.

While the original conception of the relational data model anticipated “product types” — in fact, describing a relation’s schema in terms of a product type — it unfortunately did not anticipate sum types.

Note

Codd’s original paper on the relational model has a footnote that essentially describes as a product type:

But sum types were notably absent.

Armed with both sum and product types, super-structured data provides a comprehensive algebraic type system that can represent any JSON value as a concrete type. And since relations are simply product types as originally envisioned by Codd, any relational table can be represented also as a super-structured product type. Thus, JSON and relational tables are cleanly unified with an algebraic type system.

In this way, SuperDB “just works” when it comes to processing the JSON example from above:

$ super -c "SELECT * FROM 'example.json'"
{a:[1,"foo"]}
$ super -c "SELECT [1,'foo'] AS a"
{a:[1,"foo"]}

In fact, we can see algebraic types at work here if we interrogate the type of such an expression:

$ super -c "SELECT typeof(a) as type FROM (SELECT [1,'foo'] AS a)"
{type:<[int64|string]>}

In this super-structured representation, the type field is a first-class type value representing an array type of elements having a sum type of int64 and string.

SuperSQL

Since super-structured data is a superset of the relational model, it turns out that a query language for super-structured data can be devised that is a superset of SQL. The SuperDB query language is a Pipe SQL adapted for super-structured data called SuperSQL.

SuperSQL is particularly well suited for data-wrangling use cases like ETL and data exploration and discovery. Syntactic shortcuts, keyword search, and the pipe syntax make interactively querying data a breeze. And language features like recursive functions with re-entrant subqueries allow for traversing nested data in a general and powerful fashion.

Instead of operating upon statically typed relational tables as SQL does, SuperSQL operates upon super-structured data. When such data happens to look like a table, then SuperSQL can work just like SQL:

$ super -c "
SELECT a+b AS x, a-b AS y FROM (
  VALUES (1,2),	(3,0)
) AS T(a,b)
"
{x:3,y:-1}
{x:3,y:3}

But when data does not conform to the relational model, SuperSQL can still handle it with its super-structured runtime:

$ super -c "
SELECT avg(radius) as R, avg(width) as W FROM (
  VALUES
    {kind:'circle',radius:1.5},
    {kind:'rect',width:2.0,height:1.0},
    {kind:'circle',radius:2},
    {kind:'rect',width:1.0,height:3.5}
)
"
{R:1.75,W:1.5}

Things get more interesting when you want to do different kinds of processing for differently typed entities, e.g., let’s compute an average radius of circles, and double the width of each rectangle. This time we’ll use the pipe syntax with shortcuts and employ first-class errors to flag unknown types:

$ super -c "
values
  {kind:'circle',radius:1.5},
  {kind:'rect',width:2.0,height:1.0},
  {kind:'circle',radius:2},
  {kind:'rect',width:1.0,height:3.5},
  {kind:'triangle',base:2.0,height:3.0}
| switch kind
    case 'circle' (
        R:=avg(radius)
    )
    case 'rect' (
        width:=width*2
    )
    default (
        error({message:'unknown shape',value:this})
    )
"
{R:1.75}
{kind:"rect",width:4.,height:1.}
{kind:"rect",width:2.,height:3.5}
error({message:"unknown shape",value:{kind:"triangle",base:2.,height:3.}})

So what’s going on here? The data model here is acting both as a strongly typed representation of JSON-like sequences as well as a means to represent relational tables. And SuperSQL is behaving like SQL when applied to table-like data, but at the same time is a pipe-syntax language for arbitrarily typed data. The super-structured data model ties it all together.

To make this all work, the runtime must handle arbitrarily typed data. Hence, every operator in SuperSQL has defined behavior for every possible input type. This is the key point of departure for super-structured data: instead of the unit of processing being a relational table, which requires a statically defined schema, the unit of processing is a collection of arbitrarily typed values. In a sense, SuperDB generalizes Codd’s relational algebra to polymorphic operators. All of Codd’s relational operators can be recast in this fashion forming the polymorphic algebra of super-structured data implemented by SuperDB.

Evolving SQL

Despite SQL’s enduring success, it is widely accepted that there are serious flaws in the language and a number of authors argue that SQL should be replaced in its entirety. Among many such works, here are some noteworthy arguments:

A Critique of the SQL Database Language a 1983 paper by C.J. Date,
Against SQL by Jamie Brandon,
A Critique of Modern SQL And A Proposal Towards A Simple and Expressive Query Language by Neumann and Leis.

A very different approach is taken in SQL Has Problems. We Can Fix Them: Pipe Syntax In SQL, which argues that SQL should be merely improved upon and not replaced outright. Here the authors argue that except for compositional syntax, SQL is perfectly reasonable and we should live with its anachronisms (see Section 2.4). Thus, their Pipe SQL specification carries forward SQL eccentricities into their modern adaptation of pipes for SQL.

SuperSQL takes a different approach and seizes the opportunity to modernize the ergonomics of a SQL-compatible query language. While embracing backward compatibility, SuperSQL diverges significantly from SQL anachronisms in the pipe portion of the language by introducing pipe scoping semantics that coexist next to the relational scoping semantics of SQL tables and columns.

The vision here is that comprehensive backward compatibility can reside in the SQL operators while a modernized syntax and and improved ergonomics can reside in the pipe operators, e.g.,

array indexing can be configured as 1-based in SQL clauses but 0-based in pipe operators,
column names in SQL clauses are case insensitive while record field references are case sensitive in pipe operators,
complex scoping rules for table aliases and column references are required in relational SQL while binding from names to data in pipe operators is managed in a uniform and simple way as derefenced paths on this,
the syntactic structure of SQL clauses means all data must conform to a table whereas pipe operators can emit any data type desired in a varying fashion, and
sum types are integral to piped data allowing mix-typed data processing and results that need not fit in a uniform table.

With this approach, SuperSQL can be adopted and used for existing use cases based on legacy SQL while incrementally expanding and embracing the pipe model tied to super-structured data. Perhaps this could enable a long-term and gradual transition away from relational SQL toward a modern and more ergonomic replacement.

The jury is out as to whether a Pipe SQL for super-structured data is the right approach for curing SQL’s ills, but it certainly provides a framework for exploring entirely new language abstractions while maintaining complete backward compatibility with SQL all in the same query language.

Note

If SuperDB or super-structured data has piqued your interest, you can dive deeper by:

exploring the SuperSQL query language,

learning about the super-structured data model and formats underlying SuperDB, or

browsing the tutorials.

We’d love your feedback and we hope to build a thriving community around SuperDB so please feel free to reach out to us via

Slack

GitHub issues, or

GitHub pull requests.

See you online!

Keyboard shortcuts

SuperDB