Getting Started
Trying out SuperDB is easy: just install the command-line tool
super
and run through its usage documentation.
✵ Note ✵
The SuperDB code and docs are still under construction. Once you’ve
installed super
we
recommend focusing first on the functionality shown in the
super
command doc. Feel free to explore other docs and
try things out, but please don’t be shocked if you hit speedbumps in the near
term, particularly in areas like performance and full SQL coverage. We’re
working on it! 😉
Once you’ve tried it out, we’d love to hear your feedback via our community Slack.
Compared to putting JSON data in a relational column, the
super-structured data model makes it really easy to
mash up JSON with your relational tables. The super
command is a little
like DuckDB and a little like
jq
but super-structured data ties the
two patterns together with strong typing of dynamic values.
For a non-technical user, SuperDB is as easy to use as web search while for a technical user, SuperDB exposes its technical underpinnings in a gradual slope, providing as much detail as desired, packaged up in the easy-to-understand Super JSON data format and SuperSQL language.
While super
and its accompanying data formats are production quality for some use cases, the project’s
SuperDB data lake is a bit earlier in development.
Terminology#
“Super” is an umbrella term that describes a number of different elements of the system:
- The super data model is the abstract definition of the data types and semantics that underlie the super-structured data formats.
- The super-structured data formats are a family of human-readable (Super JSON, JSUP), sequential (Super Binary, BSUP), and columnar (Super Columnar, CSUP) formats that all adhere to the same abstract super data model.
- SuperSQL is the system’s language for performing queries, searches, analytics, transformations, or any of the above combined together.
- A SuperSQL pipe query (SPQ) is a query that employs SuperSQL’s unique pipeline extensions and shortcuts to perform data operations that are difficult or impossible in standard SQL.
- A SuperDB data lake is a collection of super-structured data stored across one or more data pools with ACID commit semantics and accessed via a Git-like API.
Digging Deeper#
The SuperSQL language documentation
is the best way to learn about super
in depth. Most
examples that appear throughout the docs can be
executed right in your browser and can easily be copied to the command line
for execution with super
. Run super -h
for a list of command options and
brief help.
The super db
documentation
is the best way to learn about the SuperDB data lake.
All of its examples use super db
commands run on the command line.
Run super db -h
or -h
with any subcommand for a list of command options
and online help. The same language query that works for super
operating
on local files or streams also works for super db query
operating on a lake.
Design Philosophy#
The design philosophy for SuperDB is based on composable building blocks built from self-describing data structures. Everything in a SuperDB data lake is built from super-structured data and each system component can be run and tested in isolation.
Since super-structured data is self-describing, this approach makes stream composition
very easy. Data from a query can trivially be piped to a local
instance of super
by feeding the resulting output stream to stdin of super
, for example,
super db query "from pool | ...remote query..." | super -c "...local query..." -
There is no need to configure the SuperDB entities with schema information like protobuf configs or connections to schema registries.
A SuperDB data lake is completely self-contained, requiring no auxiliary databases
(like the Hive metastore)
or other third-party services to interpret the lake data.
Once copied, a new service can be instantiated by pointing a super db serve
at the copy of the lake.
Functionality like data compaction and retention are all API-driven.
Bite-sized components are unified by the super-structured data, usually in the BSUP format:
- All lake meta-data is available via meta-queries.
- All lake operations available through the service API are also available
directly via the
super db
command. - Lake management is agent-driven through the API. For example, instead of complex policies like data compaction being implemented in the core with some fixed set of algorithms and policies, an agent can simply hit the API to obtain the meta-data of the objects in the lake, analyze the objects (e.g., looking for too much key space overlap) and issue API commands to merge overlapping objects and delete the old fragmented objects, all with the transactional consistency of the commit log.
- Components are easily tested and debugged in isolation.