For jq Users
SuperSQL’s pipes and shortcuts provide a flexible and powerful way for SQL
users to query their JSON data while leveraging their existing skills.
However, users who’ve traditionally wrangled their JSON with other tools such
as jq
will find super
equally powerful
even if they don’t know SQL or just prefer to work primarily in shortcuts. This tour
walks through a number of examples using super
at
the command line with jq
as a
reference point.
We’ll start with some simple one-liners where we feed some data to super
with echo
and specify -
as input to indicate that standard input
should be used, e.g.,
echo '"hello, world"' | super -
Then, toward the end of the tour, we’ll experiment with some real-world GitHub data pulled from the GitHub API.
Of course, if SQL is your preference, you can write many of the examples shown using the SQL equivalent, e.g.,
super -c "SELECT VALUE 'hello, world'"
or a mix of SQL and pipeline extensions as suits your preferences. However,
to make this tutorial relevant to jq
users, we’ll lean heavily on SuperSQL’s
use of pipes and shortcuts.
If you want to follow along on the command line,
just make sure the super
command is installed
as well as jq
.
But JSON
While super
is based on a new type of data model,
its human-readable format Super JSON (JSUP) just so
happens to be a superset of JSON.
So if all you ever use super
for is manipulating JSON data,
it can serve you well as a handy, go-to tool. In this way, super
is kind of
like jq
. As you probably know, jq
is a popular command-line tool for taking a sequence of JSON values as input,
doing interesting things on that input, and emitting results, of course, as JSON.
jq
is awesome and powerful, but its syntax and computational model can
sometimes be daunting and difficult. We tried to make super
really easy and intuitive,
and it is usually faster, sometimes much faster,
than jq
.
To this end, if you want full JSON compatibility without having to delve into the
details of JSUP, just use the -j
option with super
and this will tell it to
expect JSON values as input and produce JSON values as output, much like jq
.
✵ Tip ✵
If your downstream JSON tooling expects only a single JSON value, we can use
-j
along with collect()
to aggregate
multiple input values into an array. A collect()
example is shown
later in this tutorial.
this
vs .
For example, to add 1 to some numbers with jq
, you say:
echo '1 2 3' | jq '.+1'
and you get
2
3
4
With super
, the mysterious jq
value .
is instead called
the almost-as-mysterious value
this
and you say:
echo '1 2 3' | super -z -c 'this+1' -
which also gives
2
3
4
✵ Note ✵
We are using the -z
option with super
in all of the examples,
which formats the output as JSUP.
When running super
on the terminal, you do not need -z
as it is the default,
but we include it here for clarity and because all of these examples are
run through automated testing, which is not attached to a terminal.
Search vs Transformation
Generally, jq
leads with transformation. By comparison, super
leads with search, but
transformation is also pretty easy. Let’s show what we mean here with an
example.
Let’s start from a minimal example. If we run this jq
command,
echo '1 2 3' | jq 2
we get
2
2
2
Hmm, that’s a little odd, but it did what we told it to do. In jq
, the
expression 2
is evaluated for each input value, and the value 2
is produced each time, so three copies of 2
are emitted.
In super
, a lonely 2
all by itself is not a valid query, but adding a
leading ?
(shorthand for the search
operator)
echo '1 2 3' | super -z -c '? 2' -
produces this “search result”:
2
In fact, this search syntax generalizes, and if we search over a more complex input:
echo '1 2 [1,2,3] [4,5,6] {r:{x:1,y:2}} {r:{x:3,y:4}} "hello" "Number 2"' |
super -z -c '? 2' -
we naturally find all the places 2
appears whether as a value, inside a value, or inside a string:
2
[1,2,3]
{r:{x:1,y:2}}
"Number 2"
You can also do keyword-text search, e.g.,
echo '1 2 [1,2,3] [4,5,6] {r:{x:1,y:2}} {r:{x:3,y:4}} "hello" "Number 2"' |
super -z -c '? hello or Number' -
produces
"hello"
"Number 2"
Doing searches like this in jq
would be hard.
That said, we can emulate the jq
transformation stance by explicitly
indicating that we want to yield
the result of the expression evaluated for each input value, e.g.,
echo '1 2 3' | super -z -c 'yield 2' -
now gives the same answer as jq
:
2
2
2
Cool, but doesn’t it seem like search is a better disposition for shorthand syntax? What do you think?
On to JSUP
JSON is super easy and ubiquitous, but it can be limiting and frustrating when trying to do high-precision stuff with data.
When using super
, it’s handy to operate in the
domain of super-structured data and only output to
JSON when needed. Providing human-readability without losing detail is what
Super JSON (JSUP) is all about.
JSUP is nice because it has a comprehensive type system and you can go from JSUP to an efficient binary row format (Super Binary, BSUP) and columnar (Super Columnar, CSUP) — and vice versa — with complete fidelity and no loss of information. In this tour, we’ll stick to JSUP, though for large data sets Super Binary is much faster.
The first thing you’ll notice about JSUP is that you don’t need
quotations around field names. We can see this by taking some JSON
as input (the JSON format is auto-detected by super
) and formatting
it as pretty-printed JSUP with -Z
:
echo '{"s":"hello","val":1,"a":[1,2],"b":true}' | super -Z -
which gives
{
s: "hello",
val: 1,
a: [
1,
2
],
b: true
}
s
, val
, a
, and b
all appear as unquoted identifiers here.
Of course if you have funny characters in a field name, JSUP can handle
it with quotes just like JSON:
echo '{"funny@name":1}' | super -z -
produces
{"funny@name":1}
Moreover, JSUP is fully compatible with all of JSON’s corner cases like empty string as a field name and empty object as a value, e.g.,
echo '{"":{}}' | super -z -
produces
{"":{}}
Comprehensive Types
JSUP also has a comprehensive type system.
For example, here is a JSUP “record” with a taste of different types of values as record fields:
{
v1: 1.5,
v2: 1,
v3: 1 (uint8),
v4: 2018-03-24T17:30:20.600852Z,
v5: 2m30s,
v6: 192.168.1.1,
v7: 192.168.1.0/24,
v8: [
1,
2,
3
],
v9: |[
"GET",
"PUT",
"POST"
]|,
v10: |{
"key1": 123,
"key2": 456
}|,
v11: {
a: 1,
r: {
s1: "hello",
s2: "world"
}
}
}
The first seven values are all primitive types in the super data model.
Here, v1
is a 64-bit IEEE floating-point value just like JSON.
Unlike JSON, v2
is a 64-bit integer. And there are other integer
types as with v3
,
which utilizes a JSUP type decorator,
in this case,
to clarify its specific type of integer as unsigned 8 bits.
v4
has type time
and v5
type duration
.
v6
is type ip
and v7
type net
.
v8
is an array of elements of type int64
,
which is a type written as [int64]
.
v9
is a “set of strings”, which is written like an array but with the
enclosing syntax |[
and ]|
.
v10
is a “map” type, which in other languages is often called a “table”
or a “dictionary”. In the super data model, a value of any type can be used for key or
value in a map though all of the keys and all of the values must have the same type.
Finally, v11
is a “record”, which is similar to a JSON “object”, but the
keys are called “fields” and the order of the fields is significant and
is always preserved.
Records
As is often the case with semi-structured systems, you deal with nested values all the time: in JSON, data is nested with objects and arrays, while in super-structured data, data is nested with “records” and arrays (as well as other complex types).
Record expressions
are rather flexible with super
and look a bit like JavaScript
or jq
syntax, e.g.,
echo '1 2 3' | super -z -c 'yield {kind:"counter",val:this}' -
produces
{kind:"counter",val:1}
{kind:"counter",val:2}
{kind:"counter",val:3}
Note that like the search shortcut, you can also drop the yield
keyword
here because the record literal implies
the yield
operator, e.g.,
echo '1 2 3' | super -z -c '{kind:"counter",val:this}' -
also produces
{kind:"counter",val:1}
{kind:"counter",val:2}
{kind:"counter",val:3}
super
can also use a spread operator like JavaScript, e.g.,
echo '{a:{s:"foo", val:1}}{b:{s:"bar"}}' | super -z -c '{...a,s:"baz"}' -
produces
{s:"baz",val:1}
{s:"baz"}
while
echo '{a:{s:"foo", val:1}}{b:{s:"bar"}}' | super -z -c '{d:2,...a,...b}' -
produces
{d:2,s:"foo",val:1}
{d:2,s:"bar"}
Record Mutation
Sometimes you just want to extract or mutate certain fields of records.
Similar to the Unix cut
command, the cut
operator
extracts fields, e.g.,
echo '{s:"foo", val:1}{s:"bar"}' | super -z -c 'cut s' -
produces
{s:"foo"}
{s:"bar"}
while the put
operator mutates existing fields
or adds new fields, e.g.,
echo '{s:"foo", val:1}{s:"bar"}' | super -z -c 'put val:=123,pi:=3.14' -
produces
{s:"foo",val:123,pi:3.14}
{s:"bar",val:123,pi:3.14}
Note that put
is also an implied operator so the command with put
omitted
echo '{s:"foo", val:1}{s:"bar"}' | super -z -c 'val:=123,pi:=3.14' -
produces the very same output:
{s:"foo",val:123,pi:3.14}
{s:"bar",val:123,pi:3.14}
Finally, it’s worth mentioning that errors in the super data model are
first class.
This means they can just show up in the data as values. In particular,
a common error is error("missing")
which occurs most often when referencing
a field that does not exist, e.g.,
echo '{s:"foo", val:1}{s:"bar"}' | super -z -c 'cut val' -
produces
{val:1}
{val:error("missing")}
Sometimes you expect “missing” errors to occur sporadically and just want
to ignore them, which can you easily do with the
quiet
function, e.g.,
echo '{s:"foo", val:1}{s:"bar"}' | super -z -c 'cut quiet(val)' -
produces
{val:1}
Union Types
One of the tricks super
uses to represent JSON data in its structured type system
is union types.
Most of the time, you don’t need to worry about unions
but they show up from time to time. Even when
they show up, super
just tries to “do the right thing” so you usually
don’t have to worry about them even when they show up.
For example, this query is perfectly happy to operate on the union values that are implied by a mixed-type array:
echo '[1, "foo", 2, "bar"]' | super -z -c 'yield this[2],this[1]' -
produces
2
"foo"
but under the covers, the elements of the array have a union type of
int64
and string
, which is written (int64,string)
, e.g.,
echo '[1, "foo", 2, "bar"]' | super -z -c 'yield typeof(this)' -
produces
<[(int64,string)]>
which is a type value representing an array of union values.
As you learn more about super-structured data and want to use super
to do data discovery and
preparation, union types are really quite powerful. They allow records
with fields of different types or mixed-type arrays to be easily expressed
while also having a very precise type definition. This is the essence
of the new
super-structured data model.
First-class Types
Note that in the type value above, the type is wrapped in angle brackets. This is how JSUP represents types when expressed as values. In other words, the super data model has first-class types.
The type of any value in super
can be accessed via the
typeof
function, e.g.,
echo '1 "foo" 10.0.0.1' | super -z -c 'yield typeof(this)' -
produces
<int64>
<string>
<ip>
What’s the big deal here? We can print out the type of something. Yawn.
Au contraire, this is really quite powerful because we can
use types as values to functions, e.g., as a dynamic argument to
the cast
function:
echo '{a:0,b:"2"}{a:0,b:"3"}' | super -z -c 'yield cast(b, typeof(a))' -
produces
2
3
But more powerfully, types can be used anywhere a value can be used and in particular, they can be grouping keys, e.g.,
echo '{x:1,y:2}{s:"foo"}{x:3,y:4}' |
super -f table -c "count() by this['shape']:=typeof(this) | sort count" -
produces
shape count
<{s:string}> 1
<{x:int64,y:int64}> 2
When run over large data sets, this gives you an insightful count of each “shape” of data in the input. This is a powerful building block for data discovery.
It’s worth mentioning jq
also has a type operator, but it produces a
simple string instead of first-class types, and arrays and objects have
no detail about their structure, e.g.,
echo '1 true [1,2,3] {"s":"foo"}' | jq type
produces
"number"
"boolean"
"array"
"object"
Moreover, if we compare types of different objects
echo '{"a":{"s":"foo"},"b":{"x":1,"y":2}}' | jq '(.a|type)==(.b|type)'
we get “object” here for each type and thus the result:
true
i.e., they match even though their underlying shape is different.
With super
of course, these are different super-structured types so
the result is false, e.g.,
echo '{"a":{"s":"foo"},"b":{"x":1,"y":2}}' |
super -z -c 'yield typeof(a)==typeof(b)' -
produces
false
Sample
Sometimes you’d like to see a sample value of each shape, not its type.
This is easy to do with the any
aggregate function,
e.g.,
echo '{x:1,y:2}{s:"foo"}{x:3,y:4}' |
super -z -c 'val:=any(this) by typeof(this) | sort val | yield val' -
produces
{s:"foo"}
{x:1,y:2}
We like this pattern so much there is a shortcut sample
operator, e.g.,
echo '{x:1,y:2}{s:"foo"}{x:3,y:4}' | super -z -c 'sample this | sort this' -
emits the same result:
{s:"foo"}
{x:1,y:2}
Fuse
Sometimes JSON data can get really messy with lots of variations in fields, with null values appearing sometimes and sometimes not, and with the same fields having different data types. Most annoyingly, when you see a JSON object like this in isolation:
{a:1,b:null}
you have no idea what the expected data type of b
will be. Maybe it’s another
number? Or maybe a string? Or maybe an array or an embedded object?
super
and JSUP don’t have this problem because every value (even null
) is
comprehensively typed. However, super
in fact must deal with this thorny problem
when reading JSON and converting it to super-structured data.
This is where you might have to spend a little bit of time coding up the right query logic to disentangle a JSON mess. But once the data is cleaned up, you can leave it in a super-structured format and not worry again.
To do so, the fuse
operator comes in handy.
Let’s say you have this sequence of data:
{a:1,b:null}
{a:null,b:[2,3,4]}
As we said,
you can’t tell by looking at either value what the types of both a
and b
should be. But if you merge the values into a common type, things begin to make
sense, e.g.,
echo '{a:1,b:null}{a:null,b:[2,3,4]}' | super -z -c fuse -
produces this transformed and comprehensively-typed JSUP output:
{a:1,b:null([int64])}
{a:null(int64),b:[2,3,4]}
Now you can see all the detail.
This turns out to be so useful, especially with large amounts of messy input data, you will often find yourself fusing data then sampling it, e.g.,
echo '{a:1,b:null}{a:null,b:[2,3,4]}' | super -Z -c 'fuse | sample' -
produces a comprehensively-typed sample:
{
a: 1,
b: null ([int64])
}
As you explore data in this fashion, you will often type various searches
to slice and dice the data as you get a feel for it all while sending
your interactive search results to fuse | sample
.
To appreciate all this, let’s have a look next at some real-world data…
Real-world GitHub Data
Now that we’ve covered the basics of super
and its query language, let’s
use the query patterns from above to explore some GitHub data.
First, we need to grab the data. You can use curl
for this or you can
just use super
as it can take URLs in addition to file name arguments.
This command will grab descriptions of first 30 PRs created in the
public super
repository and place it in a file called prs.json
:
super -f json \
https://api.github.com/repos/brimdata/super/pulls\?state\=all\&sort\=desc\&per_page=30 \
> prs.json
✵ Note ✵
As we get into the exercise below, we’ll reach a step where we encounter some
unexpected empty objects in the original data. It seems the GitHub API
must have been having a bad day when we first ran this exercise, as these
empty records no longer appear if the download is repeated today using the same
URL shown above. But taming glitchy data is a big part of data discovery, so to
relive the magic of our original experience, you can download
this archived copy of the
prs.json
we originally saw.
Now that you have this JSON file on your local file system, how would you query it
with super
?
Data Discovery
Before you can do anything, you need to know its structure but you generally don’t know anything after pulling some random data from an API.
So, let’s poke around a bit and figure it out. This process of data introspection is often called data discovery.
You could start by using jq
to pretty-print the JSON data,
jq . prs.json
That’s 10,592 lines. Ugh, quite a challenge to sift through.
Instead, let’s start out by figuring out how many values are in the input, e.g.,
super -f text -c 'count()' prs.json
produces
1
Hmm, there’s just one value. It’s probably a big JSON array but let’s check with
the kind
function, and as expected:
super -z -c 'kind(this)' prs.json
produces
"array"
Ok got it. But, how many items are in the array?
super -z -c 'len(this)' prs.json
produces
30
Of course! We asked GitHub to return 30 items and the API returns the pull-request objects as elements of one array representing a single JSON value.
Let’s see what sorts of things are in this array. Here, we need to enumerate
the items from the array and do something with them. So how about we use
the over
operator
to traverse the array and count the array items by their “kind”,
super -z -c 'over this | count() by kind(this)' prs.json
produces
{kind:"record",count:30(uint64)}
Ok, they’re all records. Good, this should be easy!
The records were all originally JSON objects. Maybe we can just use “sample” to have a deeper look…
super -Z -c 'over this | sample' prs.json
✵ Tip ✵
Here we are using -Z
, which is like -z
, but instead of formatting each
JSUP value on its own line, it pretty-prints with vertical
formatting like jq
does for JSON.
Ugh, that output is still pretty big. It’s not 10k lines but it’s still more than 700 lines of pretty-printed JSUP.
Ok, maybe it’s not so bad. Let’s check how many shapes there are with sample
…
super -z -c 'over this | sample | count()' prs.json
produces
3(uint64)
All that data across the samples and only three shapes. They must each be really big. Let’s check that out.
We can use the len
function on the records to
see the size of each of the four records:
super -z -c 'over this | sample | len(this) | sort this' prs.json
and we get
0
36
36
Ok, this isn’t so bad… two shapes each have 36 fields but one is length zero?! That outlier could only be the empty record. Let’s check:
super -z -c 'over this | sample | len(this)==0' prs.json
produces
{}
Sure enough, there it is. We could also double check with jq
that there are
blank records in the GitHub results, and sure enough
jq '.[] | select(length==0)' prs.json
produces
{}
{}
Try opening your editor on that JSON file to look for the empty objects. Who knows why they are there? No fun. Real-world data is messy.
How about we fuse the 3 shapes together and have a look at the result:
super -Z -c 'over this | fuse | sample' prs.json
We won’t display the result here as it’s still pretty big. But you can give it a try. It’s 379 lines.
But let’s break down what’s taking up all this space.
We can take the output from fuse | sample
and list the fields with
and their “kind”. Note that when we do an over this
with records as
input, we get a new record value for each field structured as a key/value pair:
super -f table -c '
over this
| fuse
| sample
| over this
| {field:key[0],kind:kind(value)}
' prs.json
produces
field kind
url primitive
id primitive
node_id primitive
html_url primitive
diff_url primitive
patch_url primitive
issue_url primitive
number primitive
state primitive
locked primitive
title primitive
user record
body primitive
created_at primitive
updated_at primitive
closed_at primitive
merged_at primitive
merge_commit_sha primitive
assignee primitive
assignees array
requested_reviewers array
requested_teams array
labels array
milestone primitive
draft primitive
commits_url primitive
review_comments_url primitive
review_comment_url primitive
comments_url primitive
statuses_url primitive
head record
base record
_links record
author_association primitive
auto_merge primitive
active_lock_reason primitive
With this list of top-level fields, we can easily explore the different
pieces of their structure with sample
. Let’s have a look at a few of the
record fields by giving these one-liners each a try and looking at the output:
super -Z -c 'over this | sample head' prs.json
super -Z -c 'over this | sample base' prs.json
super -Z -c 'over this | sample _links' prs.json
While these fields have some useful information, we’ll decide to drop them here
and focus on other top-level fields. To do this, we can use the
drop
operator to whittle down the data:
super -Z -c 'over this | fuse | drop head,base,_link | sample' prs.json
Ok, this looks more reasonable and is now only 120 lines of pretty-printed JSUP.
One more annoying detail here about JSON: time values are stored as strings, in this case, in ISO format, e.g., we can pull this value out with this query:
super -z -c 'over this | head 1 | yield created_at' prs.json
which produces this string:
"2019-11-11T19:50:46Z"
Since the super data model has a native time
type and we might want to do native date comparisons
on these time fields, we can easily translate the string to a time with a cast, e.g.,
super -z -c 'over this | head 1 | yield time(created_at)' prs.json
produces the native time value:
2019-11-11T19:50:46Z
To be sure, you can check any value’s type with the typeof
function, e.g.,
super -z -c 'over this | head 1 | yield time(created_at) | typeof(this)' prs.json
produces the native time value:
<time>
Cleaning up the Messy JSON
Okay, now that we’ve explored the data, we have a sense of it and can “clean it up” with some transformative queries. We’ll do this one step at a time, then put it all together.
First, let’s get rid of the outer array and generate elements of an array as a sequence of records that have been fused and let’s filter out the empty records:
super -c 'over this | len(this) != 0 | fuse' prs.json > prs1.bsup
We can check that worked with count
:
super -z -c 'count()' prs1.bsup
super -z -c 'sample | count()' prs1.bsup
produces
{count:28(uint64)}
{count:1(uint64)}
Okay, good. There are 28 values (the 30 requested less the two empty records) and exactly one shape since the data was fused.
Now, let’s drop the fields we aren’t interested in:
super -c 'drop head,base,_links' prs1.bsup > prs2.bsup
Finally, let’s clean up those dates. To track down all the candidates, we can run this query to group field names by their type and limit the output to primitive types:
super -z -c '
over this
| kind(value)=="primitive"
| fields:=union(key[0]) by type:=typeof(value)
' prs2.bsup
which gives
{type:<string>,fields:|["url","body","state","title","node_id","diff_url","html_url","closed_at","issue_url","merged_at","patch_url","created_at","updated_at","commits_url","comments_url","statuses_url","merge_commit_sha","author_association","review_comment_url","review_comments_url"]|}
{type:<int64>,fields:|["id","number"]|}
{type:<bool>,fields:|["draft","locked"]|}
{type:<null>,fields:|["assignee","milestone","auto_merge","active_lock_reason"]|}
✵ Note ✵
This use of over
traverses each record and generates a key-value pair
for each field in each record.
Looking through the fields that are strings, the candidates for ISO dates appear to be
closed_at
,merged_at
,created_at
, andupdated_at
.
You can do a quick check of the theory by running…
super -z -c '{closed_at,merged_at,created_at,updated_at}' prs2.bsup
and you will get strings that are all ISO dates:
{closed_at:"2019-11-11T20:00:22Z",merged_at:"2019-11-11T20:00:22Z",created_at:"2019-11-11T19:50:46Z",updated_at:"2019-11-11T20:00:25Z"}
{closed_at:"2019-11-11T21:00:15Z",merged_at:"2019-11-11T21:00:15Z",created_at:"2019-11-11T20:57:12Z",updated_at:"2019-11-11T21:00:26Z"}
...
To fix those strings, we simply transform the fields in place using the
(implied) put
operator and redirect the final
output as BSUP to the file prs.bsup
:
super -c '
closed_at:=time(closed_at),
merged_at:=time(merged_at),
created_at:=time(created_at),
updated_at:=time(updated_at)
' prs2.bsup > prs.bsup
We can check the result with our type analysis:
super -z -c '
over this
| kind(value)=="primitive"
| fields:=union(key[0]) by type:=typeof(value)
| sort type
' prs.bsup
which now gives:
{type:<int64>,fields:|["id","number"]|}
{type:<time>,fields:|["closed_at","merged_at","created_at","updated_at"]|}
{type:<bool>,fields:|["draft","locked"]|}
{type:<string>,fields:|["url","body","state","title","node_id","diff_url","html_url","issue_url","patch_url","commits_url","comments_url","statuses_url","merge_commit_sha","author_association","review_comment_url","review_comments_url"]|}
{type:<null>,fields:|["assignee","milestone","auto_merge","active_lock_reason"]|}
and we can see that the date fields are correctly typed as type time
!
✵ Note ✵
We sorted the output values here using the sort
operator
to produce a consistent output order since aggregations can be run in parallel
to achieve scale and do not guarantee their output order.
Putting It All Together
Instead of running each step above into a temporary file, we can put all the transformations together in a single pipeline, where the full query text might look like this:
over this // traverse the array of objects
| len(this) != 0 // skip empty objects
| fuse // fuse objects into records of a combined type
| drop head,base,_links // drop fields that we don't need
| closed_at:=time(closed_at), // transform string dates to type time
merged_at:=time(merged_at),
created_at:=time(created_at),
updated_at:=time(updated_at)
✵ Note ✵
The //
syntax indicates a single-line comment.
We can then put this in a file, called say transform.spq
, and use the -I
argument to run all the transformations in one fell swoop:
super -I transform.spq prs.json > prs.bsup
Running Analytics
Now that we’ve cleaned up our data, we can reliably and easily run analytics
on the finalized BSUP file prs.bsup
.
Super-structured data gives us the best of both worlds of JSON and relational tables: we have the structure and clarity of the relational model while retaining the flexibility of JSON’s document model. No need to create tables then issue SQL insert commands to put your clean data into all the right places.
Let’s start with something simple. How about we output a “PR Report” listing the title of each PR along with its PR number and creation date:
super -f table -c '{DATE:created_at,NUMBER:f"PR #{number}",TITLE:title}' prs.bsup
and you’ll see this output…
DATE NUMBER TITLE
2019-11-11T19:50:46Z PR #1 Make "make" work in zq
2019-11-11T20:57:12Z PR #2 fix install target
2019-11-11T23:24:00Z PR #3 import github.com/looky-cloud/lookytalk
2019-11-12T16:25:46Z PR #5 Make zq -f work
2019-11-12T16:49:07Z PR #6 a few clarifications to the zson spec
...
Note that we used a formatted string literal
to convert the field number
into a string and format it with surrounding text.
Instead of old PRs, we can get the latest list of PRs using the
tail
operator since we know the data is sorted
chronologically. This command retrieves the last five PRs in the dataset:
super -f table -c '
tail 5
| {DATE:created_at,"NUMBER":f"PR #{number}",TITLE:title}
' prs.bsup
and the output is:
DATE NUMBER TITLE
2019-11-18T22:14:08Z PR #26 ndjson writer
2019-11-18T22:43:07Z PR #27 Add reader for ndjson input
2019-11-19T00:11:46Z PR #28 fix TS_ISO8601, TS_MILLIS handling in NewRawAndTsFromJSON
2019-11-19T21:14:46Z PR #29 Return count of "dropped" fields from zson.NewRawAndTsFromJSON
2019-11-20T00:36:30Z PR #30 zval.sizeBytes incorrect
How about some aggregations? We can count the number of PRs and sort by the count highest first:
super -z -c "count() by user:=user.login | sort count desc" prs.bsup
produces
{user:"mattnibs",count:10(uint64)}
{user:"aswan",count:7(uint64)}
{user:"mccanne",count:6(uint64)}
{user:"nwt",count:4(uint64)}
{user:"henridf",count:1(uint64)}
How about getting a list of all of the reviewers? To do this, we need to
traverse the records in the requested_reviewers
array and collect up
the login field from each record:
super -z -c 'over requested_reviewers | collect(login)' prs.bsup
Oops, this gives us an array of the reviewer logins
with repetitions since collect
collects each item that it encounters into an array:
["mccanne","nwt","henridf","mccanne","nwt","mccanne","mattnibs","henridf","mccanne","mattnibs","henridf","mccanne","mattnibs","henridf","mccanne","nwt","aswan","henridf","mccanne","nwt","aswan","philrz","mccanne","mccanne","aswan","henridf","aswan","mccanne","nwt","aswan","mikesbrown","henridf","aswan","mattnibs","henridf","mccanne","aswan","nwt","henridf","mattnibs","aswan","aswan","mattnibs","aswan","henridf","aswan","henridf","mccanne","aswan","aswan","mccanne","nwt","aswan","henridf","aswan"]
What we’d prefer is a set of reviewers where each reviewer appears only once. This
is easily done with the union
aggregate function
(not to be confused with union types) which
computes the set-wise union of its input and produces a set
type as its
output. In this case, the output is a set of strings, written |[string]|
in the query language. For example:
super -z -c 'over requested_reviewers | reviewers:=union(login)' prs.bsup
produces
{reviewers:|["nwt","aswan","philrz","henridf","mccanne","mattnibs","mikesbrown"]|}
Ok, that’s pretty neat.
Let’s close with an analysis that’s a bit more sophisticated. Suppose we want to look at the reviewers that each user tends to ask for. We can think about this question as a “graph problem” where the user requesting reviews is one node in the graph and each set of reviewers is another node.
So as a first step, let’s figure out how to create each edge, where an edge is a relation between the requesting user and the set of reviewers. We can create this with a “lateral subquery”. Instead of computing a set-union over all the reviewers across all PRs, we instead want to compute the set-union over the reviewers in each PR. We can do this as follows:
super -z -c 'over requested_reviewers => ( reviewers:=union(login) )' prs.bsup
which produces an output like this:
{reviewers:|["nwt","mccanne"]|}
{reviewers:|["nwt","henridf","mccanne"]|}
{reviewers:|["mccanne","mattnibs"]|}
{reviewers:|["henridf","mccanne","mattnibs"]|}
{reviewers:|["henridf","mccanne","mattnibs"]|}
...
Note that the syntax => ( ... )
defines a lateral scope where any subquery can
run in isolation over the input values created from the sequence of values
traversed by the outer over
.
But we need a “graph edge” between the requesting user and the reviewers.
To do this, we need to reference the user.login
from the top-level scope within the
lateral scope. This can be done by
bringing that value into the scope using a with
clause appended to the
over
expression and yielding a
record literal with the desired value:
super -z -c '
over requested_reviewers with user=user.login => (
reviewers:=union(login)
| {user,reviewers}
)
| sort user,len(reviewers)
' prs.bsup
which gives us
{user:"aswan",reviewers:|["mccanne"]|}
{user:"aswan",reviewers:|["nwt","mccanne"]|}
{user:"aswan",reviewers:|["nwt","henridf","mccanne"]|}
{user:"aswan",reviewers:|["henridf","mccanne","mattnibs"]|}
{user:"aswan",reviewers:|["henridf","mccanne","mattnibs"]|}
{user:"henridf",reviewers:|["nwt","aswan","mccanne"]|}
{user:"mattnibs",reviewers:|["aswan","mccanne"]|}
{user:"mattnibs",reviewers:|["aswan","henridf"]|}
...
The final step is to simply aggregate the “reviewer sets” with the user
field
as the grouping key:
super -Z -c '
over requested_reviewers with user=user.login => (
reviewers:=union(login)
| {user,reviewers}
)
| groups:=union(reviewers) by user
| sort user,len(groups)
' prs.bsup
and we get
{
user: "aswan",
groups: |[
|[
"mccanne"
]|,
|[
"nwt",
"mccanne"
]|,
|[
"nwt",
"henridf",
"mccanne"
]|,
|[
"henridf",
"mccanne",
"mattnibs"
]|
]|
}
{
user: "henridf",
groups: |[
|[
"nwt",
"aswan",
"mccanne"
]|
]|
}
{
user: "mattnibs",
groups: |[
|[
"aswan",
"henridf"
]|,
|[
"aswan",
"mccanne"
]|,
|[
"aswan",
"henridf",
"mccanne"
]|,
|[
"nwt",
"aswan",
"henridf",
"mccanne"
]|,
|[
"nwt",
"aswan",
"mccanne",
"mikesbrown"
]|,
|[
"nwt",
"aswan",
"philrz",
"henridf",
"mccanne"
]|
]|
}
{
user: "mccanne",
groups: |[
|[
"nwt"
]|,
|[
"aswan"
]|,
|[
"mattnibs"
]|
]|
}
{
user: "nwt",
groups: |[
|[
"aswan"
]|,
|[
"aswan",
"mattnibs"
]|,
|[
"henridf",
"mattnibs"
]|,
|[
"mccanne",
"mattnibs"
]|
]|
}
After a quick glance here, you can tell that mccanne
looks for
very targeted reviews while mattnibs
casts a wide net, at least
for the PRs from the beginning of the repo.
To quantify this concept, we can easily modify this query to compute the average number of reviewers requested instead of the set of groups of reviewers. To do this, we just average the reviewer set size with an aggregation:
super -z -c '
over requested_reviewers with user=user.login => (
reviewers:=union(login)
| {user,reviewers}
)
| avg_reviewers:=avg(len(reviewers)) by user
| sort avg_reviewers
' prs.bsup
which produces
{user:"mccanne",avg_reviewers:1.}
{user:"nwt",avg_reviewers:1.75}
{user:"aswan",avg_reviewers:2.4}
{user:"mattnibs",avg_reviewers:2.9}
{user:"henridf",avg_reviewers:3.}
Of course, if you’d like the query output in JSON, you can just say -j
and
super
will happily format the sets as JSON arrays, e.g.,
super -j -c '
over requested_reviewers with user=user.login => (
reviewers:=union(login)
| {user,reviewers}
)
| groups:=union(reviewers) by user
| sort user,len(groups)
' prs.bsup
produces
{"user":"aswan","groups":[["mccanne"],["nwt","mccanne"],["nwt","henridf","mccanne"],["henridf","mccanne","mattnibs"]]}
{"user":"henridf","groups":[["nwt","aswan","mccanne"]]}
{"user":"mattnibs","groups":[["aswan","henridf"],["aswan","mccanne"],["aswan","henridf","mccanne"],["nwt","aswan","henridf","mccanne"],["nwt","aswan","mccanne","mikesbrown"],["nwt","aswan","philrz","henridf","mccanne"]]}
{"user":"mccanne","groups":[["nwt"],["aswan"],["mattnibs"]]}
{"user":"nwt","groups":[["aswan"],["aswan","mattnibs"],["henridf","mattnibs"],["mccanne","mattnibs"]]}
Key Takeaways
So to summarize, we gave you a tour here of how super
the super data model
provide a powerful way do search, transformation, and analytics in a structured-like
way on data that begins its life as semi-structured JSON and is transformed
into the powerful super-structured format without having to create relational
tables and schemas.
As you can see, super
is a general-purpose tool that you can add to your bag
of tricks to:
- explore messy and confusing JSON data using shaping and sampling,
- transform JSON data in ad hoc ways, and
- develop transform logic for hitting APIs like the GitHub API to produce
clean data for analysis by
super
or even export into other systems or for testing.
If you’d like to learn more, feel free to read through the language docs in depth or see how you can organize data into a lake using a git-like commit model.