Join Overview

Table of Contents

This is a brief primer on the SuperPipe join operator.

Currently, join is limited in that only equi-join (i.e., a join predicate containing =) is supported.

Example Data

The first input data source for our usage examples is fruit.json, which describes the characteristics of some fresh produce.

{"name":"apple","color":"red","flavor":"tart"}
{"name":"banana","color":"yellow","flavor":"sweet"}
{"name":"avocado","color":"green","flavor":"savory"}
{"name":"strawberry","color":"red","flavor":"sweet"}
{"name":"dates","color":"brown","flavor":"sweet","note":"in season"}
{"name":"figs","color":"brown","flavor":"plain"}

The other input data source is people.json, which describes the traits and preferences of some potential eaters of fruit.

{"name":"morgan","age":61,"likes":"tart"}
{"name":"quinn","age":14,"likes":"sweet","note":"many kids enjoy sweets"}
{"name":"jessie","age":30,"likes":"plain"}
{"name":"chris","age":47,"likes":"tart"}

Inner Join

We’ll start by outputting only the fruits liked by at least one person. The name of the matching person is copied into a field of a different name in the joined results.

Because we’re performing an inner join (the default), the explicit inner is not strictly necessary, but including it clarifies our intention.

The SuperPipe query inner-join.spq:

file fruit.json
|> inner join (
  file people.json
) on flavor=likes eater:=name

Executing the query:

Error:
super -z -I inner-join.spq

produces

Left Join

Note

In some databases a left join is called a left outer join.

By performing a left join that targets the same key fields, now all of our fruits will be shown in the results even if no one likes them (e.g., avocado).

As another variation, we’ll also copy over the age of the matching person. By referencing only the field name rather than using := for assignment, the original field name age is maintained in the results.

The query left-join.spq:

file fruit.json
|> left join (
  file people.json
) on flavor=likes eater:=name,age

Executing the query:

Error:
super -z -I left-join.spq

produces

Right join

Note

In SQL, a right join is called a right outer join.

Next we’ll change the join type from left to right. Notice that this causes the note field from the right-hand input to appear in the joined results.

The query right-join.spq:

file fruit.json
|> right join (
  file people.json
) on flavor=likes fruit:=name

Executing the query:

Error:
super -z -I right-join.spq

produces

Anti join

Note

In some databases an anti join is called a left anti join.

The join type anti allows us to see which fruits are not liked by anyone. Note that with anti join only values from the left-hand input appear in the results.

The query anti-join.spq:

file fruit.json
|> anti join (
  file people.json
) on flavor=likes

Executing the query:

Error:
super -z -I anti-join.spq

produces

Inputs from Pools

In the examples above, we used the file operator to read our respective inputs from named file sources. However, if the inputs are stored in pools in a SuperDB data lake, we would instead specify the sources as data pools using the from operator.

Here we’ll load our input data to pools in a temporary data lake, then execute our inner join using super db query.

The query inner-join-pools.spq:

from fruit
|> inner join (
  from people
) on flavor=likes eater:=name

Populating the pools, then executing the query:

Error:
export SUPER_DB_LAKE=lake
super db init -q
super db create -q -orderby flavor:asc fruit
super db create -q -orderby likes:asc people
super db load -q -use fruit fruit.json
super db load -q -use people people.json
super db query -z -I inner-join-pools.spq

produces

Alternate Syntax

In addition to the syntax shown so far, join supports an alternate syntax in which left and right inputs are specified by the two branches of a preceding fork operator, from operator, or switch operator.

Here we’ll use the alternate syntax to perform the same inner join shown earlier in the Inner Join section.

The query inner-join-alternate.spq:

from (
  file fruit.json
  file people.json
) |> inner join on flavor=likes eater:=name

Executing the query:

Error:
super -z -I inner-join-alternate.spq

produces

Self Joins

In addition to the named files and pools like we’ve used in the prior examples, SuperPipe also works on a single sequence of data that is split and joined to itself. Here we’ll combine our file sources into a stream that we’ll pipe into super via stdin. Because join requires two separate inputs, here we’ll use the has() function inside a switch operator to identify the records in the stream that will be treated as the left and right sides. Then we’ll use the alternate syntax for join to read those two inputs.

The query inner-join-streamed.spq:

switch (
  case has(color) => pass
  case has(age) => pass
) |> inner join on flavor=likes eater:=name

Executing the query:

Error:
cat fruit.json people.json | super -z -I inner-join-streamed.spq -

produces

Multi-value Joins

The equality test in a join accepts only one named key from each input. However, joins on multiple matching values can still be performed by making the values available in comparable complex types, such as embedded records.

To illustrate this, we’ll introduce some new input data inventory.json that represents a vendor’s available quantity of fruit for sale. As the colors indicate, they separately offer both ripe and unripe fruit.

{"name":"banana","color":"yellow","quantity":1000}
{"name":"banana","color":"green","quantity":5000}
{"name":"strawberry","color":"red","quantity":3000}
{"name":"strawberry","color":"white","quantity":6000}

Let’s assume we’re interested in seeing the available quantities of only the ripe fruit in our fruit.json records. In the query multi-value-join.spq, we create the keys as embedded records inside each input record, using the same field names and data types in each. We’ll leave the created fruitkey records intact to show what they look like, but since it represents redundant data, in practice we’d typically drop it after the join in our pipeline.

file fruit.json |> put fruitkey:={name,color}
|> inner join (
  file inventory.json |> put invkey:={name,color}
) on fruitkey=invkey quantity

Executing the query:

Error:
super -z -I multi-value-join.spq

produces

Joining More Than Two Inputs

While the join operator takes only two inputs, more inputs can be joined by extending the pipeline.

To illustrate this, we’ll introduce some new input data in prices.json.

{"name":"apple","price":3.15}
{"name":"banana","price":4.01}
{"name":"avocado","price":2.50}
{"name":"strawberry","price":1.05}
{"name":"dates","price":6.70}
{"name":"figs","price": 1.60}

In our query three-way-join.spq we’ll extend the pipeline we used previously for our inner join by piping its output to an additional join against the price list.

file fruit.json
|> inner join (
  file people.json
) on flavor=likes eater:=name
|> inner join (
  file prices.json
) on name=name price:=price

Executing the query:

Error:
super -z -I three-way-join.spq

produces

Including the entire opposite record

In the current join implementation, explicit entries must be provided in the [field-list] in order to copy values from the opposite input into the joined results (a possible future enhancement super/2815 may improve upon this). This can be cumbersome if your goal is to copy over many fields or you don’t know the names of all desired fields.

One way to work around this limitation is to specify this in the field list to copy the contents of the entire opposite record into an embedded record in the result.

The query embed-opposite.spq:

file fruit.json
|> inner join (
  file people.json
) on flavor=likes eaterinfo:=this

Executing the query:

Error:
super -z -I embed-opposite.spq

produces

If embedding the opposite record is undesirable, the left and right records can easily be merged with the spread operator. Additional processing may be necessary to handle conflicting field names, such as in the example just shown where the name field is used differently in the left and right inputs. We’ll demonstrate this by augmenting embed-opposite.spq to produce merge-opposite.spq.

file fruit.json
|> inner join (
  file people.json
) on flavor=likes eaterinfo:=this
|> rename fruit:=name
|> yield {...this,...eaterinfo}
|> drop eaterinfo

Executing the query:

Error:
super -z -I merge-opposite.spq

produces

SuperDB