Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

from

🎲  source data from databases, files, or URLs

Synopsis

from <entity> [ ( <options> ) ]
from <named-query>

where <entity> has the form of

  • a text entity representing a file, URL, or pool name,
  • an f-string representing a file, URL, or pool name,
  • a glob matching files in the local file system or pool names in a database, or
  • a regular expression matching pool names in a database;

<options> is an optional concatenation of named options; and,

<named-query> is an identifier referencing a declared query.

Description

The from operator identifies one or more sources of data as input to a query and transmits the data required by the query to its output.

Unlike the FROM clause in a SQL query, the pipe from merely sources data to its downstream operators and does not include relational joins or table subqueries, .

As a pipe operator, from preserves the order of the data within a file, URL, or a sorted pool but when multiple sources are identified (e.g., as a file-system glob or regular expression matching pools), the data may be read in parallel and interleaved in an undefined order.

Optional arguments to from may be appended as a parenthesized concatenation of named arguments.

Entity Syntax

How the entity is interpreted depends on whether the query is run attached to or detached from a database.

When detached from a database, the entity must be a text entity, f-string, or glob. A glob matches files in the file system while a text entity or f-string is an URL if it parses as an URL; otherwise, it is presumed to be a file path.

When attached to a database, the entity must be a text entity, f-string, glob, or a slash-delimitated regular expression. A regular expression matches pools in the attached database. A text entity or f-string is an URL if it parses as an URL and otherwise, is presumed to be a pool name.

Local files are not accessible when attached to a database.

Note

While pool names and file names have overlapping syntax, their use is disambiguated by the presence or absence of an attached database.

When the entity is an f-string, the from operator reads data from its upstream pipe operator and for each input value, the f-string expression is evaluated and used as the <entity> string argument. Each such entity is scanned one at a time and the data is fed to the output of from. When an entity does not exist, a structured error is produced and the query continues execution.

Options

Options to from may be appended as a parenthesized list of name/value pairs having the form:

( <name> <value> [ <name> <value> ... ] )

Each entity type supports a specific set of named options as described below. When the entity comprises multiple sources (e.g., with a glob), then the options apply to every entity matched.

Format Detection

When reading data from files or URLs, the serialization format of the input data is determined by the presence of a well-known extension (e.g., .json, .sup, etc.) on the file path or URL, or if the extension is not present or unknown, the format is inferred by inspecting the input data.

To manually specify the format of a source and override these heuristics, a format argument may be appended as an argument and has the form

format <fmt>

where <fmt> is the name of a supported serialization format and is parsed as a text entity.

Files

When the <entity> argument is recognized as a file, the file data required for the query is read from the local file system, parsed as its specified or detected serialization format, and emitted to its output.

File-system paths are interpreted relative to the directory in which the super command is running.

The only allowed option for file entities is the format option described above.

Here are some examples of file syntax:

from file.json
from 'file-with-dash.sup'
from /path/to/file.csv
from file*.parq (format parquet)

Pools

When the <entity> argument is recognized as a database pool, the data required for the query is ready from the database and emitted to its output.

The only allowed option for a pool is the commit argument having the form

commit <commitsh>

where <commitish> specifies a specific commit in the pool’s log thereby allowing time travel.

The the commit argument may be abbreviated by appending to the pool name an @ character followed by the commitish, e.g.,

from Pool (commit 36AwHUt9s8usF7pi9x3l6LOl8IB)

maybe be instead written as

from Pool@36AwHUt9s8usF7pi9x3l6LOl8IB

When a single pool name is specified without a commit option, or when using a regular expression, the tip of the main branch of each pool is accessed.

It is an error to specify a format option when the entity is is a pool.

Note

Metadata from database pools also may be sourced using from. This will be documented in a future release of SuperDB.

URLs

Data sources identified by URLs can be accessed either when attached or detached from a database.

As a text entity, typical URLs need not be quoted though URLs with special characters must be quoted.

When the <entity> argument begins with http: or https: and has the form of a valid URL, then the source is fetched remotely using either HTTP or HTTPS.

When the URL begins with s3: then data is fetched via the Amazon S3 object service using the settings defined by a local configuration.

Named options for URL entities include format, method, headers, and body as in

from <url> [ ( format <fmt> method <method> headers <expr> body <string> ) ]

where

  • <method> is one of GET, PUT, POST, or DELETE,
  • <expr> is a record expression that defines the names and values to be included as HTTP header options, and
  • <body> is a text-entity string to be included as the body of the HTTP request.

Currently, the headers expression must evaluate to a compile-time constant though this may change to allow dynamic computation in a future version of SuperSQL. Each field of this record must either be a string or (to specify a header option appearing multiple times with different values) an array or set of strings.

These options cannot be used with S3 URLs.

Note

Currently, the headers expression must evaluate to a compile-time constant though this may change to allow run-time evaluation in a future version of SuperSQL.

Combining Data

To combine data from multiple sources using pipe operators, from may be used in combination with other operators like fork and join.

For example, multiple pools can be accessed in parallel and combined in undefined order:

fork
  ( from PoolOne | op1 | op2 | ... )
  ( from PoolTwo | op1 | op2 | ... )
| ...

or joined according to a join condition:

fork
  ( from PoolOne | op1 | op2 | ... )
  ( from PoolTwo | op1 | op2 | ... )
| join as {left,right} on left.key=right.key
| ...

Alternatively, the right-hand leg of the join may be written as a subquery of join:

from PoolOne | op1 | op2 | ...
| join ( from PoolTwo | op1 | op2 | ... )
    as {left,right} on left.key=right.key
| ...

File Examples


Source structured data from a JSON file

echo '{"greeting":"hello world!"}' > hello.json
super -s -c 'from hello.json | values greeting'
"hello world!"

Source super-structured from a local file

echo '1 2 {x:1} {s:1::(int64|string)} {s:"hello"::(int64|string)}' > vals.sup
super -s -c 'from vals.sup'
1
2
{x:1}
{s:1::(int64|string)}
{s:"hello"::(int64|string)}

HTTP Example


Source data from a URL

super -s -c 'from https://api.github.com/repos/brimdata/super | values name'
"super"

F-String Example

Read from dynamically defined files and add a column

echo '{a:1}{a:2}' > a.sup
echo '{b:3}{b:4}' > b.sup
echo '"a.sup" "b.sup"' | super -s -c "from f'{this}' | c:=coalesce(a,b)+1" -
{a:1,c:2}
{a:2,c:3}
{b:3,c:4}
{b:4,c:5}

Database Examples

The remaining examples below assume the existence of the SuperDB database created and populated by the following commands:

export SUPER_DB=example
super db -q init
super db -q create -orderby flip:desc coinflips
echo '{flip:1,result:"heads"} {flip:2,result:"tails"}' |
  super db load -q -use coinflips -
super db branch -q -use coinflips trial
echo '{flip:3,result:"heads"}' | super db load -q -use coinflips@trial -
super db -q create numbers
echo '{number:1,word:"one"} {number:2,word:"two"} {number:3,word:"three"}' |
  super db load -q -use numbers -
super db -f line -c '
  from :branches
  | values pool.name || "@" || branch.name
  | sort'

The database then contains the two pools and three branches:

coinflips@main
coinflips@trial
numbers@main

The following file hello.sup is also used.

{greeting:"hello world!"}

Source data from the main branch of a pool

super db -db example -s -c 'from coinflips'
{flip:2,result:"tails"}
{flip:1,result:"heads"}

Source data from a specific branch of a pool

super db -db example -s -c 'from coinflips@trial'
{flip:3,result:"heads"}
{flip:2,result:"tails"}
{flip:1,result:"heads"}

Count the number of values in the main branch of all pools

super db -db example -s -c 'from * | count()'
5

Join the data from multiple pools

super db -db example -s -c '
  from coinflips
  | join ( from numbers ) on left.flip=right.number
  | values {...left, word:right.word}
  | sort'
{flip:1,result:"heads",word:"one"}
{flip:2,result:"tails",word:"two"}

Use pass to combine our join output with data from yet another source

super db -db example -s -c '
  from coinflips
  | join ( from numbers ) on left.flip=right.number
  | values {...left, word:right.word}
  | fork
    ( pass )
    ( from coinflips@trial
      | c:=count()
      | values f"There were {c} flips" )
  | sort this'
"There were 3 flips"
{flip:1,result:"heads",word:"one"}
{flip:2,result:"tails",word:"two"}