from
✅🎲 source data from databases, files, or URLs
Synopsis
from <entity> [ ( <options> ) ]
from <named-query>
where <entity> has the form of
- a text entity representing a file, URL, or pool name,
- an f-string representing a file, URL, or pool name,
- a glob matching files in the local file system or pool names in a database, or
- a regular expression matching pool names in a database;
<options> is an optional concatenation of named options; and,
<named-query> is an identifier referencing a
declared query.
Description
The from operator identifies one or more sources of data as input to
a query and transmits the data required by the query to its output.
Unlike the FROM clause in a SQL query,
the pipe from merely sources data to its downstream operators
and does not include relational joins or table subqueries, .
As a pipe operator,
from preserves the order of the data within a file,
URL, or a sorted pool but when multiple sources are identified
(e.g., as a file-system glob or regular expression matching pools),
the data may be read in parallel and interleaved in an undefined order.
Optional arguments to from may be appended as a parenthesized concatenation of named arguments.
Entity Syntax
How the entity is interpreted depends on whether the query is run attached to or detached from a database.
When detached from a database, the entity must be a text entity, f-string, or glob. A glob matches files in the file system while a text entity or f-string is an URL if it parses as an URL; otherwise, it is presumed to be a file path.
When attached to a database, the entity must be a text entity, f-string, glob, or a slash-delimitated regular expression. A regular expression matches pools in the attached database. A text entity or f-string is an URL if it parses as an URL and otherwise, is presumed to be a pool name.
Local files are not accessible when attached to a database.
Note
While pool names and file names have overlapping syntax, their use is disambiguated by the presence or absence of an attached database.
When the entity is an f-string,
the from operator reads data from its upstream pipe operator
and for each input value, the f-string expression is evaluated and
used as the <entity> string argument. Each such entity is scanned
one at a time and the data is fed to the output of from.
When an entity does not exist, a structured error is produced and
the query continues execution.
Options
Options to from may be appended as a parenthesized list of name/value pairs
having the form:
( <name> <value> [ <name> <value> ... ] )
Each entity type supports a specific set of named options as described below. When the entity comprises multiple sources (e.g., with a glob), then the options apply to every entity matched.
Format Detection
When reading data from files or URLs, the serialization format of the
input data is determined by the presence of a
well-known extension
(e.g., .json, .sup, etc.) on the file path or URL,
or if the extension is not present or unknown, the format is
inferred
by inspecting the input data.
To manually specify the format of a source and override these heuristics, a format argument may be appended as an argument and has the form
format <fmt>
where <fmt> is the name of a supported
serialization format and is
parsed as a text entity.
Files
When the <entity> argument is recognized as a file, the file
data required for the query
is read from the local file system, parsed as its specified or
detected serialization format, and emitted to its output.
File-system paths are interpreted relative to the directory in which the super command is running.
The only allowed option for file entities is the format option described above.
Here are some examples of file syntax:
from file.json
from 'file-with-dash.sup'
from /path/to/file.csv
from file*.parq (format parquet)
Pools
When the <entity> argument is recognized as a database pool,
the data required for the query is ready from the database and
emitted to its output.
The only allowed option for a pool is the commit argument having the form
commit <commitsh>
where <commitish> specifies a specific
commit in the pool’s log thereby allowing time travel.
The the commit argument may be abbreviated by appending to the pool name
an @ character followed by the commitish, e.g.,
from Pool (commit 36AwHUt9s8usF7pi9x3l6LOl8IB)
maybe be instead written as
from Pool@36AwHUt9s8usF7pi9x3l6LOl8IB
When a single pool name is specified without a commit option, or
when using a regular expression, the tip of the main branch
of each pool is accessed.
It is an error to specify a format option when the entity is is a pool.
Note
Metadata from database pools also may be sourced using
from. This will be documented in a future release of SuperDB.
URLs
Data sources identified by URLs can be accessed either when attached or detached from a database.
As a text entity, typical URLs need not be quoted though URLs with special characters must be quoted.
When the <entity> argument begins with http: or https:
and has the form of a valid URL, then the source is fetched remotely
using either HTTP or HTTPS.
When the URL begins with s3: then data is fetched via
the Amazon S3 object service using the settings defined
by a local configuration.
Named options for URL entities include format, method, headers,
and body as in
from <url> [ ( format <fmt> method <method> headers <expr> body <string> ) ]
where
<method>is one ofGET,PUT,POST, orDELETE,<expr>is a record expression that defines the names and values to be included as HTTP header options, and<body>is a text-entity string to be included as the body of the HTTP request.
Currently, the headers expression must evaluate to a compile-time constant though this may change to allow dynamic computation in a future version of SuperSQL. Each field of this record must either be a string or (to specify a header option appearing multiple times with different values) an array or set of strings.
These options cannot be used with S3 URLs.
Note
Currently, the headers expression must evaluate to a compile-time constant though this may change to allow run-time evaluation in a future version of SuperSQL.
Combining Data
To combine data from multiple sources using pipe operators, from may be
used in combination with other operators like fork and join.
For example, multiple pools can be accessed in parallel and combined in undefined order:
fork
( from PoolOne | op1 | op2 | ... )
( from PoolTwo | op1 | op2 | ... )
| ...
or joined according to a join condition:
fork
( from PoolOne | op1 | op2 | ... )
( from PoolTwo | op1 | op2 | ... )
| join as {left,right} on left.key=right.key
| ...
Alternatively, the right-hand leg of the join may be written as a subquery of join:
from PoolOne | op1 | op2 | ...
| join ( from PoolTwo | op1 | op2 | ... )
as {left,right} on left.key=right.key
| ...
File Examples
Source structured data from a JSON file
echo '{"greeting":"hello world!"}' > hello.json
super -s -c 'from hello.json | values greeting'
"hello world!"
Source super-structured from a local file
echo '1 2 {x:1} {s:1::(int64|string)} {s:"hello"::(int64|string)}' > vals.sup
super -s -c 'from vals.sup'
1
2
{x:1}
{s:1::(int64|string)}
{s:"hello"::(int64|string)}
HTTP Example
Source data from a URL
super -s -c 'from https://api.github.com/repos/brimdata/super | values name'
"super"
F-String Example
Read from dynamically defined files and add a column
echo '{a:1}{a:2}' > a.sup
echo '{b:3}{b:4}' > b.sup
echo '"a.sup" "b.sup"' | super -s -c "from f'{this}' | c:=coalesce(a,b)+1" -
{a:1,c:2}
{a:2,c:3}
{b:3,c:4}
{b:4,c:5}
Database Examples
The remaining examples below assume the existence of the SuperDB database created and populated by the following commands:
export SUPER_DB=example
super db -q init
super db -q create -orderby flip:desc coinflips
echo '{flip:1,result:"heads"} {flip:2,result:"tails"}' |
super db load -q -use coinflips -
super db branch -q -use coinflips trial
echo '{flip:3,result:"heads"}' | super db load -q -use coinflips@trial -
super db -q create numbers
echo '{number:1,word:"one"} {number:2,word:"two"} {number:3,word:"three"}' |
super db load -q -use numbers -
super db -f line -c '
from :branches
| values pool.name || "@" || branch.name
| sort'
The database then contains the two pools and three branches:
coinflips@main
coinflips@trial
numbers@main
The following file hello.sup is also used.
{greeting:"hello world!"}
Source data from the main branch of a pool
super db -db example -s -c 'from coinflips'
{flip:2,result:"tails"}
{flip:1,result:"heads"}
Source data from a specific branch of a pool
super db -db example -s -c 'from coinflips@trial'
{flip:3,result:"heads"}
{flip:2,result:"tails"}
{flip:1,result:"heads"}
Count the number of values in the main branch of all pools
super db -db example -s -c 'from * | count()'
5
Join the data from multiple pools
super db -db example -s -c '
from coinflips
| join ( from numbers ) on left.flip=right.number
| values {...left, word:right.word}
| sort'
{flip:1,result:"heads",word:"one"}
{flip:2,result:"tails",word:"two"}
Use pass to combine our join output with data from yet another source
super db -db example -s -c '
from coinflips
| join ( from numbers ) on left.flip=right.number
| values {...left, word:right.word}
| fork
( pass )
( from coinflips@trial
| c:=count()
| values f"There were {c} flips" )
| sort this'
"There were 3 flips"
{flip:1,result:"heads",word:"one"}
{flip:2,result:"tails",word:"two"}