Queries
The syntactical structure of a query consists of
- an optional concatenation of declarations, followed by
- a sequence of pipe operators
separated by a pipe symbol (
|or|>).
Any valid SQL query may appear as a pipe operator and thus be embedded in a pipe query. A SQL query expressed as a pipe operator is called a SQL operator.
Operator sequences may be parenthesized and nested to form lexical scopes.
Operators utilize expressions in composable variations to perform their computations and all expressions share a common expression syntax. While operators consume a sequence of values, the expressions embedded within an operator are typically evaluated once for each value processed by the operator.
Scope
A scope is formed by enclosing a set of declarations along with an operator sequence in the parentheses having the structure:
(
<declarations>
<operators>
)
Scope blocks may appear
- anywhere a pipe operator may appear,
- as a subquery in an expression, or
- as the body of declared operator.
The parenthesized block forms a lexical scope and the bindings created by declarations within the scope are reachable only within that scope inclusive of other scopes defined within the scope.
A declaration cannot redefine an identifier that was previously defined in the same scope but can override identifiers defined in ancestor scopes.
The topmost scope is the global scope where all declared identifiers are available everywhere and does not include parentheses.
Note that this lexical scope refers only to the declared identifiers. Scoping of references to data input is defined by pipe scoping and relational scoping.
For example, this example of a constant declaration
const PI=3.14
values PI
emits the value 3.14 whereas
(
const PI=3.14
values PI
)
| values this+PI
emits error("missing") because the second reference to PI is not
in the scope of the declared constant and thus the identifier is interpreted
as a field reference this.pi via pipe scoping.
Identifiers
Identifiers are names that arise in many syntactical structures and
may be any sequence of UTF-8 characters. When not quoted,
an identifier may be comprised of Unicode letters, $, _,
and digits [0-9], but may not start with a digit.
To express an identifier that does not meet the requirements of an
unquoted identifier, arbitrary text may be quoted inside of backtick (`)
quotes.
Escape sequences in backtick-quoted identifiers are interpreted as in
string literals. In particular, a backtick (`)
character may be included in a backtick string with Unicode escape \u0060.
In SQL expressions, identifiers may also be enclosed in double-quoted strings.
The special value this is also available in SQL but has
peculiar semantics
due to SQL scoping rules. To reference a column called this
in a SQL expression, simply use double quotes, i.e., "this".
An unquoted identifier cannot be true, false, null, NaN, or Inf.
Patterns
For ease of use, several operators utilize a syntax for string entities outside of expression syntax where quotation marks for such entities may be conveniently elided.
For example, when sourcing data from a file on the file system, the file path can be expressed as a text entity and need not be quoted:
from file.json | ...
Likewise, in the search operator, the syntax for a
regular expression search can be specified as
search /\w+(foo|bar)/
whereas an explicit function call like regexp must be invoked to utilize
regular expressions in expressions as in
where len(regexp(r'\w+(foo|bar)', this)) > 0
Regular Expression
Regular expressions follow the syntax and semantics of the RE2 regular expression library, which is documented in the RE2 Wiki.
When used in an expression, e.g., as a parameter to a function, the RE2 text is simply passed as a string, e.g.,
regexp('foo|bar', this)
To avoid having to add escaping that would otherwise be necessary to
represent a regular expression as a raw string,
with prefix with r, e.g.,
regexp(r'\w+(foo|bar)', this)
But when used outside of expressions where an explicit indication of
a regular expression is required (e.g., in a
search or
from operator), the RE2 is instead
prefixed and suffixed with a /, e.g.,
/foo|bar/
matches the string "foo" or "bar".
Glob
Globs provide a convenient short-hand for regular expressions and follow
the familiar pattern of “file globbing” supported by Unix shells.
Globs are a simple, special case that utilize only the * wildcard.
Like regular expressions, globs may be used in
a search operator or a
from operator.
Valid glob characters include letters, digits (excepting the leading character),
any valid string escape sequence
(along with escapes for *, =, +, -), and the unescaped characters:
_ . : / % # @ ~ *
A glob cannot begin with a digit.
Text Entity
A text entity represents a string where quotes can be omitted for certain common use cases regarding URLs and file paths.
Text entities are syntactically valid as targets of a
from operator and as named arguments
to from and the
load operator.
Specifically, a text entity is one of:
- a string literal (double quoted, single quoted, or raw string),
- a path consisting of a sequence of characters consisting of letters, digits,
_,$,., and/, or - a simple URL consisting of a sequence of characters beginning with
http://orhttps://, followed by dotted strings of letters, digits,-, and_, and in turn optionally followed by/and a sequence of characters consisting of letters, digits,_,$,., and/.
If a URL does not meet the constraints of the simple URL rule,
e.g., containing a : or &, then it must be quoted.
Comments
Single-line comments are SQL style begin with two dashes -- and end at the
subsequent newline.
Multi-line comments are C style and begin with /* and end with */.
# spq
values 1, 2 -- , 3
/*
| aggregate sum(this)
*/
| aggregate sum(this / 2.0)
# input
# expected output
1.5