Output

Many commands produce output, which always originates in super-structured form, but can be serialized into a number of supported formats. The super-structured formats are generally preferred because they retain the full richness of the super-structured data model.

Output is written to standard output by default or, if -o is specified, to the indicated file or directory.

When writing to stdout and stdout is a terminal, the default output format is SUP. Otherwise, the default format is BSUP. These defaults may be overridden with -f, -s, or -S.

Note

While BSUP is currently default, a forthcoming release will change CSUP to default after CSUP supports streaming.

Since SUP is a common format choice for interactive use, the -s flag is shorthand for -f sup. Also, -S is a shortcut for -f sup with -pretty 2 as described below.

And since plain JSON is another common format choice, the -j flag is a shortcut for -f json and -J is a shortcut for pretty-printing JSON.

Note

Having the default output format dependent on the terminal status causes an occasional surprise (e.g., forgetting -f or -s in a scripted test that works fine on the command line but fails in CI). However, this avoids problematic performance where a data pipeline deployed to production accidentally uses SUP instead of CSUP. Since super gracefully handles any input, this would be hard to detect. Alternatively, making CSUP always be default would cause much annoyance when binary data is written to the terminal.

If no query is specified with -c, the inputs are scanned without modification and output in the specified format providing a convenient means to convert files from one format to another, e.g.,

super -f arrows -o out.arrows file1.json file2.parquet file3.csv

Pretty Printing

SUP and plain JSON text may be “pretty printed” with the -pretty option, which takes the number of spaces to use for indentation. As this is a common option, the -S option is a shortcut for -f sup -pretty 2 and -J is a shortcut for -f json -pretty 2.

For example,

echo '{a:{b:1,c:[1,2]},d:"foo"}' | super -S -

produces

{
  a: {
    b: 1,
    c: [
      1,
      2
    ]
  },
  d: "foo"
}

and

echo '{a:{b:1,c:[1,2]},d:"foo"}' | super -f sup -pretty 4 -

produces

{
    a: {
        b: 1,
        c: [
            1,
            2
        ]
    },
    d: "foo"
}

When pretty printing, colorization is enabled by default when writing to a terminal, and can be disabled with -color false.

Pipeline-friendly Formats

Though they’re compressed formats, CSUP and BSUP data are self-describing and stream-oriented and thus is pipeline friendly.

Since data is self-describing you can simply take super-structured output of one command and pipe it to the input of another. It doesn’t matter if the value sequence is scalars, complex types, or records. There is no need to declare or register schemas or “protos” with the downstream entities.

In particular, super-structured data can simply be concatenated together, e.g.,

super -f bsup -c 'values 1, [1,2,3]' > a.bsup
super -f bsup -c "values {s:'hello'}, {s:'world'}" > b.bsup
cat a.bsup b.bsup | super -s -

produces

1
[1,2,3]
{s:"hello"}
{s:"world"}

Schema-rigid Outputs

Certain data formats like Arrow and Parquet are schema rigid in the sense that they require a schema to be defined before values can be written into the file and all the values in the file must conform to this schema.

SuperDB, however, has a fine-grained type system instead of schemas such that a sequence of data values is completely self-describing and may be heterogeneous in nature. This creates a challenge converting the type-flexible super-structured data formats to a schema-rigid format like Arrow and Parquet.

For example, this seemingly simple conversion:

echo '{x:1}{s:"hello"}' | super -o out.parquet -f parquet -

causes this error

parquetio: encountered multiple types (consider 'fuse'): {x:int64} and {s:string}

To write heterogeneous data to a schema-based file format, you must convert the data to a monolithic type. To handle this, you can either fuse the data into a single fused type or you can specify the -split flag to indicate a destination directory that receives a separate output file for each output type.

Fused Data

The fuse operator uses type fusion to merge different record types into a blended type, e.g.,

echo '{x:1}{s:"hello"}' | super -o out.parquet -f parquet -c fuse -
super -s out.parquet

which produces

{x:1,s:null::string}
{x:null::int64,s:"hello"}

The downside of this approach is that the data must be changed (by inserting nulls) to conform to a single type.

Also, data fusion can sometimes involve sum types that are not representable in a format like Parquet. While a bit cumbersome, you could write a query that adjusts the output be renaming columns so that heterogenous data column types are avoided. This modified data could then be fused without sum types and output to Parquet.

Splitting Schemas

An alternative approach to the schema-rigid limitation of Arrow and Parquet is to create a separate file for each schema.

super can do this too with its -split option, which specifies a path to a directory for the output files. If the path is ., then files are written to the current directory.

The files are named using the -o option as a prefix and the suffix is -<n>.<ext> where the <ext> is determined from the output format and where <n> is a unique integer for each distinct output file.

For example, the example above would produce two output files, which can then be read separately to reproduce the original data, e.g.,

echo '{x:1}{s:"hello"}' | super -o out -split . -f parquet -
super -s out-*.parquet

produces the original data

{x:1}
{s:"hello"}

While the -split option is most useful for schema-rigid formats, it can be used with any output format.

Database Metadata

TODO: We should get rid of this. Or document it as an internal format. It’s not a format that people should rely upon.

The db format is used to pretty-print lake metadata, such as in super db sub-command outputs. Because it’s super db’s default output format, it’s rare to request it explicitly via -f. However, since it’s possible for super db to generate output in any supported format, the db format is useful to reverse this.

For example, imagine you’d executed a meta-query via super db query -S "from :pools" and saved the output in this file pools.sup.

{
    ts: 2024-07-19T19:28:22.893089Z,
    name: "MyPool",
    id: 0x132870564f00de22d252b3438c656691c87842c2::=ksuid.KSUID,
    layout: {
        order: "desc"::=order.Which,
        keys: [
            [
                "ts"
            ]::=field.Path
        ]::=field.List
    }::=order.SortKey,
    seek_stride: 65536,
    threshold: 524288000
}::=pools.Config

Using super -f db, this can be rendered in the same pretty-printed form as it would have originally appeared in the output of super db ls, e.g.,

super -f db pools.sup

produces

MyPool 2jTi7n3sfiU7qTgPTAE1nwTUJ0M key ts order desc

Line Format

The line format is convenient for interacting with other Unix-style tooling that produces text input and output a line at a time.

When -i line is specified as the input format, data is read a line as a string type.

When -f line is specified as the output format, each value is formatted a line at a time. String values are printed as is with otherwise escaped values formatted as their native character in the output, e.g.,

Escape Sequence	Rendered As
`\n`	Newline
`\t`	Horizontal tab
`\\`	Backslash
`\"`	Double quote
`\r`	Carriage return
`\b`	Backspace
`\f`	Form feed
`\u`	Unicode escape (e.g., `\u0041` for `A`)

Non-string values are formatted as SUP.

For example:

echo '"hi" "hello\nworld" { time_elapsed: 86400s }' | super -f line -

produces

hi
hello
world
{time_elapsed:1d}

Because embedded newlines create multi-lined output with -i line, this mode can alter the sequence of values, e.g.,

super -c "values 'foo\nbar' | count()"

results in 1 but

super -f line -c "values 'foo\nbar'" | super -i line -c "count()" -

results in 2.

Keyboard shortcuts

SuperDB