Shaping and Type Fusion

Shaping

Data that originates from heterogeneous sources typically has inconsistent structure and is thus difficult to reason about or query. To unify disparate data sources, data is often cleaned up to fit into a well-defined set of schemas, which combines the data into a unified store like a data warehouse.

In Zed, this cleansing process is called “shaping” the data, and Zed leverages its rich, super-structured type system to perform core aspects of data transformation. In a data model with nesting and multiple scalar types (such as Zed or JSON), shaping includes converting the type of leaf fields, adding or removing fields to “fit” a given shape, and reordering fields.

While shaping remains an active area of development, the core functions in Zed that currently perform shaping are:

cast - coerce a value to a different type
crop - remove fields from a value that are missing in a specified type
fill - add null values for missing fields
order - reorder record fields
shape - apply cast, fill, and order

They all have the same signature, taking two parameters: the value to be transformed and a type value for the target type.

Another type of transformation that’s needed for shaping is renaming fields, which is supported by the rename operator. Also, the yield operator is handy for simply emitting new, arbitrary record literals based on input values and mixing in these shaping functions in an embedded record literal. The fuse aggregate function is also useful for fusing values into a common schema, though a type is returned rather than values.

In the examples below, we will use the following named type connection:

type socket = { addr:ip, port:port=uint16 }
type connection = {
    kind: string,
    client: socket,
    server: socket,
    vlan: uint16
}

We’ll also use this sample JSON input:

{
  "kind": "dns",
  "server": {
    "addr": "10.0.0.100",
    "port": 53
  },
  "client": {
    "addr": "10.47.1.100",
    "port": 41772
  },
  "uid": "C2zK5f13SbCtKcyiW5"
}

Cast

The cast function applies a cast operation to each leaf value that matches the field path in the specified type.

In the following example we cast the address fields to type ip, the port fields to type port (which is a named type for type uint16) and the address port pairs to type socket without modifying the uid field or changing the order of the server and client fields:

Query

type socket = { addr:ip, port:port=uint16 }
type connection = {
    kind: string,
    client: socket,
    server: socket,
    vlan: uint16
}

cast(this, <connection>)

Input

{
  "kind": "dns",
  "server": {
    "addr": "10.0.0.100",
    "port": 53
  },
  "client": {
    "addr": "10.47.1.100",
    "port": 41772
  },
  "uid": "C2zK5f13SbCtKcyiW5"
}

Result

Loading...

echo '{
  "kind": "dns",
  "server": {
    "addr": "10.0.0.100",
    "port": 53
  },
  "client": {
    "addr": "10.47.1.100",
    "port": 41772
  },
  "uid": "C2zK5f13SbCtKcyiW5"
}' \
| super -z -c 'type socket = { addr:ip, port:port=uint16 }
type connection = {
    kind: string,
    client: socket,
    server: socket,
    vlan: uint16
}

cast(this, <connection>)' -

Crop

Cropping is useful when you want records to “fit” a schema tightly.

In the following example we remove the uid field since it is not in the connection type:

Query

type socket = { addr:ip, port:port=uint16 }
type connection = {
    kind: string,
    client: socket,
    server: socket,
    vlan: uint16
}

crop(this, <connection>)

Input

{
  "kind": "dns",
  "server": {
    "addr": "10.0.0.100",
    "port": 53
  },
  "client": {
    "addr": "10.47.1.100",
    "port": 41772
  },
  "uid": "C2zK5f13SbCtKcyiW5"
}

Result

Loading...

echo '{
  "kind": "dns",
  "server": {
    "addr": "10.0.0.100",
    "port": 53
  },
  "client": {
    "addr": "10.47.1.100",
    "port": 41772
  },
  "uid": "C2zK5f13SbCtKcyiW5"
}' \
| super -z -c 'type socket = { addr:ip, port:port=uint16 }
type connection = {
    kind: string,
    client: socket,
    server: socket,
    vlan: uint16
}

crop(this, <connection>)' -

Fill

Use fill when you want to fill out missing fields with nulls.

In the following example we add a null-valued vlan field since the input value is missing it and the connection type has it:

Query

type socket = { addr:ip, port:port=uint16 }
type connection = {
    kind: string,
    client: socket,
    server: socket,
    vlan: uint16
}

fill(this, <connection>)

Input

{
  "kind": "dns",
  "server": {
    "addr": "10.0.0.100",
    "port": 53
  },
  "client": {
    "addr": "10.47.1.100",
    "port": 41772
  },
  "uid": "C2zK5f13SbCtKcyiW5"
}

Result

Loading...

echo '{
  "kind": "dns",
  "server": {
    "addr": "10.0.0.100",
    "port": 53
  },
  "client": {
    "addr": "10.47.1.100",
    "port": 41772
  },
  "uid": "C2zK5f13SbCtKcyiW5"
}' \
| super -z -c 'type socket = { addr:ip, port:port=uint16 }
type connection = {
    kind: string,
    client: socket,
    server: socket,
    vlan: uint16
}

fill(this, <connection>)' -

Order

The order function changes the order of fields in its input to match the order in the specified type, as field order is significant in Zed records.

The following example reorders the client and server fields to match the input but does nothing about the uid field as it is not in the connection type.

Query

type socket = { addr:ip, port:port=uint16 }
type connection = {
    kind: string,
    client: socket,
    server: socket,
    vlan: uint16
}

order(this, <connection>)

Input

{
  "kind": "dns",
  "server": {
    "addr": "10.0.0.100",
    "port": 53
  },
  "client": {
    "addr": "10.47.1.100",
    "port": 41772
  },
  "uid": "C2zK5f13SbCtKcyiW5"
}

Result

Loading...

echo '{
  "kind": "dns",
  "server": {
    "addr": "10.0.0.100",
    "port": 53
  },
  "client": {
    "addr": "10.47.1.100",
    "port": 41772
  },
  "uid": "C2zK5f13SbCtKcyiW5"
}' \
| super -z -c 'type socket = { addr:ip, port:port=uint16 }
type connection = {
    kind: string,
    client: socket,
    server: socket,
    vlan: uint16
}

order(this, <connection>)' -

As an alternative to the order function, record expressions can be used to reorder fields without specifying types. For example:

Query

type socket = { addr:ip, port:port=uint16 }
type connection = {
    kind: string,
    client: socket,
    server: socket,
    vlan: uint16
}

yield {kind,client,server,...this}

Input

{
  "kind": "dns",
  "server": {
    "addr": "10.0.0.100",
    "port": 53
  },
  "client": {
    "addr": "10.47.1.100",
    "port": 41772
  },
  "uid": "C2zK5f13SbCtKcyiW5"
}

Result

Loading...

echo '{
  "kind": "dns",
  "server": {
    "addr": "10.0.0.100",
    "port": 53
  },
  "client": {
    "addr": "10.47.1.100",
    "port": 41772
  },
  "uid": "C2zK5f13SbCtKcyiW5"
}' \
| super -z -c 'type socket = { addr:ip, port:port=uint16 }
type connection = {
    kind: string,
    client: socket,
    server: socket,
    vlan: uint16
}

yield {kind,client,server,...this}' -

Shape

The shape function brings everything together by applying cast, fill, and order all in one step.

In the following example we reorder the client and server fields to match the input but do not impact the uid field as it is not in the connection type.

Query

type socket = { addr:ip, port:port=uint16 }
type connection = {
    kind: string,
    client: socket,
    server: socket,
    vlan: uint16
}

shape(this, <connection>)

Input

{
  "kind": "dns",
  "server": {
    "addr": "10.0.0.100",
    "port": 53
  },
  "client": {
    "addr": "10.47.1.100",
    "port": 41772
  },
  "uid": "C2zK5f13SbCtKcyiW5"
}

Result

Loading...

echo '{
  "kind": "dns",
  "server": {
    "addr": "10.0.0.100",
    "port": 53
  },
  "client": {
    "addr": "10.47.1.100",
    "port": 41772
  },
  "uid": "C2zK5f13SbCtKcyiW5"
}' \
| super -z -c 'type socket = { addr:ip, port:port=uint16 }
type connection = {
    kind: string,
    client: socket,
    server: socket,
    vlan: uint16
}

shape(this, <connection>)' -

To get a tight shape of the target type, apply crop to the output of shape, e.g., to dropping the uid after shaping:

Query

type socket = { addr:ip, port:port=uint16 }
type connection = {
    kind: string,
    client: socket,
    server: socket,
    vlan: uint16
}

shape(this, <connection>)
| crop(this, <connection>)

Input

{
  "kind": "dns",
  "server": {
    "addr": "10.0.0.100",
    "port": 53
  },
  "client": {
    "addr": "10.47.1.100",
    "port": 41772
  },
  "uid": "C2zK5f13SbCtKcyiW5"
}

Result

Loading...

echo '{
  "kind": "dns",
  "server": {
    "addr": "10.0.0.100",
    "port": 53
  },
  "client": {
    "addr": "10.47.1.100",
    "port": 41772
  },
  "uid": "C2zK5f13SbCtKcyiW5"
}' \
| super -z -c 'type socket = { addr:ip, port:port=uint16 }
type connection = {
    kind: string,
    client: socket,
    server: socket,
    vlan: uint16
}

shape(this, <connection>)
| crop(this, <connection>)' -

Error Handling

A failure during shaping produces an error value in the problematic leaf field.

In the next two examples, we use a malformed variation of our input data. When we apply our shaper to it, we now see two errors.

Query

type socket = { addr:ip, port:port=uint16 }
type connection = {
    kind: string,
    client: socket,
    server: socket,
    vlan: uint16
}

shape(this, <connection>)

Input

{
  "kind": "dns",
  "server": {
    "addr": "10.0.0.100",
    "port": 53
  },
  "client": {
    "addr": "39 Elm Street",
    "port": 41772
  },
  "vlan": "available"
}

Result

Loading...

echo '{
  "kind": "dns",
  "server": {
    "addr": "10.0.0.100",
    "port": 53
  },
  "client": {
    "addr": "39 Elm Street",
    "port": 41772
  },
  "vlan": "available"
}' \
| super -z -c 'type socket = { addr:ip, port:port=uint16 }
type connection = {
    kind: string,
    client: socket,
    server: socket,
    vlan: uint16
}

shape(this, <connection>)' -

Since these error values are nested inside an otherwise healthy record, adding has_error(this) downstream in our pipeline could help find or exclude such records. If the failure to shape any single field is considered severe enough to render the entire input record unhealthy, a conditional expression could be applied to wrap the input record as an error while including detail to debug the problem, e.g.,

Query

type socket = { addr:ip, port:port=uint16 }
type connection = {
    kind: string,
    client: socket,
    server: socket,
    vlan: uint16
}

yield {original: this, shaped: shape(this, <connection>)}
| yield has_error(shaped)
  ? error({
      msg: "shaper error (see inner errors for details)",
      original,
      shaped
    })
  : shaped

Input

{
  "kind": "dns",
  "server": {
    "addr": "10.0.0.100",
    "port": 53
  },
  "client": {
    "addr": "39 Elm Street",
    "port": 41772
  },
  "vlan": "available"
}

Result

Loading...

echo '{
  "kind": "dns",
  "server": {
    "addr": "10.0.0.100",
    "port": 53
  },
  "client": {
    "addr": "39 Elm Street",
    "port": 41772
  },
  "vlan": "available"
}' \
| super -z -c 'type socket = { addr:ip, port:port=uint16 }
type connection = {
    kind: string,
    client: socket,
    server: socket,
    vlan: uint16
}

yield {original: this, shaped: shape(this, <connection>)}
| yield has_error(shaped)
  ? error({
      msg: "shaper error (see inner errors for details)",
      original,
      shaped
    })
  : shaped' -

If you require awareness about changes made by the shaping functions that aren’t surfaced as errors, a similar wrapping approach can be used with a general check for equality. For example, to treat cropped fields as an error, we can execute

Query

type socket = { addr:ip, port:port=uint16 }
type connection = {
    kind: string,
    client: socket,
    server: socket,
    vlan: uint16
}

yield {original: this, cropped: crop(this, <connection>)}
| yield original==cropped
  ? original
  : error({msg: "data was cropped", original, cropped})

Input

{
  "kind": "dns",
  "server": {
    "addr": "10.0.0.100",
    "port": 53
  },
  "client": {
    "addr": "10.47.1.100",
    "port": 41772
  },
  "uid": "C2zK5f13SbCtKcyiW5"
}

Result

Loading...

echo '{
  "kind": "dns",
  "server": {
    "addr": "10.0.0.100",
    "port": 53
  },
  "client": {
    "addr": "10.47.1.100",
    "port": 41772
  },
  "uid": "C2zK5f13SbCtKcyiW5"
}' \
| super -z -c 'type socket = { addr:ip, port:port=uint16 }
type connection = {
    kind: string,
    client: socket,
    server: socket,
    vlan: uint16
}

yield {original: this, cropped: crop(this, <connection>)}
| yield original==cropped
  ? original
  : error({msg: "data was cropped", original, cropped})' -

Type Fusion

Type fusion is another important building block of data shaping. Here, types are operated upon by fusing them together, where the result is a single fused type. Some systems call a related process “schema inference” where a set of values, typically JSON, is analyzed to determine a relational schema that all the data will fit into. However, this is just a special case of type fusion as fusion is fine-grained and based on Zed’s type system rather than having the narrower goal of computing a schema for representations like relational tables, Parquet, Avro, etc.

Type fusion utilizes two key techniques.

The first technique is to simply combine types with a type union. For example, an int64 and a string can be merged into a common type of union (int64,string), e.g., the value sequence 1 "foo" can be fused into the single-type sequence:

1((int64,string))
"foo"((int64,string))

The second technique is to merge fields of records, analogous to a spread expression. Here, the value sequence {a:1}{b:"foo"} may be fused into the single-type sequence:

{a:1,b:null(string)}
{a:null(int64),b:"foo"}

Of course, these two techniques can be powerfully combined, e.g., where the value sequence {a:1}{a:"foo",b:2} may be fused into the single-type sequence:

{a:1((int64,string)),b:null(int64)}
{a:"foo"((int64,string)),b:2}

To perform fusion, Zed currently includes two key mechanisms (though this is an active area of development):

the fuse operator, and
the fuse aggregate function.

Fuse Operator

The fuse operator reads all of its input, computes a fused type using the techniques above, and outputs the result, e.g.,

Query

fuse

Input

{x:1}
{y:"foo"}
{x:2,y:"bar"}

Result

Loading...

echo '{x:1}
{y:"foo"}
{x:2,y:"bar"}' \
| super -z -c 'fuse' -

Whereas a type union for field x is produced in the following:

Query

fuse

Input

{x:1}
{x:"foo",y:"foo"}
{x:2,y:"bar"}

Result

Loading...

echo '{x:1}
{x:"foo",y:"foo"}
{x:2,y:"bar"}' \
| super -z -c 'fuse' -

Fuse Aggregate Function

The fuse aggregate function is most often useful during data exploration and discovery where you might interactively run queries to determine the shapes of some new or unknown input data and how those various shapes relate to one another.

For example, in the example sequence above, we can use the fuse aggregate function to determine the fused type rather than transforming the values, e.g.,

Query

fuse(this)

Input

{x:1}
{x:"foo",y:"foo"}
{x:2,y:"bar"}

Result

Loading...

echo '{x:1}
{x:"foo",y:"foo"}
{x:2,y:"bar"}' \
| super -z -c 'fuse(this)' -

Since the fuse here is an aggregate function, it can also be used with grouping keys. Supposing we want to divide records into categories and fuse the records in each category, we can use a grouped aggregation. In this simple example, we will fuse records based on their number of fields using the len function:

Query

fuse(this) by len(this) | sort len

Input

{x:1}
{x:"foo",y:"foo"}
{x:2,y:"bar"}

Result

Loading...

echo '{x:1}
{x:"foo",y:"foo"}
{x:2,y:"bar"}' \
| super -z -c 'fuse(this) by len(this) | sort len' -

Now, we can turn around and write a “shaper” for data that has the patterns we “discovered” above, e.g.,

Query

switch len(this) (
    case 1 => pass
    case 2 => yield shape(this, <{x:(int64,string),y:string}>)
    default => yield error({kind:"unrecognized shape",value:this})
)
| sort this desc

Input

{x:1}
{x:"foo",y:"foo"}
{x:2,y:"bar"}
{a:1,b:2,c:3}

Result

Loading...

echo '{x:1}
{x:"foo",y:"foo"}
{x:2,y:"bar"}
{a:1,b:2,c:3}' \
| super -z -c 'switch len(this) (
    case 1 => pass
    case 2 => yield shape(this, <{x:(int64,string),y:string}>)
    default => yield error({kind:"unrecognized shape",value:this})
)
| sort this desc' -