Shaping and Type Fusion
Shaping
Data that originates from heterogeneous sources typically has inconsistent structure and is thus difficult to reason about or query. To unify disparate data sources, data is often cleaned up to fit into a well-defined set of schemas, which combines the data into a unified store like a data warehouse.
In Zed, this cleansing process is called “shaping” the data, and Zed leverages its rich, super-structured type system to perform core aspects of data transformation. In a data model with nesting and multiple scalar types (such as Zed or JSON), shaping includes converting the type of leaf fields, adding or removing fields to “fit” a given shape, and reordering fields.
While shaping remains an active area of development, the core functions in Zed that currently perform shaping are:
cast
- coerce a value to a different typecrop
- remove fields from a value that are missing in a specified typefill
- add null values for missing fieldsorder
- reorder record fieldsshape
- applycast
,fill
, andorder
They all have the same signature, taking two parameters: the value to be transformed and a type value for the target type.
Another type of transformation that’s needed for shaping is renaming fields, which is supported by the
rename
operator. Also, theyield
operator is handy for simply emitting new, arbitrary record literals based on input values and mixing in these shaping functions in an embedded record literal. Thefuse
aggregate function is also useful for fusing values into a common schema, though a type is returned rather than values.
In the examples below, we will use the following named type connection
:
type socket = { addr:ip, port:port=uint16 }
type connection = {
kind: string,
client: socket,
server: socket,
vlan: uint16
}
We’ll also use this sample JSON input:
{
"kind": "dns",
"server": {
"addr": "10.0.0.100",
"port": 53
},
"client": {
"addr": "10.47.1.100",
"port": 41772
},
"uid": "C2zK5f13SbCtKcyiW5"
}
Cast
The cast
function applies a cast operation to each leaf value that matches the
field path in the specified type.
In the following example we cast the address fields to type ip
, the port fields to type port
(which is a named type for type uint16
) and the address port pairs to
type socket
without modifying the uid
field or changing the
order of the server
and client
fields:
type socket = { addr:ip, port:port=uint16 }
type connection = {
kind: string,
client: socket,
server: socket,
vlan: uint16
}
cast(this, <connection>)
{
"kind": "dns",
"server": {
"addr": "10.0.0.100",
"port": 53
},
"client": {
"addr": "10.47.1.100",
"port": 41772
},
"uid": "C2zK5f13SbCtKcyiW5"
}
Loading...
echo '{
"kind": "dns",
"server": {
"addr": "10.0.0.100",
"port": 53
},
"client": {
"addr": "10.47.1.100",
"port": 41772
},
"uid": "C2zK5f13SbCtKcyiW5"
}' \
| super -z -c 'type socket = { addr:ip, port:port=uint16 }
type connection = {
kind: string,
client: socket,
server: socket,
vlan: uint16
}
cast(this, <connection>)' -
Crop
Cropping is useful when you want records to “fit” a schema tightly.
In the following example we remove the uid
field since it is not in the connection
type:
type socket = { addr:ip, port:port=uint16 }
type connection = {
kind: string,
client: socket,
server: socket,
vlan: uint16
}
crop(this, <connection>)
{
"kind": "dns",
"server": {
"addr": "10.0.0.100",
"port": 53
},
"client": {
"addr": "10.47.1.100",
"port": 41772
},
"uid": "C2zK5f13SbCtKcyiW5"
}
Loading...
echo '{
"kind": "dns",
"server": {
"addr": "10.0.0.100",
"port": 53
},
"client": {
"addr": "10.47.1.100",
"port": 41772
},
"uid": "C2zK5f13SbCtKcyiW5"
}' \
| super -z -c 'type socket = { addr:ip, port:port=uint16 }
type connection = {
kind: string,
client: socket,
server: socket,
vlan: uint16
}
crop(this, <connection>)' -
Fill
Use fill
when you want to fill out missing fields with nulls.
In the following example we add a null-valued vlan
field since the input value is missing it and
the connection
type has it:
type socket = { addr:ip, port:port=uint16 }
type connection = {
kind: string,
client: socket,
server: socket,
vlan: uint16
}
fill(this, <connection>)
{
"kind": "dns",
"server": {
"addr": "10.0.0.100",
"port": 53
},
"client": {
"addr": "10.47.1.100",
"port": 41772
},
"uid": "C2zK5f13SbCtKcyiW5"
}
Loading...
echo '{
"kind": "dns",
"server": {
"addr": "10.0.0.100",
"port": 53
},
"client": {
"addr": "10.47.1.100",
"port": 41772
},
"uid": "C2zK5f13SbCtKcyiW5"
}' \
| super -z -c 'type socket = { addr:ip, port:port=uint16 }
type connection = {
kind: string,
client: socket,
server: socket,
vlan: uint16
}
fill(this, <connection>)' -
Order
The order
function changes the order of fields in its input to match the
order in the specified type, as field order is significant in Zed records.
The following example reorders the client
and server
fields to match
the input but does nothing about the uid
field as it is not in the
connection
type.
type socket = { addr:ip, port:port=uint16 }
type connection = {
kind: string,
client: socket,
server: socket,
vlan: uint16
}
order(this, <connection>)
{
"kind": "dns",
"server": {
"addr": "10.0.0.100",
"port": 53
},
"client": {
"addr": "10.47.1.100",
"port": 41772
},
"uid": "C2zK5f13SbCtKcyiW5"
}
Loading...
echo '{
"kind": "dns",
"server": {
"addr": "10.0.0.100",
"port": 53
},
"client": {
"addr": "10.47.1.100",
"port": 41772
},
"uid": "C2zK5f13SbCtKcyiW5"
}' \
| super -z -c 'type socket = { addr:ip, port:port=uint16 }
type connection = {
kind: string,
client: socket,
server: socket,
vlan: uint16
}
order(this, <connection>)' -
As an alternative to the order
function,
record expressions can be used to reorder
fields without specifying types. For example:
type socket = { addr:ip, port:port=uint16 }
type connection = {
kind: string,
client: socket,
server: socket,
vlan: uint16
}
yield {kind,client,server,...this}
{
"kind": "dns",
"server": {
"addr": "10.0.0.100",
"port": 53
},
"client": {
"addr": "10.47.1.100",
"port": 41772
},
"uid": "C2zK5f13SbCtKcyiW5"
}
Loading...
echo '{
"kind": "dns",
"server": {
"addr": "10.0.0.100",
"port": 53
},
"client": {
"addr": "10.47.1.100",
"port": 41772
},
"uid": "C2zK5f13SbCtKcyiW5"
}' \
| super -z -c 'type socket = { addr:ip, port:port=uint16 }
type connection = {
kind: string,
client: socket,
server: socket,
vlan: uint16
}
yield {kind,client,server,...this}' -
Shape
The shape
function brings everything together by applying cast
,
fill
, and order
all in one step.
In the following example we reorder the client
and server
fields to match
the input but do not impact the uid
field as it is not in the connection
type.
type socket = { addr:ip, port:port=uint16 }
type connection = {
kind: string,
client: socket,
server: socket,
vlan: uint16
}
shape(this, <connection>)
{
"kind": "dns",
"server": {
"addr": "10.0.0.100",
"port": 53
},
"client": {
"addr": "10.47.1.100",
"port": 41772
},
"uid": "C2zK5f13SbCtKcyiW5"
}
Loading...
echo '{
"kind": "dns",
"server": {
"addr": "10.0.0.100",
"port": 53
},
"client": {
"addr": "10.47.1.100",
"port": 41772
},
"uid": "C2zK5f13SbCtKcyiW5"
}' \
| super -z -c 'type socket = { addr:ip, port:port=uint16 }
type connection = {
kind: string,
client: socket,
server: socket,
vlan: uint16
}
shape(this, <connection>)' -
To get a tight shape of the target type,
apply crop
to the output of shape
, e.g.,
to dropping the uid
after shaping:
type socket = { addr:ip, port:port=uint16 }
type connection = {
kind: string,
client: socket,
server: socket,
vlan: uint16
}
shape(this, <connection>)
| crop(this, <connection>)
{
"kind": "dns",
"server": {
"addr": "10.0.0.100",
"port": 53
},
"client": {
"addr": "10.47.1.100",
"port": 41772
},
"uid": "C2zK5f13SbCtKcyiW5"
}
Loading...
echo '{
"kind": "dns",
"server": {
"addr": "10.0.0.100",
"port": 53
},
"client": {
"addr": "10.47.1.100",
"port": 41772
},
"uid": "C2zK5f13SbCtKcyiW5"
}' \
| super -z -c 'type socket = { addr:ip, port:port=uint16 }
type connection = {
kind: string,
client: socket,
server: socket,
vlan: uint16
}
shape(this, <connection>)
| crop(this, <connection>)' -
Error Handling
A failure during shaping produces an error value in the problematic leaf field.
In the next two examples, we use a malformed variation of our input data. When we apply our shaper to it, we now see two errors.
type socket = { addr:ip, port:port=uint16 }
type connection = {
kind: string,
client: socket,
server: socket,
vlan: uint16
}
shape(this, <connection>)
{
"kind": "dns",
"server": {
"addr": "10.0.0.100",
"port": 53
},
"client": {
"addr": "39 Elm Street",
"port": 41772
},
"vlan": "available"
}
Loading...
echo '{
"kind": "dns",
"server": {
"addr": "10.0.0.100",
"port": 53
},
"client": {
"addr": "39 Elm Street",
"port": 41772
},
"vlan": "available"
}' \
| super -z -c 'type socket = { addr:ip, port:port=uint16 }
type connection = {
kind: string,
client: socket,
server: socket,
vlan: uint16
}
shape(this, <connection>)' -
Since these error values are nested inside an otherwise healthy record, adding
has_error(this)
downstream in our pipeline
could help find or exclude such records. If the failure to shape any single
field is considered severe enough to render the entire input record unhealthy,
a conditional expression
could be applied to wrap the input record as an error while including detail
to debug the problem, e.g.,
type socket = { addr:ip, port:port=uint16 }
type connection = {
kind: string,
client: socket,
server: socket,
vlan: uint16
}
yield {original: this, shaped: shape(this, <connection>)}
| yield has_error(shaped)
? error({
msg: "shaper error (see inner errors for details)",
original,
shaped
})
: shaped
{
"kind": "dns",
"server": {
"addr": "10.0.0.100",
"port": 53
},
"client": {
"addr": "39 Elm Street",
"port": 41772
},
"vlan": "available"
}
Loading...
echo '{
"kind": "dns",
"server": {
"addr": "10.0.0.100",
"port": 53
},
"client": {
"addr": "39 Elm Street",
"port": 41772
},
"vlan": "available"
}' \
| super -z -c 'type socket = { addr:ip, port:port=uint16 }
type connection = {
kind: string,
client: socket,
server: socket,
vlan: uint16
}
yield {original: this, shaped: shape(this, <connection>)}
| yield has_error(shaped)
? error({
msg: "shaper error (see inner errors for details)",
original,
shaped
})
: shaped' -
If you require awareness about changes made by the shaping functions that aren’t surfaced as errors, a similar wrapping approach can be used with a general check for equality. For example, to treat cropped fields as an error, we can execute
type socket = { addr:ip, port:port=uint16 }
type connection = {
kind: string,
client: socket,
server: socket,
vlan: uint16
}
yield {original: this, cropped: crop(this, <connection>)}
| yield original==cropped
? original
: error({msg: "data was cropped", original, cropped})
{
"kind": "dns",
"server": {
"addr": "10.0.0.100",
"port": 53
},
"client": {
"addr": "10.47.1.100",
"port": 41772
},
"uid": "C2zK5f13SbCtKcyiW5"
}
Loading...
echo '{
"kind": "dns",
"server": {
"addr": "10.0.0.100",
"port": 53
},
"client": {
"addr": "10.47.1.100",
"port": 41772
},
"uid": "C2zK5f13SbCtKcyiW5"
}' \
| super -z -c 'type socket = { addr:ip, port:port=uint16 }
type connection = {
kind: string,
client: socket,
server: socket,
vlan: uint16
}
yield {original: this, cropped: crop(this, <connection>)}
| yield original==cropped
? original
: error({msg: "data was cropped", original, cropped})' -
Type Fusion
Type fusion is another important building block of data shaping. Here, types are operated upon by fusing them together, where the result is a single fused type. Some systems call a related process “schema inference” where a set of values, typically JSON, is analyzed to determine a relational schema that all the data will fit into. However, this is just a special case of type fusion as fusion is fine-grained and based on Zed’s type system rather than having the narrower goal of computing a schema for representations like relational tables, Parquet, Avro, etc.
Type fusion utilizes two key techniques.
The first technique is to simply combine types with a type union.
For example, an int64
and a string
can be merged into a common
type of union (int64,string)
, e.g., the value sequence 1 "foo"
can be fused into the single-type sequence:
1((int64,string))
"foo"((int64,string))
The second technique is to merge fields of records, analogous to a spread
expression. Here, the value sequence {a:1}{b:"foo"}
may be
fused into the single-type sequence:
{a:1,b:null(string)}
{a:null(int64),b:"foo"}
Of course, these two techniques can be powerfully combined,
e.g., where the value sequence {a:1}{a:"foo",b:2}
may be
fused into the single-type sequence:
{a:1((int64,string)),b:null(int64)}
{a:"foo"((int64,string)),b:2}
To perform fusion, Zed currently includes two key mechanisms (though this is an active area of development):
- the
fuse
operator, and - the
fuse
aggregate function.
Fuse Operator
The fuse
operator reads all of its input, computes a fused type using
the techniques above, and outputs the result, e.g.,
fuse
{x:1}
{y:"foo"}
{x:2,y:"bar"}
Loading...
echo '{x:1}
{y:"foo"}
{x:2,y:"bar"}' \
| super -z -c 'fuse' -
Whereas a type union for field x
is produced in the following:
fuse
{x:1}
{x:"foo",y:"foo"}
{x:2,y:"bar"}
Loading...
echo '{x:1}
{x:"foo",y:"foo"}
{x:2,y:"bar"}' \
| super -z -c 'fuse' -
Fuse Aggregate Function
The fuse
aggregate function is most often useful during data exploration and discovery
where you might interactively run queries to determine the shapes of some new
or unknown input data and how those various shapes relate to one another.
For example, in the example sequence above, we can use the fuse
aggregate function to determine
the fused type rather than transforming the values, e.g.,
fuse(this)
{x:1}
{x:"foo",y:"foo"}
{x:2,y:"bar"}
Loading...
echo '{x:1}
{x:"foo",y:"foo"}
{x:2,y:"bar"}' \
| super -z -c 'fuse(this)' -
Since the fuse
here is an aggregate function, it can also be used with
grouping keys. Supposing we want to divide records into categories and fuse
the records in each category, we can use a grouped aggregation. In this simple example, we
will fuse records based on their number of fields using the
len
function:
fuse(this) by len(this) | sort len
{x:1}
{x:"foo",y:"foo"}
{x:2,y:"bar"}
Loading...
echo '{x:1}
{x:"foo",y:"foo"}
{x:2,y:"bar"}' \
| super -z -c 'fuse(this) by len(this) | sort len' -
Now, we can turn around and write a “shaper” for data that has the patterns we “discovered” above, e.g.,
switch len(this) (
case 1 => pass
case 2 => yield shape(this, <{x:(int64,string),y:string}>)
default => yield error({kind:"unrecognized shape",value:this})
)
| sort this desc
{x:1}
{x:"foo",y:"foo"}
{x:2,y:"bar"}
{a:1,b:2,c:3}
Loading...
echo '{x:1}
{x:"foo",y:"foo"}
{x:2,y:"bar"}
{a:1,b:2,c:3}' \
| super -z -c 'switch len(this) (
case 1 => pass
case 2 => yield shape(this, <{x:(int64,string),y:string}>)
default => yield error({kind:"unrecognized shape",value:this})
)
| sort this desc' -