Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Type Fusion

Type fusion is a process by which a set of input types is merged together to form one output type where all values of the input types are subtypes of the output type such that any value of an input type is representable in a reasonable way (e.g., by inserting null values) as an equivalent value in the output type.

Type fusion is implemented by the fuse aggregate function.

A fused type determined by a collection of data can in turn be used in a cast operation over that collection of data to create a new collection of data comprising the single, fused type. The process of transforming data in this fashion is called data fusion.

Data fusion is implemented by the fuse operator.

The simplest fusion function is a sum type, e.g., int64 and string fuse into int64|string. More interesting fusion functions apply to record types, e.g., these two record types

{a:int64}
{b:string}

might fuse to type

{a:int64,b:string}

When the output type models a relational schema and the input types are derived from semi-structured data, then this technique resembles schema inference in other systems.

Note

Schema inference also involves the inference of particular primitive data types from string data when the strings represent dates, times, IP addresses, etc. This step is orthogonal to type fusion and can be applied to the input types of any type fusion algorithm.

A fused type computed over heterogeneous data represents a typical design pattern of a data warehouse, where a relational table with a single very-wide type-fused schema defines slots for all possible input values and the columns are sparsely populated by each row value with missing columns set to null.

While super-structured data natively represents heterogeneous data and fortunately does not require a fused schema to persist data, type fusion is nonetheless very useful:

  • for data exploration, when sampling or filtering data to look at slices of raw data that are fused together;
  • for exporting super-structured data to other systems and formats, where formats like Parquet or a tabular structure like CSV require fixed schemas; and,
  • for ETL, where data might be gathered from APIs using SuperDB, transformed in a SuperDB pipeline, and written to another data warehouse.

Unfortunately, when data leaves a super-structured format using type fusion to accomplish this, the original data must be altered to fit into the rigid structure of these output formats.

The Mechanism

The foundation of type fusion is to merge record types by their field names while other types are generally merged with sum types.

For example, type {a:int64} merged with type {b:string} would simply be type {a:int64,b:string}.

When fields overlap, {a:int64,c:bool} merged with type {b:string,c:bool} would naturally be type {a:int64,b:string,c:bool}.

But when fields overlap and their types conflict, then a sum type is used to represent the overlapping types, e.g., the types

{a:int64,c:bool}
{b:string,c:time}

are fused as

{a:int64,c:string,c:bool|time}

Arrays, maps, and sets are merged by merging their constituent types, e.g., the element type of the array or set and the key/value types of a map.

For example, the types

{a:[int64],b:|[string]|}
{a:[string]}

are fused as

{a:[int64|string],b:|[string]|}

Detailed Algorithm

Type fusion may be formally defined as a function over types:

T = F(T1, T2, ... Tn)

where T is the fused type and T1...Tn are the input types.

When F() can be decomposed into an iterative application of a merge function m(type, type) -> type

F(T1, T2, ... Tn) = m(m(m(T1, T2), T3), ... Tn)

then type fusion may be computing iteratively over a set of arbitrary types.

And when m() is commutative and associative, then the fused type can be computed in parallel without respect to the input order of the data.

The merge function m(t1,t2) implemented by SuperSQL combines complex types by merging their structure and combines otherwise incompatible types with a union type as follows.

When t1 and t2 are different categories of types (e.g., record and array, a set and a primitive type, two different primitive types, and so forth), then the merged type is simply their sum type t1|t2.

When one type (e.g., t1) is a union type and the other (e.g., t2) is not, then the t2 is added to the elements of t1 if not already a member of the union.

When t1 and t2 are the same category of type, then they are merged as follows:

  • for record types, the fields of t1 and t2 are matched by name and for each matched field present in both types, the merged field is the recursive merge of the two field types, and any non-matching fields are simply appended to the resulting record type,
  • for array types, the result is an array type whose element type is the recursive merge of the two element types,
  • for set types, the result is a set type whose element type is the recursive merge of the two element types,
  • for map types, the result is a set type whose key type is the recursive merge of the two key types and whose value type is the recursive merge of the two value type,
  • for union types, the result is a union type comprising all the elemental types of t1 and t2,
  • for error types, the result is an error type comprising the recursive merge of the two contained types,
  • for enum types, the result is the sum type of the two types t1|t2
  • for named types, the result is the sum type of the two types t1|t2.

For further information and examples of type fusion, see the documentation for the fuse aggregate function and the fuse pipe operator.