grok

Table of Contents

Function

grok — parse a string using a Grok pattern

Synopsis

grok(p: string, s: string) -> record
grok(p: string, s: string, definitions: string) -> record

Description

The grok function parses a string s using Grok pattern p and returns a record containing the parsed fields. The syntax for pattern p is %{pattern:field_name} where pattern is the name of the pattern to match in s and field_name is the resultant field name of the capture value.

When provided with three arguments, definitions is a string of named patterns in the format PATTERN_NAME PATTERN each separated by newlines (\n). The named patterns can then be referenced in argument p.

Included Patterns

The grok function by default includes a set of built-in named patterns that can be referenced in any pattern. The included named patterns can be seen here.

Comparison to Other Implementations

Although Grok functionality appears in many open source tools, it lacks a formal specification. As a result, example parsing configurations found via web searches may not all plug seamlessly into SuperPipe’s grok function without modification.

Logstash was the first tool to widely promote the approach via its Grok filter plugin, so it serves as the de facto reference implementation. Many articles have been published by Elastic and others that provide helpful guidance on becoming proficient in Grok. To help you adapt what you learn from these resources to the use of the grok function, review the tips below.

Note

As these represent areas of possible future SuperPipe enhancement, links to open issues are provided. If you find a functional gap significantly impacts your ability to use the grok function, please add a comment to the relevant issue describing your use case.

  1. Logstash’s Grok offers an optional data type conversion syntax, e.g.,
    %{NUMBER:num:int}
    

to store num as an integer type instead of as a string. SuperPipe currently accepts this trailing :type syntax but effectively ignores it and stores all parsed values as strings. Downstream use of the cast function can be used instead for data type conversion. (super/4928)

  1. Some Logstash Grok examples use an optional square bracket syntax for storing a parsed value in a nested field, e.g.,

    %{GREEDYDATA:[nested][field]}
    

    to store a value into {"nested": {"field": ... }}. In SuperPipe the more common dot-separated field naming convention nested.field can be combined with the downstream use of the nest_dotted function to store values in nested fields. (super/4929)

  2. SuperPipe’s regular expressions syntax does not currently support the “named capture” syntax shown in the Logstash docs. (super/4899)

    Instead use the the approach shown later in that section of the Logstash docs by including a custom pattern in the definitions argument, e.g.,

    Error:
    echo '"Jan  1 06:25:43 mailserver14 postfix/cleanup[21403]: BEF25A72965: message-id=<20130101142543.5828399CCAF@mailserver14.example.com>"' |
      super -Z -c 'yield grok("%{SYSLOGBASE} %{POSTFIX_QUEUEID:queue_id}: %{GREEDYDATA:syslog_message}",
                        this,
                        "POSTFIX_QUEUEID [0-9A-F]{10,11}")' -

    produces

  3. The Grok implementation for Logstash uses the Oniguruma regular expressions library while SuperPipe’s grok uses Go’s regexp and RE2 syntax. These implementations share the same basic syntax which should suffice for most parsing needs. But per a detailed comparison, Oniguruma does provide some advanced syntax not available in RE2, such as recursion, look-ahead, look-behind, and backreferences. To avoid compatibility issues, we recommend building configurations starting from the RE2-based included patterns.

Note

If you absolutely require features of Logstash’s Grok that are not currently present in SuperPipe, you can create a Logstash-based preprocessing pipeline that uses its Grok filter plugin and send its output as JSON to SuperPipe. Issue super/3151 provides some tips for getting started. If you pursue this approach, please add a comment to the issue describing your use case or come talk to us on community Slack.

Debugging

Much like creating complex regular expressions, building sophisticated Grok configurations can be frustrating because single-character mistakes can make the difference between perfect parsing and total failure.

A recommended workflow is to start by successfully parsing a small/simple portion of your target data and incrementally adding more parsing logic and re-testing at each step.

To aid in this workflow, you may find an interactive Grok debugger helpful. However, note that these have their own differences and limitations. If you devise a working Grok config in such a tool be sure to incrementally test it with SuperPipe’s grok. Be mindful of necessary adjustments such as those described above and in the examples.

Need Help?

If you have difficulty with your Grok configurations, please come talk to us on the community Slack.

Examples

Parsing a simple log line using the built-in named patterns:

Error:
echo '"2020-09-16T04:20:42.45+01:00 DEBUG This is a sample debug log message"' |
  super -Z -c 'yield grok("%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} %{GREEDYDATA:message}",
                    this)' -

As with any string literal, the leading backslash in escape sequences in string arguments must be doubled, such as changing the \d to \\d if we repurpose the included pattern for NUMTZ as a definitions argument:

Error:
echo '"+7000"' |
  super -z -c 'yield grok("%{MY_NUMTZ:tz}",
                    this,
                    "MY_NUMTZ [+-]\\d{4}")' -

In addition to using \n newline escapes to separate multiple named patterns in the definitions argument, string concatenation via + may further enhance readability.

Error:
echo '"(555)-1212"' |
  super -z -c 'yield grok("\\(%{PH_PREFIX:prefix}\\)-%{PH_LINE_NUM:line_number}",
                    this,
                    "PH_PREFIX \\d{3}\n" +
                    "PH_LINE_NUM \\d{4}")' -
Next: has

SuperDB