parquet_encode

EXPERIMENTAL

This component is experimental and therefore subject to change or removal outside of major version releases.

Encodes Parquet files from a batch of structured messages.

Introduced in version 1.0.0.

Common
Advanced

# Common config fields, showing default values
label: ""
parquet_encode:
  schema: [] # No default (required)
  default_compression: uncompressed

# All config fields, showing default values
label: ""
parquet_encode:
  schema: [] # No default (required)
  default_compression: uncompressed
  default_encoding: DELTA_LENGTH_BYTE_ARRAY

This processor uses https://github.com/parquet-go/parquet-go, which is itself experimental. Therefore changes could be made into how this processor functions outside of major version releases.

Examples

Writing Parquet Files to AWS S3
Using STRUCT Types for Nested Objects

In this example we use the batching mechanism of an aws_s3 output to collect a batch of messages in memory, which then converts it to a parquet file and uploads it.

output:
  aws_s3:
    bucket: TODO
    path: 'stuff/${! timestamp_unix() }-${! uuid_v4() }.parquet'
    batching:
      count: 1000
      period: 10s
      processors:
        - parquet_encode:
            schema:
              - name: id
                type: INT64
              - name: weight
                type: DOUBLE
              - name: content
                type: BYTE_ARRAY
              - name: attributes
                type: MAP
                fields:
                  - { name: key, type: UTF8 }
                  - { name: value, type: INT64 }
              - name: tags
                type: LIST
                fields:
                  - { name: element, type: UTF8 }
            default_compression: zstd

This example shows how to use STRUCT types to store complex nested objects natively in Parquet format, avoiding JSON serialization overhead.

output:
  aws_s3:
    bucket: TODO
    path: 'events/${! timestamp_unix() }-${! uuid_v4() }.parquet'
    batching:
      count: 1000
      period: 10s
      processors:
        - parquet_encode:
            schema:
              - name: id
                type: INT64
              - name: timestamp
                type: INT64
              - name: cloud
                type: STRUCT
                optional: true
                fields:
                  - name: provider
                    type: UTF8
                  - name: region
                    type: UTF8
                    optional: true
                  - name: account
                    type: STRUCT
                    optional: true
                    fields:
                      - name: uid
                        type: UTF8
                        optional: true
              - name: metadata
                type: STRUCT
                optional: true
                fields:
                  - name: version
                    type: UTF8
                  - name: product
                    type: STRUCT
                    optional: true
                    fields:
                      - name: name
                        type: UTF8
                      - name: version
                        type: UTF8
                        optional: true
            default_compression: zstd

Fields

`schema`

Parquet schema.

Type: array

`schema[].name`

The name of the column.

Type: string

`schema[].type`

The type of the column, only applicable for leaf columns with no child fields. STRUCT represents nested objects with defined field schemas. MAP supports only string keys, but can support values of all types. Some logical types can be specified here such as UTF8.

Type: string
Options: BOOLEAN, INT8, INT16, INT32, INT64, DECIMAL64, DECIMAL32, FLOAT, DOUBLE, BYTE_ARRAY, UTF8, MAP, LIST, STRUCT.

`schema[].decimal_precision`

Precision to use for DECIMAL32/DECIMAL64 type

Type: int
Default: 0

`schema[].decimal_scale`

Scale to use for DECIMAL32/DECIMAL64 type

Type: int
Default: 0

`schema[].repeated`

Whether the field is repeated.

Type: bool
Default: false

`schema[].optional`

Whether the field is optional.

Type: bool
Default: false

`schema[].fields`

A list of child fields.

Type: array

# Examples

fields:
  - name: foo
    type: INT64
  - name: bar
    type: BYTE_ARRAY

`default_compression`

The default compression type to use for fields.

Type: string
Default: "uncompressed"
Options: uncompressed, snappy, gzip, brotli, zstd, lz4raw.

`default_encoding`

The default encoding type to use for fields. A custom default encoding is only necessary when consuming data with libraries that do not support DELTA_LENGTH_BYTE_ARRAY and is therefore best left unset where possible.

Type: string
Default: "DELTA_LENGTH_BYTE_ARRAY"
Requires version 1.0.0 or newer
Options: DELTA_LENGTH_BYTE_ARRAY, PLAIN.

Examples​

Fields​

schema​

schema[].name​

schema[].type​

schema[].decimal_precision​

schema[].decimal_scale​

schema[].repeated​

schema[].optional​

schema[].fields​

default_compression​

default_encoding​