parquet_encode
This component is experimental and therefore subject to change or removal outside of major version releases.
Encodes Parquet files from a batch of structured messages.
Introduced in version 1.0.0.
- Common
- Advanced
# Common config fields, showing default values
label: ""
parquet_encode:
schema: [] # No default (required)
default_compression: uncompressed
# All config fields, showing default values
label: ""
parquet_encode:
schema: [] # No default (required)
default_compression: uncompressed
default_encoding: DELTA_LENGTH_BYTE_ARRAY
This processor uses https://github.com/parquet-go/parquet-go, which is itself experimental. Therefore changes could be made into how this processor functions outside of major version releases.
Examples
- Writing Parquet Files to AWS S3
- Using STRUCT Types for Nested Objects
In this example we use the batching mechanism of an aws_s3 output to collect a batch of messages in memory, which then converts it to a parquet file and uploads it.
output:
aws_s3:
bucket: TODO
path: 'stuff/${! timestamp_unix() }-${! uuid_v4() }.parquet'
batching:
count: 1000
period: 10s
processors:
- parquet_encode:
schema:
- name: id
type: INT64
- name: weight
type: DOUBLE
- name: content
type: BYTE_ARRAY
- name: attributes
type: MAP
fields:
- { name: key, type: UTF8 }
- { name: value, type: INT64 }
- name: tags
type: LIST
fields:
- { name: element, type: UTF8 }
default_compression: zstd
This example shows how to use STRUCT types to store complex nested objects natively in Parquet format, avoiding JSON serialization overhead.
output:
aws_s3:
bucket: TODO
path: 'events/${! timestamp_unix() }-${! uuid_v4() }.parquet'
batching:
count: 1000
period: 10s
processors:
- parquet_encode:
schema:
- name: id
type: INT64
- name: timestamp
type: INT64
- name: cloud
type: STRUCT
optional: true
fields:
- name: provider
type: UTF8
- name: region
type: UTF8
optional: true
- name: account
type: STRUCT
optional: true
fields:
- name: uid
type: UTF8
optional: true
- name: metadata
type: STRUCT
optional: true
fields:
- name: version
type: UTF8
- name: product
type: STRUCT
optional: true
fields:
- name: name
type: UTF8
- name: version
type: UTF8
optional: true
default_compression: zstd
Fields
schema
Parquet schema.
Type: array
schema[].name
The name of the column.
Type: string
schema[].type
The type of the column, only applicable for leaf columns with no child fields. STRUCT represents nested objects with defined field schemas. MAP supports only string keys, but can support values of all types. Some logical types can be specified here such as UTF8.
Type: string
Options: BOOLEAN, INT8, INT16, INT32, INT64, DECIMAL64, DECIMAL32, FLOAT, DOUBLE, BYTE_ARRAY, UTF8, MAP, LIST, STRUCT.
schema[].decimal_precision
Precision to use for DECIMAL32/DECIMAL64 type
Type: int
Default: 0
schema[].decimal_scale
Scale to use for DECIMAL32/DECIMAL64 type
Type: int
Default: 0
schema[].repeated
Whether the field is repeated.
Type: bool
Default: false
schema[].optional
Whether the field is optional.
Type: bool
Default: false
schema[].fields
A list of child fields.
Type: array
# Examples
fields:
- name: foo
type: INT64
- name: bar
type: BYTE_ARRAY
default_compression
The default compression type to use for fields.
Type: string
Default: "uncompressed"
Options: uncompressed, snappy, gzip, brotli, zstd, lz4raw.
default_encoding
The default encoding type to use for fields. A custom default encoding is only necessary when consuming data with libraries that do not support DELTA_LENGTH_BYTE_ARRAY and is therefore best left unset where possible.
Type: string
Default: "DELTA_LENGTH_BYTE_ARRAY"
Requires version 1.0.0 or newer
Options: DELTA_LENGTH_BYTE_ARRAY, PLAIN.