Skip to main content

parquet_decode

EXPERIMENTAL

This component is experimental and therefore subject to change or removal outside of major version releases.

Decodes Parquet files into a batch of structured messages.

Introduced in version 1.0.0.

# Common config fields, showing default values
label: ""
parquet_decode: {}

This processor uses https://github.com/parquet-go/parquet-go, which is itself experimental. Therefore changes could be made into how this processor functions outside of major version releases.

Fields

use_parquet_list_format

Whether to decodeLISTtype columns into their Parquet logical type format {"list": [{"element": value_1}, {"element": value_2}, ...]} instead of a Go slice [value_1, value_2, ...].

caution

This flag will be disabled (set to false) by default and deprecated in future versions, with the logical format being deprecated in favour of the Go slice.

Type: bool
Default: true
Requires version 1.8.0 or newer

strict_schema

Whether to enforce strict Parquet schema validation. When set to false, allows reading files with non-standard schema structures (such as non-standard LIST formats). Disabling strict mode may reduce validation but increases compatibility.

Type: bool
Default: true

Examples

In this example we consume files from AWS S3 as they're written by listening onto an SQS queue for upload events. We make sure to use the to_the_end scanner which means files are read into memory in full, which then allows us to use a parquet_decode processor to expand each file into a batch of messages. Finally, we write the data out to local files as newline delimited JSON.

input:
aws_s3:
bucket: TODO
prefix: foos/
scanner:
to_the_end: {}
sqs:
url: TODO
processors:
- parquet_decode: {}

output:
file:
codec: lines
path: './foos/${! metadata("s3_key") }.jsonl'