parquet_decode

EXPERIMENTAL

This component is experimental and therefore subject to change or removal outside of major version releases.

Decodes Parquet files into a batch of structured messages.

Introduced in version 1.0.0.

Common
Advanced

# Common config fields, showing default values
label: ""
parquet_decode: {}

# All config fields, showing default values
label: ""
parquet_decode:
  use_parquet_list_format: true

This processor uses https://github.com/parquet-go/parquet-go, which is itself experimental. Therefore changes could be made into how this processor functions outside of major version releases.

Fields

`use_parquet_list_format`

Whether to decodeLISTtype columns into their Parquet logical type format {"list": [{"element": value_1}, {"element": value_2}, ...]} instead of a Go slice [value_1, value_2, ...].

caution

This flag will be disabled (set to false) by default and deprecated in future versions, with the logical format being deprecated in favour of the Go slice.

Type: bool
Default: true
Requires version 1.8.0 or newer

Examples

Reading Parquet Files from AWS S3

In this example we consume files from AWS S3 as they're written by listening onto an SQS queue for upload events. We make sure to use the to_the_end scanner which means files are read into memory in full, which then allows us to use a parquet_decode processor to expand each file into a batch of messages. Finally, we write the data out to local files as newline delimited JSON.

input:
  aws_s3:
    bucket: TODO
    prefix: foos/
    scanner:
      to_the_end: {}
    sqs:
      url: TODO
  processors:
    - parquet_decode: {}

output:
  file:
    codec: lines
    path: './foos/${! metadata("s3_key") }.jsonl'

Fields​

use_parquet_list_format​

Examples​

Fields

`use_parquet_list_format`

Examples