opensnowcat

EXPERIMENTAL

This component is experimental and therefore subject to change or removal outside of major version releases.

Processes OpenSnowcat/Snowplow enriched TSV events. Convert enriched TSV to flattened JSON, filter events, and transform sensitive fields for privacy compliance.

Introduced in version 1.12.0.

Common
Advanced

# Common config fields, showing default values
label: ""
opensnowcat:
  output_format: tsv
  set_metadata: {} # No default (optional)

# All config fields, showing default values
label: ""
opensnowcat:
  output_format: tsv
  set_metadata: {} # No default (optional)
  filters:
    drop: {} # No default (optional)
    transform:
      salt: "" # No default (optional)
      hash_algo: SHA-256
      fields: {} # No default (optional)
  schema_discovery:
    enabled: false
    flush_interval: 5m
    endpoint: https://api.snowcatcloud.com/internal/schema-discovery
    template: '{"schemas": {{SCHEMAS}}}'

This processor provides comprehensive event processing capabilities:

Features

Format Conversion

Convert enriched TSV to flattened JSON with automatic context extraction
Maintain TSV format for OpenSnowcat/Snowplow downstream compatibility

Event Filtering

Drop events based on field values (IP addresses, user agents, etc.)
Filter by schema property paths in contexts, derived_contexts, and unstruct_event
OR logic: event is dropped if ANY filter matches

Field Transformations

Transform sensitive fields for PII compliance and privacy:

hash: Hash field values using configurable algorithms (MD5, SHA-1, SHA-256, SHA-384, SHA-512) with salt
redact: Replace field values with a fixed string (e.g., "[REDACTED]")
anonymize_ip: Mask IP addresses while preserving network information (supports both IPv4 and IPv6)
- IPv4: Mask last N octets using anon_octets parameter
- IPv6: Mask last N segments using anon_segments parameter

All transformations support both direct TSV columns and schema property paths.

Examples

Converts OpenSnowcat/Snowplow enriched TSV events to flattened JSON format, extracting all contexts, derived contexts, and unstruct events into top-level fields.

pipeline:
  processors:
    - opensnowcat:
        output_format: json

Converts OpenSnowcat/Snowplow enriched TSV to database-optimized nested JSON with key-based schema structure. Each schema becomes a key (vendor_schema_name) with version and data fields. Enables simple direct-access queries across all databases without UNNEST operations. Perfect for BigQuery, Snowflake, Databricks, Redshift, and other data warehouses.

pipeline:
  processors:
    - opensnowcat:
        output_format: enriched_json
# Out: { 'contexts': { 'com_snowplowanalytics_snowplow_web_page': {version: '1-0-0', data: [{id: '...'}] } } }

Filters out events from IP addresses while maintaining TSV format.

pipeline:
  processors:
    - opensnowcat:
        output_format: tsv
        filters:
          drop:
            user_ipaddress:
              contains: ["127.0.0.1", "192.168.", "10.0."]

Filters events based on schema property values (without version). The processor automatically searches contexts, derived_contexts, and unstruct_event fields for matching vendor, schemas and property name (case sensitive).

pipeline:
  processors:
    - opensnowcat:
        output_format: tsv
        filters:
          drop:
            com.snowplowanalytics.snowplow.ua_parser_context.useragentFamily:
              contains: ["Chrome", "Firefox"]
            user_ipaddress:
              contains: ["10.0."]

Transforms sensitive fields using various strategies: hash user identifiers, anonymize IP addresses, and redact network identifiers. Perfect for GDPR and privacy compliance.

pipeline:
  processors:
    - opensnowcat:
        output_format: json
        filters:
          transform:
            salt: "your-secret-salt-here"
            hash_algo: SHA-256
            fields:
              user_id:
                strategy: hash
              user_ipaddress:
                strategy: anonymize_ip
                anon_octets: 2
                anon_segments: 3
              network_userid:
                strategy: redact
                redact_value: "[REDACTED]"

Combines multiple transformation strategies with field-specific configurations. Uses different hash algorithms for different fields and supports both IPv4 and IPv6 anonymization.

pipeline:
  processors:
    - opensnowcat:
        output_format: tsv
        filters:
          transform:
            salt: "global-default-salt"
            hash_algo: SHA-256
            fields:
              user_id:
                strategy: hash
                hash_algo: SHA-512
                salt: "user-specific-salt"
              user_ipaddress:
                strategy: anonymize_ip
                anon_octets: 2
                anon_segments: 4
              domain_userid:
                strategy: hash
              network_userid:
                strategy: redact
                redact_value: "REDACTED"
              user_fingerprint:
                strategy: hash
                hash_algo: MD5

Sets metadata from event fields to enable deduplication using Bento's cache processor. Parses TSV once, extracts fingerprint as metadata, then uses cache for deduplication. Duplicate events within 5 minutes are dropped.

pipeline:
  processors:
    - opensnowcat:
        set_metadata:
          fingerprint: event_fingerprint
        output_format: json
    
    - cache:
        resource: dedupe_cache
        operator: add
        key: ${! metadata("fingerprint") }
        value: "1"
    
    - mapping: |
        root = if !errored() { this } else { deleted() }

cache_resources:
  - label: dedupe_cache
    memory:
      default_ttl: 5m

Fields

`output_format`

Output format for processed events.

Type: string
Default: "tsv"

Option	Summary
`enriched_json`	Convert to database-optimized nested JSON with key-based schema structure. Each schema becomes a key (vendor_name) containing version and data array. Compatible with BigQuery, Snowflake, Databricks, Redshift, PostgreSQL, ClickHouse, and Iceberg tables. Enables simple queries without UNNEST and schema evolution without table mutations.
`json`	Convert enriched TSV to flattened JSON with contexts, derived_contexts, and unstruct_event automatically flattened into top-level objects.
`tsv`	Maintain enriched TSV format without conversion.

`set_metadata`

Map metadata keys to OpenSnowcat canonical event model field names. Supports direct TSV column names (e.g., 'event_fingerprint', 'app_id') and schema property paths (e.g., 'com.vendor.schema.field'). Metadata is set before any filters or transformations are applied.

Type: object

# Examples

set_metadata:
  app_id: app_id
  collector_tstamp: collector_tstamp
  eid: event_id
  fingerprint: event_fingerprint

`filters`

Filter and transformation configurations

Type: object

`filters.drop`

Map of field names to filter criteria. Events matching ANY criteria will be dropped (OR logic). Supports both regular TSV columns (e.g., user_ipaddress, useragent) and schema property paths (e.g., com.snowplowanalytics.snowplow.ua_parser_context.useragentFamily). Each filter uses 'contains' for substring matching.

Type: object

`filters.transform`

Field transformation configuration for anonymization, hashing, and redaction

Type: object

`filters.transform.salt`

Global default salt for hashing operations. Can be overridden per field.

Type: string

`filters.transform.hash_algo`

Global default hash algorithm. Can be overridden per field.

Type: string
Default: "SHA-256"
Options: MD5, SHA-1, SHA-256, SHA-384, SHA-512.

`filters.transform.fields`

Map of field names to transformation configurations. Each field must specify:

strategy (required): Transformation type - "hash", "redact", or "anonymize_ip"
hash_algo (optional): Algorithm for hash strategy - "MD5", "SHA-1", "SHA-256", "SHA-384", "SHA-512" (overrides global default)
salt (optional): Salt for hash strategy (overrides global default)
redact_value (optional): Replacement value for redact strategy (default: "[REDACTED]")
anon_octets (optional): Number of IPv4 octets to mask for anonymize_ip strategy (default: 0)
anon_segments (optional): Number of IPv6 segments to mask for anonymize_ip strategy (default: 0)

Supports both TSV columns (e.g., user_id, user_ipaddress) and schema property paths (e.g., com.vendor.schema.field).

Type: object

`schema_discovery`

Schema discovery configuration

Type: object

`schema_discovery.enabled`

Enable schema discovery feature

Type: bool
Default: false

`schema_discovery.flush_interval`

Interval between schema discovery flushes

Type: string
Default: "5m"

`schema_discovery.endpoint`

HTTP endpoint to send schema discovery data

Type: string
Default: "https://api.snowcatcloud.com/internal/schema-discovery"

`schema_discovery.template`

Template for schema discovery payload. Use {{SCHEMAS}} variable for schema list

Type: string
Default: "{\"schemas\": {{SCHEMAS}}}"

Features​

Format Conversion​

Event Filtering​

Field Transformations​

Examples​

Fields​

output_format​

set_metadata​

filters​

filters.drop​

filters.transform​

filters.transform.salt​

filters.transform.hash_algo​

filters.transform.fields​

schema_discovery​

schema_discovery.enabled​

schema_discovery.flush_interval​

schema_discovery.endpoint​

schema_discovery.template​

Features

Format Conversion

Event Filtering

Field Transformations

Examples

Fields

`output_format`

`set_metadata`

`filters`

`filters.drop`

`filters.transform`

`filters.transform.salt`

`filters.transform.hash_algo`

`filters.transform.fields`

`schema_discovery`

`schema_discovery.enabled`

`schema_discovery.flush_interval`

`schema_discovery.endpoint`

`schema_discovery.template`