Closed Bug 1365012 Opened 8 years ago Closed 8 years ago

Add direct to parquet output for telemetry.duplicate messages

Categories

(Data Platform and Tools :: General, enhancement, P1)

x86_64
Windows 10
enhancement
Points:
1

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: trink, Assigned: whd)

Details

(Whiteboard: [SvcOps])

filename = "s3_parquet.lua" message_matcher = "Type == 'telemetry.duplicate'" preserve_data = false ticker_interval = 60 parquet_schema = [=[ message telemetry_duplicates { required int64 Timestamp; required group Fields { required binary appBuildId (UTF8); required binary appName (UTF8); required binary appUpdateChannel (UTF8); required binary appVersion (UTF8); required binary docType (UTF8); required binary documentId (UTF8); required int32 duplicateDelta (UINT_8); required binary normalizedChannel (UTF8); optional binary geoCity (UTF8); optional binary geoCountry (UTF8); } } ]=] metadata_group = nil json_objects = nil s3_path_dimensions = { {name = "submission", source = "Timestamp", dateformat = "%Y-%m-%d-%H"}, } batch_dir = "parquet" max_writers = 5 max_rowgroup_size = 10000 max_file_size = 1024 * 1024 * 300 max_file_age = 3600 hive_compatible = true
Assignee: nobody → whd
Points: --- → 1
Priority: -- → P1
(In reply to Mike Trinkala [:trink] from comment #0) > filename = "s3_parquet.lua" > message_matcher = "Type == 'telemetry.duplicate'" > preserve_data = false > ticker_interval = 60 > > parquet_schema = [=[ > message telemetry_duplicates { > required int64 Timestamp; > required group Fields { > required binary appBuildId (UTF8); > required binary appName (UTF8); > required binary appUpdateChannel (UTF8); > required binary appVersion (UTF8); > required binary docType (UTF8); > required binary documentId (UTF8); > required int32 duplicateDelta (UINT_8); > required binary normalizedChannel (UTF8); > optional binary geoCity (UTF8); > optional binary geoCountry (UTF8); > } > } > ]=] > > metadata_group = nil > json_objects = nil > s3_path_dimensions = { > {name = "submission", source = "Timestamp", dateformat = "%Y-%m-%d-%H"}, I don't think we need to include the hour in the path dimension... one partition per day should suffice. Also, should use "%Y%m%d" format for consistency with the format used by the `submissionDate` / `submission_date_s3` field elsewhere in the pipeline? > } > > batch_dir = "parquet" > max_writers = 5 > max_rowgroup_size = 10000 > max_file_size = 1024 * 1024 * 300 > max_file_age = 3600 > hive_compatible = true
(In reply to Mark Reid [:mreid] from comment #1) Yes, that was what I used while testing the max_file_age and batch_dir will also be adjusted by whd.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Whiteboard: [SvcOps]
Component: Pipeline Ingestion → General
You need to log in before you can comment on or make changes to this bug.