Closed
Bug 1365012
Opened 8 years ago
Closed 8 years ago
Add direct to parquet output for telemetry.duplicate messages
Categories
(Data Platform and Tools :: General, enhancement, P1)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: trink, Assigned: whd)
Details
(Whiteboard: [SvcOps])
filename = "s3_parquet.lua"
message_matcher = "Type == 'telemetry.duplicate'"
preserve_data = false
ticker_interval = 60
parquet_schema = [=[
message telemetry_duplicates {
required int64 Timestamp;
required group Fields {
required binary appBuildId (UTF8);
required binary appName (UTF8);
required binary appUpdateChannel (UTF8);
required binary appVersion (UTF8);
required binary docType (UTF8);
required binary documentId (UTF8);
required int32 duplicateDelta (UINT_8);
required binary normalizedChannel (UTF8);
optional binary geoCity (UTF8);
optional binary geoCountry (UTF8);
}
}
]=]
metadata_group = nil
json_objects = nil
s3_path_dimensions = {
{name = "submission", source = "Timestamp", dateformat = "%Y-%m-%d-%H"},
}
batch_dir = "parquet"
max_writers = 5
max_rowgroup_size = 10000
max_file_size = 1024 * 1024 * 300
max_file_age = 3600
hive_compatible = true
Reporter | ||
Updated•8 years ago
|
Assignee: nobody → whd
Points: --- → 1
Priority: -- → P1
Comment 1•8 years ago
|
||
(In reply to Mike Trinkala [:trink] from comment #0)
> filename = "s3_parquet.lua"
> message_matcher = "Type == 'telemetry.duplicate'"
> preserve_data = false
> ticker_interval = 60
>
> parquet_schema = [=[
> message telemetry_duplicates {
> required int64 Timestamp;
> required group Fields {
> required binary appBuildId (UTF8);
> required binary appName (UTF8);
> required binary appUpdateChannel (UTF8);
> required binary appVersion (UTF8);
> required binary docType (UTF8);
> required binary documentId (UTF8);
> required int32 duplicateDelta (UINT_8);
> required binary normalizedChannel (UTF8);
> optional binary geoCity (UTF8);
> optional binary geoCountry (UTF8);
> }
> }
> ]=]
>
> metadata_group = nil
> json_objects = nil
> s3_path_dimensions = {
> {name = "submission", source = "Timestamp", dateformat = "%Y-%m-%d-%H"},
I don't think we need to include the hour in the path dimension... one partition per day should suffice. Also, should use "%Y%m%d" format for consistency with the format used by the `submissionDate` / `submission_date_s3` field elsewhere in the pipeline?
> }
>
> batch_dir = "parquet"
> max_writers = 5
> max_rowgroup_size = 10000
> max_file_size = 1024 * 1024 * 300
> max_file_age = 3600
> hive_compatible = true
Reporter | ||
Comment 2•8 years ago
|
||
(In reply to Mark Reid [:mreid] from comment #1)
Yes, that was what I used while testing the max_file_age and batch_dir will also be adjusted by whd.
Assignee | ||
Comment 3•8 years ago
|
||
https://github.com/mozilla-services/puppet-config/pull/2582, deployed today.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Whiteboard: [SvcOps]
Updated•3 years ago
|
Component: Pipeline Ingestion → General
You need to log in
before you can comment on or make changes to this bug.
Description
•