Backfill historical ping tables in BQ to get complete event and user_pref data
Categories
(Data Platform and Tools :: General, task)
Tracking
(Not tracked)
People
(Reporter: klukas, Unassigned)
References
Details
Now that we have completed https://bugzilla.mozilla.org/show_bug.cgi?id=1568042, we have event and user_pref data out of additional properties and into their own fields except for in a window from 2019-09-01 to approximately 2019-10-15. We are going to backfill from decoded bytes for main ping, event ping, etc. for this time period.
Tentatively, we are aiming to move forward with this the week of 2019-11-11.
Reporter | ||
Comment 1•6 years ago
|
||
Note that some of these tables, such as event
, did not get user_pref added like we had thought. We could delay backfilling until that's in place, but I think main ping is the major issue here and it has the relevant schema updates. We should move ahead with the backfill next week in order to ensure we have continuity for event fields. Going back and doing another backfill for some ping types other than main will be relatively cheap and fast.
Reporter | ||
Comment 2•6 years ago
|
||
This is done. The configurations used were:
mvn compile exec:java -Dexec.mainClass=com.mozilla.telemetry.Decoder -Dexec.args="\
--runner=Dataflow \
--jobName=$JOB_NAME \
--project=$PROJECT \
--geoCityDatabase=gs://backfill-test-public1/GeoIP2-City.mmdb \
--geoCityFilter=gs://backfill-test-public1/cities15000.txt \
--schemasLocation=gs://backfill-test-public1/201911070121_d2c5dbc.tar.gz \
--inputType=bigquery_query \
--input=\"WITH distinct_document_ids AS (SELECT document_id, MIN(submission_timestamp) AS submission_timestamp FROM \`moz-fx-data-shared-prod.payload_bytes_decoded.telemetry_telemetry__main_v4\` WHERE DATE(submission_timestamp) = '$DT' GROUP BY document_id), base AS (SELECT * FROM \`moz-fx-data-shared-prod.payload_bytes_decoded.telemetry_telemetry__main_v4\` JOIN distinct_document_ids USING (document_id, submission_timestamp) WHERE DATE(submission_timestamp) = '$DT'), numbered_duplicates AS (SELECT *, ROW_NUMBER() OVER (PARTITION BY document_id) AS _n FROM base) SELECT * EXCEPT(_n) FROM numbered_duplicates WHERE _n = 1\" \
--bqReadMethod=export \
--outputType=bigquery \
--bqWriteMethod=file_loads \
--bqClusteringFields=normalized_channel,sample_id \
--output=${PROJECT}:telemetry_stable.\${document_type}_v\${document_version} \
--errorOutputType=bigquery \
--errorOutput=${PROJECT}:telemetry_stable.error_v1 \
--experiments=shuffle_mode=service \
--region=us-central1 \
--usePublicIps=false \
--gcsUploadBufferSizeBytes=16777216 \
"
and
mvn compile exec:java -Dexec.mainClass=com.mozilla.telemetry.Sink -Dexec.args="\
--runner=Dataflow \
--jobName=$JOB_NAME \
--project=$PROJECT \
--schemasLocation=gs://backfill-test-public1/201911070121_d2c5dbc.tar.gz \
--inputType=bigquery_query \
--input=\"WITH distinct_document_ids AS (SELECT document_id, MIN(submission_timestamp) AS submission_timestamp FROM \`moz-fx-data-shared-prod.payload_bytes_decoded.telemetry_telemetry__*\` WHERE DATE(submission_timestamp) = '$DT' AND _TABLE_SUFFIX != 'main_v4' GROUP BY document_id), base AS (SELECT * FROM \`moz-fx-data-shared-prod.payload_bytes_decoded.telemetry_telemetry__*\` JOIN distinct_document_ids USING (document_id, submission_timestamp) WHERE DATE(submission_timestamp) = '$DT' AND payload IS NOT NULL AND _TABLE_SUFFIX != 'main_v4'), numbered_duplicates AS (SELECT *, ROW_NUMBER() OVER (PARTITION BY document_id) AS _n FROM base) SELECT * EXCEPT(_n) FROM numbered_duplicates WHERE _n = 1\" \
--bqReadMethod=export \
--outputType=bigquery \
--bqWriteMethod=file_loads \
--bqClusteringFields=normalized_channel,sample_id \
--output=${PROJECT}:telemetry_stable.\${document_type}_v\${document_version} \
--errorOutputType=bigquery \
--errorOutput=${PROJECT}:telemetry_stable.error_v1 \
--experiments=shuffle_mode=service \
--region=us-central1 \
--usePublicIps=false \
--gcsUploadBufferSizeBytes=16777216 \
"
Of note, we did the deduplication as part of the input query on the payload_bytes_decoded tables, which allows us to skip having a separate step afterwards. It's also relatively inexpensive since we get to dedupe based on the compressed payload_bytes rows, so we only scan ~1/4 the amount of data compared to deduping from live to stable tables.
Description
•