Closed Bug 1593872 Opened 6 years ago Closed 6 years ago

Backfill historical ping tables in BQ to get complete event and user_pref data

Categories

(Data Platform and Tools :: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: klukas, Unassigned)

References

Details

Now that we have completed https://bugzilla.mozilla.org/show_bug.cgi?id=1568042, we have event and user_pref data out of additional properties and into their own fields except for in a window from 2019-09-01 to approximately 2019-10-15. We are going to backfill from decoded bytes for main ping, event ping, etc. for this time period.

Tentatively, we are aiming to move forward with this the week of 2019-11-11.

Depends on: 1594902
No longer depends on: 1594902

Note that some of these tables, such as event, did not get user_pref added like we had thought. We could delay backfilling until that's in place, but I think main ping is the major issue here and it has the relevant schema updates. We should move ahead with the backfill next week in order to ensure we have continuity for event fields. Going back and doing another backfill for some ping types other than main will be relatively cheap and fast.

This is done. The configurations used were:

mvn compile exec:java -Dexec.mainClass=com.mozilla.telemetry.Decoder -Dexec.args="\
    --runner=Dataflow \
    --jobName=$JOB_NAME \
    --project=$PROJECT  \
    --geoCityDatabase=gs://backfill-test-public1/GeoIP2-City.mmdb \
    --geoCityFilter=gs://backfill-test-public1/cities15000.txt \
    --schemasLocation=gs://backfill-test-public1/201911070121_d2c5dbc.tar.gz \
    --inputType=bigquery_query \
    --input=\"WITH distinct_document_ids AS (SELECT document_id, MIN(submission_timestamp) AS submission_timestamp FROM \`moz-fx-data-shared-prod.payload_bytes_decoded.telemetry_telemetry__main_v4\` WHERE DATE(submission_timestamp) = '$DT' GROUP BY document_id), base AS (SELECT * FROM \`moz-fx-data-shared-prod.payload_bytes_decoded.telemetry_telemetry__main_v4\`  JOIN distinct_document_ids USING (document_id, submission_timestamp) WHERE DATE(submission_timestamp) = '$DT'), numbered_duplicates AS (SELECT *, ROW_NUMBER() OVER (PARTITION BY document_id) AS _n FROM base) SELECT * EXCEPT(_n) FROM numbered_duplicates WHERE _n = 1\" \
    --bqReadMethod=export \
    --outputType=bigquery \
    --bqWriteMethod=file_loads \
    --bqClusteringFields=normalized_channel,sample_id \
    --output=${PROJECT}:telemetry_stable.\${document_type}_v\${document_version} \
    --errorOutputType=bigquery \
    --errorOutput=${PROJECT}:telemetry_stable.error_v1 \
    --experiments=shuffle_mode=service \
    --region=us-central1 \
    --usePublicIps=false \
    --gcsUploadBufferSizeBytes=16777216 \
"

and

mvn compile exec:java -Dexec.mainClass=com.mozilla.telemetry.Sink -Dexec.args="\
    --runner=Dataflow \
    --jobName=$JOB_NAME \
    --project=$PROJECT  \
    --schemasLocation=gs://backfill-test-public1/201911070121_d2c5dbc.tar.gz \
    --inputType=bigquery_query \
    --input=\"WITH distinct_document_ids AS (SELECT document_id, MIN(submission_timestamp) AS submission_timestamp FROM \`moz-fx-data-shared-prod.payload_bytes_decoded.telemetry_telemetry__*\` WHERE DATE(submission_timestamp) = '$DT' AND _TABLE_SUFFIX != 'main_v4' GROUP BY document_id), base AS (SELECT * FROM \`moz-fx-data-shared-prod.payload_bytes_decoded.telemetry_telemetry__*\`  JOIN distinct_document_ids USING (document_id, submission_timestamp) WHERE DATE(submission_timestamp) = '$DT' AND payload IS NOT NULL AND _TABLE_SUFFIX != 'main_v4'), numbered_duplicates AS (SELECT *, ROW_NUMBER() OVER (PARTITION BY document_id) AS _n FROM base) SELECT * EXCEPT(_n) FROM numbered_duplicates WHERE _n = 1\" \
    --bqReadMethod=export \
    --outputType=bigquery \
    --bqWriteMethod=file_loads \
    --bqClusteringFields=normalized_channel,sample_id \
    --output=${PROJECT}:telemetry_stable.\${document_type}_v\${document_version} \
    --errorOutputType=bigquery \
    --errorOutput=${PROJECT}:telemetry_stable.error_v1 \
    --experiments=shuffle_mode=service \
    --region=us-central1 \
    --usePublicIps=false \
    --gcsUploadBufferSizeBytes=16777216 \
"

Of note, we did the deduplication as part of the input query on the payload_bytes_decoded tables, which allows us to skip having a separate step afterwards. It's also relatively inexpensive since we get to dedupe based on the compressed payload_bytes rows, so we only scan ~1/4 the amount of data compared to deduping from live to stable tables.

Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.