Closed Bug 1455383 Opened 7 years ago Closed 7 years ago

Move core ping d2p output to (submission_date_s3, app_name, os) partitioning

Categories

(Data Platform and Tools :: General, enhancement, P1)

enhancement
Points:
2

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: frank, Assigned: relud)

References

Details

We should also move existing parquet output over. This will have to be done as a Spark job, since e.g. os is not an existing partition.
Component: Datasets: General → Datasets: Mobile
Points: --- → 3
Priority: -- → P3
Assignee: nobody → dthorn
Points: 3 → 2
Priority: P3 → P2
will bump to P1 on friday
plan: 1. backfill from v2 to v3, up to the 'yesterday' (relative to step 3) 2. notify fx-data-dev@mozilla.org of a cutover period 3. merge and deploy https://github.com/mozilla-services/puppet-config/pull/2726 and https://github.com/mozilla-services/mozilla-pipeline-schemas/pull/149 4. backfill partial day when deploy from step 4 completes 5. at the same time as step 4, update athena/presto tables to v3
Since these appear to be entirely output schema changes, I'd prefer to run multiple outputs for a time to avoid a cutover period. It is my understanding that the timing of (5), unless we explicitly configure p2h otherwise, will happen automatically as the new version of the dataset becomes available, even in partial form. If the goal is to have the unversioned pointer for the dataset only point to the latest version when it's fully backfilled, we will need to manage that explicitly.
running multiple outputs sounds like a good plan. i'll update my PRs for multiple outputs. i'm going to backfill into the telemetry-backfill bucket, so as not to trigger automatic table detection. so the unversioned table pointer will get updated when we start the second output, at which time i will copy the backfill over.
this will be a backwards incompatible change, as the column channel will become metadata.normalizedChannel I will notify fx-data-dev@mozilla.org of the upcoming change, asking people dependent on the column to change to using the versioned table. Then I'll track down recurring queries in STMO that will break, and make sure those get updated. :frank does that sound good to you?
Flags: needinfo?(fbertsch)
I'm currently aiming to deploy these changes Tuesday morning (May 1).
Priority: P2 → P1
stmo query used to find and update queries to avoid breakage: https://sql.telemetry.mozilla.org/queries/52792/source
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Flags: needinfo?(fbertsch)
Component: Datasets: Mobile → General
You need to log in before you can comment on or make changes to this bug.