Closed
Bug 1374831
Opened 7 years ago
Closed 7 years ago
Schedule Sync bookmark validation job to run on Airflow
Categories
(Data Platform and Tools :: General, enhancement, P1)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: lina, Assigned: amiyaguchi)
References
Details
Attachments
(2 files)
I think we're happy with the backfilled data, so let's schedule the job to run continuously. Link to notebook: https://gist.github.com/kitcambridge/364f56182f3e96fb3131bf38ff648609 The windowing could probably use some work. `sync_bmk_validation_problems` doesn't need the 10-day delay, since it doesn't aggregate...but `sync_bmk_total_per_day` does, so, even if we have partial validation results for the last 10 days, we won't have denominators for the total number of users and bookmarks we validated. WDYT, Anthony?
Assignee | ||
Comment 1•7 years ago
|
||
It sounds like this should be split into two pieces -- generating the unnested table and then generating the aggregates. The first part only depends on the current day of the sync_summary, while the second part depends on a window of the derived bookmark validation table. I think there are a few modifications that should be made to reduce the amount of repeated work, but otherwise it looks good. There are a few steps to getting this dataset to a production ready state. I outlined these in bug 1349070 comment #2, but we've already gone through the process of querying against this data. Once the critical changes are made to the notebook, I'll go ahead and schedule the notebook on ATMO. The notebook should be converted to be part of python_mozetl, and scheduled on airflow.
Assignee: nobody → amiyaguchi
Reporter | ||
Comment 2•7 years ago
|
||
(In reply to Anthony Miyaguchi [:amiyaguchi] from comment #1) > There are a few steps to getting this dataset to a production ready state. I > outlined these in bug 1349070 comment #2, but we've already gone through the > process of querying against this data. Anthony, is there anything I can help with here? What kinds of modifications do we need to make to reduce repeated work?
Flags: needinfo?(amiyaguchi)
Assignee | ||
Comment 3•7 years ago
|
||
Please remove everything related to submission_start_window and submission_end_window. This includes the `all_engine_validation_results.filter` for the `when` column, at least until the submission latency between an event and when it's submitted is quantified. It only complicates the query. Then make the query dependent on the current date. When you backfill, you should run this job for each day. Then schedule this job on atmo via the following link: https://analysis.telemetry.mozilla.org/jobs/new/ To run this on airflow, the notebook needs to be turned into a python module that fits in `python_mozetl` (https://github.com/mozilla/python_mozetl). This provides another level support to the job for retries and notifications. ATMO might fulfill most of your needs.
Flags: needinfo?(amiyaguchi)
Assignee | ||
Comment 4•7 years ago
|
||
Assignee | ||
Comment 5•7 years ago
|
||
Comment on attachment 8898515 [details] [review] Bug 1374831 - Sync bookmark validation #102 A month's worth of data (201707-201708) can be found at the following s3 path: `s3://net-mozaws-prod-us-west-2-pipeline-analysis/kit/sync/bmk_total_per_day/v1/` I've broken down the query by day, but it can be easily modified to handle chunks of data (by quarter for example). This affects the bookmark totals by day. I've also plotted the latency of the sync summary dataset, a plot the cumulative distribution of (submission_date - when). https://sql.telemetry.mozilla.org/queries/19707/source#50369 Most of the sync pings that are seen have a timestamp that is +/- 1 day from the submission date.
Attachment #8898515 -
Flags: review?(kit)
Reporter | ||
Comment 6•7 years ago
|
||
Comment on attachment 8898515 [details] [review] Bug 1374831 - Sync bookmark validation #102 Thanks, Anthony!
Attachment #8898515 -
Flags: review?(kit) → review+
Comment 8•7 years ago
|
||
Is this still active? If so, what's the priority?
Assignee | ||
Comment 9•7 years ago
|
||
This has been regularly scheduled on ATMO for a while now -- but it hasn't been scheduled on airflow yet.
Points: --- → 2
Priority: -- → P1
Updated•7 years ago
|
Priority: P1 → P2
Assignee | ||
Updated•7 years ago
|
Priority: P2 → P1
Comment 10•7 years ago
|
||
Assignee | ||
Comment 11•7 years ago
|
||
I've removed the job from ATMO since it's now scheduled on airflow.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Updated•2 years ago
|
Component: Datasets: General → General
You need to log in
before you can comment on or make changes to this bug.
Description
•