Closed Bug 1374831 Opened 7 years ago Closed 7 years ago

Schedule Sync bookmark validation job to run on Airflow

Categories

(Data Platform and Tools :: General, enhancement, P1)

x86
macOS
enhancement
Points:
2

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: lina, Assigned: amiyaguchi)

References

Details

Attachments

(2 files)

I think we're happy with the backfilled data, so let's schedule the job to run continuously.

Link to notebook: https://gist.github.com/kitcambridge/364f56182f3e96fb3131bf38ff648609

The windowing could probably use some work. `sync_bmk_validation_problems` doesn't need the 10-day delay, since it doesn't aggregate...but `sync_bmk_total_per_day` does, so, even if we have partial validation results for the last 10 days, we won't have denominators for the total number of users and bookmarks we validated. WDYT, Anthony?
It sounds like this should be split into two pieces -- generating the unnested table and then generating the aggregates. The first part only depends on the current day of the sync_summary, while the second part depends on a window of the derived bookmark validation table. I think there are a few modifications that should be made to reduce the amount of repeated work, but otherwise it looks good.

There are a few steps to getting this dataset to a production ready state. I outlined these in bug 1349070 comment #2, but we've already gone through the process of querying against this data.

Once the critical changes are made to the notebook, I'll go ahead and schedule the notebook on ATMO. The notebook should be converted to be part of python_mozetl, and scheduled on airflow.
Assignee: nobody → amiyaguchi
Blocks: 1349065
(In reply to Anthony Miyaguchi [:amiyaguchi] from comment #1)
> There are a few steps to getting this dataset to a production ready state. I
> outlined these in bug 1349070 comment #2, but we've already gone through the
> process of querying against this data.

Anthony, is there anything I can help with here? What kinds of modifications do we need to make to reduce repeated work?
Flags: needinfo?(amiyaguchi)
Please remove everything related to submission_start_window and submission_end_window. This includes the `all_engine_validation_results.filter` for the `when` column, at least until the submission latency between an event and when it's submitted is quantified. It only complicates the query.

Then make the query dependent on the current date. When you backfill, you should run this job for each day. 

Then schedule this job on atmo via the following link: https://analysis.telemetry.mozilla.org/jobs/new/

To run this on airflow, the notebook needs to be turned into a python module that fits in `python_mozetl` (https://github.com/mozilla/python_mozetl). This provides another level support to the job for retries and notifications. ATMO might fulfill most of your needs.
Flags: needinfo?(amiyaguchi)
Comment on attachment 8898515 [details] [review]
Bug 1374831 - Sync bookmark validation #102

A month's worth of data (201707-201708) can be found at the following s3 path:
`s3://net-mozaws-prod-us-west-2-pipeline-analysis/kit/sync/bmk_total_per_day/v1/`

I've broken down the query by day, but it can be easily modified to handle chunks of data (by quarter for example). This affects the bookmark totals by day.

I've also plotted the latency of the sync summary dataset, a plot the cumulative distribution of (submission_date - when). https://sql.telemetry.mozilla.org/queries/19707/source#50369

Most of the sync pings that are seen have a timestamp that is +/- 1 day from the submission date.
Attachment #8898515 - Flags: review?(kit)
Comment on attachment 8898515 [details] [review]
Bug 1374831 - Sync bookmark validation #102

Thanks, Anthony!
Attachment #8898515 - Flags: review?(kit) → review+
See Also: → 1394946
Is this still active? If so, what's the priority?
This has been regularly scheduled on ATMO for a while now -- but it hasn't been scheduled on airflow yet.
Points: --- → 2
Priority: -- → P1
Priority: P1 → P2
Priority: P2 → P1
I've removed the job from ATMO since it's now scheduled on airflow.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Component: Datasets: General → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: