Schedule Sync bookmark validation job to run on Airflow

RESOLVED FIXED

Status

Data Platform and Tools
Datasets: General
P1
normal
RESOLVED FIXED
a year ago
7 months ago

People

(Reporter: lina, Assigned: amiyaguchi)

Tracking

(Blocks: 1 bug)

Details

Attachments

(2 attachments)

I think we're happy with the backfilled data, so let's schedule the job to run continuously.

Link to notebook: https://gist.github.com/kitcambridge/364f56182f3e96fb3131bf38ff648609

The windowing could probably use some work. `sync_bmk_validation_problems` doesn't need the 10-day delay, since it doesn't aggregate...but `sync_bmk_total_per_day` does, so, even if we have partial validation results for the last 10 days, we won't have denominators for the total number of users and bookmarks we validated. WDYT, Anthony?
(Assignee)

Comment 1

a year ago
It sounds like this should be split into two pieces -- generating the unnested table and then generating the aggregates. The first part only depends on the current day of the sync_summary, while the second part depends on a window of the derived bookmark validation table. I think there are a few modifications that should be made to reduce the amount of repeated work, but otherwise it looks good.

There are a few steps to getting this dataset to a production ready state. I outlined these in bug 1349070 comment #2, but we've already gone through the process of querying against this data.

Once the critical changes are made to the notebook, I'll go ahead and schedule the notebook on ATMO. The notebook should be converted to be part of python_mozetl, and scheduled on airflow.
Assignee: nobody → amiyaguchi
(Assignee)

Updated

a year ago
Blocks: 1349065
(In reply to Anthony Miyaguchi [:amiyaguchi] from comment #1)
> There are a few steps to getting this dataset to a production ready state. I
> outlined these in bug 1349070 comment #2, but we've already gone through the
> process of querying against this data.

Anthony, is there anything I can help with here? What kinds of modifications do we need to make to reduce repeated work?
Flags: needinfo?(amiyaguchi)
(Assignee)

Comment 3

11 months ago
Please remove everything related to submission_start_window and submission_end_window. This includes the `all_engine_validation_results.filter` for the `when` column, at least until the submission latency between an event and when it's submitted is quantified. It only complicates the query.

Then make the query dependent on the current date. When you backfill, you should run this job for each day. 

Then schedule this job on atmo via the following link: https://analysis.telemetry.mozilla.org/jobs/new/

To run this on airflow, the notebook needs to be turned into a python module that fits in `python_mozetl` (https://github.com/mozilla/python_mozetl). This provides another level support to the job for retries and notifications. ATMO might fulfill most of your needs.
Flags: needinfo?(amiyaguchi)
(Assignee)

Comment 4

10 months ago
Created attachment 8898515 [details] [review]
Bug 1374831 - Sync bookmark validation #102
(Assignee)

Comment 5

10 months ago
Comment on attachment 8898515 [details] [review]
Bug 1374831 - Sync bookmark validation #102

A month's worth of data (201707-201708) can be found at the following s3 path:
`s3://net-mozaws-prod-us-west-2-pipeline-analysis/kit/sync/bmk_total_per_day/v1/`

I've broken down the query by day, but it can be easily modified to handle chunks of data (by quarter for example). This affects the bookmark totals by day.

I've also plotted the latency of the sync summary dataset, a plot the cumulative distribution of (submission_date - when). https://sql.telemetry.mozilla.org/queries/19707/source#50369

Most of the sync pings that are seen have a timestamp that is +/- 1 day from the submission date.
Attachment #8898515 - Flags: review?(kit)
Comment on attachment 8898515 [details] [review]
Bug 1374831 - Sync bookmark validation #102

Thanks, Anthony!
Attachment #8898515 - Flags: review?(kit) → review+
Duplicate of this bug: 1319576
See Also: → bug 1394946

Comment 8

8 months ago
Is this still active? If so, what's the priority?
(Assignee)

Comment 9

8 months ago
This has been regularly scheduled on ATMO for a while now -- but it hasn't been scheduled on airflow yet.
Points: --- → 2
Priority: -- → P1

Updated

8 months ago
Priority: P1 → P2
(Assignee)

Updated

7 months ago
Priority: P2 → P1
(Assignee)

Comment 11

7 months ago
I've removed the job from ATMO since it's now scheduled on airflow.
Status: NEW → RESOLVED
Last Resolved: 7 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.