Closed Bug 1400041 Opened 7 years ago Closed 7 years ago

Schedule sync flattened data backfill.

Categories

(Data Platform and Tools :: General, enhancement, P1)

enhancement
Points:
2

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: tcsc, Assigned: amiyaguchi)

References

Details

The code from bug 1381641 has landed, and is in airflow.  There is a also bug filed for importing it into redash regularly, but we'd like some amount of historical data backfilled for this dataset.

A backfill from the start of the year is probably sufficient, but it's possible that we want more? needinfo=ADavis to answer how much backfill we want on that data.
Flags: needinfo?(adavis)
I'm checking with Leif. We're going to base our decision on performance so we'll double check with the data that is already there. If we think we can do 12 months and still have fast queries then we'll do that. If not, we'll do shorter timeline to maintain performance.

Leaving ni? on me to make sure this doesn't fall through the crack.
Adding Leif.
Flags: needinfo?(loines)
I think going back to start of the year is fine. I think we just need to be smart about writing our queries by querying from a pre-selected date range first. Also, we can make sure to use approx_distinct() when we want to measure by distinct users or devices to run quickly on 1-2 months of data.
Flags: needinfo?(adavis)
Based on information from the `sync_flat_summary` run for 20170913, 5 nodes completes in about 40 minutes with 15 minutes of bootstrap overhead. There are roughly 260 days since the beginning of the year.

It will take about 108 hours (260 * 25 [minutes] / 60 [minutes/hour]) at 5 nodes to complete this backfill if batched properly. I plan on doing this on a single 30 node cluster instead of airflow because of the bootstrapping overhead.

> $ aws s3 ls s3://telemetry-parquet/sync_flat_summary/v1/submission_date_s3=20170913/ --human-readable --summarize
> ...
> Total Objects: 100
>   Total Size: 11.9 GiB

This is only 10% larger than the nested sync_summary of the same day. The current size of sync_summary since the year start is 2.1TiB. Total compressed size estimate is 2.1+0.2 TiB with ~2000 partitions.


This will be a sizable dataset to query through, so I'll split this up into a backfill of the last 3 months and the last year respectively. The first should help determine if performance is adequate (and avoid an overnight backfill).
Assignee: nobody → amiyaguchi
Points: --- → 2
Priority: -- → P1
I ran the first backfill on 2017-09-21 with 10 nodes for 8 hours, but the process got stuck on 20170615 after 3.5 hours of processing.

I'll be running this again today (2017-10-05) with 30 nodes. If it does get stuck, the spark context may have to be restarted after each day in the main processing loop, similar to how the ExperimentAnalysisView does it.[1] Alternatively I can write a bash script to handle the backfill instead of through the job.

[1] https://github.com/mozilla/telemetry-batch-view/commit/eabdece4f1bcddf3093ed2ceed246fc867f20e8f#diff-f0a72dc1efa84cdd5d8ec4e13ed89fc1
The backfill from 20170601-present is done, it should show up on the next run of parquet2hive. The backfill operation ran in about 15 minutes (5 minutes for processing, 10 minutes of I/O) @ 30 machines. It is more efficient to run this via Airflow, since running 6 jobs in parallel uses the same amount of resources while completing faster.

I'll schedule the rest of the backfill for the year early next week via Airflow.
I've noticed there is no longer a sync_flat_summary in re:dash (presto)... do we need to wait till the rest of the backfill is done?
Flags: needinfo?(loines)
Depends on: 1406984
The dataset shouldn't have disappeared as a result of this backfill (and should still be accessible via ATMO). I've filed a separate bug because I don't have access to the relevant logs.
I think this should be able to be closed now.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Component: Datasets: General → General
You need to log in before you can comment on or make changes to this bug.