Backfill 1% of data from telemetry-sample as an early step
Categories
(Data Platform and Tools :: General, task, P1)
Tracking
(Not tracked)
People
(Reporter: mreid, Assigned: klukas)
References
Details
This unblocks migration of things currently dependent on the Longitudinal dataset, as well as being a low-cost test of our import pathway.
Comment 1•5 years ago
|
||
I started the initial import yesterday via GCS transfer service. It is about 40% done so may finish over the weekend. Data is being written to gs://moz-fx-data-prod-data/telemetry-sample-2/.
Comment 2•5 years ago
|
||
The import job finished in 52 hours. This process (if extrapolated to 100% via simple multiplication) would take about 10 times longer to import than we're expecting, so I'm going to run a separate test on a single day of telemetry-3, to be detailed elsewhere. At any rate, the sample data is available now for developing the backfill procedure. I'm assuming this bug encompasses more than the AWS->GCP import, so I'm leaving it open.
Assignee | ||
Comment 3•5 years ago
|
||
Late last week, we ran a series of jobs for this and populated backfill-test-252723.test_ingestion_1pct.telemetry__main_v4
. We now need to validate that output, which should already be fully deduplicated per day and reaches back to 2018-11-01.
Errors are in backfill-test-252723.test_ingestion_1pct.error
.
Assignee | ||
Comment 4•5 years ago
|
||
:relud has been validating a day of this data vs. what's in main_summary.
He found 217 messages in main_summary missing from the heka import, but these all show reasonable validation errors:
216 have negative sessionLength and 1 has "timezoneOffset":539.55, so they seem correct to have thrown out.
There is one anomaly not yet explained:
13578523-d76c-43f0-963b-2e7d9a903e0b on 2019-08-30 doesn't exist in main summary, but does exist in the 1pct table
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Comment 5•5 years ago
|
||
The 1% backfill table now exists as: moz-fx-data-shared-prod:static.main_1pct_backfill
Description
•