Closed
Bug 1428411
Opened 8 years ago
Closed 6 years ago
Duplicate rows in some Longitudinal datasets
Categories
(Data Platform and Tools Graveyard :: Datasets: Longitudinal, enhancement, P1)
Data Platform and Tools Graveyard
Datasets: Longitudinal
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bugzilla, Unassigned)
Details
While verifying data for bug 1427749, I noticed that longitudinal v20171209 had 1m more rows than other weeks. Looking at the approx distinct client count vs the row count, there is, indeed, about a ~1m disparity between the two. Looking at the parquet files for that date, there are extra files -- a clean run has 480 files, since that's how many partitions we're using in the reduce step, but there are 552 under that key.
I found the logs for that run and it looks like the cluster lost contact with an executor at some point, so it re-ran those tasks. This wouldn't be an issue if we were using the spark Dataset API's write method since spark would take care of this itself but currently we're using a custom write-partition-locally-and-upload flow, and duplicate tasks result in duplicate files.
The immediate fix is to figure out how many other dates are affected and re-run the job flow for those dates. The real, harder fix would be to modify the job to use the spark Dataset API. We've been talking about needing to refactor this job for a while, but this might force the issue.
For the past year, the dates where the number of distinct clients doesn't match the number of rows are:
2017-12-09
2017-11-25
2017-11-04
2017-03-18
I'm going to re-run the recent 3 now, and I'll punt on the march date until job owners are back on monday
The dates have been re-run successfully, there's now an alert: https://sql.telemetry.mozilla.org/alerts/30 and we're going to delete the 3/18 dataset unless someone objects.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Product: Data Platform and Tools → Data Platform and Tools Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•