Closed Bug 1428411 Opened 8 years ago Closed 6 years ago

Duplicate rows in some Longitudinal datasets

Categories

(Data Platform and Tools Graveyard :: Datasets: Longitudinal, enhancement, P1)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bugzilla, Unassigned)

Details

While verifying data for bug 1427749, I noticed that longitudinal v20171209 had 1m more rows than other weeks. Looking at the approx distinct client count vs the row count, there is, indeed, about a ~1m disparity between the two. Looking at the parquet files for that date, there are extra files -- a clean run has 480 files, since that's how many partitions we're using in the reduce step, but there are 552 under that key. I found the logs for that run and it looks like the cluster lost contact with an executor at some point, so it re-ran those tasks. This wouldn't be an issue if we were using the spark Dataset API's write method since spark would take care of this itself but currently we're using a custom write-partition-locally-and-upload flow, and duplicate tasks result in duplicate files. The immediate fix is to figure out how many other dates are affected and re-run the job flow for those dates. The real, harder fix would be to modify the job to use the spark Dataset API. We've been talking about needing to refactor this job for a while, but this might force the issue.
For the past year, the dates where the number of distinct clients doesn't match the number of rows are: 2017-12-09 2017-11-25 2017-11-04 2017-03-18 I'm going to re-run the recent 3 now, and I'll punt on the march date until job owners are back on monday
The dates have been re-run successfully, there's now an alert: https://sql.telemetry.mozilla.org/alerts/30 and we're going to delete the 3/18 dataset unless someone objects.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Product: Data Platform and Tools → Data Platform and Tools Graveyard
You need to log in before you can comment on or make changes to this bug.