Closed Bug 1485537 Opened 7 years ago Closed 7 years ago

Error with a COUNT(DISTINCT submission_date_s3) on telemetry_heartbeat_parquet table

Categories

(Data Platform and Tools :: General, enhancement, P2)

enhancement
Points:
1

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jgaunt, Unassigned)

Details

(Whiteboard: [DataPlatform])

https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/21134/command/21144 Compare cells 8 and 9 - line 6 leads to the error thrown in 9. per sunahsuh: "seems like for some reason that forces a read on all the columns (or at least on all the schemas)... at the very least it seems we have a file with a bad schema"
Looks like the type of one of the fields was changed from int64 to boolean in [1]. We should have created a new version of the output table to prevent this kind of error. There is data with both schemas on 20180222. I think we should move the current parquet output to a "v2" path, and migrate any data with the "boolean" type to this path. This would include anything after 20180222, and anything on that day that has the boolean type. :whd, does that sound reasonable? In the meantime, a workaround is to limit queries for heartbeat data to days after 20180222. [1] https://github.com/mozilla-services/mozilla-pipeline-schemas/pull/125
Points: --- → 1
Flags: needinfo?(whd)
Priority: -- → P2
(In reply to Mark Reid [:mreid] from comment #1) > :whd, does that sound reasonable? That we didn't create a new dataset version for such an incompatible change sounded wrong to me, so I looked into this a bit more. From bug #1440187 it appears that d2p was first implemented on the 22nd and that its schema was incorrect. We fixed it the day of but because the field was optional, a small number of messages were parquet-encoded with the incorrect schema, and we never noticed this because other tooling doesn't seem to care. That there was bad data remaining in the parquet output for deploy day suggests that there was a partial day of data for the 22nd (even if only by 38 or so messages). Since I had scripts for automating backfill for that bug and bug #1462381 anyway I didn't bother to verify the partiality and simply reprocessed the 22nd in its entirety. In summary we shouldn't need to do anything else, and I'm closing this as I've verified the query works on databricks.
Status: NEW → RESOLVED
Closed: 7 years ago
Flags: needinfo?(whd)
Resolution: --- → FIXED
Component: Datasets: General → General
You need to log in before you can comment on or make changes to this bug.