bigquery main_summary has duplicated rows for some days in Oct 2018
Categories
(Data Platform and Tools :: General, defect)
Tracking
(Not tracked)
People
(Reporter: benwu, Unassigned)
Details
On 2018-10-18, 2018-10-21, and 2018-10-22 about a third of the clients have duplicated rows in bigquery main summary that aren't in the parquet dataset. Possibly caused by an incomplete deletion before the bigquery load.
Here's an example of a single client: https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/234785/command/234790
It doesn't look like any other days in 2018 or 2019 are affected. This is causing large spikes in search volumes that are affecting search forecasting.
At this point though, because of the state of main_summary and parquet, I'm ok with leaving this as a wontfix and for search we can copy data from the search v7 data. I'm not sure anything else relies on bigquery main_summary from that far back. In that case we should probably document this somewhere (https://docs.telemetry.mozilla.org/concepts/analysis_gotchas.html).
Comment 1•5 years ago
|
||
it should be reasonably easy to run a modified version of the copy-deduplicate script on those days of main summary
Reporter | ||
Comment 2•5 years ago
|
||
I deduped main summary for those days based on document id and things look fine. I have a backup of the pre-deduped data in another table in my own project which I'll keep around for a bit in case we find errors.
Assignee | ||
Updated•3 years ago
|
Description
•