Closed Bug 1601793 Opened 5 years ago Closed 5 years ago

bigquery main_summary has duplicated rows for some days in Oct 2018

Categories

(Data Platform and Tools :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: benwu, Unassigned)

Details

On 2018-10-18, 2018-10-21, and 2018-10-22 about a third of the clients have duplicated rows in bigquery main summary that aren't in the parquet dataset. Possibly caused by an incomplete deletion before the bigquery load.

Here's an example of a single client: https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/234785/command/234790

It doesn't look like any other days in 2018 or 2019 are affected. This is causing large spikes in search volumes that are affecting search forecasting.

At this point though, because of the state of main_summary and parquet, I'm ok with leaving this as a wontfix and for search we can copy data from the search v7 data. I'm not sure anything else relies on bigquery main_summary from that far back. In that case we should probably document this somewhere (https://docs.telemetry.mozilla.org/concepts/analysis_gotchas.html).

it should be reasonably easy to run a modified version of the copy-deduplicate script on those days of main summary

I deduped main summary for those days based on document id and things look fine. I have a backup of the pre-deduped data in another table in my own project which I'll keep around for a bit in case we find errors.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Component: Datasets: Main Summary → General
You need to log in before you can comment on or make changes to this bug.