April 2021 explosion in baseline_clients_last_seen entries
Categories
(Data Platform and Tools :: General, defect, P1)
Tracking
(Not tracked)
People
(Reporter: klukas, Assigned: klukas)
Details
Attachments
(1 file)
:loines reported that Fenix DAU went crazy over the weekend. See Slack thread.
We've tracked this down to baseline_clients_last_seen
having a huge number of entries for the org_mozilla_fenix
(nightly) channel. Firefox for Fire TV (org_mozilla_tv_firefox
) also shows a huge increase.
Assignee | ||
Comment 1•4 years ago
|
||
Something appears to change on 2021-04-11 that introduces a large number of duplicate entries.
In https://sql.telemetry.mozilla.org/queries/79630/source we plot the number of duplicate entries in April. For org_mozilla_fenix
, we appear to have a base rate of ~1500 client_ids that appear more than once in a day (which should never happen, but that's perhaps a separate investigation).
On 2021-04-11, we see the number of duplicates roughly double day over day until it reaches 40 million yesterday, at which point it's overwhelming the legitimate data. The doubling likely comes from the FULL JOIN we do in the baseline_clients_last_seen query between the most recent day of clients_daily and the previous day of clients_last_seen. That join assumes each client_id appears at most once in each table, but it could cause an amplification effect if that assumption is broken.
My theory right now is that all Glean apps that happened to have some duplicate client_ids are affected by this change in behavior on 2021-04-11. We don't see this effect in org_mozilla_firefox
because that dataset happens to have exactly zero duplicate client_id
entries.
I'm now focusing on what could explain the change in behavior on 2021-04-11.
Assignee | ||
Comment 2•4 years ago
|
||
It looks like 2021-04-12 is the day that we merged https://github.com/mozilla/bigquery-etl/pull/1946 that pulls first_seen_date
into clients_daily
and ran a backfill. After the backfill, we reran the 2021-04-11 baseline_clients_daily
task and downstream. So that's the first run that does the join with clients_first_seen
and likely explains why the behavior change starts for the 2021-04-11 partition.
As seen in https://sql.telemetry.mozilla.org/queries/79632/source the clients_first_seen
tables for org_mozilla_fenix
and org_mozilla_tv_firefox
contain duplicate entries while org_mozilla_firefox
does not.
So, the flow of this problem is roughly like this:
- The backfill of
baseline_clients_first_seen
leads to duplicates for some apps. - The
LEFT JOIN
in thebaseline_clients_daily
query withbaseline_clients_first_seen
leads to some records being duplicated every day inbaseline_clients_daily
- The
FULL JOIN
in thebaseline_clients_last_seen
table amplifies the duplication, roughly doubling the number of duplicate rows day-over-day
I think we'll need to re-initialize the baseline_clients_first_seen
tables, verify that they all contain no duplicates, and then rerun the baseline_clients_daily
and baseline_clients_last_seen
ETL for 2021-04-11 onward, and we should be clear of this problem.
Comment 3•4 years ago
|
||
It would be strange for an application like org_mozilla_tv_firefox
to have duplicates in the initial backfill (which is logically very simple and does not have a dependency on a migration table). There's probably an issue with the incremental query post 2021-04-11 that needs to be identified to ensure that this doesn't eventually come back.
Assignee | ||
Comment 4•4 years ago
|
||
I have been able to reproduce duplicates in the org_mozilla_fenix
init.sql
query. It turns out that we sometimes have nulls for sample_id
, and because we're grouping by sample_id, client_id
(which I will own is a change I requested), we can get two rows per client (one with valid sample_id and one with null).
We likely introduced a pipeline change to support calculating sample_id after org_mozilla_fenix
and org_mozilla_tv_firefox
were implemented, so it's only these older appIds that are affected.
We should be able to add a filter to reject null sample_id
and have the baseline_clients_first_seen logic work correctly.
Assignee | ||
Comment 5•4 years ago
|
||
It looks like the nulls only appear in a narrow period of days in 2019-09
.
Comment 6•4 years ago
|
||
I did a little digging into why there are duplicate values in these first seen tables -- it looks like the same client id can take on several different values for the sample id. https://sql.telemetry.mozilla.org/queries/79636/source
If the sample id is recalculated from scratch or omitted from the query, we should be able to backfill without future issue. On the other-hand, it's a bit worrisome that there are multiple sample id values in the baseline ping for the same client id.
Assignee | ||
Comment 7•4 years ago
|
||
On the other-hand, it's a bit worrisome that there are multiple sample id values in the baseline ping for the same client id.
It would definitely be interesting to track this down. But it looks like this still points to a relatively short period of time that was affected by multiple sample_id values, in 2019.
Comment 8•4 years ago
•
|
||
We should be able to add a filter to reject null sample_id and have the baseline_clients_first_seen logic work correctly.
I don't think this is strictly true, since there seem to be clients with 3 values in the sample id. There must be slightly different calculations for the sample ids that are causing the difference.
It would definitely be interesting to track this down. But it looks like this still points to a relatively short period of time that was affected by multiple sample_id values, in 2019.
This looks to be mostly true for firefox tv, although the band seems to be a bit wider going from August til December of 2019 (https://sql.telemetry.mozilla.org/queries/79640/source#197816). Looking at fenix, we see duplicates as recently as March of this year. It's increasingly rare from starting September of 2019, although occasionally present (https://sql.telemetry.mozilla.org/queries/79637/source#197814)
edit: my work with the redash plots are a bit messy and I forgot to save my changes. It looks like the query for firefox tv shows duplicates fairly recently too.
Comment 9•4 years ago
|
||
Assignee | ||
Comment 10•4 years ago
|
||
We have a fix merged now. I'm working first on org_mozilla_fenix
, then TV to do the following:
- Recreate the first_seen table
- Rerun the daily query from 2021-04-11 to present
- Rerun the last_seen query from 2021-04-11 to present
After those are taken care of, I will check over other apps to ensure there are no more cases of dupes in clients_first_seen.
Assignee | ||
Comment 11•4 years ago
|
||
The only other affected dataset was org_mozilla_fenix_nightly_derived
. That has also now been backfilled.
I've also rerun the firefox_nondesktop_exact_mau28
queries and mobile_desktop_usage
. These all look normal now. There are a few other downstream tables I need to rerun before calling this closed.
Assignee | ||
Comment 12•4 years ago
|
||
Backfills (including prophet forecasts) are now finished.
Assignee | ||
Updated•4 years ago
|
Description
•