Closed Bug 1707640 Opened 3 years ago Closed 3 years ago

April 2021 explosion in baseline_clients_last_seen entries

Categories

(Data Platform and Tools :: General, defect, P1)

defect
Points:
3

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: klukas, Assigned: klukas)

Details

Attachments

(1 file)

:loines reported that Fenix DAU went crazy over the weekend. See Slack thread.

We've tracked this down to baseline_clients_last_seen having a huge number of entries for the org_mozilla_fenix (nightly) channel. Firefox for Fire TV (org_mozilla_tv_firefox) also shows a huge increase.

Something appears to change on 2021-04-11 that introduces a large number of duplicate entries.

In https://sql.telemetry.mozilla.org/queries/79630/source we plot the number of duplicate entries in April. For org_mozilla_fenix, we appear to have a base rate of ~1500 client_ids that appear more than once in a day (which should never happen, but that's perhaps a separate investigation).

On 2021-04-11, we see the number of duplicates roughly double day over day until it reaches 40 million yesterday, at which point it's overwhelming the legitimate data. The doubling likely comes from the FULL JOIN we do in the baseline_clients_last_seen query between the most recent day of clients_daily and the previous day of clients_last_seen. That join assumes each client_id appears at most once in each table, but it could cause an amplification effect if that assumption is broken.

My theory right now is that all Glean apps that happened to have some duplicate client_ids are affected by this change in behavior on 2021-04-11. We don't see this effect in org_mozilla_firefox because that dataset happens to have exactly zero duplicate client_id entries.

I'm now focusing on what could explain the change in behavior on 2021-04-11.

It looks like 2021-04-12 is the day that we merged https://github.com/mozilla/bigquery-etl/pull/1946 that pulls first_seen_date into clients_daily and ran a backfill. After the backfill, we reran the 2021-04-11 baseline_clients_daily task and downstream. So that's the first run that does the join with clients_first_seen and likely explains why the behavior change starts for the 2021-04-11 partition.

As seen in https://sql.telemetry.mozilla.org/queries/79632/source the clients_first_seen tables for org_mozilla_fenix and org_mozilla_tv_firefox contain duplicate entries while org_mozilla_firefox does not.

So, the flow of this problem is roughly like this:

  1. The backfill of baseline_clients_first_seen leads to duplicates for some apps.
  2. The LEFT JOIN in the baseline_clients_daily query with baseline_clients_first_seen leads to some records being duplicated every day in baseline_clients_daily
  3. The FULL JOIN in the baseline_clients_last_seen table amplifies the duplication, roughly doubling the number of duplicate rows day-over-day

I think we'll need to re-initialize the baseline_clients_first_seen tables, verify that they all contain no duplicates, and then rerun the baseline_clients_daily and baseline_clients_last_seen ETL for 2021-04-11 onward, and we should be clear of this problem.

It would be strange for an application like org_mozilla_tv_firefox to have duplicates in the initial backfill (which is logically very simple and does not have a dependency on a migration table). There's probably an issue with the incremental query post 2021-04-11 that needs to be identified to ensure that this doesn't eventually come back.

I have been able to reproduce duplicates in the org_mozilla_fenix init.sql query. It turns out that we sometimes have nulls for sample_id, and because we're grouping by sample_id, client_id (which I will own is a change I requested), we can get two rows per client (one with valid sample_id and one with null).

We likely introduced a pipeline change to support calculating sample_id after org_mozilla_fenix and org_mozilla_tv_firefox were implemented, so it's only these older appIds that are affected.

We should be able to add a filter to reject null sample_id and have the baseline_clients_first_seen logic work correctly.

It looks like the nulls only appear in a narrow period of days in 2019-09.

I did a little digging into why there are duplicate values in these first seen tables -- it looks like the same client id can take on several different values for the sample id. https://sql.telemetry.mozilla.org/queries/79636/source

If the sample id is recalculated from scratch or omitted from the query, we should be able to backfill without future issue. On the other-hand, it's a bit worrisome that there are multiple sample id values in the baseline ping for the same client id.

On the other-hand, it's a bit worrisome that there are multiple sample id values in the baseline ping for the same client id.

It would definitely be interesting to track this down. But it looks like this still points to a relatively short period of time that was affected by multiple sample_id values, in 2019.

We should be able to add a filter to reject null sample_id and have the baseline_clients_first_seen logic work correctly.

I don't think this is strictly true, since there seem to be clients with 3 values in the sample id. There must be slightly different calculations for the sample ids that are causing the difference.

It would definitely be interesting to track this down. But it looks like this still points to a relatively short period of time that was affected by multiple sample_id values, in 2019.

This looks to be mostly true for firefox tv, although the band seems to be a bit wider going from August til December of 2019 (https://sql.telemetry.mozilla.org/queries/79640/source#197816). Looking at fenix, we see duplicates as recently as March of this year. It's increasingly rare from starting September of 2019, although occasionally present (https://sql.telemetry.mozilla.org/queries/79637/source#197814)

edit: my work with the redash plots are a bit messy and I forgot to save my changes. It looks like the query for firefox tv shows duplicates fairly recently too.

We have a fix merged now. I'm working first on org_mozilla_fenix, then TV to do the following:

  • Recreate the first_seen table
  • Rerun the daily query from 2021-04-11 to present
  • Rerun the last_seen query from 2021-04-11 to present

After those are taken care of, I will check over other apps to ensure there are no more cases of dupes in clients_first_seen.

The only other affected dataset was org_mozilla_fenix_nightly_derived. That has also now been backfilled.

I've also rerun the firefox_nondesktop_exact_mau28 queries and mobile_desktop_usage. These all look normal now. There are a few other downstream tables I need to rerun before calling this closed.

Backfills (including prophet forecasts) are now finished.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Points: 2 → 3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: