Closed Bug 1581556 Opened 5 years ago Closed 5 years ago

Fenix reporting 50% more clients on metrics ping than baseline ping

Categories

(Data Platform and Tools :: Glean: SDK, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: frank, Assigned: mdroettboom)

References

Details

(Whiteboard: [telemetry:glean-rs:m10])

Attachments

(3 files)

See this plot: https://sql.telemetry.mozilla.org/queries/64836/source#165294

The baseline ping is supposed to be sent every session, and the metrics ping is supposed to be sent every day. We'd expect that the count of distinct clients sending the baseline ping would be correspondingly larger than the metrics ping (on a daily basis), since the baseline ping can be sent multiple times for every metrics ping. Instead, we're seeing a large cohort that is present only in the metrics ping every day.

New query: https://sql.telemetry.mozilla.org/queries/64837/source#165296

This is showing that there is a cohort of about 10k that is consistently reporting both the metrics and baseline ping every day.

Significantly, there is a cohort of ~5k that only reports the metrics ping (over all-time, so if they later reported it, we would move them out of this group).

There's a similar-sized cohort that only reports the metrics ping on a day, but has reported the baseline ping at some point.

As we'd expect, there is a cohort of users who send the baseline ping on a day, but no metrics ping (this is how we designed the pings).

To keep track of that: There was something weird going on on August 20, when baseline pings dropped heavily, below the metrics ping

Current suspect is the code that should ensure we do not send a "metrics" ping if the app's not used, which was included in android-components v9.0.0 which shipped on Aug 20.

Even when we find the culprit and find some new normal we should still explain why 50% of clients appear to be invisible when counted via "baseline" pings. Engagement numbers are a Big Deal, so we should understand this phenomenon.

Looks like it's still happening with new telemetry versions: https://sql.telemetry.mozilla.org/queries/64838/source?p_version_undefined=10.0.1#165302

Okay, something very interesting. Across all versions I've looked at (from the query in Comment 4), there is an initial period where baseline > metrics. Could it be possible that the metrics ping is still sent on days when the app is not used?

Priority: -- → P1
Whiteboard: [telemetry:glean-rs:m10]
Assignee: nobody → mdroettboom

I don't really have an explanation for why the baseline metrics drops off at a much higher rate than the metrics ping.

However, I do think I have a rock solid explanation for why there is a precipitous drop off in general. Around 08/20, an no-op experiment was deployed to Fenix, called fenix-test-2019-08-05. About 2/3 of the population is randomly enrolled in this no-op experiment. Being kebab case, this unfortunately doesn't validate against the schema for an experiment name in the Glean pipeline schemas. Therefore, anyone enrolled in the experiment isn't getting their pings entered into the database.

Query to confirm this: https://console.cloud.google.com/bigquery?project=moz-fx-data-shared-prod&folder&organizationId=442341870013&j=bq:US:bquxjob_388c40f3_16d410b70f5&page=queryresults

I suggest we either (1) loosen the schema for experiment names or (2) rename this experiment (and be careful about all experiments going forward). I think I'd prefer (1) as the simplest option.

Since this issue hit at almost exactly the same time as a fix to prevent sending metric pings every day it's hard to say whether this explains all or just some of the weirdness. I suggest we fix this asap and watch to see whether things go back to normal. Since it's a pipeline only change, we don't have to wait for a-c to deploy into Fenix.

A comment in the Fenix source code also offers some explanation as to why baseline dropped much faster than metrics:

        // When the `fenix-test-2019-08-05` experiment is active, record its branch in Glean
        // telemetry. This will be used to validate that the experiment system correctly enrolls
        // clients and segments them into branches. Note that this will not take effect the first
        // time the application has launched, since there won't be enough time for the experiments
        // library to get a list of experiments. It will take effect the second time the
        // application is launched.
See Also: → 1581554

While this is a real issue, I'm not sure this is the underlying issue for this bug (though it is certainly a bug).

Here is the BQ data for the recent version: https://sql.telemetry.mozilla.org/queries/64838/source?p_version_undefined=10.0.1&p_version_64838=10.0.1#165302
Here is the AWS data for the recent version: https://sql.telemetry.mozilla.org/queries/64894/source?p_version_64894_64894=10.0.1#165404

Athena is showing 76% higher client counts per-day for the Metrics ping than the Baseline ping.

Flags: needinfo?(mdroettboom)

Thanks. Indeed, let's keep this bug open then. There must be something else going on as well.

The first a-c version that exhibits this is 7.0.0. This points to this commit as a possible culprit:

https://github.com/mozilla-mobile/android-components/commit/54a0a58762b0b1db3df13b7088b1abc5449753e6

Of course, the bug could also have been introduced in Fenix, but nothing obvious pops out there in that timeframe, though I don't think I've exhausted all options.

Flags: needinfo?(mdroettboom)

As a note, the schemas fix for GCP attached to this bug has not rolled out on the normal daily deploy schedule due to an unrelated issue [1].

[1] https://github.com/mozilla/mozilla-schema-generator/issues/70

I've tracked down a likely candidate.

Android LifecycleObserver s must be registered on the main thread otherwise there is a race condition that may cause them to not be registered and/or unregister other observers. Also see the implementation that has a non-atomic push/operate/pop sequence of events. Fenix initializes Glean off of the main thread, and then Glean registers its lifecycle observers off of the main thread.

This bug is the source of all of the significant Fenix Sentry issues involving Glean:

This is a likely explanation for what we are seeing in the data. Glean has two lifecycle observers. Failing to register one of them would disable the baseline ping. Failing to register the other would send metrics pings more than once between runs of the application. The combination of these two is probably enough to explain the problem, though it's hard to compare the scale of Sentry events to the scale of the problems in the data.

We seem to have no more Sentry errors of this type after merging the fix into Fenix nightly:

https://sentry.prod.mozaws.net/operations/fenix-nightly/?query=is%3Aunresolved+glean

Also, overall volumes are back up:

https://sql.telemetry.mozilla.org/queries/65046/source#165707

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: