invalid_state errors in networking.nss_initialization and networking.loading_certs_task in Fenix (and, to a lesser extent, in Firefox Desktop)
Categories
(Toolkit :: Telemetry, defect, P1)
Tracking
()
People
(Reporter: chutten, Assigned: chutten)
References
Details
Attachments
(4 files)
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
phab-bot
:
approval-mozilla-beta+
|
Details | Review |
48 bytes,
text/x-phabricator-request
|
phab-bot
:
approval-mozilla-beta+
|
Details | Review |
On recent Nightlies, Fenix is showing 700k invalid state errors per day on networking.nss_initialization
and networking.loading_certs_task
(over half of clients) and FOG is showing about 250/day (between a fifth and a fourth of clients).
These metrics are recently-API-migrated Scalars of the same names from bug 1923028, specifically in this patch where you'll notice it was I who recommended it be a timespan
instead of a counter
and where we built a RAII AutoGleanTimer
to ensure the APIs are called when appropriate (ie, when the metric is in a valid state).
Interesting things on first look:
- Such a difference (three orders of magnitude!) between Android and Desktop, wow
- Both metrics are reporting essentially identical numbers and proportions of errors, suggesting a fault in common
- InvalidState errors are recorded if (by docs) start is called twice in a row without being cancelled, if (by code) it is stopped without starting, and if the API is used properly but a value is already present in the db (the old value persists).
- It would be a useful task to enumerate these additional undocumented possibilities in the docs as part of the work of this bug.
This bug is about:
- Ensuring the data flowing to the Scalar is of the same character as before the API was migrated to
timespan
- Figuring out which state is invalid, and what to do about it
- Documenting the additional invalid states
(( My intuition is that this is a "the API is being used properly, but multiple times per ping" sort of thing, meaning that the data will be relayed to the Scalar per usual and only the Glean data will be affected (because it'll use the first value instead of the most recent value). ))
Assignee | ||
Comment 1•17 days ago
•
|
||
Since both of these metrics are set once per nsNSSComponent
init (nss_initialization here, loading_certs_task here), which should be at most once per application session, I think my intuition is bearing out. This would help explain why there are just so many more reports from Android: it is far more likely to have multiple app sessions per "metrics" ping on Android than on Desktop.
Assignee | ||
Updated•17 days ago
|
Assignee | ||
Comment 2•17 days ago
|
||
The character of the data being reported by the Scalar versions of networking.nss_initialization|loading_certs_task
has perhaps changed but not in a way that I'm sure we can take credit for. There's someone who sent us a 2^32-1 sample on October 17 that we don't see happening again after API migration on Oct 19. Ditto some earlier lower spikes on Sept 9, 13, and 19 which might be legitimate, but are rare, so we don't know if there's been a change yet.
To put it another way: the migration of the API for those Scalars does not appear to have caused the data to change. So that's nice.
See stmo query on sums, glam for nss_initialization, glam for loading_certs_task.
Assignee | ||
Comment 3•17 days ago
|
||
It is unclear by looking at GLAM's Glean distributions whether or not the differences in summary aggregations are statistically significant. The values, overall shape, and client volume all roughly match. But not exactly.
(We wouldn't expect them to be exact given the vast differences in the scheduling of "main" pings and "metrics" pings, so "roughly match" is honestly what we'd expect.)
The number of samples is wildly off, though. On GLAM, Legacy Telemetry reports 2024-10-21 17:17:37 16,717 clients 44,478 samples
, Glean 2024-10-21 17:00:00 15,455 clients 607,897 samples
. Double-checking the figure with SQL shows that this is an error with GLAM, which I've now reported.
By SQL, the sample counts are roughly as we'd expect: lower from "metrics" pings because we send fewer of them (ratio of samples to clients is close to 1), higher from "main" pings because we send so many more of them (ratio of samples to clients closer to 2.5:1).
So it looks as though the data's mostly correct, despite the Glean data being subject to invalid state errors affecting at least a fifth of the population each day. Which again is consistent with the idea that it's because it's reporting the first value.
On Desktop, anyway. Alas, we have no easy comparison for Android because Legacy Telemetry doesn't work on Fenix. The shape of the data looks good, though. A little slower, but normally distributed.
Assignee | ||
Comment 4•17 days ago
|
||
I am satisfied with the approach of moving forward with solutions based on the theory that these errors are due to multiple recordings per "metrics" ping.
The question now becomes: what do we do about this?
Glean timespan
metrics continue to be the only Glean metric type that considers "being set more than once" to be an error. This is on purpose (see bug 1797828), and probably desirable: if this wasn't a mirror metric, it'd be very helpful for networking folks to know that some (and on Android: a lot) of their data is being thrown away because this is more properly instrumented on the "metrics" ping by being a timing_distribution
, not a timespan
.
At the very least one thing we should do about this is to write into the GIFFT docs that timespan
should not be used in most circumstances because of this. quantity
with unit: <time unit>
is to be preferred for situations where it's not an error for this code to execute multiple times within a ping's measurement window and it's okay to just take the most recent sample.
I think that's going to be my solution to the code here, too, as it will handle the invalid_state errors. Additionally, I'll add timing_distribution
metrics so all information is communicated via Glean that would be communicated via Legacy Telemetry (they'll be named the same as the API-migrated metrics, but made plural).
I'm also going to file a follow-up bug to audit the API-migrated scalars for other situations like these. Not only are there more timespan
s in the tree, but there are also many other scalars that will expect to be set
only single times per "main" ping, but multiple times per "metrics" ping. Though this is definitely correct for the data transmitted via Legacy Telemetry, this may result in an incomplete picture in Glean.
Assignee | ||
Comment 5•16 days ago
|
||
Assignee | ||
Comment 6•16 days ago
|
||
Assignee | ||
Comment 8•16 days ago
|
||
I've softened my stance on adding a timing_distribution
here to make these look more like the general case of API-migrated scalars. For the purposes of the migration of Scalars from Legacy Telemetry to Glean APIs we're going to specifically descope handling the case of "Scalars set once per session" as
- This problem is broader than just an API migration: changing the schedule of the "metrics" ping is a possible solution to this, but not something we can contemplate quickly
- The data in Legacy Telemetry is unaffected
- This case is not always obvious to detect
- There are already Glean metrics in-tree that do this seemingly on purpose (e.g.) casting doubt over whether a one-size-fits-all approach would even work in the simpler case not involving GIFFT mirroring
- The documentation will be updated to supply information and guidance for anyone looking into GIFFT mirroring these sorts of collections
The next steps here will be to verify that this change does indeed solve the invalid state errors and if so to request uplift.
Comment 9•15 days ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/f040beb9eb4e
https://hg.mozilla.org/mozilla-central/rev/e39fc9374407
https://hg.mozilla.org/mozilla-central/rev/4f7833cdcbcd
Assignee | ||
Comment 10•14 days ago
|
||
This plot is trending the right way, but the nature of the Glean Error Reporting Mechanism (and Nightly) is that errors can continue to be reported for a little while, so I'll give it one more day.
Assignee | ||
Comment 11•13 days ago
|
||
Error counts have cratered satisfactorily quickly. Time to request uplift.
Assignee | ||
Comment 12•13 days ago
|
||
Original Revision: https://phabricator.services.mozilla.com/D227233
Updated•13 days ago
|
Assignee | ||
Comment 13•13 days ago
|
||
Original Revision: https://phabricator.services.mozilla.com/D227234
Updated•13 days ago
|
Comment 14•13 days ago
|
||
beta Uplift Approval Request
- User impact if declined: none
- Code covered by automated testing: no
- Fix verified in Nightly: yes
- Needs manual QE test: no
- Steps to reproduce for manual QE testing: n/a
- Risk associated with taking this patch: low
- Explanation of risk level: Received data has been validated
- String changes made/needed: n/a
- Is Android affected?: yes
Updated•13 days ago
|
Updated•13 days ago
|
Comment 15•13 days ago
|
||
uplift |
Updated•13 days ago
|
Description
•