Closed Bug 1927501 Opened 17 days ago Closed 15 days ago

invalid_state errors in networking.nss_initialization and networking.loading_certs_task in Fenix (and, to a lesser extent, in Firefox Desktop)

Tracking

()

Status:

RESOLVED FIXED

Milestone:

134 Branch

Tracking Flags:

Tracking

Status

firefox133

---

fixed

firefox134

---

fixed

People

(Reporter: chutten, Assigned: chutten)

References

Details

Attachments

(4 files)

Bug 1927501 - Document GIFFT considerations for per-session Scalars r?janerik 16 days ago Chris H-C :chutten 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1927501 - Convert API-migrated nsscomponent timespans to quantity to avoid invalid states r?janerik 16 days ago Chris H-C :chutten 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1927501 - Document GIFFT considerations for per-session Scalars r?janerik 13 days ago Chris H-C :chutten 48 bytes, text/x-phabricator-request	phab-bot : approval-mozilla-beta+	Details \| Review
Bug 1927501 - Convert API-migrated nsscomponent timespans to quantity to avoid invalid states r?janerik 13 days ago Chris H-C :chutten 48 bytes, text/x-phabricator-request	phab-bot : approval-mozilla-beta+	Details \| Review

Chris H-C :chutten

Assignee

Description

•

17 days ago

•

Edited

On recent Nightlies, Fenix is showing 700k invalid state errors per day on networking.nss_initialization and networking.loading_certs_task (over half of clients) and FOG is showing about 250/day (between a fifth and a fourth of clients).

These metrics are recently-API-migrated Scalars of the same names from bug 1923028, specifically in this patch where you'll notice it was I who recommended it be a timespan instead of a counter and where we built a RAII AutoGleanTimer to ensure the APIs are called when appropriate (ie, when the metric is in a valid state).

Interesting things on first look:

Such a difference (three orders of magnitude!) between Android and Desktop, wow
Both metrics are reporting essentially identical numbers and proportions of errors, suggesting a fault in common
InvalidState errors are recorded if (by docs) start is called twice in a row without being cancelled, if (by code) it is stopped without starting, and if the API is used properly but a value is already present in the db (the old value persists).
- It would be a useful task to enumerate these additional undocumented possibilities in the docs as part of the work of this bug.

This bug is about:

Ensuring the data flowing to the Scalar is of the same character as before the API was migrated to timespan
Figuring out which state is invalid, and what to do about it
Documenting the additional invalid states

(( My intuition is that this is a "the API is being used properly, but multiple times per ping" sort of thing, meaning that the data will be relayed to the Scalar per usual and only the Glean data will be affected (because it'll use the first value instead of the most recent value). ))

Chris H-C :chutten

Assignee

Comment 1

•

17 days ago

•

Edited

Since both of these metrics are set once per nsNSSComponent init (nss_initialization here, loading_certs_task here), which should be at most once per application session, I think my intuition is bearing out. This would help explain why there are just so many more reports from Android: it is far more likely to have multiple app sessions per "metrics" ping on Android than on Desktop.

Chris H-C :chutten

Assignee

Updated

•

17 days ago

Summary: invalid_state errors in networking.nss_initialization and networking.loading_cert_tasks in Fenix (and, to a lesser extent, in Firefox Desktop) → invalid_state errors in networking.nss_initialization and networking.loading_certs_task in Fenix (and, to a lesser extent, in Firefox Desktop)

Chris H-C :chutten

Assignee

Comment 2

•

17 days ago

The character of the data being reported by the Scalar versions of networking.nss_initialization|loading_certs_task has perhaps changed but not in a way that I'm sure we can take credit for. There's someone who sent us a 2^32-1 sample on October 17 that we don't see happening again after API migration on Oct 19. Ditto some earlier lower spikes on Sept 9, 13, and 19 which might be legitimate, but are rare, so we don't know if there's been a change yet.

To put it another way: the migration of the API for those Scalars does not appear to have caused the data to change. So that's nice.

See stmo query on sums, glam for nss_initialization, glam for loading_certs_task.

Chris H-C :chutten

Assignee

Comment 3

•

17 days ago

It is unclear by looking at GLAM's Glean distributions whether or not the differences in summary aggregations are statistically significant. The values, overall shape, and client volume all roughly match. But not exactly.

(We wouldn't expect them to be exact given the vast differences in the scheduling of "main" pings and "metrics" pings, so "roughly match" is honestly what we'd expect.)

The number of samples is wildly off, though. On GLAM, Legacy Telemetry reports 2024-10-21 17:17:37 16,717 clients 44,478 samples, Glean 2024-10-21 17:00:00 15,455 clients 607,897 samples. Double-checking the figure with SQL shows that this is an error with GLAM, which I've now reported.

By SQL, the sample counts are roughly as we'd expect: lower from "metrics" pings because we send fewer of them (ratio of samples to clients is close to 1), higher from "main" pings because we send so many more of them (ratio of samples to clients closer to 2.5:1).

So it looks as though the data's mostly correct, despite the Glean data being subject to invalid state errors affecting at least a fifth of the population each day. Which again is consistent with the idea that it's because it's reporting the first value.

On Desktop, anyway. Alas, we have no easy comparison for Android because Legacy Telemetry doesn't work on Fenix. The shape of the data looks good, though. A little slower, but normally distributed.

Chris H-C :chutten

Assignee

Comment 4

•

17 days ago

I am satisfied with the approach of moving forward with solutions based on the theory that these errors are due to multiple recordings per "metrics" ping.

The question now becomes: what do we do about this?

Glean timespan metrics continue to be the only Glean metric type that considers "being set more than once" to be an error. This is on purpose (see bug 1797828), and probably desirable: if this wasn't a mirror metric, it'd be very helpful for networking folks to know that some (and on Android: a lot) of their data is being thrown away because this is more properly instrumented on the "metrics" ping by being a timing_distribution, not a timespan.

At the very least one thing we should do about this is to write into the GIFFT docs that timespan should not be used in most circumstances because of this. quantity with unit: <time unit> is to be preferred for situations where it's not an error for this code to execute multiple times within a ping's measurement window and it's okay to just take the most recent sample.

I think that's going to be my solution to the code here, too, as it will handle the invalid_state errors. Additionally, I'll add timing_distribution metrics so all information is communicated via Glean that would be communicated via Legacy Telemetry (they'll be named the same as the API-migrated metrics, but made plural).

I'm also going to file a follow-up bug to audit the API-migrated scalars for other situations like these. Not only are there more timespans in the tree, but there are also many other scalars that will expect to be set only single times per "main" ping, but multiple times per "metrics" ping. Though this is definitely correct for the data transmitted via Legacy Telemetry, this may result in an incomplete picture in Glean.

Comment 5

•

16 days ago

Attached file Bug 1927501 - Document GIFFT considerations for per-session Scalars r?janerik — Details

Chris H-C :chutten

Assignee

Comment 6

•

16 days ago

Attached file Bug 1927501 - Convert API-migrated nsscomponent timespans to quantity to avoid invalid states r?janerik — Details

Pulsebot

Comment 7

•

16 days ago

Pushed by chutten@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/f040beb9eb4e Document GIFFT considerations for per-session Scalars r=janerik https://hg.mozilla.org/integration/autoland/rev/e39fc9374407 Convert API-migrated nsscomponent timespans to quantity to avoid invalid states r=janerik https://hg.mozilla.org/integration/autoland/rev/4f7833cdcbcd apply code formatting via Lando

Chris H-C :chutten

Assignee

Comment 8

•

16 days ago

I've softened my stance on adding a timing_distribution here to make these look more like the general case of API-migrated scalars. For the purposes of the migration of Scalars from Legacy Telemetry to Glean APIs we're going to specifically descope handling the case of "Scalars set once per session" as

This problem is broader than just an API migration: changing the schedule of the "metrics" ping is a possible solution to this, but not something we can contemplate quickly
The data in Legacy Telemetry is unaffected
This case is not always obvious to detect
There are already Glean metrics in-tree that do this seemingly on purpose (e.g.) casting doubt over whether a one-size-fits-all approach would even work in the simpler case not involving GIFFT mirroring
The documentation will be updated to supply information and guidance for anyone looking into GIFFT mirroring these sorts of collections

The next steps here will be to verify that this change does indeed solve the invalid state errors and if so to request uplift.

Cristian Tuns

Comment 9

•

15 days ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/f040beb9eb4e
https://hg.mozilla.org/mozilla-central/rev/e39fc9374407
https://hg.mozilla.org/mozilla-central/rev/4f7833cdcbcd

Status: ASSIGNED → RESOLVED

Closed: 15 days ago

status-firefox134: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 134 Branch

Chris H-C :chutten

Assignee

Comment 10

•

14 days ago

This plot is trending the right way, but the nature of the Glean Error Reporting Mechanism (and Nightly) is that errors can continue to be reported for a little while, so I'll give it one more day.

Chris H-C :chutten

Assignee

Comment 11

•

13 days ago

Error counts have cratered satisfactorily quickly. Time to request uplift.

Chris H-C :chutten

Assignee

Comment 12

•

13 days ago

Attached file Bug 1927501 - Document GIFFT considerations for per-session Scalars r?janerik — Details

Original Revision: https://phabricator.services.mozilla.com/D227233

Phabricator Automation

Updated

•

13 days ago

Attachment #9434803 - Flags: approval-mozilla-beta?

Chris H-C :chutten

Assignee

Comment 13

•

13 days ago

Attached file Bug 1927501 - Convert API-migrated nsscomponent timespans to quantity to avoid invalid states r?janerik — Details

Original Revision: https://phabricator.services.mozilla.com/D227234

Phabricator Automation

Updated

•

13 days ago

Attachment #9434804 - Flags: approval-mozilla-beta?

Phabricator Automation

Comment 14

•

13 days ago

beta Uplift Approval Request

User impact if declined: none
Code covered by automated testing: no
Fix verified in Nightly: yes
Needs manual QE test: no
Steps to reproduce for manual QE testing: n/a
Risk associated with taking this patch: low
Explanation of risk level: Received data has been validated
String changes made/needed: n/a
Is Android affected?: yes

Phabricator Automation

Updated

•

13 days ago

Attachment #9434804 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Phabricator Automation

Updated

•

13 days ago

Attachment #9434803 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Pulsebot

Comment 15

•

13 days ago

uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/2cf39321efb0 https://hg.mozilla.org/releases/mozilla-beta/rev/e30ac2747330

Lando Automation

Updated

•

13 days ago

status-firefox133: --- → fixed

You need to log in before you can comment on or make changes to this bug.