Bug 1927501 Comment 0 Edit History

Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.

Original comment by

Chris H-C :chutten

on 2024-10-28 08:55:30 PDT

On recent Nightlies, Fenix is [showing](https://mozilla.cloud.looker.com/dashboards/382?Channel=nightly&Submission+Date=14+day+ago+for+14+day&Build+Date+%28Timestamp%29=) 700k invalid state errors per day on `networking.nss_initialization` and `networking.loading_cert_tasks` (over half of clients) and FOG is [showing](https://mozilla.cloud.looker.com/dashboards/694?Client+Info+App+Channel=nightly&Submission+Date=7+day+ago+for+7+day&Sample+ID=0&Build+Date+%28Datetime%29=) about 250/day (between a fifth and a fourth of clients).

These metrics are recently-API-migrated Scalars of the same names from bug 1923028, specifically in [this patch](https://phabricator.services.mozilla.com/D225979) where you'll notice it was I who recommended it be a `timespan` instead of a `counter` and where we built a RAII `AutoGleanTimer` to ensure the APIs are called when appropriate (ie, when the metric is in a _valid_ state).

Interesting things on first look:
* Such a difference (three orders of magnitude!) between Android and Desktop, wow
* Both metrics are reporting essentially identical numbers and proportions of errors, suggesting a fault in common
* InvalidState errors are recorded if [(by docs) start is called twice in a row without being cancelled](https://mozilla.github.io/glean/book/reference/metrics/timespan.html#recorded-errors), if ([by code](https://searchfox.org/glean/source/glean-core/src/metrics/timespan.rs)) it is stopped without starting, and if the API is used properly but a value is already present in the db (the old value persists).
    * It would be a useful task to enumerate these additional undocumented possibilities in the docs as part of the work of this bug.

This bug is about:
* Ensuring the data flowing to the Scalar is of the same character as before the API was migrated to `timespan`
* Figuring out which state is invalid, and what to do about it
* Documenting the additional invalid states

(( My intuition is that this is a "the API is being used properly, but multiple times per ping" sort of thing, meaning that the data will be relayed to the Scalar per usual and only the Glean data will be affected (because it'll use the first value instead of the most recent value). ))

Revision 1 by

Chris H-C :chutten

on 2024-10-28 09:14:53 PDT

On recent Nightlies, Fenix is [showing](https://mozilla.cloud.looker.com/dashboards/382?Channel=nightly&Submission+Date=14+day+ago+for+14+day&Build+Date+%28Timestamp%29=) 700k invalid state errors per day on `networking.nss_initialization` and `networking.loading_certs_task` (over half of clients) and FOG is [showing](https://mozilla.cloud.looker.com/dashboards/694?Client+Info+App+Channel=nightly&Submission+Date=7+day+ago+for+7+day&Sample+ID=0&Build+Date+%28Datetime%29=) about 250/day (between a fifth and a fourth of clients).

These metrics are recently-API-migrated Scalars of the same names from bug 1923028, specifically in [this patch](https://phabricator.services.mozilla.com/D225979) where you'll notice it was I who recommended it be a `timespan` instead of a `counter` and where we built a RAII `AutoGleanTimer` to ensure the APIs are called when appropriate (ie, when the metric is in a _valid_ state).

Interesting things on first look:
* Such a difference (three orders of magnitude!) between Android and Desktop, wow
* Both metrics are reporting essentially identical numbers and proportions of errors, suggesting a fault in common
* InvalidState errors are recorded if [(by docs) start is called twice in a row without being cancelled](https://mozilla.github.io/glean/book/reference/metrics/timespan.html#recorded-errors), if ([by code](https://searchfox.org/glean/source/glean-core/src/metrics/timespan.rs)) it is stopped without starting, and if the API is used properly but a value is already present in the db (the old value persists).
    * It would be a useful task to enumerate these additional undocumented possibilities in the docs as part of the work of this bug.

This bug is about:
* Ensuring the data flowing to the Scalar is of the same character as before the API was migrated to `timespan`
* Figuring out which state is invalid, and what to do about it
* Documenting the additional invalid states

(( My intuition is that this is a "the API is being used properly, but multiple times per ping" sort of thing, meaning that the data will be relayed to the Scalar per usual and only the Glean data will be affected (because it'll use the first value instead of the most recent value). ))

Back to Bug 1927501 Comment 0