fog.initialization invalid state errors affect more than 40% of clients
Categories
(Toolkit :: Telemetry, task, P4)
Tracking
()
People
(Reporter: chutten, Unassigned)
References
Details
(Whiteboard: [telemetry:fog:m?])
As seen on the FOG Monitoring Dashboard, the proportion of affected clients experiencing invalid state errors in the fog.initialization
metric (which measures how long Glean+FOG initialization takes) is like 46% each day, across channels.
This will require investigation.
A first theory -- that this is due to the dispatcher queue being full so the start() doesn't happen, but the stop() does -- doesn't seem to work as our queue overflowing only really started in March and the proportion of invalid state errors has been stable since at least mid-February.
Reporter | ||
Updated•2 years ago
|
Reporter | ||
Comment 1•2 years ago
|
||
See also bug 1716847 where things are problematic for glean.baseline.duration
. On Desktop, that proportion is higher than 50% so if you happen to figure that one out at the same time, well, that'd be nice...
Comment 2•2 years ago
|
||
Update on investigation of this.
Looking at volumes, it appears that Chutten's original premise has a good likelihood of being a contributor, though not the sole explanation: https://mozilla.cloud.looker.com/dashboards/872?Sample+ID=0&Submission+Date=2022%2F02%2F01+to+2022%2F03%2F17&Label=
There is more overlap that we would expect, and as such it is at least worth considering address preinit queue overflow.
In fact, when we look at FOG clients using Glean >=51.0.0 we notice that in a given day the metrics ping reports about 4% of clients having at least 1 overflow: https://sql.telemetry.mozilla.org/queries/87750/source
Learnings so far:
- Pre-init queue uses a bounded value to avoid the possibility of unbounded accumulation of measures in the case Glean never finishes initializing
- It is not trivial to expose the length of that bounded queue from the SDK
- It would be hard to do a Nimbus experiment even if that weren't the case because Nimbus needs runtime configurable values & that value must be known at compile time
- Passing initialization state back to FOG to do an experiment would be tough (ie increase default size and have a "target" group where we just drop anything past the old value on the floor) because without being able to guarantee that glean has yet to initialize the experiment would not be valid
- Bounded queue sizes to reduce pre-init overflows to a smaller level:
- 0.5 percent (ie 0.005): 5_000
- 0.1 percent (ie 0.001): 18_000
- 0.01 percent: don't ask
Reporter | ||
Comment 3•2 years ago
|
||
Increasing the preinit queue to 10^6 in bug 1796258 did not reduce the number of glean.baseline.duration
invalid_state
errors. (Though it did reduce the number of overflows quite nicely). I think we may need to leave baseline's duration to bug 1716847 and hope for fog.initialization
errors that bug 1797619 will take care of things.
Comment 4•7 months ago
|
||
The bug assignee is inactive on Bugzilla, so the assignee is being reset.
Reporter | ||
Updated•7 months ago
|
Description
•