Open Bug 1780035 Opened 2 years ago Updated 7 months ago

fog.initialization invalid state errors affect more than 40% of clients

Categories

(Toolkit :: Telemetry, task, P4)

task

Tracking

()

People

(Reporter: chutten, Unassigned)

References

Details

(Whiteboard: [telemetry:fog:m?])

As seen on the FOG Monitoring Dashboard, the proportion of affected clients experiencing invalid state errors in the fog.initialization metric (which measures how long Glean+FOG initialization takes) is like 46% each day, across channels.

This will require investigation.

A first theory -- that this is due to the dispatcher queue being full so the start() doesn't happen, but the stop() does -- doesn't seem to work as our queue overflowing only really started in March and the proportion of invalid state errors has been stable since at least mid-February.

Assignee: nobody → pmcmanis
Status: NEW → ASSIGNED

See also bug 1716847 where things are problematic for glean.baseline.duration. On Desktop, that proportion is higher than 50% so if you happen to figure that one out at the same time, well, that'd be nice...

See Also: → 1716847
Depends on: 1790894

Update on investigation of this.

Looking at volumes, it appears that Chutten's original premise has a good likelihood of being a contributor, though not the sole explanation: https://mozilla.cloud.looker.com/dashboards/872?Sample+ID=0&Submission+Date=2022%2F02%2F01+to+2022%2F03%2F17&Label=

There is more overlap that we would expect, and as such it is at least worth considering address preinit queue overflow.

In fact, when we look at FOG clients using Glean >=51.0.0 we notice that in a given day the metrics ping reports about 4% of clients having at least 1 overflow: https://sql.telemetry.mozilla.org/queries/87750/source

Learnings so far:

  • Pre-init queue uses a bounded value to avoid the possibility of unbounded accumulation of measures in the case Glean never finishes initializing
  • It is not trivial to expose the length of that bounded queue from the SDK
  • It would be hard to do a Nimbus experiment even if that weren't the case because Nimbus needs runtime configurable values & that value must be known at compile time
  • Passing initialization state back to FOG to do an experiment would be tough (ie increase default size and have a "target" group where we just drop anything past the old value on the floor) because without being able to guarantee that glean has yet to initialize the experiment would not be valid
  • Bounded queue sizes to reduce pre-init overflows to a smaller level:
    • 0.5 percent (ie 0.005): 5_000
    • 0.1 percent (ie 0.001): 18_000
    • 0.01 percent: don't ask
See Also: → 1790872
Depends on: 1797494
See Also: → 1797828
See Also: 1716847

Increasing the preinit queue to 10^6 in bug 1796258 did not reduce the number of glean.baseline.duration invalid_state errors. (Though it did reduce the number of overflows quite nicely). I think we may need to leave baseline's duration to bug 1716847 and hope for fog.initialization errors that bug 1797619 will take care of things.

See Also: → 1797619

The bug assignee is inactive on Bugzilla, so the assignee is being reset.

Assignee: pmcmanis → nobody
Status: ASSIGNED → NEW
Priority: -- → P4
You need to log in before you can comment on or make changes to this bug.