Open Bug 1780035 Opened 2 years ago Updated 7 months ago

fog.initialization invalid state errors affect more than 40% of clients

Tracking

()

Status:

NEW

People

(Reporter: chutten, Unassigned)

References

Details

(Whiteboard: [telemetry:fog:m?])

Chris H-C :chutten

Reporter

Description

•

2 years ago

•

Edited

As seen on the FOG Monitoring Dashboard, the proportion of affected clients experiencing invalid state errors in the fog.initialization metric (which measures how long Glean+FOG initialization takes) is like 46% each day, across channels.

This will require investigation.

A first theory -- that this is due to the dispatcher queue being full so the start() doesn't happen, but the stop() does -- doesn't seem to work as our queue overflowing only really started in March and the proportion of invalid state errors has been stable since at least mid-February.

Chris H-C :chutten

Reporter

Updated

•

2 years ago

Assignee: nobody → pmcmanis

Status: NEW → ASSIGNED

Chris H-C :chutten

Reporter

Comment 1

•

2 years ago

See also bug 1716847 where things are problematic for glean.baseline.duration. On Desktop, that proportion is higher than 50% so if you happen to figure that one out at the same time, well, that'd be nice...

Updated

•

2 years ago

Depends on: 1790894

Perry McManis [:perry.mcmanis]

Comment 2

•

2 years ago

Update on investigation of this.

Looking at volumes, it appears that Chutten's original premise has a good likelihood of being a contributor, though not the sole explanation: https://mozilla.cloud.looker.com/dashboards/872?Sample+ID=0&Submission+Date=2022%2F02%2F01+to+2022%2F03%2F17&Label=

There is more overlap that we would expect, and as such it is at least worth considering address preinit queue overflow.

In fact, when we look at FOG clients using Glean >=51.0.0 we notice that in a given day the metrics ping reports about 4% of clients having at least 1 overflow: https://sql.telemetry.mozilla.org/queries/87750/source

Learnings so far:

Pre-init queue uses a bounded value to avoid the possibility of unbounded accumulation of measures in the case Glean never finishes initializing
It is not trivial to expose the length of that bounded queue from the SDK
It would be hard to do a Nimbus experiment even if that weren't the case because Nimbus needs runtime configurable values & that value must be known at compile time
Passing initialization state back to FOG to do an experiment would be tough (ie increase default size and have a "target" group where we just drop anything past the old value on the floor) because without being able to guarantee that glean has yet to initialize the experiment would not be valid
Bounded queue sizes to reduce pre-init overflows to a smaller level:
- 0.5 percent (ie 0.005): 5_000
- 0.1 percent (ie 0.001): 18_000
- 0.01 percent: don't ask

Chris H-C :chutten

Reporter

Updated

•

2 years ago

Updated

•

2 years ago

Depends on: 1797494

Chris H-C :chutten

Reporter

Increasing the preinit queue to 10^6 in bug 1796258 did not reduce the number of glean.baseline.duration invalid_state errors. (Though it did reduce the number of overflows quite nicely). I think we may need to leave baseline's duration to bug 1716847 and hope for fog.initialization errors that bug 1797619 will take care of things.

Chris H-C :chutten

Reporter

Updated

•

2 years ago

Comment 4

•

7 months ago

The bug assignee is inactive on Bugzilla, so the assignee is being reset.

Assignee: pmcmanis → nobody

Status: ASSIGNED → NEW

Chris H-C :chutten

Reporter

Updated

•

7 months ago

Priority: -- → P4

You need to log in before you can comment on or make changes to this bug.

Bugzilla

fog.initialization invalid state errors affect more than 40% of clients

Categories

(Toolkit :: Telemetry, task, P4)

Tracking

()

People

(Reporter: chutten, Unassigned)

References

Details

(Whiteboard: [telemetry:fog:m?])

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Updated

Comment 2

Updated

Updated

Updated

Updated

Comment 3

Updated

Comment 4

Updated