Investigate precipitous increase in `glean.restarted` InvalidState errors starting in Fx111 (Glean v52)
Categories
(Toolkit :: Telemetry, defect, P1)
Tracking
()
| Tracking | Status | |
|---|---|---|
| firefox-esr102 | --- | unaffected |
| firefox109 | --- | unaffected |
| firefox110 | --- | unaffected |
| firefox111 | --- | fixed |
People
(Reporter: chutten|PTO, Assigned: chutten|PTO)
References
(Regression)
Details
(Keywords: regression)
Attachments
(1 file)
FOG Monitoring at the Glean SDK Weekly meeting noticed that we're getting a precipitous increase in InvalidState errors on the glean.restarted event metric (showing up in 70% of clients and climbing )
( Nothing so far on Firefox for Android, suggesting it might be Desktop-specific )
( We do see some of this from Dec 18 onwards in Mozilla VPN, but at least an order of magnitude less. This may mean that it's a Glean SDK problem not a FOG-specific problem, but it's too early to tell. )
event metrics tend not to report InvalidState or InvalidValue in general, but glean.restarted does because of its role in cross-application-session event timestamp ordering (bug 1716725). Specifically, InvalidState on glean.restarted happens iff, while collecting all events for a Custom Ping across perhaps multiple application runs:
- The
glean.startup.dateevent extra onglean.restartedis present, but not parseable. - The execution counter (monotonically-increasing value denoting the application session's order in sequence) changes without there being a
glean.restartedevent in between. - The event timestamp within a given run somehow went backwards
The resulting event stream aims to remain consistent in the face of these errors, so the events should still be safe to use in all the normal ways.
This bug is about investigating what's going on:
- This is happening to enough of the population that it should be reproducible locally. Use that to nail down which variety of InvalidState we're dealing with
- Is this bug in the glean-core impl of
eventmetrics in Custom Pings, or is it FOG-specific? - Are the resulting error-ing event streams consistent?
- Should this block a planned uplift of Glean v52?
| Assignee | ||
Comment 1•2 years ago
|
||
(A reminder to self that while testing this Activity Stream's telemetry pref (browser.newtabpage.activity-stream.telemetry)must be switched to true)
Local testing suggests that every cross-app-session "newtab" ping is exhibiting InvalidState errors of kind Inconsistent execution counter (ie this line). Specifically, it's expecting 0.
...oh, I think I see it. We trim initial glean.restarted events before we read their execution_counter, which means that for every execution after the first, the first event in this loop will have a non-zero execution_counter, but because it's the first iteration of the loop cur_ec will still be 0.
Next steps:
- Write a test to assert no errors when we have storage of
glean.restartedevents and "normal" events with non-zeroexecution_counter - Maybe properly initialize
cur_ecor maybe remove the InvalidState error on mismatch, to fix the test
Alas, because these errors happen during event snapshotting they're sent in the pings following the pings that actually exhibit the error (because metrics, including error metrics, are snapshotted before events), so there may be no way to validate that this is indeed the problem we're facing that is more straightforward than landing, releasing, and vendoring the fix.
Comment 2•2 years ago
|
||
Comment 3•2 years ago
|
||
Updated•2 years ago
|
Comment 4•2 years ago
|
||
:chutten firefox-111 is set to affected for this regression.
The attached PR was merged, is there additional work required for 111?
| Assignee | ||
Comment 5•2 years ago
|
||
There sure is. Lemme find it. bug 1812615 will bring these changes over. Should I mark bug 1812615 as 111: affected to track?
Comment 6•2 years ago
|
||
(In reply to Chris H-C :chutten from comment #5)
There sure is. Lemme find it. bug 1812615 will bring these changes over. Should I mark bug 1812615 as
111: affectedto track?
Yes, and thanks for the info.
Updated•2 years ago
|
Updated•2 years ago
|
Description
•