Spike in unrecoverable network errors for Firefox Desktop Nightly
Categories
(Data Platform and Tools :: Glean: SDK, defect, P1)
Tracking
(Not tracked)
People
(Reporter: janerik, Assigned: chutten)
References
Details
(Keywords: leave-open)
Attachments
(4 files)
Query: https://sql.telemetry.mozilla.org/queries/85601/#211984
Spike from ~11k to ~49k "unrecoverable" networks errors on Nightly only.
Assignee | ||
Comment 1•3 years ago
|
||
Correlated even more heavily with build than with submission date. Suggests something landed in the first Nightly of June 18th.
Reporter | ||
Comment 2•3 years ago
|
||
The Glean update landed on 2022-06-15: https://hg.mozilla.org/mozilla-central/rev/b726eab21f86
Assignee | ||
Comment 3•3 years ago
|
||
Pushlog for code that landed after the last June 17 build but before the first June 18 build: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=d7d4a8c914f7&tochange=eda29d58035f
Looking at that I notice a certain :chutten landed a medium-volume (between "baseline" and "events" in volume) ping (the "newtab") ping, which might be the culprit (inasmuch as trying to submit a ping all the time might have discovered that there are times where we react to ping submission with unrecoverable failures)
Assignee | ||
Comment 4•3 years ago
|
||
A theory that was good but didn't pan out: The schema deploy for the "newtab" ping didn't happen until the early hours of Tuesday. Maybe these errors are the pings being rejected for being unknown?
If that were the case we'd see a sharp drop in error counts when the schema deploy happened, and that isn't seen in this per-hour (live tables) time series. So even if true, it doesn't explain what we're seeing.
Reporter | ||
Comment 5•3 years ago
|
||
Side note: I don't think we ever reject pings on the edge, as long as they follow the spec. So even if the schema isn't deployed they will be accepted and only later dropped.
Comment 6•3 years ago
|
||
:chutten wrote in slack:
From what I can tell, though, this has not been shown to be causing problems to anything except the completeness of the newtab data collection, so we may not need to take any action at all. I would like confirmation that this isn't causing problems for the pipeline and confidence that these errors are only affecting newtab pings first.
As far as I can tell [1] this isn't causing any problems in the ingestion pipeline and agree that taking no action at all is probably acceptable here. I only checked operational metrics and didn't verify whether this is only affecting only newtab pings.
[1] cursory inspection of https://console.cloud.google.com/monitoring?project=moz-fx-data-ingesti-prod-579d&timeDomain=1h with some emphasis on http responses
Reporter | ||
Comment 7•3 years ago
|
||
Updated•3 years ago
|
Updated•3 years ago
|
Reporter | ||
Updated•3 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
Reporter | ||
Comment 9•3 years ago
|
||
Assigning to chutten as he'll do the pref-flip experiment
Assignee | ||
Comment 10•3 years ago
|
||
The experiment is live, targeting 100% of Nightly 103. Monitoring Dashboard, Network error live data comparison
Assignee | ||
Comment 11•3 years ago
|
||
Current theory is that these neterrs were caused by newtab sessions ending as a result of shutdown. To test this hypothesis you'll need to ensure that networking isn't lying and pretending success without actually trying to send a ping (docs). The way that makes the most sense to me is setting a debug tag. Then you'll want logging on so that neterrs (and successes) will log from this line.
I do this via
GLEAN_DEBUG_VIEW_TAG="chutten-newtab-err" RUST_LOG="glean,fog" ./mach run
Fun things I've learned so far
[2022-06-24T21:01:12Z ERROR glean_core::metrics::ping] Invalid reason code startup for ping newtab
-- Turns out pings with events in them can be sent with reasons that aren't declared for them. We should file and fix this in the SDK, and in the meantime add a reasonstartup
to the "newtab" ping.- Debug builds don't have problems sending "newtab" pings around shutdown (doesn't appear to try)... but this might be because of all the
[2022-06-24T21:03:43Z INFO glean_core::dispatcher::global] Failed to launch a task on the queue. Discarding task.
that happens at shutdown (probably due to all the threadstats instrumentation happening).
I'd retest on opt build, but I'm already past my EOW. Good hunting to whoever picks this up (if they do before I get back)
Comment 12•3 years ago
|
||
bugherder |
Updated•3 years ago
|
Comment 13•3 years ago
|
||
Assignee | ||
Comment 14•3 years ago
|
||
Resolving this as fixed. All follow-up work is being tracked in other bugs or by the Incident Manager.
Description
•