Closed Bug 1776201 Opened 3 years ago Closed 3 years ago

Spike in unrecoverable network errors for Firefox Desktop Nightly

Categories

(Data Platform and Tools :: Glean: SDK, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: janerik, Assigned: chutten)

References

Details

(Keywords: leave-open)

Attachments

(4 files)

Query: https://sql.telemetry.mozilla.org/queries/85601/#211984

Spike from ~11k to ~49k "unrecoverable" networks errors on Nightly only.

Attached image by_buildid

Correlated even more heavily with build than with submission date. Suggests something landed in the first Nightly of June 18th.

The Glean update landed on 2022-06-15: https://hg.mozilla.org/mozilla-central/rev/b726eab21f86

Pushlog for code that landed after the last June 17 build but before the first June 18 build: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=d7d4a8c914f7&tochange=eda29d58035f

Looking at that I notice a certain :chutten landed a medium-volume (between "baseline" and "events" in volume) ping (the "newtab") ping, which might be the culprit (inasmuch as trying to submit a ping all the time might have discovered that there are times where we react to ping submission with unrecoverable failures)

A theory that was good but didn't pan out: The schema deploy for the "newtab" ping didn't happen until the early hours of Tuesday. Maybe these errors are the pings being rejected for being unknown?

If that were the case we'd see a sharp drop in error counts when the schema deploy happened, and that isn't seen in this per-hour (live tables) time series. So even if true, it doesn't explain what we're seeing.

Side note: I don't think we ever reject pings on the edge, as long as they follow the spec. So even if the schema isn't deployed they will be accepted and only later dropped.

:chutten wrote in slack:

From what I can tell, though, this has not been shown to be causing problems to anything except the completeness of the newtab data collection, so we may not need to take any action at all. I would like confirmation that this isn't causing problems for the pipeline and confidence that these errors are only affecting newtab pings first.

As far as I can tell [1] this isn't causing any problems in the ingestion pipeline and agree that taking no action at all is probably acceptable here. I only checked operational metrics and didn't verify whether this is only affecting only newtab pings.

[1] cursory inspection of https://console.cloud.google.com/monitoring?project=moz-fx-data-ingesti-prod-579d&timeDomain=1h with some emphasis on http responses

Assignee: nobody → jrediger
Status: NEW → ASSIGNED
Attachment #9282762 - Attachment description: WIP: Bug 1776201 - Consider only some errors unrecoverable, others as recoverable. r?chutten! → WIP: Bug 1776201 - Consider only some errors unrecoverable, others as recoverable.
Keywords: leave-open
Attachment #9282762 - Attachment description: WIP: Bug 1776201 - Consider only some errors unrecoverable, others as recoverable. → Bug 1776201 - Consider only some errors unrecoverable, others as recoverable. r?chutten!
Attachment #9282762 - Attachment description: Bug 1776201 - Consider only some errors unrecoverable, others as recoverable. r?chutten! → Bug 1776201 - Consider only some errors unrecoverable, others as recoverable. r?Dexter
Pushed by jrediger@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/99dedbe96a85 Consider only some errors unrecoverable, others as recoverable. r=Dexter

Assigning to chutten as he'll do the pref-flip experiment

Assignee: jrediger → chutten

The experiment is live, targeting 100% of Nightly 103. Monitoring Dashboard, Network error live data comparison

Current theory is that these neterrs were caused by newtab sessions ending as a result of shutdown. To test this hypothesis you'll need to ensure that networking isn't lying and pretending success without actually trying to send a ping (docs). The way that makes the most sense to me is setting a debug tag. Then you'll want logging on so that neterrs (and successes) will log from this line.

I do this via

GLEAN_DEBUG_VIEW_TAG="chutten-newtab-err" RUST_LOG="glean,fog" ./mach run

Fun things I've learned so far

  1. [2022-06-24T21:01:12Z ERROR glean_core::metrics::ping] Invalid reason code startup for ping newtab -- Turns out pings with events in them can be sent with reasons that aren't declared for them. We should file and fix this in the SDK, and in the meantime add a reason startup to the "newtab" ping.
  2. Debug builds don't have problems sending "newtab" pings around shutdown (doesn't appear to try)... but this might be because of all the [2022-06-24T21:03:43Z INFO glean_core::dispatcher::global] Failed to launch a task on the queue. Discarding task. that happens at shutdown (probably due to all the threadstats instrumentation happening).

I'd retest on opt build, but I'm already past my EOW. Good hunting to whoever picks this up (if they do before I get back)

Priority: P3 → P2
Priority: P2 → P1
Whiteboard: [telemetry:glean-rs:m?]
See Also: → 1777233

Resolving this as fixed. All follow-up work is being tracked in other bugs or by the Incident Manager.

Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: