Closed Bug 1776201 Opened 3 years ago Closed 3 years ago

Spike in unrecoverable network errors for Firefox Desktop Nightly

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: janerik, Assigned: chutten)

References

Details

(Keywords: leave-open)

Attachments

(4 files)

Desktop Nightly Network Errors 3 years ago Jan-Erik Rediger [:janerik] 92.89 KB, image/png		Details
by_buildid 3 years ago Chris H-C :chutten 30.05 KB, image/png		Details
per hour live network error counts 3 years ago Chris H-C :chutten 45.27 KB, image/png		Details
Bug 1776201 - Consider only some errors unrecoverable, others as recoverable. r?Dexter 3 years ago Jan-Erik Rediger [:janerik] 48 bytes, text/x-phabricator-request		Details \| Review

Jan-Erik Rediger [:janerik]

Reporter

Description

•

3 years ago

Attached image Desktop Nightly Network Errors — Details

Query: https://sql.telemetry.mozilla.org/queries/85601/#211984

Spike from ~11k to ~49k "unrecoverable" networks errors on Nightly only.

Chris H-C :chutten

Assignee

Comment 1

•

3 years ago

Attached image by_buildid — Details

Correlated even more heavily with build than with submission date. Suggests something landed in the first Nightly of June 18th.

Jan-Erik Rediger [:janerik]

Reporter

Comment 2

•

3 years ago

The Glean update landed on 2022-06-15: https://hg.mozilla.org/mozilla-central/rev/b726eab21f86

Chris H-C :chutten

Assignee

Comment 3

•

3 years ago

Pushlog for code that landed after the last June 17 build but before the first June 18 build: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=d7d4a8c914f7&tochange=eda29d58035f

Looking at that I notice a certain :chutten landed a medium-volume (between "baseline" and "events" in volume) ping (the "newtab") ping, which might be the culprit (inasmuch as trying to submit a ping all the time might have discovered that there are times where we react to ping submission with unrecoverable failures)

Chris H-C :chutten

Assignee

Comment 4

•

3 years ago

Attached image per hour live network error counts — Details

A theory that was good but didn't pan out: The schema deploy for the "newtab" ping didn't happen until the early hours of Tuesday. Maybe these errors are the pings being rejected for being unknown?

If that were the case we'd see a sharp drop in error counts when the schema deploy happened, and that isn't seen in this per-hour (live tables) time series. So even if true, it doesn't explain what we're seeing.

Jan-Erik Rediger [:janerik]

Reporter

Comment 5

•

3 years ago

Side note: I don't think we ever reject pings on the edge, as long as they follow the spec. So even if the schema isn't deployed they will be accepted and only later dropped.

Wesley Dawson [:whd]

Comment 6

•

3 years ago

:chutten wrote in slack:

From what I can tell, though, this has not been shown to be causing problems to anything except the completeness of the newtab data collection, so we may not need to take any action at all. I would like confirmation that this isn't causing problems for the pipeline and confidence that these errors are only affecting newtab pings first.

As far as I can tell [1] this isn't causing any problems in the ingestion pipeline and agree that taking no action at all is probably acceptable here. I only checked operational metrics and didn't verify whether this is only affecting only newtab pings.

[1] cursory inspection of https://console.cloud.google.com/monitoring?project=moz-fx-data-ingesti-prod-579d&timeDomain=1h with some emphasis on http responses

Jan-Erik Rediger [:janerik]

Reporter

Comment 7

•

3 years ago

Attached file Bug 1776201 - Consider only some errors unrecoverable, others as recoverable. r?Dexter — Details

Phabricator Automation

Updated

•

3 years ago

Assignee: nobody → jrediger

Status: NEW → ASSIGNED

Phabricator Automation

Updated

•

3 years ago

Attachment #9282762 - Attachment description: WIP: Bug 1776201 - Consider only some errors unrecoverable, others as recoverable. r?chutten! → WIP: Bug 1776201 - Consider only some errors unrecoverable, others as recoverable.

Jan-Erik Rediger [:janerik]

Reporter

Updated

•

3 years ago

Keywords: leave-open

Phabricator Automation

Updated

•

3 years ago

Attachment #9282762 - Attachment description: WIP: Bug 1776201 - Consider only some errors unrecoverable, others as recoverable. → Bug 1776201 - Consider only some errors unrecoverable, others as recoverable. r?chutten!

Phabricator Automation

Updated

•

3 years ago

Attachment #9282762 - Attachment description: Bug 1776201 - Consider only some errors unrecoverable, others as recoverable. r?chutten! → Bug 1776201 - Consider only some errors unrecoverable, others as recoverable. r?Dexter

Pulsebot

Comment 8

•

3 years ago

Pushed by jrediger@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/99dedbe96a85 Consider only some errors unrecoverable, others as recoverable. r=Dexter

Jan-Erik Rediger [:janerik]

Reporter

Comment 9

•

3 years ago

Assigning to chutten as he'll do the pref-flip experiment

Assignee: jrediger → chutten

Chris H-C :chutten

Assignee

Comment 10

•

3 years ago

The experiment is live, targeting 100% of Nightly 103. Monitoring Dashboard, Network error live data comparison

Chris H-C :chutten

Assignee

Comment 11

•

3 years ago

Current theory is that these neterrs were caused by newtab sessions ending as a result of shutdown. To test this hypothesis you'll need to ensure that networking isn't lying and pretending success without actually trying to send a ping (docs). The way that makes the most sense to me is setting a debug tag. Then you'll want logging on so that neterrs (and successes) will log from this line.

I do this via

GLEAN_DEBUG_VIEW_TAG="chutten-newtab-err" RUST_LOG="glean,fog" ./mach run

Fun things I've learned so far

[2022-06-24T21:01:12Z ERROR glean_core::metrics::ping] Invalid reason code startup for ping newtab -- Turns out pings with events in them can be sent with reasons that aren't declared for them. We should file and fix this in the SDK, and in the meantime add a reason startup to the "newtab" ping.
Debug builds don't have problems sending "newtab" pings around shutdown (doesn't appear to try)... but this might be because of all the [2022-06-24T21:03:43Z INFO glean_core::dispatcher::global] Failed to launch a task on the queue. Discarding task. that happens at shutdown (probably due to all the threadstats instrumentation happening).

I'd retest on opt build, but I'm already past my EOW. Good hunting to whoever picks this up (if they do before I get back)

Priority: P3 → P2

Cristian Tuns

Comment 12

•

3 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/99dedbe96a85

Travis Long [:travis_]

Updated

•

3 years ago

Priority: P2 → P1

Whiteboard: [telemetry:glean-rs:m?]

Alessio Placitelli [:Dexter]

Comment 13

•

3 years ago

The incident doc: https://docs.google.com/document/d/1ztEyxMPjoJrRJV7QXR5ovtZyze-_li4phiq-8eXqENM/edit

Alessio Placitelli [:Dexter]

Updated

•

3 years ago

Comment 14

•

3 years ago

Resolving this as fixed. All follow-up work is being tracked in other bugs or by the Incident Manager.

Status: ASSIGNED → RESOLVED

Closed: 3 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.