Closed Bug 1777233 Opened 3 years ago Closed 2 years ago

Manage ping submission during shutdown a bit better (FOG)

Categories

(Toolkit :: Telemetry, enhancement)

enhancement

Tracking

()

RESOLVED FIXED
116 Branch
Tracking Status
firefox116 --- fixed

People

(Reporter: Dexter, Assigned: perry.mcmanis)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

In bug 1776201 we found that if we submit FOG pings during shutdown, we'll file to send them most likely because the network layer was already shut down.

We currently shut down FOG on XPCOMShutdown, but we have some other phases we could potentially fit into.

Legacy telemetry, for example, will reject any ping submission/recording happening after AppShutdownTelemetry.

Things we should figure out:

  • when should we stop doing networking?
  • should we only allow ping submission (with upload disable) after networking stops working and until XPCOMShutdown?
See Also: → 1776201

Anything after (or during?) shutdown phase AppShutdownNetTeardown will fail. See logic around tooLateToSend in TelemetrySend where we respond to the JS topic version of the shutdown phase by declaring that everything attempting to be sent from then on is to pend and be stored.

Similar logic could be worked into viaduct_uploader.rs. We have other topic observers in FOG Rust and RunOnShutdown shutdown phase listeners in C++ which could communicate to the uploader that we're too late to send.

A question for Jan-Erik when he returns, and Alessio for today is: Would the Glean SDK like an API for integrators to communicate network availability to? Or is responding to upload requests with recoverable errors the right way to do this?

Flags: needinfo?(alessio.placitelli)

(In reply to Chris H-C :chutten|PTO (back June 30) from comment #1)

Similar logic could be worked into viaduct_uploader.rs. We have other topic observers in FOG Rust and RunOnShutdown shutdown phase listeners in C++ which could communicate to the uploader that we're too late to send.

This sounds like a sensible thing to do. If we're too late to send, there's no reason why we should go through the troubles of attempting to use network :)

We should probably record some data about this, but not sure if this would strictly fit a "network error" definition.

A question for Jan-Erik when he returns, and Alessio for today is: Would the Glean SDK like an API for integrators to communicate network availability to?

Not sure. This feels like an integration-specific issue that should be solved within the FOG layer, given the complexity of the Firefox shutdown.

Or is responding to upload requests with recoverable errors the right way to do this?

Maybe, but it indeed feels "odd".

Flags: needinfo?(alessio.placitelli)

With Jan-Erik back I can ni? him : )

Because I said I could communicate to the uploader we're too late to send... but the only way to do that is to say each attempted upload is a (drumroll) Recoverable Network Failure.

So FOG does need some way to communicate to the SDK that this particular upload request wasn't tried, please try later, but also don't record this as a network failure.

Either that, or we keep it as-is and treat recoverable network failures as "fine" (ie, don't try to fix them).

Flags: needinfo?(jrediger)

Right now there's no way for the uploader to communicate that it knows it can't upload anymore.
So even if it communicates once that it encountered a recoverable network failure, the uploader manager will poll again and give it a new task.

I think it would be worthwhile adding a return value that allows it to stop processing in these cases.

Flags: needinfo?(jrediger)
Depends on: 1780806

With the rollout of Fx 103 at 100% we're seeing a marked increase in http "0"s that's being tracked in https://mozilla-hub.atlassian.net/browse/DSRE-928. These represent submissions where a client initiated an HTTP request but from GCP's perspective the client disconnected before the server could fully respond and e.g. a 200 response could be returned to the client. These appear to be, just as in bug #1776201, mostly newtab pings.

I suspect this is related to the work in this bug and other followup bugs from the incident investigation. My assumption is that we'd ideally like to minimize 0s via client behavior changes and that work on this bug might decrease the number of 0s we're seeing, but that any pings that are interrupted in this way will eventually be resent and so it's not particularly concerning that we're seeing so many disconnects. If this doesn't sound like an accurate characterization of the current state please let me know and I will prioritize server side investigation of the issue.

EDIT: :chutten has confirmed this is likely not a server side issue and the corresponding Jira ticket has been closed

Next steps: Wait for bug 1780806 to get into a release, vendor that release, figure out the best way to communicate we're in/after netteardown (note that PastShutdownPhase must be called on the main thread which we are most definitely not on when inside ViaductUploader, so we're looking at either an observer like this or registering during init a C++ closure to run at shutdown like this (notice the guard for ensuring we're not already beyond the phase we're registering a shutdown handler for) plus a little state.), then if we're in/after net teardown respond with Done all the time.

Assignee: nobody → pmcmanis
See Also: → 1816190
Pushed by pmcmanis@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/a0fa68309a2e Improve behavior around Ping Submission during shutdown r=chutten
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 116 Branch
Blocks: 1845124
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: