Manage ping submission during shutdown a bit better (FOG)
Categories
(Toolkit :: Telemetry, enhancement)
Tracking
()
Tracking | Status | |
---|---|---|
firefox116 | --- | fixed |
People
(Reporter: Dexter, Assigned: perry.mcmanis)
References
(Blocks 1 open bug)
Details
Attachments
(1 file)
In bug 1776201 we found that if we submit FOG pings during shutdown, we'll file to send them most likely because the network layer was already shut down.
We currently shut down FOG on XPCOMShutdown, but we have some other phases we could potentially fit into.
Legacy telemetry, for example, will reject any ping submission/recording happening after AppShutdownTelemetry.
Things we should figure out:
- when should we stop doing networking?
- should we only allow ping submission (with upload disable) after networking stops working and until XPCOMShutdown?
Comment 1•3 years ago
|
||
Anything after (or during?) shutdown phase AppShutdownNetTeardown
will fail. See logic around tooLateToSend
in TelemetrySend
where we respond to the JS topic version of the shutdown phase by declaring that everything attempting to be sent from then on is to pend and be stored.
Similar logic could be worked into viaduct_uploader.rs
. We have other topic observers in FOG Rust and RunOnShutdown
shutdown phase listeners in C++ which could communicate to the uploader that we're too late to send.
A question for Jan-Erik when he returns, and Alessio for today is: Would the Glean SDK like an API for integrators to communicate network availability to? Or is responding to upload requests with recoverable errors the right way to do this?
Reporter | ||
Comment 2•3 years ago
|
||
(In reply to Chris H-C :chutten|PTO (back June 30) from comment #1)
Similar logic could be worked into
viaduct_uploader.rs
. We have other topic observers in FOG Rust andRunOnShutdown
shutdown phase listeners in C++ which could communicate to the uploader that we're too late to send.
This sounds like a sensible thing to do. If we're too late to send, there's no reason why we should go through the troubles of attempting to use network :)
We should probably record some data about this, but not sure if this would strictly fit a "network error" definition.
A question for Jan-Erik when he returns, and Alessio for today is: Would the Glean SDK like an API for integrators to communicate network availability to?
Not sure. This feels like an integration-specific issue that should be solved within the FOG layer, given the complexity of the Firefox shutdown.
Or is responding to upload requests with recoverable errors the right way to do this?
Maybe, but it indeed feels "odd".
Comment 3•3 years ago
|
||
With Jan-Erik back I can ni? him : )
Because I said I could communicate to the uploader we're too late to send... but the only way to do that is to say each attempted upload is a (drumroll) Recoverable Network Failure.
So FOG does need some way to communicate to the SDK that this particular upload request wasn't tried, please try later, but also don't record this as a network failure.
Either that, or we keep it as-is and treat recoverable network failures as "fine" (ie, don't try to fix them).
Comment 4•3 years ago
|
||
Right now there's no way for the uploader to communicate that it knows it can't upload anymore.
So even if it communicates once that it encountered a recoverable network failure, the uploader manager will poll again and give it a new task.
I think it would be worthwhile adding a return value that allows it to stop processing in these cases.
Comment 5•3 years ago
•
|
||
With the rollout of Fx 103 at 100% we're seeing a marked increase in http "0"s that's being tracked in https://mozilla-hub.atlassian.net/browse/DSRE-928. These represent submissions where a client initiated an HTTP request but from GCP's perspective the client disconnected before the server could fully respond and e.g. a 200 response could be returned to the client. These appear to be, just as in bug #1776201, mostly newtab pings.
I suspect this is related to the work in this bug and other followup bugs from the incident investigation. My assumption is that we'd ideally like to minimize 0s via client behavior changes and that work on this bug might decrease the number of 0s we're seeing, but that any pings that are interrupted in this way will eventually be resent and so it's not particularly concerning that we're seeing so many disconnects. If this doesn't sound like an accurate characterization of the current state please let me know and I will prioritize server side investigation of the issue.
EDIT: :chutten has confirmed this is likely not a server side issue and the corresponding Jira ticket has been closed
Comment 6•2 years ago
|
||
Next steps: Wait for bug 1780806 to get into a release, vendor that release, figure out the best way to communicate we're in/after netteardown (note that PastShutdownPhase
must be called on the main thread which we are most definitely not on when inside ViaductUploader
, so we're looking at either an observer like this or registering during init a C++ closure to run at shutdown like this (notice the guard for ensuring we're not already beyond the phase we're registering a shutdown handler for) plus a little state.), then if we're in/after net teardown respond with Done
all the time.
Assignee | ||
Updated•2 years ago
|
Assignee | ||
Comment 7•2 years ago
|
||
Comment 9•2 years ago
|
||
bugherder |
Description
•