Crash in [@ AsyncShutdownTimeout | profile-before-change-telemetry | CrashManager: submitting Glean crash ping(s)]
Categories
(Toolkit :: Telemetry, defect, P1)
Tracking
()
| Tracking | Status | |
|---|---|---|
| firefox-esr115 | --- | unaffected |
| firefox-esr140 | --- | unaffected |
| firefox147 | --- | unaffected |
| firefox148 | --- | unaffected |
| firefox149 | --- | wontfix |
| firefox150 | + | fixed |
| firefox151 | --- | fixed |
People
(Reporter: aryx, Assigned: afranchuk)
References
(Regression)
Details
(Keywords: crash, regression)
Crash Data
Attachments
(2 files)
|
48 bytes,
text/x-phabricator-request
|
Details | Review | |
|
48 bytes,
text/x-phabricator-request
|
phab-bot
:
approval-mozilla-beta+
|
Details | Review |
169 crash reports from 146 installs of Firefox 149.0a1, oldest build ID 20260128215503.
Chris, do you know what started this?
Crash report: https://crash-stats.mozilla.org/report/index/febd5ebd-fc75-4e94-9034-23b5b0260219
MOZ_CRASH Reason:
[Parent 4130, Main Thread] ###!!! ABORT: file resource://gre/modules/CrashManager.sys.mjs:73
Top 10 frames:
0 XUL MOZ_CrashSequence(void*, long) mfbt/Assertions.h:237
0 XUL MOZ_Crash(char const*, int, char const*) mfbt/Assertions.h:375
0 XUL Abort(char const*) xpcom/base/nsDebugImpl.cpp:529
1 XUL NS_DebugBreak xpcom/base/nsDebugImpl.cpp:0
2 XUL nsDebugImpl::Abort(char const*, int) xpcom/base/nsDebugImpl.cpp:129
3 XUL NS_InvokeByIndex
4 XUL CallMethodHelper::Invoke() js/xpconnect/src/XPCWrappedNative.cpp:1619
4 XUL CallMethodHelper::Call() js/xpconnect/src/XPCWrappedNative.cpp:1173
4 XUL XPCWrappedNative::CallMethod(XPCCallContext&, XPCWrappedNative::CallMode) js/xpconnect/src/XPCWrappedNative.cpp:1119
5 XUL XPC_WN_CallMethod(JSContext*, unsigned int, JS::Value*) js/xpconnect/src/XPCWrappedNativeJSOps.cpp:966
Comment 1•4 months ago
|
||
Timeline lines up with bug 1989439
| Assignee | ||
Comment 2•4 months ago
|
||
Yes, that's from the new crash ping submission flow. I'll think about what may be a better approach. I have considered launching the crash reporter in the background at startup to submit any pending pings; if we do that, it wouldn't be a big deal to simply kill the process in these cases, because it would submit them later. We could also potentially disown the process, too (it is very unlikely to hang around stuck; in these cases the external process is probably just blocked on Glean shutdown, waiting for pings to be sent).
Updated•4 months ago
|
| Assignee | ||
Comment 3•4 months ago
|
||
FWIW, this is very likely an issue due to a combination of factors, each of which adds latency:
- An external program is now used for sending crash pings.
- The external program only sends one ping (and then exits).
- The external program invokes (at least at first) a firefox background task to use the network stack for sending the ping.
- The external program interacts with its own glean store, so it locks the store (only one instance of a program may run at a time).
I speculate that this is getting hit when there are a lot of crashes in rapid succession, creating a chain of latency. There are a few things we can do to mitigate this.
There's also a separate discussion to be had about whether we should care very much about this sort of case, too. While ideally we get all crash information, this sort of case is more than likely a single crash cause occurring repeatedly (often in a background process), so we are trying to preserve and send a whole bunch of near-identical crashes.
Comment 4•4 months ago
|
||
I think the thesis that we see a lot of (equal) pings might actually be wrong. The tl;dr is probably that the child process used to send the pings may block on anything also the parent could block (slow file IO, network, ...) and may just never finish. Of course we have 0 insight in the state of the client process when this happens, but experience from other shutdown hangs show that even simplest OS operations can block endlessly during shutdown.
I actually started from thinking about adding a crash annotation with the initial amount of pings to send. But:
I took a closer look at the code and the crash data. Here's what I found:
The 87% single-blocker case is the main problem
Looking at all 82 crashes with this signature via SuperSearch, the
async_shutdown_timeoutannotation reveals:71 crashes (87%) have exactly 1 blocker — a single crash ping submission is timing out
11 crashes (13%) have 2–13 blockers (queue buildup from multiple crashes)
Note: We seem to create the shutdown blockers all at once, so the chain of blockers in the hang annotation shows us the number of pending pings at the moment of the hang. Numbers are low, most of the times just one.
So while the unbounded queue (comment 3) makes things worse in some cases, even a single
sendGleanPingcall can exceed the shutdown timeout. The queue pileup isn't the primary driver.Why a single ping can block shutdown
Each
sendGleanPingcall spawns a child process that does significant sequential work:
flock(LOCK_EX)/LockFileExoncrashping.pid— blocks if another instance holds it
Glean DB init (file I/O)
Submit ping +glean::shutdown()— waits for network upload
Unlock + exitAny of these can be slow, especially on Windows where file I/O can block for extended periods (AV scanning, disk encryption, indexing, slow HDD). We also have no visibility into what the child is doing — the blocker state is always
(none).Silent data loss already happens today
The event file gets deleted as soon as
_handleEventFilePayloadreturnsEVENT_FILE_SUCCESS, butsendGleanPingis not awaited at that point. If the child process fails (network, Glean issue, crash), the ping is silently lost with no retry. So the current approach blocks shutdown and can lose data anyway.
It is important to mention that this loss seems to be limited to one ping, though, IIUC. Disowning the process and going on could actually make us loose all, so I do not think this is an option.
Suggested approach: decouple ping submission from shutdown entirely
Instead of blocking shutdown on the child process, move to a file-based handoff:
When processing an event file, write the annotations to a pending-pings directory (e.g.
pending-pings/<crash-id>.json)
I assume we originally wanted to avoid to do this from the parent, as it might block, like all file IO. So maybe loosing a ping once in a while is just fine...
Spawn one child process (not one per crash) pointing at that directory, and don't block on it
The child iterates over pending files, inits Glean once, sends all pings, and deletes each pending file only after successful submission
At next startup, if there are leftover files inpending-pings/, spawn the child againThis way shutdown never blocks on crash ping submission, and no data is lost — a pending file only disappears when the ping was actually sent. Failed submissions survive on disk for the next session to retry.
I am not sure there is real urge to do anything, until numbers stay relatively low. But the current architecture feels a bit incomplete if our goal is to get every possible ping over the wire. Alternatively to the above I could also imagine a world where piping the data to the child makes actually sense to avoid further disk IO, but then the child should really avoid ANY disk IO, and I do not think we can easily achieve that. So the above suggestion seems easier to achieve (and btw this would mimic to some extent a background deleter for origins we have elsewhere).
| Assignee | ||
Comment 5•4 months ago
|
||
I just looked at the annotations for the crashes in the past month, and they all seem like single-blockers to me, unless I am missing some subtlety. I didn't realize that it would show more of them (or rather, I wasn't sure whether it was condensing them to a single entry). If it would normally show a list of them, then I definitely agree: it seems like there is no build-up of crashes, and they are simply taking too long to submit. I may try to add something to get visibility into what it's doing at that point. If, for instance, it is always stuck sending the pings, it is safe to kill it (the pings will already be persisted in the Glean DB at that point).
We will eventually move to crash pings being sent by the crash helper. This will have a number of advantages, like initing a Glean DB once and sending pings as necessary from a single, long-lived process. It will also always be spawned at startup, so it will be able to handle retries if pings didn't previously send. I don't know whether the crash helper is yet ready to take on this responsibility (though :gsvelto could weigh in on that), but the crashping crate was made with that transition in mind, so when the time comes it should be as simple as a boilerplate Glean setup and a single additional function call in the crash helper (the crash ping crate needs only the extras JSON data).
One of the main benefits of using Glean here was supposed to be that it would handle management of pending pings and reliable sending, but to do so we need to get the pings into the store in the first place (and that is clearly made more difficult by needing the store to be managed by the crash reporter client). It would be disappointing to duplicate that logic manually, when we just removed the code doing so for Legacy Telemetry logic, but I will think it over further.
Comment 6•4 months ago
•
|
||
(In reply to Alex Franchuk [:afranchuk] from comment #5)
I just looked at the annotations for the crashes in the past month, and they all seem like single-blockers to me, unless I am missing some subtlety. I didn't realize that it would show more of them (or rather, I wasn't sure whether it was condensing them to a single entry). If it would normally show a list of them, then I definitely agree: it seems like there is no build-up of crashes, and they are simply taking too long to submit.
It shows all active blockers and there are a few reports where this is visible for this signature.
I may try to add something to get visibility into what it's doing at that point. If, for instance, it is always stuck sending the pings, it is safe to kill it (the pings will already be persisted in the Glean DB at that point).
Good point. So it would be safe to disown the process once we now it persisted the data locally to the Glean DB (which might also block, so until then loss could happen, IIUC).
We will eventually move to crash pings being sent by the crash helper. This will have a number of advantages, like initing a Glean DB once and sending pings as necessary from a single, long-lived process. It will also always be spawned at startup, so it will be able to handle retries if pings didn't previously send. I don't know whether the crash helper is yet ready to take on this responsibility (though :gsvelto could weigh in on that), but the
crashpingcrate was made with that transition in mind, so when the time comes it should be as simple as a boilerplate Glean setup and a single additional function call in the crash helper (the crash ping crate needs only the extras JSON data).
That sounds like a good plan. We may indeed be able to find better moments than shutdown to hand over the JSON to the crashreporter, like through the UserIdleService.
One of the main benefits of using Glean here was supposed to be that it would handle management of pending pings and reliable sending, but to do so we need to get the pings into the store in the first place (and that is clearly made more difficult by needing the store to be managed by the crash reporter client). It would be disappointing to duplicate that logic manually, when we just removed the code doing so for Legacy Telemetry logic, but I will think it over further.
I think Glean being the right choice holds, we must "just" make sure that things arrive in the right Glean DB in time without shutdown interfering. My gut feeling tells me that disk IO is a likely culprit, like we often see in QM or other shutdown hangs. As said, preparing all things up to that point while we are alive could help. I'd want to avoid to do more work directly at startup, we should use later pauses.
Comment 7•4 months ago
•
|
||
Just an additional note on what I assume we see here, that is late content process crashes close to or during parent shutdown whose processing bleeds into or happens during parent shutdown:
Recapping my understanding of what's actually happening here:
For child process crashes, the ping submission is triggered immediately when the crash occurs:
CrashService.addCrash()→ minidump analysis →CrashManager.addCrash()→sendGleanPing()spawns the child process and registers the shutdown blocker. This is the real-time path, not deferred aggregation of old event files.Note that
maybeRunMinidumpAnalyzeris skipped whengQuittingis true, but the rest of the flow — reading the.extrafile, computing the hash, spawning the ping child — still runs during shutdown.So the likely scenario is: content processes crash close to or during shutdown (not unusual — the parent tears down IPC channels, child processes may crash as a result), the ping child gets spawned at that moment, and it needs to do flock + Glean init + DB write + network upload + Glean shutdown while the parent is counting down the shutdown timeout. That's a race it often loses on slow systems, especially Windows with AV/disk encryption overhead.
This is consistent with the data: 87% of these hangs have exactly 1 blocker — a single content process crash near shutdown, not a backlog.
So I wonder if there is much we can improve in the flow, as most likely the vast majority of crashes are already processed before the parent enters shutdown. This may explain the relatively low numbers here.
Updated•4 months ago
|
Updated•4 months ago
|
| Assignee | ||
Comment 8•4 months ago
|
||
We could try to use the existing event file infrastructure to temporarily persist shutdown crash events. When we start up again, they will be read and pings will be sent. Note: I saw one or two crashes that had a stack indicating the crash was being sent from this code path at shutdown, however all the rest were from normal crash handling code paths, so taking this approach would address most crashes (assuming they don't get blocked on IO writing the event files, that is).
On the other hand, it's important to remember that at this point we already have the crash data persisted. Our code is reading the extra file and dump files that are already on disk. So it might be even easier for us to simply not make any attempt at sending crash pings during shutdown, and instead send them later when we find the pending crashes. We have code that looks for pending crashes to prompt users (e.g. "you have unsubmitted crash reports"). We could do something similar for crash pings, though it would occur silently, and we'd need to add some bit of bookkeeping to know that a ping was sent, which would be separate from the bookkeeping done for reports.
Updated•4 months ago
|
Comment 9•4 months ago
|
||
The bug is linked to a topcrash signature, which matches the following criterion:
- Top 10 desktop browser crashes on nightly
:afranchuk, could you consider increasing the severity of this top-crash bug?
For more information, please visit BugBot documentation.
Comment 10•4 months ago
|
||
The bug is marked as tracked for firefox149 (beta) and tracked for firefox150 (nightly). However, the bug still has low severity.
:Dexter, could you please increase the severity for this tracked bug? If you disagree with the tracking decision, please talk with the release managers.
For more information, please visit BugBot documentation.
Comment 11•4 months ago
|
||
Hey Chris, would you please check comment 10? Thanks!
| Assignee | ||
Comment 13•4 months ago
•
|
||
I'm confused by bugbot: this has very few crashes. I'm not sure what I'm doing wrong here, but I can't get it to come up on crash-stats in the list of top crashers for Nightly/Beta at all (I've expanded filters as much as possible). I do not see how this is a top-10 crash.
| Assignee | ||
Comment 14•4 months ago
|
||
By the way, looking at the crashes on crash-stats: it appears that it is dominated by Tahoe installs, and almost all of the other crashes are also various macOS versions. This implies that either
- there is some additional slowness introduced by macOS (which might be related to what's happening in bug 2019539), and/or
- macOS more frequently experiences crashes in Nightly at shutdown.
| Assignee | ||
Comment 15•4 months ago
|
||
Also note that, so far, there aren't any crash reports from beta. The crash reports we have received are also mostly (60%) from users with automatic submission enabled, which is more common in our Nightly population. :jstutte had theorized out-of-band that the crashes during shutdown (which are subsequently causing the async shutdown timeout in ping submission) are content process IPC-ShutdownKill crashes. If this is the case, these will only be crashes in Nightly, and not in beta/release. If we continue to see no crashes in beta, it may corroborate this idea.
| Assignee | ||
Comment 16•4 months ago
|
||
WRT comment 13: it seems as if the top crashers list may exclude some signatures? In supersearch, I do see this signature, albeit it is ranked 20th (but I imagine bugbot is set to ignore some of the signatures here, so that may increase the rank).
Comment 17•4 months ago
|
||
I think around merge day statistics are a bit shuffled due to there not being data for the new version numbers, not sure if that can still have some effect today.
| Assignee | ||
Comment 18•4 months ago
|
||
It appears there are some beta crashes. This implies there are other shutdown crashes besides IPC ShutDownKill that are triggering this subsequent crash.
| Comment hidden (duplicate) |
Comment 20•3 months ago
|
||
Running the following crash test locally as part of the WebDriver test suite always causes a timeout, due to a shutdown hang in CrashManager while it tries to submit Glean crash pings.
mach wpt --headless --setpref="remote.log.level=Trace" --webdriver-arg=-vv testing/web-platform/mozilla/tests/webdriver/harness/detect_crash.py
Here an excerpt from the log output:
0:19.27 INFO STDOUT: WARNING: At least one completion condition is taking too long to complete. Conditions: [{"name":"CrashManager: submitting Glean crash ping(s)","state":"(none)","filename":"resource://gre/modules/CrashManager.sys.mjs","lineNumber":73,"stack":["resource://gre/modules/CrashManager.sys.mjs:sendGleanPing:73","resource://gre/modules/CrashManager.sys.mjs:_sendCrashPing:829","resource://gre/modules/CrashManager.sys.mjs:addCrash/promise<:536"]}] Barrier: profile-before-change-telemetry
0:36.00 TEST_END: TIMEOUT, expected OK
Alex, I assume that this is the same problem? Note that for WebDriver we have Telemetry turned off.
| Assignee | ||
Comment 21•3 months ago
|
||
Yes, that's likely the same issue. We should short-circuit the logic for testing: the CrashManager will invoke the crash reporter client to send the ping, but the crash reporter client will never actually send a ping if tests are being run. So we could detect a testing environment and skip launching the client entirely.
Comment 22•3 months ago
|
||
Note that in terms of WebDriver we set the following environment variables to control the behavior of the crash reporter:
.env("MOZ_CRASHREPORTER", "1")
.env("MOZ_CRASHREPORTER_NO_REPORT", "1")
.env("MOZ_CRASHREPORTER_SHUTDOWN", "1");
Maybe that's helpful as well.
Updated•3 months ago
|
Comment 23•3 months ago
|
||
REO for 150: Is this a user perceived crash (that we'd like to get into 150)?
| Assignee | ||
Comment 24•3 months ago
•
|
||
This is user perceived: the crash reporter will come up when Firefox is shutting down. I double-checked crash pings, and the counts there are roughly the same as the report counts, which is nice. So that confirms this is fairly low-volume. I've still been mulling over a solution here, and I'll try to put up a patch soon (with intent to uplift to 150 beta).
Comment 25•3 months ago
|
||
Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.
For more information, please visit BugBot documentation.
| Assignee | ||
Comment 26•3 months ago
|
||
We will send them upon the next restart.
Updated•3 months ago
|
Updated•2 months ago
|
Comment 27•2 months ago
|
||
:afranchuk just a friendly reminder that RC builds next week
| Assignee | ||
Comment 28•2 months ago
|
||
I have not landed this yet because it will make uplifting bug 2026650 more difficult, but I suppose both need to get into beta, so I can do this one first. Bug 2026650 just backed out for an odd reason, so it might take a bit to figure it out.
Comment 29•2 months ago
|
||
Comment 30•2 months ago
|
||
| bugherder | ||
Comment 31•2 months ago
|
||
The patch landed in nightly and beta is affected.
:afranchuk, is this bug important enough to require an uplift?
- If yes, please nominate the patch for beta approval.
- See https://wiki.mozilla.org/Release_Management/Requesting_an_Uplift for documentation on how to request an uplift.
- If no, please set
status-firefox150towontfix.
For more information, please visit BugBot documentation.
Comment 32•2 months ago
|
||
firefox-beta Uplift Approval Request
- User impact if declined/Reason for urgency: More crashes at shutdown, and missed crash pings from crashes during shutdown.
- Code covered by automated testing?: yes
- Fix verified in Nightly?: yes
- Needs manual QE testing?: no
- Steps to reproduce for manual QE testing:
- Risk associated with taking this patch: low
- Explanation of risk level: This is logically a somewhat simple change. In the worst case (some weird interaction with the crash store, perhaps), we will miss some pings, but in most cases the logic remains the same as before.
- String changes made/needed?: No
- Is Android affected?: no
| Assignee | ||
Comment 33•2 months ago
|
||
We will send them upon the next restart.
Original Revision: https://phabricator.services.mozilla.com/D291323
| Assignee | ||
Updated•2 months ago
|
Comment 34•2 months ago
|
||
looks like there are still some crashes in nightly, can you take a look before we uplift this?
| Assignee | ||
Comment 35•2 months ago
|
||
There is still the possibility of some crashes, because we may be uploading pings and then shutdown occurs. These patches fix the case where we are already in shutdown and then a crash occurs (which causes a ping to be sent, which immediately crashes again). The former is a different issue, indicating that the external ping submission takes too long. Now this is making me wonder whether there should be an AsyncShutdown blocker registered at all: this can just as easily be a fire-and-forget process (we don't need to do anything with the response). A somewhat related precedent: the crash helper is also left to itself to shutdown (we simply drop our communication channel with it).
Updated•2 months ago
|
Updated•2 months ago
|
Comment 36•2 months ago
|
||
| uplift | ||
Description
•