Closed Bug 2017910 Opened 4 months ago Closed 2 months ago

Crash in [@ AsyncShutdownTimeout | profile-before-change-telemetry | CrashManager: submitting Glean crash ping(s)]

Categories

(Toolkit :: Telemetry, defect, P1)

defect

Tracking

()

RESOLVED FIXED
151 Branch
Tracking Status
firefox-esr115 --- unaffected
firefox-esr140 --- unaffected
firefox147 --- unaffected
firefox148 --- unaffected
firefox149 --- wontfix
firefox150 + fixed
firefox151 --- fixed

People

(Reporter: aryx, Assigned: afranchuk)

References

(Regression)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(2 files)

169 crash reports from 146 installs of Firefox 149.0a1, oldest build ID 20260128215503.

Chris, do you know what started this?

Crash report: https://crash-stats.mozilla.org/report/index/febd5ebd-fc75-4e94-9034-23b5b0260219

MOZ_CRASH Reason:

[Parent 4130, Main Thread] ###!!! ABORT: file resource://gre/modules/CrashManager.sys.mjs:73

Top 10 frames:

0  XUL  MOZ_CrashSequence(void*, long)  mfbt/Assertions.h:237
0  XUL  MOZ_Crash(char const*, int, char const*)  mfbt/Assertions.h:375
0  XUL  Abort(char const*)  xpcom/base/nsDebugImpl.cpp:529
1  XUL  NS_DebugBreak  xpcom/base/nsDebugImpl.cpp:0
2  XUL  nsDebugImpl::Abort(char const*, int)  xpcom/base/nsDebugImpl.cpp:129
3  XUL  NS_InvokeByIndex
4  XUL  CallMethodHelper::Invoke()  js/xpconnect/src/XPCWrappedNative.cpp:1619
4  XUL  CallMethodHelper::Call()  js/xpconnect/src/XPCWrappedNative.cpp:1173
4  XUL  XPCWrappedNative::CallMethod(XPCCallContext&, XPCWrappedNative::CallMode)  js/xpconnect/src/XPCWrappedNative.cpp:1119
5  XUL  XPC_WN_CallMethod(JSContext*, unsigned int, JS::Value*)  js/xpconnect/src/XPCWrappedNativeJSOps.cpp:966
Flags: needinfo?(chutten)

Timeline lines up with bug 1989439

Flags: needinfo?(chutten) → needinfo?(afranchuk)
Keywords: regression
Regressed by: 1989439

Yes, that's from the new crash ping submission flow. I'll think about what may be a better approach. I have considered launching the crash reporter in the background at startup to submit any pending pings; if we do that, it wouldn't be a big deal to simply kill the process in these cases, because it would submit them later. We could also potentially disown the process, too (it is very unlikely to hang around stuck; in these cases the external process is probably just blocked on Glean shutdown, waiting for pings to be sent).

Assignee: nobody → afranchuk
Status: NEW → ASSIGNED
Flags: needinfo?(afranchuk)

FWIW, this is very likely an issue due to a combination of factors, each of which adds latency:

  1. An external program is now used for sending crash pings.
  2. The external program only sends one ping (and then exits).
  3. The external program invokes (at least at first) a firefox background task to use the network stack for sending the ping.
  4. The external program interacts with its own glean store, so it locks the store (only one instance of a program may run at a time).

I speculate that this is getting hit when there are a lot of crashes in rapid succession, creating a chain of latency. There are a few things we can do to mitigate this.

There's also a separate discussion to be had about whether we should care very much about this sort of case, too. While ideally we get all crash information, this sort of case is more than likely a single crash cause occurring repeatedly (often in a background process), so we are trying to preserve and send a whole bunch of near-identical crashes.

I think the thesis that we see a lot of (equal) pings might actually be wrong. The tl;dr is probably that the child process used to send the pings may block on anything also the parent could block (slow file IO, network, ...) and may just never finish. Of course we have 0 insight in the state of the client process when this happens, but experience from other shutdown hangs show that even simplest OS operations can block endlessly during shutdown.

I actually started from thinking about adding a crash annotation with the initial amount of pings to send. But:

I took a closer look at the code and the crash data. Here's what I found:

The 87% single-blocker case is the main problem

Looking at all 82 crashes with this signature via SuperSearch, the async_shutdown_timeout annotation reveals:

71 crashes (87%) have exactly 1 blocker — a single crash ping submission is timing out
11 crashes (13%) have 2–13 blockers (queue buildup from multiple crashes)

Note: We seem to create the shutdown blockers all at once, so the chain of blockers in the hang annotation shows us the number of pending pings at the moment of the hang. Numbers are low, most of the times just one.

So while the unbounded queue (comment 3) makes things worse in some cases, even a single sendGleanPing call can exceed the shutdown timeout. The queue pileup isn't the primary driver.

Why a single ping can block shutdown

Each sendGleanPing call spawns a child process that does significant sequential work:

flock(LOCK_EX)/LockFileEx on crashping.pid — blocks if another instance holds it
Glean DB init (file I/O)
Submit ping + glean::shutdown() — waits for network upload
Unlock + exit

Any of these can be slow, especially on Windows where file I/O can block for extended periods (AV scanning, disk encryption, indexing, slow HDD). We also have no visibility into what the child is doing — the blocker state is always (none).

Silent data loss already happens today

The event file gets deleted as soon as _handleEventFilePayload returns EVENT_FILE_SUCCESS, but sendGleanPing is not awaited at that point. If the child process fails (network, Glean issue, crash), the ping is silently lost with no retry. So the current approach blocks shutdown and can lose data anyway.

It is important to mention that this loss seems to be limited to one ping, though, IIUC. Disowning the process and going on could actually make us loose all, so I do not think this is an option.

Suggested approach: decouple ping submission from shutdown entirely

Instead of blocking shutdown on the child process, move to a file-based handoff:

When processing an event file, write the annotations to a pending-pings directory (e.g. pending-pings/<crash-id>.json)

I assume we originally wanted to avoid to do this from the parent, as it might block, like all file IO. So maybe loosing a ping once in a while is just fine...

Spawn one child process (not one per crash) pointing at that directory, and don't block on it
The child iterates over pending files, inits Glean once, sends all pings, and deletes each pending file only after successful submission
At next startup, if there are leftover files in pending-pings/, spawn the child again

This way shutdown never blocks on crash ping submission, and no data is lost — a pending file only disappears when the ping was actually sent. Failed submissions survive on disk for the next session to retry.

I am not sure there is real urge to do anything, until numbers stay relatively low. But the current architecture feels a bit incomplete if our goal is to get every possible ping over the wire. Alternatively to the above I could also imagine a world where piping the data to the child makes actually sense to avoid further disk IO, but then the child should really avoid ANY disk IO, and I do not think we can easily achieve that. So the above suggestion seems easier to achieve (and btw this would mimic to some extent a background deleter for origins we have elsewhere).

I just looked at the annotations for the crashes in the past month, and they all seem like single-blockers to me, unless I am missing some subtlety. I didn't realize that it would show more of them (or rather, I wasn't sure whether it was condensing them to a single entry). If it would normally show a list of them, then I definitely agree: it seems like there is no build-up of crashes, and they are simply taking too long to submit. I may try to add something to get visibility into what it's doing at that point. If, for instance, it is always stuck sending the pings, it is safe to kill it (the pings will already be persisted in the Glean DB at that point).

We will eventually move to crash pings being sent by the crash helper. This will have a number of advantages, like initing a Glean DB once and sending pings as necessary from a single, long-lived process. It will also always be spawned at startup, so it will be able to handle retries if pings didn't previously send. I don't know whether the crash helper is yet ready to take on this responsibility (though :gsvelto could weigh in on that), but the crashping crate was made with that transition in mind, so when the time comes it should be as simple as a boilerplate Glean setup and a single additional function call in the crash helper (the crash ping crate needs only the extras JSON data).

One of the main benefits of using Glean here was supposed to be that it would handle management of pending pings and reliable sending, but to do so we need to get the pings into the store in the first place (and that is clearly made more difficult by needing the store to be managed by the crash reporter client). It would be disappointing to duplicate that logic manually, when we just removed the code doing so for Legacy Telemetry logic, but I will think it over further.

(In reply to Alex Franchuk [:afranchuk] from comment #5)

I just looked at the annotations for the crashes in the past month, and they all seem like single-blockers to me, unless I am missing some subtlety. I didn't realize that it would show more of them (or rather, I wasn't sure whether it was condensing them to a single entry). If it would normally show a list of them, then I definitely agree: it seems like there is no build-up of crashes, and they are simply taking too long to submit.

It shows all active blockers and there are a few reports where this is visible for this signature.

I may try to add something to get visibility into what it's doing at that point. If, for instance, it is always stuck sending the pings, it is safe to kill it (the pings will already be persisted in the Glean DB at that point).

Good point. So it would be safe to disown the process once we now it persisted the data locally to the Glean DB (which might also block, so until then loss could happen, IIUC).

We will eventually move to crash pings being sent by the crash helper. This will have a number of advantages, like initing a Glean DB once and sending pings as necessary from a single, long-lived process. It will also always be spawned at startup, so it will be able to handle retries if pings didn't previously send. I don't know whether the crash helper is yet ready to take on this responsibility (though :gsvelto could weigh in on that), but the crashping crate was made with that transition in mind, so when the time comes it should be as simple as a boilerplate Glean setup and a single additional function call in the crash helper (the crash ping crate needs only the extras JSON data).

That sounds like a good plan. We may indeed be able to find better moments than shutdown to hand over the JSON to the crashreporter, like through the UserIdleService.

One of the main benefits of using Glean here was supposed to be that it would handle management of pending pings and reliable sending, but to do so we need to get the pings into the store in the first place (and that is clearly made more difficult by needing the store to be managed by the crash reporter client). It would be disappointing to duplicate that logic manually, when we just removed the code doing so for Legacy Telemetry logic, but I will think it over further.

I think Glean being the right choice holds, we must "just" make sure that things arrive in the right Glean DB in time without shutdown interfering. My gut feeling tells me that disk IO is a likely culprit, like we often see in QM or other shutdown hangs. As said, preparing all things up to that point while we are alive could help. I'd want to avoid to do more work directly at startup, we should use later pauses.

Just an additional note on what I assume we see here, that is late content process crashes close to or during parent shutdown whose processing bleeds into or happens during parent shutdown:

Recapping my understanding of what's actually happening here:

For child process crashes, the ping submission is triggered immediately when the crash occurs: CrashService.addCrash() → minidump analysis → CrashManager.addCrash()sendGleanPing() spawns the child process and registers the shutdown blocker. This is the real-time path, not deferred aggregation of old event files.

Note that maybeRunMinidumpAnalyzer is skipped when gQuitting is true, but the rest of the flow — reading the .extra file, computing the hash, spawning the ping child — still runs during shutdown.

So the likely scenario is: content processes crash close to or during shutdown (not unusual — the parent tears down IPC channels, child processes may crash as a result), the ping child gets spawned at that moment, and it needs to do flock + Glean init + DB write + network upload + Glean shutdown while the parent is counting down the shutdown timeout. That's a race it often loses on slow systems, especially Windows with AV/disk encryption overhead.

This is consistent with the data: 87% of these hangs have exactly 1 blocker — a single content process crash near shutdown, not a backlog.

So I wonder if there is much we can improve in the flow, as most likely the vast majority of crashes are already processed before the parent enters shutdown. This may explain the relatively low numbers here.

Severity: -- → S2
Priority: -- → P1
Severity: S2 → S4

We could try to use the existing event file infrastructure to temporarily persist shutdown crash events. When we start up again, they will be read and pings will be sent. Note: I saw one or two crashes that had a stack indicating the crash was being sent from this code path at shutdown, however all the rest were from normal crash handling code paths, so taking this approach would address most crashes (assuming they don't get blocked on IO writing the event files, that is).

On the other hand, it's important to remember that at this point we already have the crash data persisted. Our code is reading the extra file and dump files that are already on disk. So it might be even easier for us to simply not make any attempt at sending crash pings during shutdown, and instead send them later when we find the pending crashes. We have code that looks for pending crashes to prompt users (e.g. "you have unsubmitted crash reports"). We could do something similar for crash pings, though it would occur silently, and we'd need to add some bit of bookkeeping to know that a ping was sent, which would be separate from the bookkeeping done for reports.

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 10 desktop browser crashes on nightly

:afranchuk, could you consider increasing the severity of this top-crash bug?

For more information, please visit BugBot documentation.

Flags: needinfo?(afranchuk)
Keywords: topcrash

The bug is marked as tracked for firefox149 (beta) and tracked for firefox150 (nightly). However, the bug still has low severity.

:Dexter, could you please increase the severity for this tracked bug? If you disagree with the tracking decision, please talk with the release managers.

For more information, please visit BugBot documentation.

Flags: needinfo?(alessio.placitelli)

Hey Chris, would you please check comment 10? Thanks!

Flags: needinfo?(alessio.placitelli) → needinfo?(chutten)

I'll let :afranchuk decide

Flags: needinfo?(chutten)

I'm confused by bugbot: this has very few crashes. I'm not sure what I'm doing wrong here, but I can't get it to come up on crash-stats in the list of top crashers for Nightly/Beta at all (I've expanded filters as much as possible). I do not see how this is a top-10 crash.

Flags: needinfo?(afranchuk)

By the way, looking at the crashes on crash-stats: it appears that it is dominated by Tahoe installs, and almost all of the other crashes are also various macOS versions. This implies that either

  1. there is some additional slowness introduced by macOS (which might be related to what's happening in bug 2019539), and/or
  2. macOS more frequently experiences crashes in Nightly at shutdown.

Also note that, so far, there aren't any crash reports from beta. The crash reports we have received are also mostly (60%) from users with automatic submission enabled, which is more common in our Nightly population. :jstutte had theorized out-of-band that the crashes during shutdown (which are subsequently causing the async shutdown timeout in ping submission) are content process IPC-ShutdownKill crashes. If this is the case, these will only be crashes in Nightly, and not in beta/release. If we continue to see no crashes in beta, it may corroborate this idea.

WRT comment 13: it seems as if the top crashers list may exclude some signatures? In supersearch, I do see this signature, albeit it is ranked 20th (but I imagine bugbot is set to ignore some of the signatures here, so that may increase the rank).

I think around merge day statistics are a bit shuffled due to there not being data for the new version numbers, not sure if that can still have some effect today.

It appears there are some beta crashes. This implies there are other shutdown crashes besides IPC ShutDownKill that are triggering this subsequent crash.

Running the following crash test locally as part of the WebDriver test suite always causes a timeout, due to a shutdown hang in CrashManager while it tries to submit Glean crash pings.

mach wpt --headless --setpref="remote.log.level=Trace" --webdriver-arg=-vv testing/web-platform/mozilla/tests/webdriver/harness/detect_crash.py

Here an excerpt from the log output:

 0:19.27 INFO STDOUT: WARNING: At least one completion condition is taking too long to complete. Conditions: [{"name":"CrashManager: submitting Glean crash ping(s)","state":"(none)","filename":"resource://gre/modules/CrashManager.sys.mjs","lineNumber":73,"stack":["resource://gre/modules/CrashManager.sys.mjs:sendGleanPing:73","resource://gre/modules/CrashManager.sys.mjs:_sendCrashPing:829","resource://gre/modules/CrashManager.sys.mjs:addCrash/promise<:536"]}] Barrier: profile-before-change-telemetry
 0:36.00 TEST_END: TIMEOUT, expected OK

Alex, I assume that this is the same problem? Note that for WebDriver we have Telemetry turned off.

Flags: needinfo?(afranchuk)

Yes, that's likely the same issue. We should short-circuit the logic for testing: the CrashManager will invoke the crash reporter client to send the ping, but the crash reporter client will never actually send a ping if tests are being run. So we could detect a testing environment and skip launching the client entirely.

Flags: needinfo?(afranchuk)

Note that in terms of WebDriver we set the following environment variables to control the behavior of the crash reporter:

            .env("MOZ_CRASHREPORTER", "1")
            .env("MOZ_CRASHREPORTER_NO_REPORT", "1")
            .env("MOZ_CRASHREPORTER_SHUTDOWN", "1");

Maybe that's helpful as well.

REO for 150: Is this a user perceived crash (that we'd like to get into 150)?

Flags: needinfo?(afranchuk)

This is user perceived: the crash reporter will come up when Firefox is shutting down. I double-checked crash pings, and the counts there are roughly the same as the report counts, which is nice. So that confirms this is fairly low-volume. I've still been mulling over a solution here, and I'll try to put up a patch soon (with intent to uplift to 150 beta).

Flags: needinfo?(afranchuk)

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit BugBot documentation.

Keywords: topcrash

We will send them upon the next restart.

Attachment #9562382 - Attachment description: WIP: Bug 2017910 - Prevent crash pings during shutdown r=gsvelto → Bug 2017910 - Prevent crash pings during shutdown r=gsvelto

:afranchuk just a friendly reminder that RC builds next week

Flags: needinfo?(afranchuk)

I have not landed this yet because it will make uplifting bug 2026650 more difficult, but I suppose both need to get into beta, so I can do this one first. Bug 2026650 just backed out for an odd reason, so it might take a bit to figure it out.

Flags: needinfo?(afranchuk)
Status: ASSIGNED → RESOLVED
Closed: 2 months ago
Resolution: --- → FIXED
Target Milestone: --- → 151 Branch

The patch landed in nightly and beta is affected.
:afranchuk, is this bug important enough to require an uplift?

For more information, please visit BugBot documentation.

Flags: needinfo?(afranchuk)

firefox-beta Uplift Approval Request

  • User impact if declined/Reason for urgency: More crashes at shutdown, and missed crash pings from crashes during shutdown.
  • Code covered by automated testing?: yes
  • Fix verified in Nightly?: yes
  • Needs manual QE testing?: no
  • Steps to reproduce for manual QE testing:
  • Risk associated with taking this patch: low
  • Explanation of risk level: This is logically a somewhat simple change. In the worst case (some weird interaction with the crash store, perhaps), we will miss some pings, but in most cases the logic remains the same as before.
  • String changes made/needed?: No
  • Is Android affected?: no
Attachment #9569287 - Flags: approval-mozilla-beta?

We will send them upon the next restart.

Original Revision: https://phabricator.services.mozilla.com/D291323

Flags: needinfo?(afranchuk)

looks like there are still some crashes in nightly, can you take a look before we uplift this?

Flags: needinfo?(afranchuk)

There is still the possibility of some crashes, because we may be uploading pings and then shutdown occurs. These patches fix the case where we are already in shutdown and then a crash occurs (which causes a ping to be sent, which immediately crashes again). The former is a different issue, indicating that the external ping submission takes too long. Now this is making me wonder whether there should be an AsyncShutdown blocker registered at all: this can just as easily be a fire-and-forget process (we don't need to do anything with the response). A somewhat related precedent: the crash helper is also left to itself to shutdown (we simply drop our communication channel with it).

Flags: needinfo?(afranchuk)
Attachment #9569287 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: