Shutdownhang in backgroundupdate task loading JS modules
Categories
(Toolkit :: Background Tasks, defect)
Tracking
()
People
(Reporter: jesup, Unassigned)
References
(Depends on 1 open bug, Blocks 1 open bug)
Details
(Keywords: crash)
Crash Data
We're getting a lot of shutdownhangs with nsZipArchive sitting in memcmp, comparing names.
The implication is that we're (i-?) looping in ::GetItem(), comparing names, and never returning (or never within the time shutdown will wait - 30s?)
This implies a loop when iterating item->next.
While we've had crashes there for a long time, it appears that the frequency increased considerably around late September
Crash report: https://crash-stats.mozilla.org/report/index/1e831b3f-a087-4b6a-986d-ecfbc0221103
MOZ_CRASH Reason: Shutdown hanging at step AppShutdownConfirmed. Something is blocking the main-thread.
Top 10 frames of crashing thread:
0 VCRUNTIME140.dll memcmp
1 xul.dll nsZipArchive::GetItem modules/libjar/nsZipArchive.cpp:428
2 xul.dll nsJAR::GetInputStreamWithSpec modules/libjar/nsJAR.cpp:312
3 xul.dll nsJAR::GetInputStream modules/libjar/nsJAR.cpp:291
4 xul.dll nsJARInputThunk::Init modules/libjar/nsJARChannel.cpp:125
5 xul.dll nsJARChannel::CreateJarInput modules/libjar/nsJARChannel.cpp:287
6 xul.dll nsJARChannel::Open modules/libjar/nsJARChannel.cpp:1065
7 xul.dll mozJSSubScriptLoader::ReadStencil js/xpconnect/loader/mozJSSubScriptLoader.cpp:227
7 xul.dll mozJSSubScriptLoader::DoLoadSubScriptWithOptions js/xpconnect/loader/mozJSSubScriptLoader.cpp:455
8 xul.dll mozJSSubScriptLoader::LoadSubScript js/xpconnect/loader/mozJSSubScriptLoader.cpp:297
Reporter | ||
Comment 1•2 years ago
|
||
Also: all the crashes in the last 6 months are windows
Reporter | ||
Comment 2•2 years ago
|
||
The rise seems to be correlated with 105+
Reporter | ||
Comment 3•2 years ago
|
||
This hang also spiked at the same time: [@ shutdownhang | xtolong ] in nsZipArchive, also only on windows.
Reporter | ||
Comment 4•2 years ago
|
||
The xtolong crashes all appear to be in initing ziparchives (GetDataOffset())
Reporter | ||
Updated•2 years ago
|
Reporter | ||
Updated•2 years ago
|
Comment 5•2 years ago
•
|
||
We took a look at a crash report: https://crash-stats.mozilla.org/report/index/1e831b3f-a087-4b6a-986d-ecfbc0221103#tab-rawdump
And found some weird values for aFileName:
- aFilename <Error reading characters of string.> const nsTSubstring<char> &
- Length {mLength=0x5b5f5e6c } mozilla::detail::nsTStringLengthStorage<char>
mLength 0x5b5f5e6c unsigned int
Flags 0xc25d mozilla::detail::StringDataFlags
- [Raw View] {...} const nsTSubstring<char>
Reporter | ||
Comment 6•2 years ago
•
|
||
There are spikes in the incidence rate on the day of or day after each release (Apr 12, Mar 15, Feb 15, etc), of 3-5x. This implies it's related to updates
Updated•2 years ago
|
Reporter | ||
Updated•2 years ago
|
Reporter | ||
Updated•2 years ago
|
Reporter | ||
Updated•2 years ago
|
Reporter | ||
Updated•2 years ago
|
Reporter | ||
Updated•2 years ago
|
Reporter | ||
Comment 7•2 years ago
|
||
All, or virtually all, are Windows crashes. All appear to have BackgroundTaskName = backgroundupdate. about 1/2 are in AppShutdownConfirmed, the others are in various other phases.
The fact that they're all backgroundupdate (and all windows) is very interesting....
Reporter | ||
Updated•2 years ago
|
Reporter | ||
Comment 8•2 years ago
|
||
Nick - There is a large set of shutdown hangs for the backgroundupdate task, all (or virtually all) on Windows. The ones on this bug all seem to be loading JS modules, but that's what I was searching for since that's what this bug was originally looking at.
A wider search of backgroundupdate crashes shows a huge number of shutdown hangs:
https://crash-stats.mozilla.org/search/?background_task_name=%3Dbackgroundupdate&product=Firefox&date=%3E%3D2023-05-02T00%3A46%3A00.000Z&date=%3C2023-05-09T00%3A46%3A00.000Z&_facets=signature&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature
Something like 36-40K crashes in the last week alone (this signatures on this bug (with ModuleLoader/nsJAR on the stack) are only around 500 of those).
I wonder if the problems reproing them (such as bug 1768344) were because it seems it most often crashes when there's a release (see the graph above).
Comment 9•2 years ago
•
|
||
Interestingly we made a quite good job in avoiding AsyncShutdownTimeout for background tasks, see for example bug 1820517, bug 1814104, bug 1799421. The remaining hangs are probably "things that start and were never even designed to interact with shutdown" such that the terminator catches them.
I think we have several paths here (and might want all):
- Mitigate the impact by having an (arbitrary) timeout at the end of the background task before starting the shutdown sequence of the process. This is an easy hack but obviously not really nice. And we should check the
Shutdown Reason
annotation if we see many cases of OS shutdown (my gut feeling is: no, but it is not yet aggregatable, so we should take a closer look) - delaying process end could delay OS shutdown. But I assume thebackgroundupdate
is something we run anytime during normal operations, IIUC. - Redesign how startup actually works, making sure that we have async shutdown blockers for every piece that starts up.
- Have a closer look at what things are actually needed to startup for a background task. My current assumption is that many things are not needed at all, but there might be hidden dependencies hard to solve. Especially on Windows I'd check if we start GPU processes or similar, or accessibility features or other heavy COM objcects. (Disclaimer: I have no idea if any of these examples really happens.)
Given that 2. and 3. are desirable but hard, I'd start from 1.
Edit regarding 1.: I assume here in runBackgroundTaskNamed
could be a good place to add a waiting promise.
Reporter | ||
Updated•2 years ago
|
Comment 10•2 years ago
|
||
I am not sure it is fair to move this to background tasks. The JS module loading could be more robust in blocking shutdown correctly, as well. The bug I made this depend on is a mitigation only.
(In reply to Randell Jesup [:jesup] (needinfo me) from comment #8)
Nick - There is a large set of shutdown hangs for the backgroundupdate task, all (or virtually all) on Windows.
backgroundupdate
is only scheduled on Windows, so this is expected.
The ones on this bug all seem to be loading JS modules, but that's what I was searching for since that's what this bug was originally looking at.
A wider search of backgroundupdate crashes shows a huge number of shutdown hangs:
https://crash-stats.mozilla.org/search/?background_task_name=%3Dbackgroundupdate&product=Firefox&date=%3E%3D2023-05-02T00%3A46%3A00.000Z&date=%3C2023-05-09T00%3A46%3A00.000Z&_facets=signature&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signatureSomething like 36-40K crashes in the last week alone (this signatures on this bug (with ModuleLoader/nsJAR on the stack) are only around 500 of those).
I wonder if the problems reproing them (such as bug 1768344) were because it seems it most often crashes when there's a release (see the graph above).
I think the spikes with releases is almost certainly that when there are actually updates to download and apply, more stuff happens in the background update task.
Everything here looks like the class of issue that Jens details in https://bugzilla.mozilla.org/show_bug.cgi?id=1798904#c9: large swathes of Gecko (and Firefox) don't handle shutdown cleanly. The background update task loads a subset of Firefox code and generally exits very quickly: we're just not robust in that situation.
I will counter the weekly stats a little bit: 40k crashes sounds high but the background update task talks to ~100M distinct background update clients a week (and many clients run the task many times, that number is just one I have in my head). So I'd certainly like to see progress on this because these issues can bite browsing Firefox users too, but I don't think it's very high priority.
(In reply to Jens Stutte [:jstutte] from comment #9)
Interestingly we made a quite good job in avoiding AsyncShutdownTimeout for background tasks, see for example bug 1820517, bug 1814104, bug 1799421. The remaining hangs are probably "things that start and were never even designed to interact with shutdown" such that the terminator catches them.
I think we have several paths here (and might want all):
- Mitigate the impact by having an (arbitrary) timeout at the end of the background task before starting the shutdown sequence of the process. This is an easy hack but obviously not really nice. And we should check the
Shutdown Reason
annotation if we see many cases of OS shutdown (my gut feeling is: no, but it is not yet aggregatable, so we should take a closer look) - delaying process end could delay OS shutdown. But I assume thebackgroundupdate
is something we run anytime during normal operations, IIUC.
Edit regarding 1.: I assume here inrunBackgroundTaskNamed
could be a good place to add a waiting promise.
backgroundupdate
has an arbitrary short wait as it tries to let Glean uploads proceed: https://searchfox.org/mozilla-central/rev/373d05f4eabdb90a6480d5d36d983b8bff36c9d8/toolkit/mozapps/update/BackgroundTask_backgroundupdate.sys.mjs#414-416. Glean has improved greatly so this is no longer neccessary.
I'd be happy to lift this delay to a per-task-configured delay (like the per-task-configured timeout) in the tasks manager, and to extend it, if it might help. But I doubt that it would help in a non-linear fashion: we have racing startup and shutdown and racers gonna race. I'd like to see more labeling of activities at startup and shutdown to help us understand ones that are failing in the wild.
Reporter | ||
Updated•2 years ago
|
Comment 13•2 years ago
|
||
(moved the discussion from comment 12 to bug 1832089)
Updated•2 years ago
|
Description
•