[meta] Crash in [@ shutdownhang | mozilla::SpinEventLoopUntil | nsThreadPool::ShutdownWithTimeout]
Categories
(Core :: XPCOM, defect)
Tracking
()
People
(Reporter: planetman1125, Assigned: keeler)
References
(Depends on 1 open bug, Blocks 1 open bug)
Details
(Keywords: meta, topcrash, Whiteboard: [tbird crash])
Crash Data
Attachments
(1 obsolete file)
Crash report: https://crash-stats.mozilla.org/report/index/640dddf9-911a-4fc5-89f9-d47810231128
MOZ_CRASH Reason: Shutdown hanging at step XPCOMShutdownThreads. Something is blocking the main-thread.
Top 10 frames of crashing thread:
0 ntdll.dll ZwWaitForAlertByThreadId
1 ntdll.dll RtlSleepConditionVariableSRW
2 KERNELBASE.dll SleepConditionVariableSRW
3 mozglue.dll mozilla::detail::ConditionVariableImpl::wait mozglue/misc/ConditionVariable_windows.cpp:50
4 xul.dll mozilla::OffTheBooksCondVar::Wait xpcom/threads/CondVar.h:58
4 xul.dll mozilla::TaskController::GetRunnableForMTTask xpcom/threads/TaskController.cpp:600
4 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1133
4 xul.dll NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:481
5 xul.dll mozilla::SpinEventLoopUntil xpcom/threads/SpinEventLoopUntil.h:176
5 xul.dll nsThreadPool::ShutdownWithTimeout xpcom/threads/nsThreadPool.cpp:470
Comment 1•1 year ago
|
||
The bug is linked to a topcrash signature, which matches the following criteria:
- Top 20 desktop browser crashes on release
- Top 20 desktop browser crashes on beta
For more information, please visit BugBot documentation.
Comment 2•1 year ago
|
||
The severity field is not set for this bug.
:nika, could you have a look please?
For more information, please visit BugBot documentation.
Comment 3•1 year ago
|
||
Unfortunately this crash is overly generic, and corresponds to a few different causes. These crashes are all due to the main thread hanging while waiting for some nsThreadPool
to shut down in the background, with different threadpools being waited on in different cases (though the signature is not walking up enough frames here to capture which threadpools are hanging).
Scanning through a few of the specific reports, I've noticed that they tend to fall into one of two major categories, though there are obviously some outliers:
StreamTransport shutdown hangs (clientcerts)
- These hangs are occurring during nsStreamTransportService threadpool shutdown, and are generally due to a single STS thread still being active.
- These crashes tend to occur within a backgroundtask process, with the most common task seeming to be
defaultagent
.- Unlike normal Firefox processes, backgroundtask processes tend to live for a much shorter amount of time (just enough to perform a single task), meaning that a long-running operation which would normally complete before the Firefox process enters shutdown may not have completed before quitting.
- Exactly where the crash happens seems to vary, but it often is occurring within
LoadLoadableCertsTask::Run()
, specifically within the rustosclientcerts
module.- In some cases, the code is hanging due to an outstanding mpsc
recv
call, though theosclientcerts
thread appears to be present and active.
- In some cases, the code is hanging due to an outstanding mpsc
Examples:
- https://crash-stats.mozilla.org/report/index/5f21807c-f559-40d5-9ef5-dca270231218#allthreads
- The
osclientcerts
thread appears to be partway through executing theopen_session
call
- The
- https://crash-stats.mozilla.org/report/index/e4672058-e671-4f8a-91eb-b530a0231218#allthreads
- https://crash-stats.mozilla.org/report/index/f919c4e9-4d0e-4974-ab05-647de0231218#allthreads
- Appears to be in
C_Initialize
instead
- Appears to be in
- https://crash-stats.mozilla.org/report/index/2797240b-53f3-44fb-a4df-a2ab40231218#allthreads
- Unclear what the osclientcerts code is doing.
- https://crash-stats.mozilla.org/report/index/e6cf7120-eea5-4541-a9a4-08e9f0231218#allthreads
- Also in
C_Initialize
- Also in
- https://crash-stats.mozilla.org/report/index/0af04780-0fec-43db-85be-0589e0231218#allthreads
- In
C_GetInfo
- In
- https://crash-stats.mozilla.org/report/index/c191f84a-3538-48b9-b513-ea2ab0231218#allthreads
- https://crash-stats.mozilla.org/report/index/4ae99336-94b2-4549-b569-782b00231218#allthreads
- No rust on the stack, but is within
LoadLoadableCertsTask::Run
- Unlike the others in this section I've noticed, this was during a
backgroundupdate
backgroundtask
- No rust on the stack, but is within
- https://crash-stats.mozilla.org/report/index/b070b369-3759-4c97-9bf1-5fc6b0231218#allthreads
- No rust on the stack, but is within
LoadLoadableCertsTask::Run
- Also in a
backgroundupdate
backgroundtask
- No rust on the stack, but is within
Printer-related Background IO Thread Pool shutdown hangs
- These hangs are occurring during the shared BackgroundThreadPool shutdown, and are usually due to one or more threads in the BgIOThreadPool being blocked.
- Unlike the osclientcerts crashes, these appear to be happening in normal Firefox processes, not backgroundtask processes.
- When the XUL caller is visible, it appears to most frequently be
nsPrinterListWin::Printers()
, though in some cases the stack is full of opaquePrintConfig.dll
frames, so we don't have a great backtrace, and it could be some other caller. - The dispatches presumably originate from https://searchfox.org/mozilla-central/rev/91cc8848427fdbbeb324e6ca56a0d08d32d3c308/widget/nsPrinterListBase.cpp#61-67
nsPrinterListWin::Printers Examples:
- https://crash-stats.mozilla.org/report/index/ba76613a-cf36-473f-9073-a4afc0231218#allthreads
- https://crash-stats.mozilla.org/report/index/1bfe13ba-7989-4670-a0f7-ee6670231218#allthreads
PrintConfig.dll Examples:
- https://crash-stats.mozilla.org/report/index/3b612e00-75a2-49c4-83aa-372800231218#allthreads
- https://crash-stats.mozilla.org/report/index/405ed665-798b-4569-985d-d860a0231218#allthreads
- https://crash-stats.mozilla.org/report/index/3a261ae3-d640-4186-abeb-084020231218#allthreads
Others
- https://crash-stats.mozilla.org/report/index/38f23fc5-1a15-48f5-8e37-362f20231218#tab-details
- StreamTransport hang which is not in a backgroundtask and appears to have no connection to osclientcerts - appears to be in
OsReauthenticator
- StreamTransport hang which is not in a backgroundtask and appears to have no connection to osclientcerts - appears to be in
- https://crash-stats.mozilla.org/report/index/00debc63-ca71-4a81-96a2-438e20231218#allthreads
- BackgroundThreadPool hang when trying to pin the app to the taskbar. No thread names, but background thread appears to be Thread 12.
Leaving a ni? for :emilio for the printer hangs and :dkeeler for the osclientcerts hangs.
Comment 4•1 year ago
|
||
The print hangs don't seem super-actionable here, it seems like a windows print API call is taking longer than expected, which happens in a background thread, but that's afaict not under our control, and lots of these operations are not really cancellable / timeout-able, see this for example... :/
Assignee | ||
Comment 5•1 year ago
|
||
For osclientcerts, I wonder if this could be due to bug 1745925. NSS initialization causes the osclientcerts module (as well as other sources of certificates) to be loaded on a background thread, which is not something we want to do during shutdown. Telemetry indicates this operation can take longer than 1 minute for some users (https://sql.telemetry.mozilla.org/queries/96623#238541), which would be identified as a shutdown hang if that's what's happening.
Comment 6•11 months ago
|
||
The severity field is not set for this bug.
:nika, could you have a look please?
For more information, please visit BugBot documentation.
Comment 7•11 months ago
|
||
Setting to S3, but I could be convinced the osclientcerts
bugs should be higher priority, as I believe they will show the Firefox crash reporter UI to the user while they are actively using the browser due to a background process crashing, which could be a poor user experience.
Comment 8•11 months ago
|
||
(In reply to Dana Keeler (she/her) (use needinfo) (:keeler for reviews) from comment #5)
For osclientcerts, I wonder if this could be due to bug 1745925. NSS initialization causes the osclientcerts module (as well as other sources of certificates) to be loaded on a background thread, which is not something we want to do during shutdown. Telemetry indicates this operation can take longer than 1 minute for some users (https://sql.telemetry.mozilla.org/queries/96623#238541), which would be identified as a shutdown hang if that's what's happening.
Avoiding NSS initialization during shutdown might help in this situation. Unfortunately, for very short-lived processes such as the backgroundtask processes (which are the ones crashing here), starting a 1-minute operation even during startup could still lead to a shutdown crash, as the process does not live for a full minute. If it's possible, doing something like making these operations interruptable by shutdown or avoiding starting osclientcerts
in backgroundtask processes might be a more reliable solution if it's possible.
Assignee | ||
Comment 9•11 months ago
|
||
Updated•11 months ago
|
Assignee | ||
Comment 10•11 months ago
|
||
Right now, there's not really a way for osclientcerts
to stop loading when shutdown starts, but we can definitely avoid loading it in backgroundtask processes. My one concern with that is if the backgroundtask needs to do network i/o but the connection is via a proxy or something that requires client authentication. Do backgroundtasks tend to rely on the network?
Comment 11•11 months ago
|
||
(In reply to Dana Keeler (she/her) (use needinfo) (:keeler for reviews) from comment #10)
Right now, there's not really a way for
osclientcerts
to stop loading when shutdown starts, but we can definitely avoid loading it in backgroundtask processes. My one concern with that is if the backgroundtask needs to do network i/o but the connection is via a proxy or something that requires client authentication. Do backgroundtasks tend to rely on the network?
I believe backgroundtasks are sometimes used to interact with the network, yes. The main task which is encountering this issue (defaultagent
) is a windows background scheduled task collecting information and submitting it to telemetry about what browser the user has set as their OS default (https://firefox-source-docs.mozilla.org/toolkit/mozapps/defaultagent/default-browser-agent/index.html).
If this is required for networking such as for sending pings like this, perhaps we need to find some other solution? It's unclear to me how we are starting shutdown before osclientcerts has loaded if we need it to send the ping though.
Assignee | ||
Comment 12•11 months ago
|
||
Yeah, looking at this some more, I don't think osclientcerts is directly the issue here. Loading that library should take almost no time (it doesn't do anything right away).
Bug 1745925 is seeming like a better place to start, again. However, that led to bug 1745043, so maybe we could just start with not dispatching the background task to load loadable certs if we're in shutdown.
Updated•11 months ago
|
Comment 13•10 months ago
|
||
The bug is linked to a topcrash signature, which matches the following criterion:
- Top 20 desktop browser crashes on release (startup)
For more information, please visit BugBot documentation.
Assignee | ||
Comment 14•9 months ago
|
||
I recently landed bug 1881117, which might improve things here.
Comment 15•9 months ago
|
||
Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.
For more information, please visit BugBot documentation.
Updated•8 months ago
|
Comment 16•7 months ago
|
||
This is back in topcrash territory for Firefox 126, 127, and 128.
Comment 17•7 months ago
•
|
||
I just hit this on a Win 2016 Server Standard VM when I had Exchange Admin Center open and was doing some tasks: https://crash-stats.mozilla.org/report/index/bp-f3c40371-a9de-47dd-a1aa-210200240520
Looking closer at Thread 2, could this be a Trend Micro-related issue?
Comment 18•5 months ago
|
||
Something landed in Nightly 129 (build 20240621100955) which may have fixed the issue (or moved the crash signature)
dana, if there is a fix and you can help figure out what it was, I wonder if it might be upliftable to 128 beta.
Assignee | ||
Comment 19•5 months ago
|
||
Well, depending on what timezone that timestamp is, could it be bug 1895110?
Updated•5 months ago
|
Comment 20•5 months ago
|
||
IIUC this signature is the specific case of bug 1505660 when shutting down the BackgroundEventTarget
, resulting in the nested pool shutdown signature. I do not want to merge them, but conceptually they are very similar.
Comment 21•5 months ago
•
|
||
Looking today at some signatures:
597431b6-60b0-4965-afd2-f69b10240711 shows the StreamTrans
pool being stuck inside RemoveProfileRecursion
when calling into Windows for some file removal. The recursion depth suggest that there is quite something to remove. This can be either the effect of a profile cleanup during start bleeding into shutdown or an explicit remove from the profiles UI (the other calls to nsToolkitProfile::Remove
seem to not run in the background). This specific instance happens only 3min after start, which might indicate the first? Maybe ProfileResetCleanup
should have an async shutdown blocker on an earlier phase.
5c8ef667-1393-43c5-b387-4ca440240711 is another instance of bug 1904206, apparently, there seem to be many more of these. We should monitor if bug 1900837 improved these, for now it looks quite good.
0ecc4d66-2a5a-489a-9981-776330240710 is an instance of "Printer-related Background IO Thread Pool shutdown hangs" from comment 3.
6dd015ad-8cf8-4cd1-8003-f28ff0240710 a StreamTrans
thread is stuck inside the nsNetworkLinkService
in calculateNetworkIdInternal. There could be a chance that by the time that the runnable executes the network has gone away for system shutdown and we are waiting for some looong timeout from the system?
b0903153-c176-44bb-b3e5-d7aa60240710 is another instance of "BackgroundThreadPool hang when trying to pin the app to the taskbar." from comment 3.
fd6a08e9-21d7-407d-a544-bc9000240703 is a case where we want to ensureLoggedIn which is executed async on the StreamTrans pool. I think we could have some IsInOrBeyond
checks there to reduce the probability for those.
(tbc)
Comment 22•3 months ago
|
||
Description
•