[meta] Crash in [@ shutdownhang | mozilla::SpinEventLoopUntil | nsThreadPool::ShutdownWithTimeout]
Categories
(Core :: XPCOM, defect)
Tracking
()
People
(Reporter: planetman1125, Assigned: keeler)
References
(Depends on 2 open bugs, Blocks 1 open bug)
Details
(Keywords: meta, topcrash, Whiteboard: [tbird crash])
Crash Data
Attachments
(1 obsolete file)
Crash report: https://crash-stats.mozilla.org/report/index/640dddf9-911a-4fc5-89f9-d47810231128
MOZ_CRASH Reason: Shutdown hanging at step XPCOMShutdownThreads. Something is blocking the main-thread.
Top 10 frames of crashing thread:
0 ntdll.dll ZwWaitForAlertByThreadId
1 ntdll.dll RtlSleepConditionVariableSRW
2 KERNELBASE.dll SleepConditionVariableSRW
3 mozglue.dll mozilla::detail::ConditionVariableImpl::wait mozglue/misc/ConditionVariable_windows.cpp:50
4 xul.dll mozilla::OffTheBooksCondVar::Wait xpcom/threads/CondVar.h:58
4 xul.dll mozilla::TaskController::GetRunnableForMTTask xpcom/threads/TaskController.cpp:600
4 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1133
4 xul.dll NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:481
5 xul.dll mozilla::SpinEventLoopUntil xpcom/threads/SpinEventLoopUntil.h:176
5 xul.dll nsThreadPool::ShutdownWithTimeout xpcom/threads/nsThreadPool.cpp:470
Comment 1•1 year ago
|
||
The bug is linked to a topcrash signature, which matches the following criteria:
- Top 20 desktop browser crashes on release
- Top 20 desktop browser crashes on beta
For more information, please visit BugBot documentation.
Comment 2•1 year ago
|
||
The severity field is not set for this bug.
:nika, could you have a look please?
For more information, please visit BugBot documentation.
Comment 3•1 year ago
|
||
Unfortunately this crash is overly generic, and corresponds to a few different causes. These crashes are all due to the main thread hanging while waiting for some nsThreadPool to shut down in the background, with different threadpools being waited on in different cases (though the signature is not walking up enough frames here to capture which threadpools are hanging).
Scanning through a few of the specific reports, I've noticed that they tend to fall into one of two major categories, though there are obviously some outliers:
StreamTransport shutdown hangs (clientcerts)
- These hangs are occurring during nsStreamTransportService threadpool shutdown, and are generally due to a single STS thread still being active.
- These crashes tend to occur within a backgroundtask process, with the most common task seeming to be
defaultagent.- Unlike normal Firefox processes, backgroundtask processes tend to live for a much shorter amount of time (just enough to perform a single task), meaning that a long-running operation which would normally complete before the Firefox process enters shutdown may not have completed before quitting.
- Exactly where the crash happens seems to vary, but it often is occurring within
LoadLoadableCertsTask::Run(), specifically within the rustosclientcertsmodule.- In some cases, the code is hanging due to an outstanding mpsc
recvcall, though theosclientcertsthread appears to be present and active.
- In some cases, the code is hanging due to an outstanding mpsc
Examples:
- https://crash-stats.mozilla.org/report/index/5f21807c-f559-40d5-9ef5-dca270231218#allthreads
- The
osclientcertsthread appears to be partway through executing theopen_sessioncall
- The
- https://crash-stats.mozilla.org/report/index/e4672058-e671-4f8a-91eb-b530a0231218#allthreads
- https://crash-stats.mozilla.org/report/index/f919c4e9-4d0e-4974-ab05-647de0231218#allthreads
- Appears to be in
C_Initializeinstead
- Appears to be in
- https://crash-stats.mozilla.org/report/index/2797240b-53f3-44fb-a4df-a2ab40231218#allthreads
- Unclear what the osclientcerts code is doing.
- https://crash-stats.mozilla.org/report/index/e6cf7120-eea5-4541-a9a4-08e9f0231218#allthreads
- Also in
C_Initialize
- Also in
- https://crash-stats.mozilla.org/report/index/0af04780-0fec-43db-85be-0589e0231218#allthreads
- In
C_GetInfo
- In
- https://crash-stats.mozilla.org/report/index/c191f84a-3538-48b9-b513-ea2ab0231218#allthreads
- https://crash-stats.mozilla.org/report/index/4ae99336-94b2-4549-b569-782b00231218#allthreads
- No rust on the stack, but is within
LoadLoadableCertsTask::Run - Unlike the others in this section I've noticed, this was during a
backgroundupdatebackgroundtask
- No rust on the stack, but is within
- https://crash-stats.mozilla.org/report/index/b070b369-3759-4c97-9bf1-5fc6b0231218#allthreads
- No rust on the stack, but is within
LoadLoadableCertsTask::Run - Also in a
backgroundupdatebackgroundtask
- No rust on the stack, but is within
Printer-related Background IO Thread Pool shutdown hangs
- These hangs are occurring during the shared BackgroundThreadPool shutdown, and are usually due to one or more threads in the BgIOThreadPool being blocked.
- Unlike the osclientcerts crashes, these appear to be happening in normal Firefox processes, not backgroundtask processes.
- When the XUL caller is visible, it appears to most frequently be
nsPrinterListWin::Printers(), though in some cases the stack is full of opaquePrintConfig.dllframes, so we don't have a great backtrace, and it could be some other caller. - The dispatches presumably originate from https://searchfox.org/mozilla-central/rev/91cc8848427fdbbeb324e6ca56a0d08d32d3c308/widget/nsPrinterListBase.cpp#61-67
nsPrinterListWin::Printers Examples:
- https://crash-stats.mozilla.org/report/index/ba76613a-cf36-473f-9073-a4afc0231218#allthreads
- https://crash-stats.mozilla.org/report/index/1bfe13ba-7989-4670-a0f7-ee6670231218#allthreads
PrintConfig.dll Examples:
- https://crash-stats.mozilla.org/report/index/3b612e00-75a2-49c4-83aa-372800231218#allthreads
- https://crash-stats.mozilla.org/report/index/405ed665-798b-4569-985d-d860a0231218#allthreads
- https://crash-stats.mozilla.org/report/index/3a261ae3-d640-4186-abeb-084020231218#allthreads
Others
- https://crash-stats.mozilla.org/report/index/38f23fc5-1a15-48f5-8e37-362f20231218#tab-details
- StreamTransport hang which is not in a backgroundtask and appears to have no connection to osclientcerts - appears to be in
OsReauthenticator
- StreamTransport hang which is not in a backgroundtask and appears to have no connection to osclientcerts - appears to be in
- https://crash-stats.mozilla.org/report/index/00debc63-ca71-4a81-96a2-438e20231218#allthreads
- BackgroundThreadPool hang when trying to pin the app to the taskbar. No thread names, but background thread appears to be Thread 12.
Leaving a ni? for :emilio for the printer hangs and :dkeeler for the osclientcerts hangs.
Comment 4•1 year ago
|
||
The print hangs don't seem super-actionable here, it seems like a windows print API call is taking longer than expected, which happens in a background thread, but that's afaict not under our control, and lots of these operations are not really cancellable / timeout-able, see this for example... :/
| Assignee | ||
Comment 5•1 year ago
|
||
For osclientcerts, I wonder if this could be due to bug 1745925. NSS initialization causes the osclientcerts module (as well as other sources of certificates) to be loaded on a background thread, which is not something we want to do during shutdown. Telemetry indicates this operation can take longer than 1 minute for some users (https://sql.telemetry.mozilla.org/queries/96623#238541), which would be identified as a shutdown hang if that's what's happening.
Comment 6•1 year ago
|
||
The severity field is not set for this bug.
:nika, could you have a look please?
For more information, please visit BugBot documentation.
Comment 7•1 year ago
|
||
Setting to S3, but I could be convinced the osclientcerts bugs should be higher priority, as I believe they will show the Firefox crash reporter UI to the user while they are actively using the browser due to a background process crashing, which could be a poor user experience.
Comment 8•1 year ago
|
||
(In reply to Dana Keeler (she/her) (use needinfo) (:keeler for reviews) from comment #5)
For osclientcerts, I wonder if this could be due to bug 1745925. NSS initialization causes the osclientcerts module (as well as other sources of certificates) to be loaded on a background thread, which is not something we want to do during shutdown. Telemetry indicates this operation can take longer than 1 minute for some users (https://sql.telemetry.mozilla.org/queries/96623#238541), which would be identified as a shutdown hang if that's what's happening.
Avoiding NSS initialization during shutdown might help in this situation. Unfortunately, for very short-lived processes such as the backgroundtask processes (which are the ones crashing here), starting a 1-minute operation even during startup could still lead to a shutdown crash, as the process does not live for a full minute. If it's possible, doing something like making these operations interruptable by shutdown or avoiding starting osclientcerts in backgroundtask processes might be a more reliable solution if it's possible.
| Assignee | ||
Comment 9•1 year ago
|
||
Updated•1 year ago
|
| Assignee | ||
Comment 10•1 year ago
|
||
Right now, there's not really a way for osclientcerts to stop loading when shutdown starts, but we can definitely avoid loading it in backgroundtask processes. My one concern with that is if the backgroundtask needs to do network i/o but the connection is via a proxy or something that requires client authentication. Do backgroundtasks tend to rely on the network?
Comment 11•1 year ago
|
||
(In reply to Dana Keeler (she/her) (use needinfo) (:keeler for reviews) from comment #10)
Right now, there's not really a way for
osclientcertsto stop loading when shutdown starts, but we can definitely avoid loading it in backgroundtask processes. My one concern with that is if the backgroundtask needs to do network i/o but the connection is via a proxy or something that requires client authentication. Do backgroundtasks tend to rely on the network?
I believe backgroundtasks are sometimes used to interact with the network, yes. The main task which is encountering this issue (defaultagent) is a windows background scheduled task collecting information and submitting it to telemetry about what browser the user has set as their OS default (https://firefox-source-docs.mozilla.org/toolkit/mozapps/defaultagent/default-browser-agent/index.html).
If this is required for networking such as for sending pings like this, perhaps we need to find some other solution? It's unclear to me how we are starting shutdown before osclientcerts has loaded if we need it to send the ping though.
| Assignee | ||
Comment 12•1 year ago
|
||
Yeah, looking at this some more, I don't think osclientcerts is directly the issue here. Loading that library should take almost no time (it doesn't do anything right away).
Bug 1745925 is seeming like a better place to start, again. However, that led to bug 1745043, so maybe we could just start with not dispatching the background task to load loadable certs if we're in shutdown.
Updated•1 year ago
|
Comment 13•1 year ago
|
||
The bug is linked to a topcrash signature, which matches the following criterion:
- Top 20 desktop browser crashes on release (startup)
For more information, please visit BugBot documentation.
| Assignee | ||
Comment 14•1 year ago
|
||
I recently landed bug 1881117, which might improve things here.
Comment 15•1 year ago
|
||
Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.
For more information, please visit BugBot documentation.
Updated•1 year ago
|
Comment 16•1 year ago
|
||
This is back in topcrash territory for Firefox 126, 127, and 128.
Comment 17•1 year ago
•
|
||
I just hit this on a Win 2016 Server Standard VM when I had Exchange Admin Center open and was doing some tasks: https://crash-stats.mozilla.org/report/index/bp-f3c40371-a9de-47dd-a1aa-210200240520
Looking closer at Thread 2, could this be a Trend Micro-related issue?
Comment 18•1 year ago
|
||
Something landed in Nightly 129 (build 20240621100955) which may have fixed the issue (or moved the crash signature)
dana, if there is a fix and you can help figure out what it was, I wonder if it might be upliftable to 128 beta.
| Assignee | ||
Comment 19•1 year ago
|
||
Well, depending on what timezone that timestamp is, could it be bug 1895110?
Updated•1 year ago
|
Comment 20•1 year ago
|
||
IIUC this signature is the specific case of bug 1505660 when shutting down the BackgroundEventTarget, resulting in the nested pool shutdown signature. I do not want to merge them, but conceptually they are very similar.
Comment 21•1 year ago
•
|
||
Looking today at some signatures:
597431b6-60b0-4965-afd2-f69b10240711 shows the StreamTrans pool being stuck inside RemoveProfileRecursion when calling into Windows for some file removal. The recursion depth suggest that there is quite something to remove. This can be either the effect of a profile cleanup during start bleeding into shutdown or an explicit remove from the profiles UI (the other calls to nsToolkitProfile::Remove seem to not run in the background). This specific instance happens only 3min after start, which might indicate the first? Maybe ProfileResetCleanup should have an async shutdown blocker on an earlier phase.
5c8ef667-1393-43c5-b387-4ca440240711 is another instance of bug 1904206, apparently, there seem to be many more of these. We should monitor if bug 1900837 improved these, for now it looks quite good.
0ecc4d66-2a5a-489a-9981-776330240710 is an instance of "Printer-related Background IO Thread Pool shutdown hangs" from comment 3.
6dd015ad-8cf8-4cd1-8003-f28ff0240710 a StreamTrans thread is stuck inside the nsNetworkLinkService in calculateNetworkIdInternal. There could be a chance that by the time that the runnable executes the network has gone away for system shutdown and we are waiting for some looong timeout from the system?
b0903153-c176-44bb-b3e5-d7aa60240710 is another instance of "BackgroundThreadPool hang when trying to pin the app to the taskbar." from comment 3.
fd6a08e9-21d7-407d-a544-bc9000240703 is a case where we want to ensureLoggedIn which is executed async on the StreamTrans pool. I think we could have some IsInOrBeyondchecks there to reduce the probability for those.
(tbc)
Comment 22•1 year ago
|
||
Comment 23•7 months ago
|
||
Got this crash just now: https://crash-stats.mozilla.org/report/index/e9b476a1-44f9-4d64-bde8-c10be0250422#tab-bugzilla
Comment 24•6 months ago
|
||
Comment 25•6 months ago
|
||
(In reply to Mayank Bansal from comment #23)
Got this crash just now: https://crash-stats.mozilla.org/report/index/e9b476a1-44f9-4d64-bde8-c10be0250422#tab-bugzilla
(In reply to Mayank Bansal from comment #24)
And again: https://crash-stats.mozilla.org/report/index/9496912a-f350-4518-a692-52f5e0250502#tab-bugzilla
Both these instances hang during the background thread pool shutdown while nsPACMan::GetPACFromDHCP wants to run its payload on the background thread.
Bug 1937367 changed something there very recently and added the hop through the background pool while blocking on a monitor until a certain timeout. That sounds like we can see sporadically hangs inside GetOption and we tried to escape them via that timeout. But the blocking event remains active on the pool and blocks then potentially our shutdown.
Comment 26•6 months ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #25)
Bug 1937367 changed something there very recently and added the hop through the background pool while blocking on a monitor until a certain timeout. That sounds like we can see sporadically hangs inside
GetOptionand we tried to escape them via that timeout. But the blocking event remains active on the pool and blocks then potentially our shutdown.
Does dispatching the event with NS_DISPATCH_EVENT_MAY_BLOCK avoid the shutdown hang?
GetOption hangs in a windows library call - which is why we dispatched it to a different thread - so there's not much we can do to break out of that call.
For hanging DNS threads we solved this by using nsIThreadPool::shutdownWithTimeout - not sure if that's an option for background threads.
Comment 27•6 months ago
•
|
||
(In reply to Valentin Gosu [:valentin] (he/him) from comment #26)
(In reply to Jens Stutte [:jstutte] from comment #25)
Bug 1937367 changed something there very recently and added the hop through the background pool while blocking on a monitor until a certain timeout. That sounds like we can see sporadically hangs inside
GetOptionand we tried to escape them via that timeout. But the blocking event remains active on the pool and blocks then potentially our shutdown.Does dispatching the event with NS_DISPATCH_EVENT_MAY_BLOCK avoid the shutdown hang?
GetOptionhangs in a windows library call - which is why we dispatched it to a different thread - so there's not much we can do to break out of that call.
It just moves the problem to a different thread pool (which has a few more threads).
For hanging DNS threads we solved this by using nsIThreadPool::shutdownWithTimeout - not sure if that's an option for background threads.
Well, I think the major risk here is not the shutdown hang but to have filled up the pool with enough blocking events to block all threads during normal operations?
Edit: That might be what bug 1964030 experiences.
Description
•