Open Bug 1866944 Opened 1 year ago Updated 3 months ago

[meta] Crash in [@ shutdownhang | mozilla::SpinEventLoopUntil | nsThreadPool::ShutdownWithTimeout]

Categories

(Core :: XPCOM, defect)

Firefox 120
Desktop
Windows 11
defect

Tracking

()

ASSIGNED
Tracking Status
firefox126 --- affected
firefox127 --- affected
firefox128 --- affected

People

(Reporter: planetman1125, Assigned: keeler)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(Keywords: meta, topcrash, Whiteboard: [tbird crash])

Crash Data

Attachments

(1 obsolete file)

Crash report: https://crash-stats.mozilla.org/report/index/640dddf9-911a-4fc5-89f9-d47810231128

MOZ_CRASH Reason: Shutdown hanging at step XPCOMShutdownThreads. Something is blocking the main-thread.

Top 10 frames of crashing thread:

0  ntdll.dll  ZwWaitForAlertByThreadId  
1  ntdll.dll  RtlSleepConditionVariableSRW  
2  KERNELBASE.dll  SleepConditionVariableSRW  
3  mozglue.dll  mozilla::detail::ConditionVariableImpl::wait  mozglue/misc/ConditionVariable_windows.cpp:50
4  xul.dll  mozilla::OffTheBooksCondVar::Wait  xpcom/threads/CondVar.h:58
4  xul.dll  mozilla::TaskController::GetRunnableForMTTask  xpcom/threads/TaskController.cpp:600
4  xul.dll  nsThread::ProcessNextEvent  xpcom/threads/nsThread.cpp:1133
4  xul.dll  NS_ProcessNextEvent  xpcom/threads/nsThreadUtils.cpp:481
5  xul.dll  mozilla::SpinEventLoopUntil  xpcom/threads/SpinEventLoopUntil.h:176
5  xul.dll  nsThreadPool::ShutdownWithTimeout  xpcom/threads/nsThreadPool.cpp:470
Component: General → Untriaged

The bug is linked to a topcrash signature, which matches the following criteria:

  • Top 20 desktop browser crashes on release
  • Top 20 desktop browser crashes on beta

For more information, please visit BugBot documentation.

Keywords: topcrash
Component: Untriaged → XPCOM
Product: Firefox → Core

The severity field is not set for this bug.
:nika, could you have a look please?

For more information, please visit BugBot documentation.

Flags: needinfo?(nika)

Unfortunately this crash is overly generic, and corresponds to a few different causes. These crashes are all due to the main thread hanging while waiting for some nsThreadPool to shut down in the background, with different threadpools being waited on in different cases (though the signature is not walking up enough frames here to capture which threadpools are hanging).

Scanning through a few of the specific reports, I've noticed that they tend to fall into one of two major categories, though there are obviously some outliers:

StreamTransport shutdown hangs (clientcerts)

  • These hangs are occurring during nsStreamTransportService threadpool shutdown, and are generally due to a single STS thread still being active.
  • These crashes tend to occur within a backgroundtask process, with the most common task seeming to be defaultagent.
    • Unlike normal Firefox processes, backgroundtask processes tend to live for a much shorter amount of time (just enough to perform a single task), meaning that a long-running operation which would normally complete before the Firefox process enters shutdown may not have completed before quitting.
  • Exactly where the crash happens seems to vary, but it often is occurring within LoadLoadableCertsTask::Run(), specifically within the rust osclientcerts module.
    • In some cases, the code is hanging due to an outstanding mpsc recv call, though the osclientcerts thread appears to be present and active.

Examples:

Printer-related Background IO Thread Pool shutdown hangs

nsPrinterListWin::Printers Examples:

PrintConfig.dll Examples:

Others


Leaving a ni? for :emilio for the printer hangs and :dkeeler for the osclientcerts hangs.

Flags: needinfo?(nika)
Flags: needinfo?(emilio)
Flags: needinfo?(dkeeler)

The print hangs don't seem super-actionable here, it seems like a windows print API call is taking longer than expected, which happens in a background thread, but that's afaict not under our control, and lots of these operations are not really cancellable / timeout-able, see this for example... :/

Flags: needinfo?(emilio)

For osclientcerts, I wonder if this could be due to bug 1745925. NSS initialization causes the osclientcerts module (as well as other sources of certificates) to be loaded on a background thread, which is not something we want to do during shutdown. Telemetry indicates this operation can take longer than 1 minute for some users (https://sql.telemetry.mozilla.org/queries/96623#238541), which would be identified as a shutdown hang if that's what's happening.

Flags: needinfo?(dkeeler)

The severity field is not set for this bug.
:nika, could you have a look please?

For more information, please visit BugBot documentation.

Flags: needinfo?(nika)

Setting to S3, but I could be convinced the osclientcerts bugs should be higher priority, as I believe they will show the Firefox crash reporter UI to the user while they are actively using the browser due to a background process crashing, which could be a poor user experience.

Severity: -- → S3
Flags: needinfo?(nika)

(In reply to Dana Keeler (she/her) (use needinfo) (:keeler for reviews) from comment #5)

For osclientcerts, I wonder if this could be due to bug 1745925. NSS initialization causes the osclientcerts module (as well as other sources of certificates) to be loaded on a background thread, which is not something we want to do during shutdown. Telemetry indicates this operation can take longer than 1 minute for some users (https://sql.telemetry.mozilla.org/queries/96623#238541), which would be identified as a shutdown hang if that's what's happening.

Avoiding NSS initialization during shutdown might help in this situation. Unfortunately, for very short-lived processes such as the backgroundtask processes (which are the ones crashing here), starting a 1-minute operation even during startup could still lead to a shutdown crash, as the process does not live for a full minute. If it's possible, doing something like making these operations interruptable by shutdown or avoiding starting osclientcerts in backgroundtask processes might be a more reliable solution if it's possible.

Flags: needinfo?(dkeeler)
Assignee: nobody → dkeeler
Status: NEW → ASSIGNED

Right now, there's not really a way for osclientcerts to stop loading when shutdown starts, but we can definitely avoid loading it in backgroundtask processes. My one concern with that is if the backgroundtask needs to do network i/o but the connection is via a proxy or something that requires client authentication. Do backgroundtasks tend to rely on the network?

Flags: needinfo?(dkeeler)

(In reply to Dana Keeler (she/her) (use needinfo) (:keeler for reviews) from comment #10)

Right now, there's not really a way for osclientcerts to stop loading when shutdown starts, but we can definitely avoid loading it in backgroundtask processes. My one concern with that is if the backgroundtask needs to do network i/o but the connection is via a proxy or something that requires client authentication. Do backgroundtasks tend to rely on the network?

I believe backgroundtasks are sometimes used to interact with the network, yes. The main task which is encountering this issue (defaultagent) is a windows background scheduled task collecting information and submitting it to telemetry about what browser the user has set as their OS default (https://firefox-source-docs.mozilla.org/toolkit/mozapps/defaultagent/default-browser-agent/index.html).

If this is required for networking such as for sending pings like this, perhaps we need to find some other solution? It's unclear to me how we are starting shutdown before osclientcerts has loaded if we need it to send the ping though.

Flags: needinfo?(dkeeler)

Yeah, looking at this some more, I don't think osclientcerts is directly the issue here. Loading that library should take almost no time (it doesn't do anything right away).

Bug 1745925 is seeming like a better place to start, again. However, that led to bug 1745043, so maybe we could just start with not dispatching the background task to load loadable certs if we're in shutdown.

Flags: needinfo?(dkeeler)
Attachment #9370777 - Attachment is obsolete: true

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 20 desktop browser crashes on release (startup)

For more information, please visit BugBot documentation.

I recently landed bug 1881117, which might improve things here.

See Also: → 1881117

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit BugBot documentation.

Whiteboard: [tbird crash]

This is back in topcrash territory for Firefox 126, 127, and 128.

I just hit this on a Win 2016 Server Standard VM when I had Exchange Admin Center open and was doing some tasks: https://crash-stats.mozilla.org/report/index/bp-f3c40371-a9de-47dd-a1aa-210200240520

Looking closer at Thread 2, could this be a Trend Micro-related issue?

See Also: → 1895110

Something landed in Nightly 129 (build 20240621100955) which may have fixed the issue (or moved the crash signature)
dana, if there is a fix and you can help figure out what it was, I wonder if it might be upliftable to 128 beta.

Flags: needinfo?(dkeeler)

Well, depending on what timezone that timestamp is, could it be bug 1895110?

Flags: needinfo?(dkeeler)
Crash Signature: [@ shutdownhang | mozilla::SpinEventLoopUntil | nsThreadPool::ShutdownWithTimeout] → [@ shutdownhang | mozilla::SpinEventLoopUntil | nsThreadPool::ShutdownWithTimeout] [@ shutdownhang | mozilla::SpinEventLoopUntil<T> | nsThreadPool::ShutdownWithTimeout]
See Also: → 1904206

IIUC this signature is the specific case of bug 1505660 when shutting down the BackgroundEventTarget, resulting in the nested pool shutdown signature. I do not want to merge them, but conceptually they are very similar.

Blocks: 1505660

Looking today at some signatures:

597431b6-60b0-4965-afd2-f69b10240711 shows the StreamTrans pool being stuck inside RemoveProfileRecursion when calling into Windows for some file removal. The recursion depth suggest that there is quite something to remove. This can be either the effect of a profile cleanup during start bleeding into shutdown or an explicit remove from the profiles UI (the other calls to nsToolkitProfile::Remove seem to not run in the background). This specific instance happens only 3min after start, which might indicate the first? Maybe ProfileResetCleanup should have an async shutdown blocker on an earlier phase.

5c8ef667-1393-43c5-b387-4ca440240711 is another instance of bug 1904206, apparently, there seem to be many more of these. We should monitor if bug 1900837 improved these, for now it looks quite good.

0ecc4d66-2a5a-489a-9981-776330240710 is an instance of "Printer-related Background IO Thread Pool shutdown hangs" from comment 3.

6dd015ad-8cf8-4cd1-8003-f28ff0240710 a StreamTrans thread is stuck inside the nsNetworkLinkService in calculateNetworkIdInternal. There could be a chance that by the time that the runnable executes the network has gone away for system shutdown and we are waiting for some looong timeout from the system?

b0903153-c176-44bb-b3e5-d7aa60240710 is another instance of "BackgroundThreadPool hang when trying to pin the app to the taskbar." from comment 3.

fd6a08e9-21d7-407d-a544-bc9000240703 is a case where we want to ensureLoggedIn which is executed async on the StreamTrans pool. I think we could have some IsInOrBeyondchecks there to reduce the probability for those.

(tbc)

Keywords: meta
Summary: Crash in [@ shutdownhang | mozilla::SpinEventLoopUntil | nsThreadPool::ShutdownWithTimeout] → [meta] Crash in [@ shutdownhang | mozilla::SpinEventLoopUntil | nsThreadPool::ShutdownWithTimeout]
Depends on: 1907539
Depends on: 1907804
See Also: → 1581079
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: