Open Bug 1633342 Opened 4 years ago Updated 8 days ago

[meta] Crash in [mozilla::net::nsHttpConnectionMgr::Shutdown] and other net related places. Shutdown hang.

Categories

(Core :: Networking, defect, P3)

defect

Tracking

()

Tracking Status
firefox78 --- wontfix
firefox79 --- wontfix
firefox80 - wontfix
firefox113 --- wontfix

People

(Reporter: jstutte, Unassigned)

References

(Depends on 3 open bugs, Blocks 4 open bugs)

Details

(5 keywords, Whiteboard: [DWS_NEXT][stockwell unknown][tbird crash][necko-triaged][necko-monitor])

Crash Data

+++ This bug was initially created as a clone of Bug #1435343 +++

Extracts the cases 1.1 to 1.4 from comment 70 of bug 1435343.

Crash Signature: [@ mozilla::dom::workerinternals::RuntimeService::CrashIfHanging] [@ shutdownhang | nsThread::Shutdown | mozilla::net::nsSocketTransportService::ShutdownThread ] [@ shutdownhang | mozilla::net::ShutdownEvent::PostAndWait] [@ shutdownhang | mozilla::Spi… → [@ shutdownhang | nsThread::Shutdown | mozilla::net::nsSocketTransportService::ShutdownThread ] [@ shutdownhang | mozilla::net::ShutdownEvent::PostAndWait] [@ shutdownhang | mozilla::SpinEventLoopUntil<T> | mozilla::net::nsHttpConnectionMgr::Shutdown ] …

Cleaned dependencies and blocker as I expect all these to be quite outdated.

No longer blocks: 988872, 1425323
Summary: Crash in [mozilla::net::nsHttpConnectionMgr::Shutdown] and other net related places. Shutdown. → Crash in [mozilla::net::nsHttpConnectionMgr::Shutdown] and other net related places. Shutdown hang.

Restoring dependencies for active bugs for further investigation.

Blocks: 1435343
No longer depends on: 1435343

I am not sure about the (causing) component here at all.

No longer blocks: 1435343
Component: DOM: Workers → DOM: Core & HTML
Depends on: 1435343
Flags: needinfo?(amarchesini)
Blocks: 1435343
No longer depends on: 1435343

The cause is similar for each of those components: during the shutdown, many components spin the event loop when they receive notification X ("xpcom-shutdown" for instance, but we have a few others). They do because they want to receive IPC calls, or because they want to wait until a subcomponent/object is released. Often this spinning is not terminated, and because of this, other components do not receive the same notification X.

For instance, https://bugzilla.mozilla.org/show_bug.cgi?id=1435962#c0 describes what happens for mozilla::net::nsHttpConnectionMgr::Shutdown().

The fix is to remove the spinning of the event loop in every component that does it during the shutdown because that can trigger race conditions.

Flags: needinfo?(amarchesini)
See Also: → 1611094
See Also: → 1633469
Severity: critical → N/A
Type: defect → task
Keywords: meta
Summary: Crash in [mozilla::net::nsHttpConnectionMgr::Shutdown] and other net related places. Shutdown hang. → [meta] Crash in [mozilla::net::nsHttpConnectionMgr::Shutdown] and other net related places. Shutdown hang.
Type: task → defect
Severity: N/A → S3
Severity: S3 → --
Component: DOM: Core & HTML → Networking
Priority: P2 → --
See Also: → 1630231
Severity: -- → S1
Priority: -- → P3
Whiteboard: [DWS_NEXT][stockwell unknown][tbird topcrash] → [DWS_NEXT][stockwell unknown][tbird topcrash][necko-triaged]

S1 or S2 bugs needs an assignee - could you find someone for this bug?

Flags: needinfo?(nhnguyen)
Assignee: nobody → juhsu
Flags: needinfo?(nhnguyen)

:junior - do you think you will have a patch on this S1 bug in 79?

Flags: needinfo?(juhsu)

No, this is lingering for years, we fight with it very hard before, but fail to find a solution (see also bug 1158189)
That's the reason we put it P3.
An STR would be very helpful, but 79 is not something expected.

Flags: needinfo?(juhsu)
Regressions: 1648553
Crash Signature: ] [@ shutdownhang | static bool mozilla::SpinEventLoopUntil<T> | mozilla::net::nsHttpConnectionMgr::Shutdown] [@ shutdownhang | mozilla::net::nsHttpConnectionMgr::Shutdown ] → ] [@ shutdownhang | static bool mozilla::SpinEventLoopUntil<T> | mozilla::net::nsHttpConnectionMgr::Shutdown] [@ shutdownhang | mozilla::net::nsHttpConnectionMgr::Shutdown ] [@ shutdownhang | __pthread_cond_wait | <name omitted> | mozilla::net::nsHttpC…

(In reply to Junior [:junior] from comment #7)

An STR would be very helpful

Hi Junior, I do not expect to have clear STR for this anytime soon, but :baku had in comment 4 a plausible call to action:

The fix is to remove the spinning of the event loop in every component that does it during the shutdown because that can trigger race conditions.

Though I have very limited understanding of this, it sounds very plausible to me that we should avoid this, and not only for the hangs (the event loop spinning on the main thread can lead sometimes to racy off-thread deconstruction of objects, too, it seems). So if you were able to identify those places of event loop spinning in your components, you could file a (set of) bug(s) to improve that aspect and see, if this helps also here.

Flags: needinfo?(juhsu)
No longer regressions: 1648553

We are spin the event loop since the socket thread can't finish its job due to hanging PR_POLL in socket thread.
bug 1435962 comment 3 indicates a possible network driver bug, and all crashes happen in Windows.

We can't simply remove the event loop spinning without closing connections when we receive offline notification during shutdown process.
I don't have a way to remove the event loop on top of my head (and I believe we can't.)

Flags: needinfo?(juhsu)

nsHttpConnectionMgr::Shutdown() must not spin the event loop because that is called from nsIObserver::Observe(). Spinning the event loop permits calls to nsIObserver::Remove(), for example, which is forbidden during notifications, so Observe() implementations must not spin the event loop.

It is also called from a destructor. I don't know what ensures that nsHttpHandlers are destroyed only at times when it is safe to run arbitrary code.

Anything that must spin an event loop should dispatch a runnable to do that so that it happens at a safe time.
(I'm not clear whether or not that would be sufficient to resolve the issues here.)

Added a macOS-specific signature.

Crash Signature: [@ shutdownhang | nsThread::Shutdown | mozilla::net::nsSocketTransportService::ShutdownThread ] [@ shutdownhang | mozilla::net::ShutdownEvent::PostAndWait] [@ shutdownhang | mozilla::SpinEventLoopUntil<T> | mozilla::net::nsHttpConnectionMgr::Shutdown ] … → [@ shutdownhang | nsThread::Shutdown | mozilla::net::nsSocketTransportService::ShutdownThread] [@ shutdownhang | mozilla::net::ShutdownEvent::PostAndWait] [@ shutdownhang | mozilla::SpinEventLoopUntil<T> | mozilla::net::nsHttpConnectionMgr::Shutdown] […

We can't simply remove the event loop spinning without closing connections when we receive offline notification during shutdown process.
I don't have a way to remove the event loop on top of my head (and I believe we can't.)

I'm not super familiar with this code but can we do the following?

  1. ConnectionManager has a boolean flag: mShuttingDown(false). When ::Shutdown() is called, we set it to true. No extra connections are accepted.
  2. A runnable + nsIAsyncShutdownBlocker is dispatch to complete the operation.
  3. When OnMshShutdownConfirm() is called, the blocker is removed and the shutdown can continue.
Flags: needinfo?(juhsu)

(In reply to Andrea Marchesini [:baku] from comment #12)

We can't simply remove the event loop spinning without closing connections when we receive offline notification during shutdown process.
I don't have a way to remove the event loop on top of my head (and I believe we can't.)

I'm not super familiar with this code but can we do the following?

  1. ConnectionManager has a boolean flag: mShuttingDown(false). When ::Shutdown() is called, we set it to true. No extra connections are accepted.
  2. A runnable + nsIAsyncShutdownBlocker is dispatch to complete the operation.
  3. When OnMshShutdownConfirm() is called, the blocker is removed and the shutdown can continue.

It looks like a non-blocking way to do the same thing and could work. I'll investigate if we can do this.
Keep the ni? to follow. Thanks, baku!

Flags: needinfo?(juhsu)
Flags: needinfo?(juhsu)
Crash Signature: __pthread_cond_wait | <name omitted> | mozilla::net::nsHttpConnectionMgr::Shutdown] [@ shutdownhang | __psynch_cvwait | <name omitted> | mozilla::net::nsHttpConnectionMgr::Shutdown] → __pthread_cond_wait | <name omitted> | mozilla::net::nsHttpConnectionMgr::Shutdown] [@ shutdownhang | __psynch_cvwait | <name omitted> | mozilla::net::nsHttpConnectionMgr::Shutdown] [@ shutdownhang | mozilla::TaskController::GetRunnableForMTTask]
Crash Signature: __pthread_cond_wait | <name omitted> | mozilla::net::nsHttpConnectionMgr::Shutdown] [@ shutdownhang | __psynch_cvwait | <name omitted> | mozilla::net::nsHttpConnectionMgr::Shutdown] [@ shutdownhang | mozilla::TaskController::GetRunnableForMTTask] → __pthread_cond_wait | <name omitted> | mozilla::net::nsHttpConnectionMgr::Shutdown] [@ shutdownhang | __psynch_cvwait | <name omitted> | mozilla::net::nsHttpConnectionMgr::Shutdown] [@ shutdownhang | mozilla::TaskController::GetRunnableForMTTask] [@ sh…
Crash Signature: shutdownhang | __pthread_cond_wait | <name omitted> | mozilla::TaskController::GetRunnableForMTTask ] → shutdownhang | __pthread_cond_wait | <name omitted> | mozilla::TaskController::GetRunnableForMTTask ] [@ shutdownhang | __psynch_cvwait | <name omitted> | mozilla::TaskController::GetRunnableForMTTask ]

:nhi Triaging as REO for 79 - is this intended to be looked at for 79? Looks this issue has gotten worse in beta.

Flags: needinfo?(nhnguyen)

Hello Kim,
Thanks for weighing in and pointing out the issues in 79. It helps to identify some regression.

I take a look for the first 50 crashes at the stack of @shutdownhang | mozilla::net::nsHttpConnectionMgr::Shutdown in beta.
I believe HTTP3 and nss are new crashes which is not covered in bug 1158189.
This crashes could be eased to <20% once we have those fixed.
Another 10% of crashes are from haning in poll.
bug 1435962 comment 3 showes it could be a network driver bug.
We could address the last mile once later. However, we don't have much things to do with haning Poll.

I don't have strong feeling that it would be solved by not spinning main thread.
HTTP3 are mostly owing to slow string operation and nss lives in non-main thread.
If we don't spin the main thread loop, it still crashes or hangs elsewhere.
I'll ni? dragana, who chases this crashes for a long time, in the next comment for more input.
And file the responding bug to each components.

I list the crashes reported which I triaged at the bottom of this comment.
And I take another look on the crashes in nightly. All 7 crashes are H3 with different symbol.

I also take a look at the release crashes

To me, focusing on beta could be a good idea at this stage.

As for other crash signature,
@ shutdownhang | __psynch_cvwait | <name omitted> | mozilla::TaskController::GetRunnableForMTTask and other mozilla::TaskController::GetRunnableForMTTask are the mostly crashes in QuotaManager (some of them are even without socket thread). Only few of them are about H3 and nss, which are the known cases.

@ shutdownhang | mozilla::net::ShutdownEvent::PostAndWait are about the CacheFileIOManager::Shutdown, which is a different issue.

shutdownhang | nsThread::Shutdown | mozilla::net::nsSocketTransportService::ShutdownThread is interesting. Half of them are known nss issues. Another half happens for deadlock in shutdown thread for ssl here
Here's some reports:
https://crash-stats.mozilla.org/report/index/a071e81d-0955-497f-bc3d-036210200708#allthreads
https://crash-stats.mozilla.org/report/index/011da31f-1aeb-46cd-a81f-eac870200708#allthreads
https://crash-stats.mozilla.org/report/index/3d063176-b728-437b-a4c9-c00410200708#allthreads

This is the analysis for @shutdownhang | mozilla::net::nsHttpConnectionMgr::Shutdown in beta.

Flags: needinfo?(juhsu)

We're not fixing this for 79. We will try to fix what we can for 80.

Flags: needinfo?(nhnguyen)
Depends on: 1651564
Depends on: 1651565

I've looked into some crashes report in comment 16.
We do have some new culprits to cause the shutdown hangs (bug 1651564, bug 1651565).
However, removing the main thread spin loop seems unable to solve.

What do you think, Dragana?

Flags: needinfo?(dd.mozilla)
Depends on: 1029213

(In reply to Junior [:junior] from comment #18)

However, removing the main thread spin loop seems unable to solve.

Is there a reason the spin loop can't be an async shutdown blocker?

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #19)

(In reply to Junior [:junior] from comment #18)

However, removing the main thread spin loop seems unable to solve.

Is there a reason the spin loop can't be an async shutdown blocker?

We need to be very careful about this. We need to make sure some things happen in certain order, transactions needs to be canceled before some other things happened, etc. We can explore this, but that change will not fix the crashes.

(In reply to Junior [:junior] from comment #18)

I've looked into some crashes report in comment 16.
We do have some new culprits to cause the shutdown hangs (bug 1651564, bug 1651565).
However, removing the main thread spin loop seems unable to solve.

What do you think, Dragana?

I will ask for an uplift, the patch is very isolated and it does not tuch other code exept http3 that is disabled by default.

Flags: needinfo?(dd.mozilla)
Crash Signature: shutdownhang | __pthread_cond_wait | <name omitted> | mozilla::TaskController::GetRunnableForMTTask ] [@ shutdownhang | __psynch_cvwait | <name omitted> | mozilla::TaskController::GetRunnableForMTTask ] → shutdownhang | __pthread_cond_wait | <name omitted> | mozilla::TaskController::GetRunnableForMTTask ] [@ shutdownhang | __psynch_cvwait | <name omitted> | mozilla::TaskController::GetRunnableForMTTask ] [@ IPCError-browser | ShutDownKill | __psynch_cvwai…

Just had a Thunderbird bug with NSS/SSL issues:
Bug 1655068
Crash in [@ shutdownhang | nssList_Remove | nssCertificateStore_RemoveCertLOCKED | nssCertificate_Destroy | NSSCertificate_Destroy | CERT_DestroyCertificate | ssl_DestroySID | SSL_ClearSessionCache | mozilla::ShutdownXPCOM]

[Tracking Requested - why for this release]:
these shutdownhangs seem to get more common during the 80.0b cycle again - the various crash signatures containing mozilla::TaskController::GetRunnableForMTTask now account for 8% of all browser crashes there.

Crash Signature: __psynch_cvwait | <name omitted> | mozilla::TaskController::GetRunnableForMTTask ] → __psynch_cvwait | <name omitted> | mozilla::TaskController::GetRunnableForMTTask ] [@ shutdownhang | __psynch_cvwait | mozilla::TaskController::GetRunnableForMTTask] [@ shutdownhang | trunc | mozilla::TaskController::GetRunnableForMTTask]
Depends on: 1656992

(In reply to [:philipp] from comment #24)

[Tracking Requested - why for this release]:
these shutdownhangs seem to get more common during the 80.0b cycle again - the various crash signatures containing mozilla::TaskController::GetRunnableForMTTask now account for 8% of all browser crashes there.

I did a quick investigation on the reports for GetRunnableForMTTask, which contains

Depends on: 1620157
No longer depends on: 1656992
Depends on: 1656992

I take a closer look at the #4 crashes in beta 80.4b, which is @ shutdownhang | mozilla::TaskController::GetRunnableForMTTask

It shows 46 crashes for 4.59%.
I would say the signature mozilla::TaskController::GetRunnableForMTTask is way out of networking signature.
It includes

  • deadlock in nsThread::Shutdown
  • unfinished spineventloop

Actually, I investigate 30 reports of them, and only 1 crash belongs to necko.
Around half of them are QuotaManager, quarter of them are nsThread::Shutdown.

As for bug 1656992, bug 1656992 comment 5 shows that it doesn't matter with beta.

Here's the triaged list.
Bug 1542485 QuotaManager: 14
https://crash-stats.mozilla.org/report/index/9f6b7760-93e0-4faa-a51d-c05d70200805
https://crash-stats.mozilla.org/report/index/da9b07bf-7e42-477a-b142-8727f0200805

Bug 1629669 Bug 1505660 nsThread::Shutdown 7
https://crash-stats.mozilla.org/report/index/dc459fc0-e2ee-4aa7-977b-b6a830200805
https://crash-stats.mozilla.org/report/index/a553423b-171a-4c04-978e-bb8620200805
https://crash-stats.mozilla.org/report/index/ca81d0ae-b048-484b-8a32-287c70200805
https://crash-stats.mozilla.org/report/index/6968de7a-4aa2-4f86-9567-2f4b30200805

js::jit 3
https://crash-stats.mozilla.org/report/index/484c4a97-230a-4db7-8b55-e50020200805
https://crash-stats.mozilla.org/report/index/48982758-5bc2-4ef2-9810-336310200805
https://crash-stats.mozilla.org/report/index/1c06bbc4-68e7-4a0f-9cbd-5fcd90200805

NSS nssTrustDomain 2
https://crash-stats.mozilla.org/report/index/cd70867d-b0de-40a3-8274-b6d1f0200805#allthreads
https://crash-stats.mozilla.org/report/index/39a412e6-cd15-4caa-81c4-a7fc60200805#allthreads

spineventloop in PreferencesWriter::Flush 2
https://crash-stats.mozilla.org/report/index/3dd5e055-51da-41c5-b006-b0cb20200805
https://crash-stats.mozilla.org/report/index/28c20ba4-0006-44cb-9dcf-f7eee0200805

workerinternals::RuntimeService::Cleanup
https://crash-stats.mozilla.org/report/index/ae84df84-0da1-4261-bbd6-001c30200805#allthreads
Here's more example with different signature

PR_POLL
https://crash-stats.mozilla.org/report/index/667e805a-5033-4af2-b2b9-d8bfb0200805#allthreads

(In reply to Junior [:junior] from comment #27)

Just found bug 1500861.
Do you think ShutdownWithTimeout could resolve the nsThread::Shutdown hangs like the following examples, valentin?

https://crash-stats.mozilla.org/report/index/dc459fc0-e2ee-4aa7-977b-b6a830200805
https://crash-stats.mozilla.org/report/index/a553423b-171a-4c04-978e-bb8620200805
https://crash-stats.mozilla.org/report/index/ca81d0ae-b048-484b-8a32-287c70200805
https://crash-stats.mozilla.org/report/index/6968de7a-4aa2-4f86-9567-2f4b30200805

I don't think it's easy to make it work in all cases. It's mostly meant to be used with nsIThreadPool specifically when doing blocking tasks, and I don't know if that's the case here.

If we're looking at nsHttpConnectionMgr::Shutdown, we could make SpinEventLoopUntil work until a timeout expires.
This might work unless there's an event in the loop that is blocked.

Flags: needinfo?(valentin.gosu)
Crash Signature: __psynch_cvwait | <name omitted> | mozilla::TaskController::GetRunnableForMTTask ] [@ shutdownhang | __psynch_cvwait | mozilla::TaskController::GetRunnableForMTTask] [@ shutdownhang | trunc | mozilla::TaskController::GetRunnableForMTTask] → __psynch_cvwait | <name omitted> | mozilla::TaskController::GetRunnableForMTTask ] [@ shutdownhang | __psynch_cvwait | mozilla::TaskController::GetRunnableForMTTask] [@ shutdownhang | trunc | mozilla::TaskController::GetRunnableForMTTask] [@ shutdownh…
See Also: → 1658729

Removing the generic signatures that stop at mozilla::TaskController::GetRunnableForMTTask after Bug 1658729, and add more first specific signatures that include mozilla::net::nsHttpConnectionMgr::Shutdown.

Crash Signature: mozilla::net::nsHttpConnectionMgr::Shutdown] [@ shutdownhang | static bool mozilla::SpinEventLoopUntil<T> | mozilla::net::nsHttpConnectionMgr::Shutdown] [@ shutdownhang | mozilla::net::nsHttpConnectionMgr::Shutdown ] [@ shutdownhang | __pthread_cond_wa… → mozilla::net::nsHttpConnectionMgr::Shutdown] [@ shutdownhang | static bool mozilla::SpinEventLoopUntil<T> | mozilla::net::nsHttpConnectionMgr::Shutdown] [@ shutdownhang | mozilla::net::nsHttpConnectionMgr::Shutdown ] [@ shutdownhang | __pthread_cond_…
Assignee: CuveeHsu → nobody

Regarding comment 30, I just filed bug 1660950.

Flags: needinfo?(amarchesini)
See Also: → 1660950
Crash Signature: mozilla::net::nsHttpConnectionMgr::Shutdown ] [@ shutdownhang | PR_CallOnceWithArg | mozilla::net::nsHttpConnectionMgr::Shutdown ] → mozilla::net::nsHttpConnectionMgr::Shutdown ] [@ shutdownhang | PR_CallOnceWithArg | mozilla::net::nsHttpConnectionMgr::Shutdown ] [@ shutdownhang | mozilla::TaskController::GetRunnableForMTTask | nsThread::Shutdown | mozilla::net::nsSocketTransportSer…
Severity: S1 → S2

Looking at mozilla::net::ShutdownEvent::PostAndWait I see:

    rv = CacheFileIOManager::gInstance->mIOThread->Dispatch(
        this,
        CacheIOThread::WRITE);  // When writes and closing of handles is done
    MOZ_ASSERT(NS_SUCCEEDED(rv));

    TimeDuration waitTime = TimeDuration::FromSeconds(1);
    while (!mNotified) {
       ...

Shouldn't we return here if the mIOThread->Dispatch did not succeed without even entering the while loop (instead of just doing MOZ_ASSERT)?

Flags: needinfo?(valentin.gosu)

" Version: 44 Branch"

Should this be changed?

Version: 44 Branch → Trunk

(In reply to Jens Stutte [:jstutte] (REO for FF 81) from comment #34)

Shouldn't we return here if the mIOThread->Dispatch did not succeed without even entering the while loop (instead of just doing MOZ_ASSERT)?

That's a great point. I'll submit a patch in a separate bug. I'm not the connection manager/socket thread waits on the cache thread, so it's not likely to make an impact on this bug (unless I'm missing the code path that does so).

Flags: needinfo?(valentin.gosu)
Crash Signature: [@ shutdownhang | nsThread::Shutdown | mozilla::net::nsSocketTransportService::ShutdownThread] [@ shutdownhang | mozilla::net::ShutdownEvent::PostAndWait] [@ shutdownhang | mozilla::SpinEventLoopUntil<T> | mozilla::net::nsHttpConnectionMgr::Shutdown] [… → [@ shutdownhang | mozilla::TaskController::GetRunnableForMTTask | nsThread::Shutdown | mozilla::net::nsSocketTransportService::ShutdownThread ] [@ shutdownhang | __psynch_cvwait | _pthread_cond_wait | pthread_cond_signal_thread_np | <name omitted> | mozi…
Crash Signature: mozilla::net::nsHttpConnectionMgr::Shutdown ] → mozilla::net::nsHttpConnectionMgr::Shutdown ] [@ mozilla::dom::workers::RuntimeService::CrashIfHanging]
Crash Signature: mozilla::net::nsHttpConnectionMgr::Shutdown ] [@ mozilla::dom::workers::RuntimeService::CrashIfHanging] → mozilla::net::nsHttpConnectionMgr::Shutdown ]
Crash Signature: mozilla::net::nsHttpConnectionMgr::Shutdown ] → mozilla::net::nsHttpConnectionMgr::Shutdown ] [@ mozilla::dom::workers::RuntimeService::CrashIfHanging]
Crash Signature: mozilla::net::nsHttpConnectionMgr::Shutdown ] [@ mozilla::dom::workers::RuntimeService::CrashIfHanging] → mozilla::net::nsHttpConnectionMgr::Shutdown ]

Looking at one crash from mozilla::net::ShutdownEvent::PostAndWait, I see the Socket Thread stuck here, waiting probably for some data to arrive, such that the shutdown event posted here is never even started to be elaborated and thus this SpinEventLoopUntil does not return before timeout.

Cleaning up signatures.

FF Active
shutdownhang | mozilla::TaskController::GetRunnableForMTTask | nsThread::Shutdown | mozilla::net::nsSocketTransportService::ShutdownThread
shutdownhang | kernelbase.dll | mozilla::net::nsHttpConnectionMgr::Shutdown
shutdownhang | mozglue.dll | mozilla::net::nsHttpConnectionMgr::Shutdown
shutdownhang | mozilla::net::nsHttpConnectionMgr::Shutdown
shutdownhang | mozilla::net::ShutdownEvent::PostAndWait
shutdownhang | mozilla::TaskController::GetRunnableForMTTask | mozilla::net::nsHttpConnectionMgr::Shutdown
shutdownhang | nsThread::Shutdown | mozilla::net::nsSocketTransportService::ShutdownThread
shutdownhang | ntdll.dll | kernelbase.dll | mozilla::net::nsHttpConnectionMgr::Shutdown
shutdownhang | PR_CallOnceWithArg | mozilla::net::nsHttpConnectionMgr::Shutdown

Thunderbird only active
shutdownhang | __psynch_cvwait | _pthread_cond_wait | pthread_cond_signal_thread_np | <name omitted> | <name omitted> | mozilla::net::nsHttpConnectionMgr::Shutdown
shutdownhang | __pthread_cond_wait | <name omitted> | mozilla::net::nsHttpConnectionMgr::Shutdown

Inactive / unsupported versions only
shutdownhang | __psynch_cvwait | <name omitted> | mozilla::net::nsHttpConnectionMgr::Shutdown
shutdownhang | _PR_MD_WAIT_CV | _PR_WaitCondVar | mozilla::net::nsHttpConnectionMgr::Shutdown
shutdownhang | kernelbase.dll | mozilla::SpinEventLoopUntil<T> | mozilla::net::nsHttpConnectionMgr::Shutdown
shutdownhang | mozilla::SpinEventLoopUntil<T> | mozilla::net::nsHttpConnectionMgr::Shutdown
shutdownhang | __psynch_cvwait | _pthread_cond_wait | pthread_cond_signal_thread_np | <name omitted> | mozilla::net::nsHttpConnectionMgr::Shutdown
shutdownhang | ntdll.dll | kernel32.dll | mozilla::SpinEventLoopUntil<T> | mozilla::net::nsHttpConnectionMgr::Shutdown
shutdownhang | ntdll.dll | mozilla::SpinEventLoopUntil<T> | mozilla::net::nsHttpConnectionMgr::Shutdown
shutdownhang | static bool mozilla::SpinEventLoopUntil<T> | mozilla::net::nsHttpConnectionMgr::Shutdown

Crash Signature: mozilla::net::nsHttpConnectionMgr::Shutdown ] [@ shutdownhang | __psynch_cvwait | _pthread_cond_wait | pthread_cond_signal_thread_np | <name omitted> | <name omitted> | mozilla::net::nsHttpConnectionMgr::Shutdown ] [@ shutdownhang | __psynch_cvwait | <… → <name omitted> | mozilla::net::nsHttpConnectionMgr::Shutdown ] [@ shutdownhang | __pthread_cond_wait | <name omitted> | mozilla::net::nsHttpConnectionMgr::Shutdown ] [@ shutdownhang | kernelbase.dll | mozilla::net::nsHttpConnectionMgr::Shutdown ] [@ s…

Sorting signatures for frequency

Signature Count
shutdownhang mozilla::TaskController::GetRunnableForMTTask mozilla::net::nsHttpConnectionMgr::Shutdown 3269
shutdownhang mozilla::net::nsHttpConnectionMgr::Shutdown 2380
shutdownhang mozilla::net::ShutdownEvent::PostAndWait 1625
shutdownhang nsThread::Shutdown mozilla::net::nsSocketTransportService::ShutdownThread 1315
shutdownhang mozilla::TaskController::GetRunnableForMTTask nsThread::Shutdown mozilla::net::nsSocketTransportService::ShutdownThread 240
shutdownhang kernelbase.dll mozilla::net::nsHttpConnectionMgr::Shutdown 164
shutdownhang mozglue.dll mozilla::net::nsHttpConnectionMgr::Shutdown 109
shutdownhang PR_CallOnceWithArg mozilla::net::nsHttpConnectionMgr::Shutdown 73
shutdownhang ntdll.dll kernelbase.dll mozilla::net::nsHttpConnectionMgr::Shutdown 39
shutdownhang __pthread_cond_wait <name omitted> mozilla::net::nsHttpConnectionMgr::Shutdown 2

Looking at shutdownhang | mozilla::TaskController::GetRunnableForMTTask | nsThread::Shutdown | mozilla::net::nsSocketTransportService::ShutdownThread.

In all the reports I clicked on, the SocketThread is stuck while shutting down the SSL Cert threadpool.

Looking at the shutdown function, it seems we shutdown the threads in the order we created them (and wait for each single thread before we loop). I see in the first three reports I clicked on, that SSL Cert #1 is still alive, and am assuming that it processes some long lasting event when the shutdown event comes in, such that we never get to process the shutdown event.

In two cases it is stuck in nsNSSComponent::BlockUntilLoadableCertsLoaded(), in the other case mozilla::psm::NSSCertDBTrustDomain::GetCertTrust seems stuck.

For this stack trace I see the SSL Cert threads stuck in nsNSSComponent::CheckForSmartCardChanges() security/manager/ssl/nsNSSComponent.cpp:900
Unfortunately I don't know if we can make them break out any faster - AFAIK this calls into the driver of these smart cards, which may be slow or badly written.

In this specific case I think we could use ShutdownWithTimeout in StopSSLServerCertVerificationThreads.
Not the most elegant fix, but it should improve the shutdown case.

Depends on: 1674410

So I clicked on the first 15 reports (on latest versions) of @ shutdownhang mozilla::TaskController::GetRunnableForMTTask mozilla::net::nsHttpConnectionMgr::Shutdown .

They all are doing something on the socket thread which probably prevents the shutdown event from being processed.
They fall into 3 buckets, it seems:

  1. _PR_MD_PR_POLL(PRPollDesc*, int, unsigned int)
    It might be worth to check, if we use unsuitable long timeouts on the poll.

  2. nsSocketTransportService::DetachSocket

  3. nsHttpConnection::EnsureNPNComplete

It is not clear, if we are just unlucky that there is no more time left on the shutdown timer when we send the shutdown event or if those events on the socket thread are really much slower than they should be or even blocking.

(In reply to Jens Stutte [:jstutte] from comment #43)

It is not clear, if we are just unlucky that there is no more time left on the shutdown timer when we send the shutdown event or if those events on the socket thread are really much slower than they should be or even blocking.

I proposed this patch on bug 1505660 that adds some more shutdown phases explicitly to the shutdown watchdog logic. That might mitigate the case of a late start of the network shutdown due to previous delays as the timer will be reset more often.

Looking at some of the nsHttpConnectionMgr::Shutdown hangs.

IIUC, the intended sequence is as follows:

  1. We call nsHttpConnectionMgr::Shutdown() on the main thread
  2. This dispatches OnMsgShutdown to the socket thread passing a boolean. After clearing mSocketThreadTarget and setting mIsShuttingDown we enter a SpinEventLoopUntil that waits for the passed boolean to flip.
  3. OnMsgShutdown closes everything and dispatches OnMsgShutdownConfirm to the same socket thread with the same boolean.
  4. OnMsgShutdownConfirm finally sets that boolean.

What seems to happen is that the socket process never reaches the OnMsgShutdown event, doing different things that are already in the queue. As we are in the same process, we might just want to share the mIsShuttingDown information directly with the socket thread and abort any event processing there immediately.

This is obviously not a good solution once we have the socket process, but still it might paper over some of the hangs for now.

(In reply to Jens Stutte [:jstutte] from comment #45)

Looking at some of the nsHttpConnectionMgr::Shutdown hangs.

IIUC, the intended sequence is as follows:

  1. We call nsHttpConnectionMgr::Shutdown() on the main thread
  2. This dispatches OnMsgShutdown to the socket thread passing a boolean. After clearing mSocketThreadTarget and setting mIsShuttingDown we enter a SpinEventLoopUntil that waits for the passed boolean to flip.
  3. OnMsgShutdown closes everything and dispatches OnMsgShutdownConfirm to the same socket thread with the same boolean.
  4. OnMsgShutdownConfirm finally sets that boolean.

What seems to happen is that the socket process never reaches the OnMsgShutdown event, doing different things that are already in the queue. As we are in the same process, we might just want to share the mIsShuttingDown information directly with the socket thread and abort any event processing there immediately.

This is obviously not a good solution once we have the socket process, but still it might paper over some of the hangs for now.

nsSocketTransportService gets information about a shutdown in a different way from nsIOService, by calling gIOService->IsNetTearingDown().
As soon as gIOService->IsNetTearingDown() is true the nsSocketTransportService: 1) does not call PR_Poll (check is here), 2) does not create new sockets (here), 3) start leaking socket (not closing them to avoid callling PR_Close here), does not call PR_ConnectContinue here, etc. Also it has some logic to try to wake up PR_Poll.

Most of this crashes are in PR_Poll, PR_Close, PR_Connect and PR_ConnectContinue and before we call this function we check gIOService->IsNetTearingDown(). I assume that the socketThread is already hanging in one of these functions when a shutdown is called.

There is some increase in the volume of this hangs.

I had a look at some hangs (about 50 of them) and the most hangs are in PR_Close for the UDP sockets. Recently, we rolled out QUIC and this explains that there are more hangs with UDP sockets (previously there were non or almost non). I have not found any other new hang signature except this one, but that was only 3 out of 50.

Maybe UDP sockets hand more often in PR_Close than a TCP socket. I found this related bugs:
Bug 1124880 and
This Chrome bug

Got this 2x in an hour. Both times I was watching a youtube video, that being most of my network traffic, then networking in Firefox stopped working (networking seemed fine in other applications). Quit firefox and get this shutdown hang.

See Also: → foxstuck

(In reply to Timothy Nikkel (:tnikkel) from comment #48)

Got this 2x in an hour. Both times I was watching a youtube video, that being most of my network traffic, then networking in Firefox stopped working (networking seemed fine in other applications). Quit firefox and get this shutdown hang.

Also just got this multiple times over the past few hours, and was watching YouTube at the time of the first occurrence. Ran across this bugzilla # via the about:crashes related links after I filed #1749920, and I'm not sure whether that should actually be duped to this since the shutdown hang is a symptom of the real problem (networking died).

Updating the signatures with the 10 most frequent ones.

Crash Signature: [@ shutdownhang | mozilla::TaskController::GetRunnableForMTTask | nsThread::Shutdown | mozilla::net::nsSocketTransportService::ShutdownThread ] [@ shutdownhang | __psynch_cvwait | _pthread_cond_wait | pthread_cond_signal_thread_np | <name omitted> | <nam… → [@ shutdownhang | mozilla::net::nsHttpConnectionMgr::Shutdown ] [@ shutdownhang | __psynch_cvwait | _pthread_cond_wait | mozilla::net::nsHttpConnectionMgr::Shutdown ] [@ shutdownhang | __pthread_cond_wait | mozilla::TaskController::GetRunnableForMTTask …

Note that the graph above shows an unreasonable +400k cases on January 13 which is not confirmed if I repeat the query in crash-stats.

Those are from the foxstuck incident.

I think this is an old known issue, not a regression.

Keywords: regression

Thunderbird is no longer a signficant presence in any of these signatures

Whiteboard: [DWS_NEXT][stockwell unknown][tbird topcrash][necko-triaged] → [DWS_NEXT][stockwell unknown][tbird crash][necko-triaged]

Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.

For more information, please visit auto_nag documentation.

Keywords: topcrash
Crash Signature: | ntdll.dll | kernelbase.dll | mozilla::net::nsHttpConnectionMgr::Shutdown ] [@ shutdownhang | kernelbase.dll | mozglue.dll | mozilla::net::nsHttpConnectionMgr::Shutdown ] → | ntdll.dll | kernelbase.dll | mozilla::net::nsHttpConnectionMgr::Shutdown ] [@ shutdownhang | kernelbase.dll | mozglue.dll | mozilla::net::nsHttpConnectionMgr::Shutdown ] [@ shutdownhang | mozilla::SpinEventLoopUntil | mozilla::net::nsHttpConnectionMg…

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 20 desktop browser crashes on release

For more information, please visit auto_nag documentation.

Keywords: topcrash
Whiteboard: [DWS_NEXT][stockwell unknown][tbird crash][necko-triaged] → [DWS_NEXT][stockwell unknown][tbird crash][necko-triaged][necko-priority-review]]
Whiteboard: [DWS_NEXT][stockwell unknown][tbird crash][necko-triaged][necko-priority-review]] → [DWS_NEXT][stockwell unknown][tbird crash][necko-triaged]

The socket process should really help here.
We could just kill the socket process instead of waiting for the sockets to close.

Depends on: socket-proc
Whiteboard: [DWS_NEXT][stockwell unknown][tbird crash][necko-triaged] → [DWS_NEXT][stockwell unknown][tbird crash][necko-triaged][necko-monitor]

keyword: Perf ?

¡Hola y'all!

Happy 🌮 Tuesday!

Crashed like

bp-d6f1b34d-91a0-486c-b5a7-3e42b0230328

on 113.0a1 (2023-03-28) (64-bit)

Updating flags FWIW.

¡Gracias!
Alex

(In reply to Worcester12345 from comment #58)

keyword: Perf ?

Crash Signature: mozilla::net::nsHttpConnectionMgr::Shutdown ] → mozilla::net::nsHttpConnectionMgr::Shutdown ] [@ shutdownhang | <name omitted> | mozilla::SpinEventLoopUntil<T> | mozilla::net::nsHttpConnectionMgr::Shutdown ]

This really spiked up in a recent Nightly. I guess that could be related to bug 1863491, as the massive hangs there would probably result in shutdown hangs, too.

The trend seems to be stable recently.

You need to log in before you can comment on or make changes to this bug.