1704391 - Crash in [@ shutdownhang | nsThreadManager::Shutdown]

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Description

•

4 years ago

All crashes on Windows, single crashes per build ID starting in early March, 2+ crashes starting with 20210407094544.

Crash report: https://crash-stats.mozilla.org/report/index/500b3371-f6b6-4e1f-9d9f-315900210411

MOZ_CRASH Reason: MOZ_CRASH(Shutdown hanging after all known phases and workers finished.)

Top 10 frames of crashing thread:

0 ntdll.dll NtWaitForAlertByThreadId 
1 ntdll.dll RtlSleepConditionVariableSRW 
2 kernelbase.dll SleepConditionVariableSRW 
3 mozglue.dll mozilla::detail::ConditionVariableImpl::wait mozglue/misc/ConditionVariable_windows.cpp:50
4 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1093
5 xul.dll NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:548
6 xul.dll nsThreadManager::Shutdown xpcom/threads/nsThreadManager.cpp:459
7 xul.dll mozilla::ShutdownXPCOM xpcom/build/XPCOMInit.cpp:653
8 xul.dll ScopedXPCOMStartup::~ScopedXPCOMStartup toolkit/xre/nsAppRunner.cpp:1668
9 xul.dll XREMain::XRE_main toolkit/xre/nsAppRunner.cpp:5556

Jens Stutte [:jstutte]

Comment 1

•

4 years ago

•

Edited

The direction of duping is not 100% clear to me, but this seems to be the same as bug 1505660 and/or bug 1524247.

Basically we get into late XPCOM shutdown and the main thread hangs here, which makes us wait here until all threads' close promises resolve. In the stack traces I looked at, all threads seemed to be idling when this happens, which would indicate that something goes wrong.

This kind of crashes always existed, but I see some recent changes around canceling pending delayed runnables in bug 1695580 that might be related to the latest spikes so ni also in the hope that :pehrsons knows more about this code.

Flags: needinfo?(apehrson)

Jens Stutte [:jstutte]

Updated

•

4 years ago

Comment 2

•

4 years ago

•

Edited

Well, if we hang on nsThreadManager::Shutdown, that means we're past the (sync) nsThreadManager::CancelBackgroundDelayedRunnables already, so I don't see how bug 1695580 could be the direct cause of these hangs.

But looking at the report in comment 0, BackgroundThreadPool #18 is blocked on a mutex in nsTimer::Cancel, presumably here.

I see this pattern in other 89 crashes for this signature too.
Bug 1704159 looks related, with a similar pattern for its 89 crashes.
It seems likely that bug 1695580 has regressed something here.

Could bug 1683404 be related too? nsTimerImpl::Shutdown() happens before nsThreadManager::Shutdown() and we've clearly caught a timer being interacted with while doing nsThreadManager::Shutdown().

My main questions are,
What timer is this -- is it one from a DelayedRunnable?
If so, how did we slip by nsThreadManager::CancelBackgroundDelayedRunnables()?

The only hole I can see in CancelBackgroundDelayedRunnables logic atm is if something creates a background TaskQueue after CancelBackgroundDelayedRunnables. If a DelayedDispatch is then done on that TaskQueue, we'd dispatch a runnable to clean that up during TaskQueue::BeginShutdown() instead, but for background TaskQueues this happens after nsTimerImpl::Shutdown(). I'm not sure how this case should behave, but I can try to verify this by inserting some TaskQueue creations and delayed dispatches locally.

I can also plug this hole and hope that it gives us more info.

Assignee: nobody → apehrson

Flags: needinfo?(apehrson)

Keywords: regression

Regressed by: 1695580

Updated

•

4 years ago

Has Regression Range: --- → yes

BugBot [:suhaib / :marco/ :calixte]

Comment 3

•

4 years ago

Set release status flags based on info from the regressing bug 1695580

status-firefox87: --- → unaffected

status-firefox88: --- → unaffected

status-firefox-esr78: --- → unaffected

Andreas Pehrson [:pehrsons]

Comment 4

•

4 years ago

KrisWright will take this on as I am going on a leave.

Assignee: apehrson → kwright

Pascal Chevrel:pascalc

Comment 5

•

4 years ago

Tracking as the volume is much higher in 89 beta than in 88

status-firefox90: --- → affected

tracking-firefox89: --- → +

Pascal Chevrel:pascalc

Comment 6

•

4 years ago

The volume is massive on 89 beta, clearly something regressed in this cycle or the recent crash stats work shows that we have a top crasher we didn't know about. We should fix that before shipping 89 so I am marking that as a release blocker.

Severity: -- → S1

tracking-firefox89: + → blocking

Priority: -- → P1

Jens Stutte [:jstutte]

Comment 7

•

4 years ago

I spent another half hour with the the report in comment 0.

To me this looks like a UAF starting from nsTimerImpl::CancelImpl on gThread which apparently is never nulled out. In fact, the "Timer Thread" is not among the remaining active threads in the crash dump. So we just happen to find a random piece of memory for mMonitor which is then interpreted as being locked.

I assume this was always there but some change made it more probable to happen now.

ni to :KrisWright as you are the current assignee.

Group: core-security

Flags: needinfo?(kwright)

Andrew McCreight [:mccr8] (PTO until 12-2)

Updated

•

4 years ago

Group: core-security → dom-core-security

Keywords: csectype-uaf

Jens Stutte [:jstutte]

Comment 8

•

4 years ago

(In reply to Jens Stutte [:jstutte] from comment #7)

To me this looks like a UAF starting from nsTimerImpl::CancelImpl on gThread which apparently is never nulled out.

This might be a misread, there is NS_RELEASE(gThread).

In fact, the "Timer Thread" is not among the remaining active threads in the crash dump. So we just happen to find a random piece of memory for mMonitor which is then interpreted as being locked.

But the timer thread is definitely missing. Is there a way to end it without passing through nsTimerImpl::Shutdown ?

Kris Wright :KrisWright

Assignee

Comment 9

•

4 years ago

I've been playing with this for a while and i think I'm getting a better picture of what's going on. I noticed in the past that we have no way to stop useless timers from being created between timer shutdown and thread shutdown. Say the background thread was created for a dispatch received sometime between the two - we just might run into a missing timer. If my theory is correct then it explains why I was able to get a shutdown crash with the following steps:

Create a backround TaskQueue between cancelling delayed runnables and thread pool shutdown
Dispatch any runnable on the TaskQueue
Wait for hang

The hang I got only happens if you create the background TaskQueue after calling CancelBackgroundDelayedRunnables. I can't guarantee they are in fact related but this follows :pehrsons theories from earlier and the stacks and crash signature I get from these hangs also appears to also be waiting on threads to join. Based on where we are stuck I think these both are related.

Why this may happen on windows and not on any other platform is a bit out of my depth. I'm still wondering, how did we get here? I have just one thought of what might be happening:

The background event target receives a dispatch after timer shutdown but before thread pool shutdown
Background event target spawns a new thread (BackgroundThreadPool #18 in report from Comment 0) to service the runnable which I am guessing came from an escaped DelayedRunnable that only dispatches on windows
The scenario above happens

I would know for sure if I knew what the last runnable was that ran on the pool, but I don't believe we retain that info in any meaningful way.

This begs the question, how do we fix it? I think we should fix this from the background task queues. I think a fix looks something like this:
-Disallow creation and dispatch of background taskqueues after gXPCOMThreadsShutDown
-Either handle background dispatches we do receive on BackgroundEventTarget::Dispatch synchronously on whatever thread it's been called on or dispatch specifically to main thread. We need to dispatch these regardless and can't just swallow shutdown dispatches because we need to dispatch runnables to shut down task queues.

And in case i am way off the mark and just discovered some other, unrelated hang while looking for the problem, I also think we should prevent this timer hang from happening in the first place. I think of how we use gXPCOMThreadsShutDown to prevent some dispatch after threads have begun shutting down, maybe we should have a similar flag for timers (or outright use gXPCOMThreadsShutDown for this too, though I think we still want to allow timers to fire until actual timer shutdown)

I am working on a fix for each right now so I should have something up shortly.

Flags: needinfo?(kwright)

Kris Wright :KrisWright

Assignee

Comment 10

•

4 years ago

(In reply to Kris Wright :KrisWright from comment #9)

-Disallow creation and dispatch of background taskqueues after gXPCOMThreadsShutDown

I played with this some more and I was a bit off. the problem is only with task queues created after gXPCOMThreadsShutDown and you don't have to dispatch anything off of them except for the cleanup tasks they dispatch during shutdown in order to cancel delayed runnables. So I am more interested in why they can't shut down like the rest of task queues. Regardless I think being able to create task queues at this point in shut down is a bug and needs to be fixed.

Kris Wright :KrisWright

Assignee

Comment 11

•

4 years ago

Attached file Bug 1704391 - Address timers cancelled after timer shutdown. (obsolete) — Details

There is a chance that a timer will be cancelled by its holder after the timer thread and all of the timers have been cleaned up. Doing this can result in a hang on the mutex lock, likely because the mutex is now pointing to something else. I'm not completely familiar with the inner workings of our timers so I'm not sure what the best fix is here. I decided to follow a pattern in our threading shutdown routines and flag the timer shutdown so the timers can't attempt to cancel past a certain point where they would be shutdown anyway. This is tricky because there are timers created during XPCOM shutdown all the way up until we shutdown the timer thread. Checking for the timer thread outside of the lock could also result in a race with timer shutdown. This value is an attempt to address this without breaking timers shutdown.

Kris Wright :KrisWright

Assignee

Comment 12

•

4 years ago

Attached file Bug 1704391 - Disallow creation of new TaskQueues after we have begun threads shutdown (obsolete) — Details

Phabricator Automation

Comment 13

•

4 years ago

Comment on attachment 9218535 [details]
Bug 1704391 - Disallow creation of new TaskQueues after we have begun threads shutdown

Revision D113449 was moved to bug 1707924. Setting attachment 9218535 [details] to obsolete.

Attachment #9218535 - Attachment is obsolete: true

Kris Wright :KrisWright

Assignee

Comment 14

•

4 years ago

(In reply to Kris Wright :KrisWright from comment #10)

I played with this some more and I was a bit off. the problem is only with task queues created after gXPCOMThreadsShutDown and you don't have to dispatch anything off of them except for the cleanup tasks they dispatch during shutdown in order to cancel delayed runnables. So I am more interested in why they can't shut down like the rest of task queues. Regardless I think being able to create task queues at this point in shut down is a bug and needs to be fixed.

After some discussion we came to the conclusion that the TaskQueue hang, while it looked a bit similar to me at first glance, is unrelated to the timer access after shutdown and I'll spin it off into its own bug (bug 1707924) and will leave the attempts to prevent timer issues here.

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Comment 15

•

4 years ago

Landed: https://hg.mozilla.org/integration/autoland/rev/1d91ad1c6c6ec1d817f264bd97b094c15e43735d

Backed out for multiple failures in xpcom/threads/nsTimerImpl.cpp:

https://hg.mozilla.org/integration/autoland/rev/eae6689dd31910794b188d2553a268a5f58fcb7b

Logs:
https://treeherder.mozilla.org/logviewer?job_id=338256027&repo=autoland&lineNumber=2784
https://treeherder.mozilla.org/logviewer?job_id=338255869&repo=autoland

Flags: needinfo?(kwright)

Kris Wright :KrisWright

Assignee

Comment 16

•

4 years ago

(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #15)

Landed: https://hg.mozilla.org/integration/autoland/rev/1d91ad1c6c6ec1d817f264bd97b094c15e43735d

Backed out for multiple failures in xpcom/threads/nsTimerImpl.cpp:

https://hg.mozilla.org/integration/autoland/rev/eae6689dd31910794b188d2553a268a5f58fcb7b

Logs:
https://treeherder.mozilla.org/logviewer?job_id=338256027&repo=autoland&lineNumber=2784
https://treeherder.mozilla.org/logviewer?job_id=338255869&repo=autoland

Looks like this is happening more frequently than I thought based on https://treeherder.mozilla.org/logviewer?job_id=338255869&repo=autoland&lineNumber=2115 which is pretty good news for debugging.

These are mostly coming out of the same code (the message pump) and it has no way to know if a timer still exists. Probably still coming from an escaped delayed runnable, though I wish I knew what code was dispatching it. We will probably have to rework the message loop somewhat.

Flags: needinfo?(kwright)

Kris Wright :KrisWright

Assignee

Comment 17

•

4 years ago

I've played with this a bit more and reproduced this assertion failure in lieu of a UAF a few times and it looks like whatever is using this delayed work timer schedules work before we cancel our delayed runnables, but this timer persists after timer shutdown. Maybe MessagePump needs its own way to exit during timer shutdown.

Julien Cristau [:jcristau]

Comment 18

•

4 years ago

Regressing bug backed out of beta for 89: https://hg.mozilla.org/releases/mozilla-beta/rev/21a23370449e01dcb4388f99bcd3778923db3def

status-firefox89: affected → fixed

Julien Cristau [:jcristau]

Comment 19

•

4 years ago

Tracking as a blocker for 90 to make sure we have a fix or backout before June.

tracking-firefox90: --- → blocking

Daniel Veditz [:dveditz]

Updated

•

4 years ago

Keywords: sec-moderate

Kris Wright :KrisWright

Assignee

Comment 20

•

4 years ago

It looks like this data race might be involved after all. I think the last of the DelayedRunnables to finish could possibly overlap with timer shutdown which increases the frequency of hangs of this type.

Andrew McCreight [:mccr8] (PTO until 12-2)

Updated

•

4 years ago

Comment 21

•

3 years ago

Regressor backed out of beta again for 90:
https://hg.mozilla.org/releases/mozilla-beta/rev/3dabe43c4982a4759af9195a9eaf02d1062304eb

status-firefox90: affected → fixed

status-firefox91: --- → affected

tracking-firefox91: --- → blocking

BugBot [:suhaib / :marco/ :calixte]

Comment 22

•

3 years ago

There's a r+ patch which didn't land and no activity in this bug for 2 weeks.
:KrisWright, could you have a look please?
For more information, please visit auto_nag documentation.

Flags: needinfo?(nika)

Flags: needinfo?(kwright)

Kris Wright :KrisWright

Assignee

Comment 23

•

3 years ago

(In reply to Release mgmt bot [:sylvestre / :calixte / :marco for bugbug] from comment #22)

There's a r+ patch which didn't land and no activity in this bug for 2 weeks.
:KrisWright, could you have a look please?
For more information, please visit auto_nag documentation.

This patch might be made obsolete if other patches sufficiently reduce crash volume.

Flags: needinfo?(nika)

Flags: needinfo?(kwright)

Julien Cristau [:jcristau]

Comment 24

•

3 years ago

No crashes so far in beta 91, so that's a good sign.

Jens Stutte [:jstutte]

Comment 25

•

3 years ago

(In reply to Julien Cristau [:jcristau] from comment #24)

No crashes so far in beta 91, so that's a good sign.

Am I reading bug 1695580 right that the regressor is still backed out? Maybe we should decide, if there is a path forward for that change on that bug and re-open it) ?

Flags: needinfo?(kwright)

Julien Cristau [:jcristau]

Comment 26

•

3 years ago

Bug 1695580 was backed out of beta 89 and 90, as far as I can tell. It's still in 91+.

Jens Stutte [:jstutte]

Comment 27

•

3 years ago

Oh, I just referred to the status "Accepted" on all patches in that bug which confuses me.

Kris Wright :KrisWright

Assignee

Comment 28

•

3 years ago

Bug 1683404 fixed the existing shutdown race that DelayedRunnable was tripping more frequently. I checked beta and AFAICT we carried over DelayedRunnable shutdown changes into 91.

Flags: needinfo?(kwright)

Phabricator Automation

Updated

•

3 years ago

Attachment #9218534 - Attachment is obsolete: true

Pascal Chevrel:pascalc

Comment 29

•

3 years ago

I am not seeing any crash in 91 with this signature and it is a sec-moderate, so no need to have it marking a blocking the release. I see one crash in 92 nightly.

tracking-firefox91: blocking → ---

Pascal Chevrel:pascalc

Updated

•

3 years ago

status-firefox91: affected → wontfix

status-firefox92: --- → affected

status-firefox-esr78: unaffected → ---

Ryan VanderMeulen [:RyanVM]

Comment 30

•

3 years ago

AFAICT, the crash rate now is what it was prior to the regression back in April. Calling this fixed by bug 1683404.

Group: dom-core-security → core-security-release

Status: NEW → RESOLVED

Closed: 3 years ago

status-firefox91: wontfix → fixed

status-firefox92: affected → ---

status-firefox-esr78: --- → unaffected

status-firefox-esr91: --- → fixed

Depends on: 1683404

Resolution: --- → FIXED

Target Milestone: --- → 91 Branch

Daniel Veditz [:dveditz]

Updated

•

2 years ago

Group: core-security-release

Bug 1704391 - Address timers cancelled after timer shutdown. 4 years ago Kris Wright :KrisWright 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1704391 - Disallow creation of new TaskQueues after we have begun threads shutdown 4 years ago Kris Wright :KrisWright 48 bytes, text/x-phabricator-request		Details \| Review