Closed Bug 1704391 Opened 3 years ago Closed 3 years ago

Crash in [@ shutdownhang | nsThreadManager::Shutdown]

Categories

(Core :: XPCOM, defect, P1)

All
Windows
defect

Tracking

()

RESOLVED FIXED
91 Branch
Tracking Status
firefox-esr78 --- unaffected
firefox-esr91 --- fixed
firefox87 --- unaffected
firefox88 --- unaffected
firefox89 blocking fixed
firefox90 blocking fixed
firefox91 --- fixed

People

(Reporter: aryx, Assigned: KrisWright)

References

(Regression)

Details

(4 keywords)

Crash Data

Attachments

(2 obsolete files)

All crashes on Windows, single crashes per build ID starting in early March, 2+ crashes starting with 20210407094544.

Crash report: https://crash-stats.mozilla.org/report/index/500b3371-f6b6-4e1f-9d9f-315900210411

MOZ_CRASH Reason: MOZ_CRASH(Shutdown hanging after all known phases and workers finished.)

Top 10 frames of crashing thread:

0 ntdll.dll NtWaitForAlertByThreadId 
1 ntdll.dll RtlSleepConditionVariableSRW 
2 kernelbase.dll SleepConditionVariableSRW 
3 mozglue.dll mozilla::detail::ConditionVariableImpl::wait mozglue/misc/ConditionVariable_windows.cpp:50
4 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1093
5 xul.dll NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:548
6 xul.dll nsThreadManager::Shutdown xpcom/threads/nsThreadManager.cpp:459
7 xul.dll mozilla::ShutdownXPCOM xpcom/build/XPCOMInit.cpp:653
8 xul.dll ScopedXPCOMStartup::~ScopedXPCOMStartup toolkit/xre/nsAppRunner.cpp:1668
9 xul.dll XREMain::XRE_main toolkit/xre/nsAppRunner.cpp:5556

The direction of duping is not 100% clear to me, but this seems to be the same as bug 1505660 and/or bug 1524247.

Basically we get into late XPCOM shutdown and the main thread hangs here, which makes us wait here until all threads' close promises resolve. In the stack traces I looked at, all threads seemed to be idling when this happens, which would indicate that something goes wrong.

This kind of crashes always existed, but I see some recent changes around canceling pending delayed runnables in bug 1695580 that might be related to the latest spikes so ni also in the hope that :pehrsons knows more about this code.

Flags: needinfo?(apehrson)
See Also: → 1505660, 1524247

Well, if we hang on nsThreadManager::Shutdown, that means we're past the (sync) nsThreadManager::CancelBackgroundDelayedRunnables already, so I don't see how bug 1695580 could be the direct cause of these hangs.

But looking at the report in comment 0, BackgroundThreadPool #18 is blocked on a mutex in nsTimer::Cancel, presumably here.

I see this pattern in other 89 crashes for this signature too.
Bug 1704159 looks related, with a similar pattern for its 89 crashes.
It seems likely that bug 1695580 has regressed something here.

Could bug 1683404 be related too? nsTimerImpl::Shutdown() happens before nsThreadManager::Shutdown() and we've clearly caught a timer being interacted with while doing nsThreadManager::Shutdown().

My main questions are,
What timer is this -- is it one from a DelayedRunnable?
If so, how did we slip by nsThreadManager::CancelBackgroundDelayedRunnables()?

The only hole I can see in CancelBackgroundDelayedRunnables logic atm is if something creates a background TaskQueue after CancelBackgroundDelayedRunnables. If a DelayedDispatch is then done on that TaskQueue, we'd dispatch a runnable to clean that up during TaskQueue::BeginShutdown() instead, but for background TaskQueues this happens after nsTimerImpl::Shutdown(). I'm not sure how this case should behave, but I can try to verify this by inserting some TaskQueue creations and delayed dispatches locally.

I can also plug this hole and hope that it gives us more info.

Assignee: nobody → apehrson
Flags: needinfo?(apehrson)
Keywords: regression
Regressed by: 1695580
See Also: → 1704159
Has Regression Range: --- → yes

Set release status flags based on info from the regressing bug 1695580

KrisWright will take this on as I am going on a leave.

Assignee: apehrson → kwright

Tracking as the volume is much higher in 89 beta than in 88

The volume is massive on 89 beta, clearly something regressed in this cycle or the recent crash stats work shows that we have a top crasher we didn't know about. We should fix that before shipping 89 so I am marking that as a release blocker.

Severity: -- → S1
Priority: -- → P1

I spent another half hour with the the report in comment 0.

To me this looks like a UAF starting from nsTimerImpl::CancelImpl on gThread which apparently is never nulled out. In fact, the "Timer Thread" is not among the remaining active threads in the crash dump. So we just happen to find a random piece of memory for mMonitor which is then interpreted as being locked.

I assume this was always there but some change made it more probable to happen now.

ni to :KrisWright as you are the current assignee.

Group: core-security
Flags: needinfo?(kwright)
Group: core-security → dom-core-security
Keywords: csectype-uaf

(In reply to Jens Stutte [:jstutte] from comment #7)

To me this looks like a UAF starting from nsTimerImpl::CancelImpl on gThread which apparently is never nulled out.

This might be a misread, there is NS_RELEASE(gThread).

In fact, the "Timer Thread" is not among the remaining active threads in the crash dump. So we just happen to find a random piece of memory for mMonitor which is then interpreted as being locked.

But the timer thread is definitely missing. Is there a way to end it without passing through nsTimerImpl::Shutdown ?

I've been playing with this for a while and i think I'm getting a better picture of what's going on. I noticed in the past that we have no way to stop useless timers from being created between timer shutdown and thread shutdown. Say the background thread was created for a dispatch received sometime between the two - we just might run into a missing timer. If my theory is correct then it explains why I was able to get a shutdown crash with the following steps:

  • Create a backround TaskQueue between cancelling delayed runnables and thread pool shutdown
  • Dispatch any runnable on the TaskQueue
  • Wait for hang

The hang I got only happens if you create the background TaskQueue after calling CancelBackgroundDelayedRunnables. I can't guarantee they are in fact related but this follows :pehrsons theories from earlier and the stacks and crash signature I get from these hangs also appears to also be waiting on threads to join. Based on where we are stuck I think these both are related.

Why this may happen on windows and not on any other platform is a bit out of my depth. I'm still wondering, how did we get here? I have just one thought of what might be happening:

  • The background event target receives a dispatch after timer shutdown but before thread pool shutdown
  • Background event target spawns a new thread (BackgroundThreadPool #18 in report from Comment 0) to service the runnable which I am guessing came from an escaped DelayedRunnable that only dispatches on windows
  • The scenario above happens

I would know for sure if I knew what the last runnable was that ran on the pool, but I don't believe we retain that info in any meaningful way.

This begs the question, how do we fix it? I think we should fix this from the background task queues. I think a fix looks something like this:
-Disallow creation and dispatch of background taskqueues after gXPCOMThreadsShutDown
-Either handle background dispatches we do receive on BackgroundEventTarget::Dispatch synchronously on whatever thread it's been called on or dispatch specifically to main thread. We need to dispatch these regardless and can't just swallow shutdown dispatches because we need to dispatch runnables to shut down task queues.

And in case i am way off the mark and just discovered some other, unrelated hang while looking for the problem, I also think we should prevent this timer hang from happening in the first place. I think of how we use gXPCOMThreadsShutDown to prevent some dispatch after threads have begun shutting down, maybe we should have a similar flag for timers (or outright use gXPCOMThreadsShutDown for this too, though I think we still want to allow timers to fire until actual timer shutdown)

I am working on a fix for each right now so I should have something up shortly.

Flags: needinfo?(kwright)

(In reply to Kris Wright :KrisWright from comment #9)

-Disallow creation and dispatch of background taskqueues after gXPCOMThreadsShutDown

I played with this some more and I was a bit off. the problem is only with task queues created after gXPCOMThreadsShutDown and you don't have to dispatch anything off of them except for the cleanup tasks they dispatch during shutdown in order to cancel delayed runnables. So I am more interested in why they can't shut down like the rest of task queues. Regardless I think being able to create task queues at this point in shut down is a bug and needs to be fixed.

There is a chance that a timer will be cancelled by its holder after the timer thread and all of the timers have been cleaned up. Doing this can result in a hang on the mutex lock, likely because the mutex is now pointing to something else. I'm not completely familiar with the inner workings of our timers so I'm not sure what the best fix is here. I decided to follow a pattern in our threading shutdown routines and flag the timer shutdown so the timers can't attempt to cancel past a certain point where they would be shutdown anyway. This is tricky because there are timers created during XPCOM shutdown all the way up until we shutdown the timer thread. Checking for the timer thread outside of the lock could also result in a race with timer shutdown. This value is an attempt to address this without breaking timers shutdown.

Comment on attachment 9218535 [details]
Bug 1704391 - Disallow creation of new TaskQueues after we have begun threads shutdown

Revision D113449 was moved to bug 1707924. Setting attachment 9218535 [details] to obsolete.

Attachment #9218535 - Attachment is obsolete: true

(In reply to Kris Wright :KrisWright from comment #10)

I played with this some more and I was a bit off. the problem is only with task queues created after gXPCOMThreadsShutDown and you don't have to dispatch anything off of them except for the cleanup tasks they dispatch during shutdown in order to cancel delayed runnables. So I am more interested in why they can't shut down like the rest of task queues. Regardless I think being able to create task queues at this point in shut down is a bug and needs to be fixed.

After some discussion we came to the conclusion that the TaskQueue hang, while it looked a bit similar to me at first glance, is unrelated to the timer access after shutdown and I'll spin it off into its own bug (bug 1707924) and will leave the attempts to prevent timer issues here.

(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #15)

Landed: https://hg.mozilla.org/integration/autoland/rev/1d91ad1c6c6ec1d817f264bd97b094c15e43735d

Backed out for multiple failures in xpcom/threads/nsTimerImpl.cpp:

https://hg.mozilla.org/integration/autoland/rev/eae6689dd31910794b188d2553a268a5f58fcb7b

Logs:
https://treeherder.mozilla.org/logviewer?job_id=338256027&repo=autoland&lineNumber=2784
https://treeherder.mozilla.org/logviewer?job_id=338255869&repo=autoland

Looks like this is happening more frequently than I thought based on https://treeherder.mozilla.org/logviewer?job_id=338255869&repo=autoland&lineNumber=2115 which is pretty good news for debugging.

These are mostly coming out of the same code (the message pump) and it has no way to know if a timer still exists. Probably still coming from an escaped delayed runnable, though I wish I knew what code was dispatching it. We will probably have to rework the message loop somewhat.

Flags: needinfo?(kwright)

I've played with this a bit more and reproduced this assertion failure in lieu of a UAF a few times and it looks like whatever is using this delayed work timer schedules work before we cancel our delayed runnables, but this timer persists after timer shutdown. Maybe MessagePump needs its own way to exit during timer shutdown.

Tracking as a blocker for 90 to make sure we have a fix or backout before June.

It looks like this data race might be involved after all. I think the last of the DelayedRunnables to finish could possibly overlap with timer shutdown which increases the frequency of hangs of this type.

See Also: → 1711282

There's a r+ patch which didn't land and no activity in this bug for 2 weeks.
:KrisWright, could you have a look please?
For more information, please visit auto_nag documentation.

Flags: needinfo?(nika)
Flags: needinfo?(kwright)

(In reply to Release mgmt bot [:sylvestre / :calixte / :marco for bugbug] from comment #22)

There's a r+ patch which didn't land and no activity in this bug for 2 weeks.
:KrisWright, could you have a look please?
For more information, please visit auto_nag documentation.

This patch might be made obsolete if other patches sufficiently reduce crash volume.

Flags: needinfo?(nika)
Flags: needinfo?(kwright)

No crashes so far in beta 91, so that's a good sign.

(In reply to Julien Cristau [:jcristau] from comment #24)

No crashes so far in beta 91, so that's a good sign.

Am I reading bug 1695580 right that the regressor is still backed out? Maybe we should decide, if there is a path forward for that change on that bug and re-open it) ?

Flags: needinfo?(kwright)

Bug 1695580 was backed out of beta 89 and 90, as far as I can tell. It's still in 91+.

Oh, I just referred to the status "Accepted" on all patches in that bug which confuses me.

Bug 1683404 fixed the existing shutdown race that DelayedRunnable was tripping more frequently. I checked beta and AFAICT we carried over DelayedRunnable shutdown changes into 91.

Flags: needinfo?(kwright)
Attachment #9218534 - Attachment is obsolete: true

I am not seeing any crash in 91 with this signature and it is a sec-moderate, so no need to have it marking a blocking the release. I see one crash in 92 nightly.

AFAICT, the crash rate now is what it was prior to the regression back in April. Calling this fixed by bug 1683404.

Group: dom-core-security → core-security-release
Status: NEW → RESOLVED
Closed: 3 years ago
Depends on: 1683404
Resolution: --- → FIXED
Target Milestone: --- → 91 Branch
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: