Crash in [@ shutdownhang | nsThreadManager::Shutdown]
Categories
(Core :: XPCOM, defect, P1)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr78 | --- | unaffected |
firefox-esr91 | --- | fixed |
firefox87 | --- | unaffected |
firefox88 | --- | unaffected |
firefox89 | blocking | fixed |
firefox90 | blocking | fixed |
firefox91 | --- | fixed |
People
(Reporter: aryx, Assigned: KrisWright)
References
(Regression)
Details
(4 keywords)
Crash Data
Attachments
(2 obsolete files)
All crashes on Windows, single crashes per build ID starting in early March, 2+ crashes starting with 20210407094544.
Crash report: https://crash-stats.mozilla.org/report/index/500b3371-f6b6-4e1f-9d9f-315900210411
MOZ_CRASH Reason: MOZ_CRASH(Shutdown hanging after all known phases and workers finished.)
Top 10 frames of crashing thread:
0 ntdll.dll NtWaitForAlertByThreadId
1 ntdll.dll RtlSleepConditionVariableSRW
2 kernelbase.dll SleepConditionVariableSRW
3 mozglue.dll mozilla::detail::ConditionVariableImpl::wait mozglue/misc/ConditionVariable_windows.cpp:50
4 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1093
5 xul.dll NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:548
6 xul.dll nsThreadManager::Shutdown xpcom/threads/nsThreadManager.cpp:459
7 xul.dll mozilla::ShutdownXPCOM xpcom/build/XPCOMInit.cpp:653
8 xul.dll ScopedXPCOMStartup::~ScopedXPCOMStartup toolkit/xre/nsAppRunner.cpp:1668
9 xul.dll XREMain::XRE_main toolkit/xre/nsAppRunner.cpp:5556
Comment 1•4 years ago
•
|
||
The direction of duping is not 100% clear to me, but this seems to be the same as bug 1505660 and/or bug 1524247.
Basically we get into late XPCOM shutdown and the main thread hangs here, which makes us wait here until all threads' close promises resolve. In the stack traces I looked at, all threads seemed to be idling when this happens, which would indicate that something goes wrong.
This kind of crashes always existed, but I see some recent changes around canceling pending delayed runnables in bug 1695580 that might be related to the latest spikes so ni also in the hope that :pehrsons knows more about this code.
Updated•4 years ago
|
Comment 2•4 years ago
•
|
||
Well, if we hang on nsThreadManager::Shutdown, that means we're past the (sync) nsThreadManager::CancelBackgroundDelayedRunnables already, so I don't see how bug 1695580 could be the direct cause of these hangs.
But looking at the report in comment 0, BackgroundThreadPool #18 is blocked on a mutex in nsTimer::Cancel, presumably here.
I see this pattern in other 89 crashes for this signature too.
Bug 1704159 looks related, with a similar pattern for its 89 crashes.
It seems likely that bug 1695580 has regressed something here.
Could bug 1683404 be related too? nsTimerImpl::Shutdown() happens before nsThreadManager::Shutdown() and we've clearly caught a timer being interacted with while doing nsThreadManager::Shutdown().
My main questions are,
What timer is this -- is it one from a DelayedRunnable?
If so, how did we slip by nsThreadManager::CancelBackgroundDelayedRunnables()?
The only hole I can see in CancelBackgroundDelayedRunnables logic atm is if something creates a background TaskQueue after CancelBackgroundDelayedRunnables. If a DelayedDispatch is then done on that TaskQueue, we'd dispatch a runnable to clean that up during TaskQueue::BeginShutdown() instead, but for background TaskQueues this happens after nsTimerImpl::Shutdown(). I'm not sure how this case should behave, but I can try to verify this by inserting some TaskQueue creations and delayed dispatches locally.
I can also plug this hole and hope that it gives us more info.
Updated•4 years ago
|
Comment 3•4 years ago
|
||
Set release status flags based on info from the regressing bug 1695580
Comment 4•4 years ago
|
||
KrisWright will take this on as I am going on a leave.
Comment 5•4 years ago
|
||
Tracking as the volume is much higher in 89 beta than in 88
Comment 6•4 years ago
|
||
The volume is massive on 89 beta, clearly something regressed in this cycle or the recent crash stats work shows that we have a top crasher we didn't know about. We should fix that before shipping 89 so I am marking that as a release blocker.
Comment 7•4 years ago
|
||
I spent another half hour with the the report in comment 0.
To me this looks like a UAF starting from nsTimerImpl::CancelImpl on gThread
which apparently is never nulled out. In fact, the "Timer Thread" is not among the remaining active threads in the crash dump. So we just happen to find a random piece of memory for mMonitor
which is then interpreted as being locked.
I assume this was always there but some change made it more probable to happen now.
ni to :KrisWright as you are the current assignee.
Updated•4 years ago
|
Comment 8•4 years ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #7)
To me this looks like a UAF starting from nsTimerImpl::CancelImpl on
gThread
which apparently is never nulled out.
This might be a misread, there is NS_RELEASE(gThread)
.
In fact, the "Timer Thread" is not among the remaining active threads in the crash dump. So we just happen to find a random piece of memory for
mMonitor
which is then interpreted as being locked.
But the timer thread is definitely missing. Is there a way to end it without passing through nsTimerImpl::Shutdown
?
Assignee | ||
Comment 9•4 years ago
|
||
I've been playing with this for a while and i think I'm getting a better picture of what's going on. I noticed in the past that we have no way to stop useless timers from being created between timer shutdown and thread shutdown. Say the background thread was created for a dispatch received sometime between the two - we just might run into a missing timer. If my theory is correct then it explains why I was able to get a shutdown crash with the following steps:
- Create a backround TaskQueue between cancelling delayed runnables and thread pool shutdown
- Dispatch any runnable on the TaskQueue
- Wait for hang
The hang I got only happens if you create the background TaskQueue after calling CancelBackgroundDelayedRunnables
. I can't guarantee they are in fact related but this follows :pehrsons theories from earlier and the stacks and crash signature I get from these hangs also appears to also be waiting on threads to join. Based on where we are stuck I think these both are related.
Why this may happen on windows and not on any other platform is a bit out of my depth. I'm still wondering, how did we get here? I have just one thought of what might be happening:
- The background event target receives a dispatch after timer shutdown but before thread pool shutdown
- Background event target spawns a new thread (BackgroundThreadPool #18 in report from Comment 0) to service the runnable which I am guessing came from an escaped DelayedRunnable that only dispatches on windows
- The scenario above happens
I would know for sure if I knew what the last runnable was that ran on the pool, but I don't believe we retain that info in any meaningful way.
This begs the question, how do we fix it? I think we should fix this from the background task queues. I think a fix looks something like this:
-Disallow creation and dispatch of background taskqueues after gXPCOMThreadsShutDown
-Either handle background dispatches we do receive on BackgroundEventTarget::Dispatch
synchronously on whatever thread it's been called on or dispatch specifically to main thread. We need to dispatch these regardless and can't just swallow shutdown dispatches because we need to dispatch runnables to shut down task queues.
And in case i am way off the mark and just discovered some other, unrelated hang while looking for the problem, I also think we should prevent this timer hang from happening in the first place. I think of how we use gXPCOMThreadsShutDown
to prevent some dispatch after threads have begun shutting down, maybe we should have a similar flag for timers (or outright use gXPCOMThreadsShutDown
for this too, though I think we still want to allow timers to fire until actual timer shutdown)
I am working on a fix for each right now so I should have something up shortly.
Assignee | ||
Comment 10•4 years ago
|
||
(In reply to Kris Wright :KrisWright from comment #9)
-Disallow creation and dispatch of background taskqueues after
gXPCOMThreadsShutDown
I played with this some more and I was a bit off. the problem is only with task queues created after gXPCOMThreadsShutDown
and you don't have to dispatch anything off of them except for the cleanup tasks they dispatch during shutdown in order to cancel delayed runnables. So I am more interested in why they can't shut down like the rest of task queues. Regardless I think being able to create task queues at this point in shut down is a bug and needs to be fixed.
Assignee | ||
Comment 11•4 years ago
|
||
There is a chance that a timer will be cancelled by its holder after the timer thread and all of the timers have been cleaned up. Doing this can result in a hang on the mutex lock, likely because the mutex is now pointing to something else. I'm not completely familiar with the inner workings of our timers so I'm not sure what the best fix is here. I decided to follow a pattern in our threading shutdown routines and flag the timer shutdown so the timers can't attempt to cancel past a certain point where they would be shutdown anyway. This is tricky because there are timers created during XPCOM shutdown all the way up until we shutdown the timer thread. Checking for the timer thread outside of the lock could also result in a race with timer shutdown. This value is an attempt to address this without breaking timers shutdown.
Assignee | ||
Comment 12•4 years ago
|
||
Comment 13•4 years ago
|
||
Comment on attachment 9218535 [details]
Bug 1704391 - Disallow creation of new TaskQueues after we have begun threads shutdown
Revision D113449 was moved to bug 1707924. Setting attachment 9218535 [details] to obsolete.
Assignee | ||
Comment 14•4 years ago
|
||
(In reply to Kris Wright :KrisWright from comment #10)
I played with this some more and I was a bit off. the problem is only with task queues created after
gXPCOMThreadsShutDown
and you don't have to dispatch anything off of them except for the cleanup tasks they dispatch during shutdown in order to cancel delayed runnables. So I am more interested in why they can't shut down like the rest of task queues. Regardless I think being able to create task queues at this point in shut down is a bug and needs to be fixed.
After some discussion we came to the conclusion that the TaskQueue hang, while it looked a bit similar to me at first glance, is unrelated to the timer access after shutdown and I'll spin it off into its own bug (bug 1707924) and will leave the attempts to prevent timer issues here.
Reporter | ||
Comment 15•4 years ago
|
||
Landed: https://hg.mozilla.org/integration/autoland/rev/1d91ad1c6c6ec1d817f264bd97b094c15e43735d
Backed out for multiple failures in xpcom/threads/nsTimerImpl.cpp:
https://hg.mozilla.org/integration/autoland/rev/eae6689dd31910794b188d2553a268a5f58fcb7b
Logs:
https://treeherder.mozilla.org/logviewer?job_id=338256027&repo=autoland&lineNumber=2784
https://treeherder.mozilla.org/logviewer?job_id=338255869&repo=autoland
Assignee | ||
Comment 16•4 years ago
|
||
(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #15)
Landed: https://hg.mozilla.org/integration/autoland/rev/1d91ad1c6c6ec1d817f264bd97b094c15e43735d
Backed out for multiple failures in xpcom/threads/nsTimerImpl.cpp:
https://hg.mozilla.org/integration/autoland/rev/eae6689dd31910794b188d2553a268a5f58fcb7b
Logs:
https://treeherder.mozilla.org/logviewer?job_id=338256027&repo=autoland&lineNumber=2784
https://treeherder.mozilla.org/logviewer?job_id=338255869&repo=autoland
Looks like this is happening more frequently than I thought based on https://treeherder.mozilla.org/logviewer?job_id=338255869&repo=autoland&lineNumber=2115 which is pretty good news for debugging.
These are mostly coming out of the same code (the message pump) and it has no way to know if a timer still exists. Probably still coming from an escaped delayed runnable, though I wish I knew what code was dispatching it. We will probably have to rework the message loop somewhat.
Assignee | ||
Comment 17•4 years ago
|
||
I've played with this a bit more and reproduced this assertion failure in lieu of a UAF a few times and it looks like whatever is using this delayed work timer schedules work before we cancel our delayed runnables, but this timer persists after timer shutdown. Maybe MessagePump
needs its own way to exit during timer shutdown.
Comment 18•4 years ago
|
||
Regressing bug backed out of beta for 89: https://hg.mozilla.org/releases/mozilla-beta/rev/21a23370449e01dcb4388f99bcd3778923db3def
Comment 19•4 years ago
|
||
Tracking as a blocker for 90 to make sure we have a fix or backout before June.
Updated•4 years ago
|
Assignee | ||
Comment 20•4 years ago
|
||
It looks like this data race might be involved after all. I think the last of the DelayedRunnables to finish could possibly overlap with timer shutdown which increases the frequency of hangs of this type.
Comment 21•3 years ago
|
||
Regressor backed out of beta again for 90:
https://hg.mozilla.org/releases/mozilla-beta/rev/3dabe43c4982a4759af9195a9eaf02d1062304eb
Comment 22•3 years ago
|
||
There's a r+ patch which didn't land and no activity in this bug for 2 weeks.
:KrisWright, could you have a look please?
For more information, please visit auto_nag documentation.
Assignee | ||
Comment 23•3 years ago
|
||
(In reply to Release mgmt bot [:sylvestre / :calixte / :marco for bugbug] from comment #22)
There's a r+ patch which didn't land and no activity in this bug for 2 weeks.
:KrisWright, could you have a look please?
For more information, please visit auto_nag documentation.
This patch might be made obsolete if other patches sufficiently reduce crash volume.
Comment 24•3 years ago
|
||
No crashes so far in beta 91, so that's a good sign.
Comment 25•3 years ago
|
||
(In reply to Julien Cristau [:jcristau] from comment #24)
No crashes so far in beta 91, so that's a good sign.
Am I reading bug 1695580 right that the regressor is still backed out? Maybe we should decide, if there is a path forward for that change on that bug and re-open it) ?
Comment 26•3 years ago
|
||
Bug 1695580 was backed out of beta 89 and 90, as far as I can tell. It's still in 91+.
Comment 27•3 years ago
|
||
Oh, I just referred to the status "Accepted" on all patches in that bug which confuses me.
Assignee | ||
Comment 28•3 years ago
|
||
Bug 1683404 fixed the existing shutdown race that DelayedRunnable
was tripping more frequently. I checked beta and AFAICT we carried over DelayedRunnable
shutdown changes into 91.
Updated•3 years ago
|
Comment 29•3 years ago
|
||
I am not seeing any crash in 91 with this signature and it is a sec-moderate, so no need to have it marking a blocking the release. I see one crash in 92 nightly.
Updated•3 years ago
|
Comment 30•3 years ago
|
||
AFAICT, the crash rate now is what it was prior to the regression back in April. Calling this fixed by bug 1683404.
Updated•2 years ago
|
Description
•