Closed Bug 1434856 Opened 2 years ago Closed 2 years ago

Firefox hang @ __psynch_mutexwait


(Core :: XPCOM, defect, P1)




Tracking Status
firefox-esr52 --- unaffected
firefox58 - wontfix
firefox59 + fixed
firefox60 + fixed


(Reporter: miker, Assigned: froydnj)


(Keywords: hang, regression)


(2 files)

Firefox was hanging so I sent SIGABRT producing this crash report:

I honestly can't give you STR because the hangs appear to be completely random.
Hey Nathan, do you have any ideas on this one?
Flags: needinfo?(nfroyd)
Product: Firefox → Core
[Tracking Requested - why for this release]: A randomly-occurring deadlock is no fun for users.

Normally for hangs in a crash report, you look at all the threads to figure out which ones are *not* hanging.  Usually those are the ones not in __psynch_cvwait (that is, waiting on a condition variable): the exact function will depend on the OS in the crash report.  Once you figure out what threads aren't hanging, it's usually pretty obvious what's going on, because there's usually only one or two threads, and the call stacks will provide a good clue as to what's happening.

With this crash report, though, you have a lot of threads not hanging, and all of the threads seem to be going through _pthread_mutex_lock_slow / __psynch_mutexwait.  The slow path for mutex locking is usually when you have a contended mutex, and I'm going to assume that all of the threads hitting the slow path are actually contending for the same mutex.  I'm going to further assume that the mutex in question is the mutex guarding the main thread's event queue, since the main thread appears to be stuck dispatching events to its own queue.  (It's certainly possible that several threads are actually touching different mutexes, but I don't think that changes the analysis much.)

But beyond that...?  We have:

1. The main thread (Thread 0);
2. The IPC I/O thread (I think, Thread 2);
3. The timer thread (Thread 3);
4. A socket transport service thread (Thread 4);
5. An (the?) OMTC thread (Thread 23);
6. Several thread pool threads, presumably all the same thread pool (?) (Threads 71, 72, 73).

all hammering the main thread.  What could they possibly be doing?

Oh, oh, look at the the main thread's stack.  We have:

24 	XUL 	nsCOMPtr_base::assign_from_qi(nsQueryInterface, nsID const&) 	xpcom/base/nsCOMPtr.cpp:14
25 	XUL 	mozilla::PrioritizedEventQueue<mozilla::EventQueue>::PutEvent(already_AddRefed<nsIRunnable>&&, mozilla::EventPriority, mozilla::BaseAutoLock<mozilla::Mutex> const&) 	xpcom/base/nsCOMPtr.h:541
26 	XUL 	mozilla::ThreadEventQueue<mozilla::PrioritizedEventQueue<mozilla::EventQueue> >::PutEventInternal(already_AddRefed<nsIRunnable>&&, mozilla::EventPriority, mozilla::ThreadEventQueue<mozilla::PrioritizedEventQueue<mozilla::EventQueue> >::NestedSink*) 	xpcom/threads/ThreadEventQueue.cpp:99
27 	XUL 	<name omitted> 	xpcom/threads/ThreadEventTarget.cpp:174
28 	XUL 	nsThread::DispatchFromScript(nsIRunnable*, unsigned int) 	xpcom/threads/nsThread.cpp:632
40 	XUL 	nsXPCWrappedJSClass::CallMethod(nsXPCWrappedJS*, unsigned short, XPTMethodDescriptor const*, nsXPTCMiniVariant*) 	js/xpconnect/src/XPCWrappedJSClass.cpp:1317
41 	XUL 	PrepareAndDispatch 	xpcom/reflect/xptcall/md/unix/xptcstubs_x86_64_darwin.cpp:122
42 	XUL 	SharedStub 	
43 	XUL 	nsUrlClassifierLookupCallback::LookupComplete(nsTArray<mozilla::safebrowsing::LookupResult>*) 	toolkit/components/url-classifier/nsUrlClassifierDBService.cpp:1125
44 	XUL 	UrlClassifierLookupCallbackProxy::LookupCompleteRunnable::Run() 	toolkit/components/url-classifier/nsUrlClassifierProxies.cpp:300

So we have the url-classifier signaling that lookup is done, and calling through xpconnect to some sort of observer.  That observer is implemented in JS and winds up dispatching an event to the main thread (that's frame #28, nsThread::DispatchFromScript).  So we're going to lock the mutex for the main thread's event queue.  But we need to QI inside of PrioritzedEventQueue::PutEvent, probably here:

The runnable we're dispatching (and calling QI on) is implemented in JS.  So we have to go through xpconnect machinery to do the QI, and somewhere in there, the JS engine decides it's a good time to do a GC:

 3 	XUL 	mozilla::ThreadEventQueue<mozilla::PrioritizedEventQueue<mozilla::EventQueue> >::PutEventInternal(already_AddRefed<nsIRunnable>&&, mozilla::EventPriority, mozilla::ThreadEventQueue<mozilla::PrioritizedEventQueue<mozilla::EventQueue> >::NestedSink*) 	xpcom/threads/Mutex.h:65
4 	XUL 	nsThread::IdleDispatch(already_AddRefed<nsIRunnable>) 	xpcom/threads/nsThread.cpp:825
5 	XUL 	NS_IdleDispatchToThread(already_AddRefed<nsIRunnable>&&, unsigned int, nsIThread*) 	xpcom/threads/nsThreadUtils.cpp:310
6 	XUL 	XPCJSRuntime::DispatchDeferredDeletion(bool, bool) 	js/xpconnect/src/XPCJSRuntime.cpp:138
7 	XUL 	DOMGCSliceCallback(JSContext*, JS::GCProgress, JS::GCDescription const&) 	dom/base/nsJSEnvironment.cpp:2406
8 	XUL 	js::gcstats::Statistics::endSlice() 	js/src/gc/Statistics.cpp:1091
9 	XUL 	js::gc::GCRuntime::gcCycle(bool, js::SliceBudget&, JS::gcreason::Reason) 	js/src/gc/Statistics.h:408
10 	XUL 	js::gc::GCRuntime::collect(bool, js::SliceBudget, JS::gcreason::Reason) 	js/src/jsgc.cpp:7460
11 	XUL 	js::gc::GCRuntime::startGC(JSGCInvocationKind, JS::gcreason::Reason, long long) 	js/src/jsgc.cpp:7542

Notice frame #3 here: we're dispatching an event to the main thread's event queue...but we've already locked the main thread's event queue some 15 frames earlier.  We're therefore trying to reacquire the lock the main thread is already holding, which is a deadlock.  That also conveniently explains why all the other threads are stuck in the slow path to acquire the main thread's lock for the main thread's event queue: the main thread is holding it waaaaaay longer than it ought to, because it can't release the lock.

Assuming that's what's going on, what should we do here?  We're going to have to figure out how to do that QI *not* under the lock, I think, and we're going to have to figure out how to do it without slowing down event dispatch too much.
Component: General → XPCOM
Flags: needinfo?(nfroyd)
Priority: -- → P1
Mike, is it possible for you to attach a debugger to the hanging process the next time this happens, so we can verify that comment 2 is what's happening?
Flags: needinfo?(mratcliffe)
Otherwise, we might enter JS, decide to GC, and deadlock because we were
trying to dispatch tasks to the main thread's event queue while holding
the lock for the event queue.

The only sad part about this is that all subclasses of SynchronizedEventQueue
need to do the nsIRunnablePriority dance.  Fortunately, we don't have that
many, and the only other subclass, SchedulerEventQueue, deals with priorities
in a different way (or not at all?).
Attachment #8947558 - Flags: review?(erahm)
[Tracking Requested - why for this release]: See comment 2; it's possible this could be a ride-along for a chemspill or something.  It'd be hard to get data on how often this is happening in the field, because we don't actually crash and therefore we don't have any visibility into how often it happens.  But it'd be super-annoying for our users...
Assignee: nobody → nfroyd
Keywords: regression
Comment on attachment 8947558 [details] [diff] [review]
move runnable prioritization checks outside of event queue locks

Review of attachment 8947558 [details] [diff] [review]:

Nice fix! Small nits, but nothing major.

I wonder if we could add an xpcshell test for this. Something like: create a JS runnable that triggers a GC from it's QI implementation (if we have those privileges), post the runnable, make sure everything runs okay. I'm not sure that's even feasible though.

::: xpcom/threads/PrioritizedEventQueue.cpp
@@ +36,5 @@
>                                               const MutexAutoLock& aProofOfLock)
>  {
>    // Double check the priority with a QI.
>    RefPtr<nsIRunnable> event(aEvent);
>    EventPriority priority = aPriority;

I guess we could get rid of this, but you'd have to change everything below too so I'm kinda of meh on it.

::: xpcom/threads/ThreadEventQueue.cpp
@@ +82,5 @@
>    LeakRefPtr<nsIRunnable> event(Move(aEvent));
>    nsCOMPtr<nsIThreadObserver> obs;
>    {
> +    // Double check the priority with a QI.  Do this outside the lock, so

Not your wording, but might be nice to rephrase this, something like: "Check if we should override the priority with a QI"
Attachment #8947558 - Flags: review?(erahm) → review+
(In reply to Nathan Froyd [:froydnj] from comment #3)
> Mike, is it possible for you to attach a debugger to the hanging process the
> next time this happens, so we can verify that comment 2 is what's happening?

I see that there is already a patch but I am more than happy to attach a debugger next time this happens... I am assuming you need a full stack trace.
Flags: needinfo?(mratcliffe)
(In reply to Mike Ratcliffe [:miker] [:mratcliffe] [:mikeratcliffe] from comment #8)
> (In reply to Nathan Froyd [:froydnj] from comment #3)
> > Mike, is it possible for you to attach a debugger to the hanging process the
> > next time this happens, so we can verify that comment 2 is what's happening?
> I see that there is already a patch but I am more than happy to attach a
> debugger next time this happens... I am assuming you need a full stack trace.

A stack trace from all active threads would be nice (it might be higher quality than the stacks from the crash report), but the really interesting thing would be to confirm that:

a) all the threads involved are trying to acquire the same mutex; and
b) that mutex is the mutex for the main thread's event queue; and
c) that mutex was already locked by the main thread,

just to confirm the hypothesis from comment 2.  If the threads were trying to lock different mutexes, that is also interesting information.

Eric's suggestion of a test in comment 7 is a good idea, though; I'm going to try that and see if that duplicates the hang.
Tracking for 59/60 - sounds like we would want to verify this once it lands and uplift to at least beta 59.
This is a test, as Eric suggested, that attempts to dispatch an event during a
JS-implemented QI.  We could have used Cu.schedulePreciseGC here, but using a
plain event was simpler, and keeps the specifics of the GC out of the picture
while fulfilling the spirit of the original bug.

Eric r+'d this over IRC.  The test deadlocks without the patch already here,
and succeeds with the patch.  I'll be committing the test along with the patch,
but am recording it here for posterity.
Attachment #8947937 - Flags: review+
Pushed by
move runnable prioritization checks outside of event queue locks; r=erahm
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla60
Comment on attachment 8947558 [details] [diff] [review]
move runnable prioritization checks outside of event queue locks

[This approval request is for both patches in the bug; they landed as a single commit.]

Approval Request Comment
[Feature/Bug causing the regression]: Quantum DOM-related refactorings.
[User impact if declined]: Random lockups of the Firefox process.
[Is this code covered by automated tests?]: Yes, a test was added in this bug.
[Has the fix been verified in Nightly?]: No.
[Needs manual test from QE? If yes, steps to reproduce]: No.
[List of other uplifts needed for the feature/fix]: None.
[Is the change risky?]: No.
[Why is the change risky/not risky?]: Small amount of code moving around, and we added a test to demonstrate the fix works.
[String changes made/needed]: None.
Attachment #8947558 - Flags: approval-mozilla-beta?
Flags: in-testsuite+
Comment on attachment 8947558 [details] [diff] [review]
move runnable prioritization checks outside of event queue locks

Let's get this on Beta for 59b7 so we can get some feedback before later in the week when we're deciding on 58.0.2 ride-alongs.
Attachment #8947558 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Ritu, fyi this might turn out to be good for a future 58 dot release.
Flags: needinfo?(rkothari)
The fix doesn't look entirely risk free and it hadn't stabilized enough on beta yet so we decided to push 58.0.2 without it.
Flags: needinfo?(rkothari)
You need to log in before you can comment on or make changes to this bug.