OK, so I tracked down what's going on with the gtest failures, at least; I managed to reproduce them under rr on Linux with chaos mode. I expect the other failures are similar, though I haven't managed to reproduce them locally yet.
We're running the test at https://searchfox.org/mozilla-central/rev/696cf3b52a9aa04aa5013008a8b448904a298356/xpcom/tests/gtest/TestThreads.cpp#99-126 which basically creates 50 threads, joins them all, repeats that 1000 times. We hang partway through the test.
That happens because we are running on the main thread in the test and attempt to join an
nsThread by calling
Shutdown on it. That posts a runnable to that thread, which shuts down and posts a runnable back to us. We, in the meantime, block while waiting for that runnable via the
SpinEventLoopUntil call under
nsThread::Shutdown. In our event loop spin we call into
PrioritizedEventQueue::GetEvent and if it returns null we sleep until an event appears. In this case it returns null, we sleep, and never wake up.
And that happens because we call
GetEvent and discover that there is no event queued so far. So we go to run this code:
and while we have the mutex unlocked is exactly when the shutting-down thread queues its
nsThreadShutdownAckEvent on us. So we return null and sleep waiting for more events to appear, but that never happens. So yes, the
MutexAutoUnlock thing is just fishy...
Without my changes this does not fail because the code before my changes looks like this:
ClearIdleToken can end up doing
MutexAutoUnlock in the code without my changes, but
ClearIdleToken only does it if we have
mIdleRequestId, and that only happens in content processes, while this test is running in a parent process. And
EnsureIsPaused only does the unlock if
mActive is true (which in this case it's not) and
mIdleScheduler is non-null (but in this case it's null, of course). So without my changes this test can never reach the
Of course in an actual Firefox run, if a content process ever joins a thread while not in shutdown, it could in fact hit that exact hang if nothing else posts any other runnables to the main-thread event queue. Which seems like a pretty unlikely situation....
Anyway, in terms of fixing this, apart from getting rid from the
MutexAutoUnlock bits completely, which is the long-term plan what I could do for now is replace all the "proof of unlock" arguments with a
Mutex& argument and doing the
MutexAutoUnlock calls within IdlePeriodState in places that exactly correspond to the current
MutexAutoUnlock calls. That will at least restore the status quo.
I pushed https://treeherder.mozilla.org/#/jobs?repo=try&revision=204d00c2281d701a578d233dfc55b2764b2ce426 with a patch to that effect, for the moment. If that generally sounds reasonable, I'll put it up in phabricator...