No stack, but only a couple ways to get to this assert. Seems very likely caused by extra messaging added by bug 1013571.
Will look at this after I return from PTO. Assigning now so I don't forget. Also, I may do some try debugging when I have time this week since it takes so long.
Got it to trigger in this try: https://tbpl.mozilla.org/?tree=Try&rev=5b48de1b8921 We seem to hit the assert when we schedule WorkerPrivate::NotifyInternal() to run on the worker thread from the finalizer via NotifyPrivate().
I think this is probably enough for me to force the race in the windows debugger when I get back next week. I'm doing a few more triggers to try to reproduce again to verify it happens in the same place. That first reproduction was at about 1-in-50 triggers or 2% repro rate.
Finally got this to reproduce again as described in comment 6. It took ~200 triggers for 2 total reproductions.
I haven't been able to force this in the debugger yet, but my current theory is something like this: 1) Worker thread runs WorkerThreadPrimaryRunnable. 2) Worker thread enters event spin loop in SynchronouslyCreatePBackground() 3) Main thread cycle collector delete SharedWorker triggering scheduling of worker thread notification runnable. 4) Worker thread executes cancelation process from runnable, including ClearMainEventQueue(). 5) PBackground actor creation completes on main thread while worker thread is in the middle of ClearMainEventQueue(). This would essentially place a new event in the queue between these two lines, triggering the assert: MOZ_ASSERT(mCancelAllPendingRunnables); mCancelAllPendingRunnables = false;
Theory in comment 14 is based on the fact that in this test I consistently see the worker thread in SynchronouslyCreatePBackground() event loop when the ~SharedWorker() occurs on the main thread.
Hmm, on second thought I don't think that can happen. The worker status isn't set to Running until WorkerPrivate::DoRunLoop() is entered, so ClearMainEventQueue() would not get called from SynchronouslyCreatePBackground().
Perhaps this is related to the TestPBackground that occurs between creating PBackground and starting DoRunLoop(). I believe this protocol is asynchronous, so perhaps the response is coming in just as we cancel the loop.
Created attachment 8461741 [details] [diff] [review] Close PBackground actor immediately when worker leaves Running state. (v0) Here is a possible fix. It strikes me that these two lines will always be racy in the presence of an IPC actor that can receive messages at any time: NS_ProcessPendingEvents(currentThread); MOZ_ASSERT(!NS_HasPendingEvents(currentThread)); Those lines are in WorkerPrivate::ClearMainEventQueue() and get called when the worker is closed, canceled, or killed. This patch simply shuts down the PBackground actor when we determine we must clear the event queue. The problem with this, though, is that the content script may still be running. For example, they could do something like this: close(); FetchStore.delete('mystore'); In this case the FetchStore (SW cache) would not be able to function because PBackground actor was destroyed in close(). Another alternative is to just drop the assertion because its racy with IPC in the mix. Any thoughts Ben? Here's a try to see how this patch works: https://tbpl.mozilla.org/?tree=Try&rev=07a85bbfdee1
Hmm, clearly that particular patch is flawed. Lots of errors like in bug 1040288, although its not clear if its the exact same circumstances. I'll have to investigate further.
A full try shows that all the other platforms are successful. Just b2g emulator hits bug 1040288. https://tbpl.mozilla.org/?tree=Try&rev=7d4e9373260b It would be quite fortunate if this allows me to solve bug 1040288 more easily.
The errors from the try build in comment 18 are do to trying to run ProcessLink::OnChannelConnected() after we close the actor. This causes AssertWorkerThread to fail because the message loop ID has been cleared. I think the solution here is to simply not perform OnChannelConnected() if the channel has already closed.
(In reply to Ben Kelly [:bkelly] from comment #21) > I think the solution here is to simply not perform OnChannelConnected() if > the channel has already closed. So I tried this, but in general closing the agent sooner created many more problems than it solved. While it seems a bit of a cop-out, I am leaning towards relaxing the assert instead. It just does not seem valid with IPC enabled on the thread.
I spoke with Ben on IRC this morning. He pointed out that ClearMainEventQueue() is called in two places: 1) When we first transition out of the Running state. 2) When ScheduleDeletion() is executed. We only really need to assert that the event queue is empty in the second case. So our plan is to move the assertion into ScheduleDeletion().
Created attachment 8463468 [details] [diff] [review] Move empty queue assert to ScheduleDeletion() as its no longer always valid in ClearMainEventQueue() (v0) Here is the patch that moves the assertion. https://tbpl.mozilla.org/?tree=Try&rev=e8cb5f86cd3e
Comment on attachment 8463468 [details] [diff] [review] Move empty queue assert to ScheduleDeletion() as its no longer always valid in ClearMainEventQueue() (v0) Try has 60+ triggers so far without a failure on B2G emulator debug M11.