Closed Bug 1037294 Opened 10 years ago Closed 10 years ago

Intermittent B2G debug mochitest "MOZ_Assert( 740): Assertion failure: !NS_HasPendingEvents(currentThread), at ../../../gecko/dom/workers/WorkerPrivate.cpp:4531"

Categories

(Core :: DOM: Workers, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla34
Tracking Status
firefox32 --- unaffected
firefox33 --- unaffected
firefox34 --- fixed
firefox-esr24 --- unaffected
firefox-esr31 --- unaffected
b2g-v2.0 --- unaffected
b2g-v2.1 --- fixed

People

(Reporter: RyanVM, Assigned: bkelly)

References

Details

(Keywords: assertion, crash, intermittent-failure)

Attachments

(1 file, 1 obsolete file)

https://tbpl.mozilla.org/php/getParsedLog.php?id=43561970&tree=Mozilla-Inbound

b2g_emulator_vm mozilla-inbound debug test mochitest-debug-11 on 2014-07-10 16:47:24 PDT for push e1a037c085d1
slave: tst-linux64-spot-673

17:29:31     INFO -  07-11 00:27:41.020 I/GeckoDump(  740): 732 INFO TEST-START | /tests/dom/workers/test/test_bug1002702.html
17:29:31     INFO -  07-11 00:27:41.320 I/Gecko   (  740): [Child 740] WARNING: Transparent content with displayports can be expensive.: file ../../../gecko/layout/base/nsDisplayList.cpp, line 1371
17:29:31     INFO -  07-11 00:27:41.600 I/Gecko   (  740): ++DOMWINDOW == 18 (0x440952b0) [pid = 740] [serial = 649] [outer = 0x440933e0]
17:29:31     INFO -  07-11 00:27:41.850 I/Gecko   (  740): [Child 740] WARNING: Failed to retarget HTML data delivery to the parser thread.: file ../../../gecko/parser/html/nsHtml5StreamParser.cpp, line 947
17:29:31     INFO -  07-11 00:27:41.920 I/Gecko   (  740): [Child 740] WARNING: Transparent content with displayports can be expensive.: file ../../../gecko/layout/base/nsDisplayList.cpp, line 1371
17:29:31     INFO -  07-11 00:27:43.290 I/Gecko   (  740): [Child 740] WARNING: Called close() before start()!: 'mStarted', file ../../../gecko/dom/workers/MessagePort.cpp, line 214
17:29:31     INFO -  07-11 00:27:46.210 I/Gecko   (  740): --DOMWINDOW == 17 (0x44093b20) [pid = 740] [serial = 637] [outer = 0x0] [url = http://mochi.test:8888/tests/dom/workers/test/serviceworkers/test_installation_simple.html]
17:29:31     INFO -  07-11 00:27:46.210 I/Gecko   (  740): --DOMWINDOW == 16 (0x44092ca0) [pid = 740] [serial = 635] [outer = 0x0] [url = http://mochi.test:8888/tests/dom/workers/test/serviceworkers/test_install_event.html]
17:29:31     INFO -  07-11 00:27:46.240 E/GeckoConsole(  740): [JavaScript Warning: "The character encoding of a framed document was not declared. The document may appear different if viewed without the document framing it." {file: "http://mochi.test:8888/tests/dom/workers/test/test_bug1002702.html" line: 0}]
17:29:31     INFO -  07-11 00:27:46.410 I/Gecko   (  740): [Child 740] WARNING: Transparent content with displayports can be expensive.: file ../../../gecko/layout/base/nsDisplayList.cpp, line 1371
17:29:31     INFO -  07-11 00:27:46.442 I/Gecko   (  740): [Child 740] WARNING: Unable to printf the requested string due to error.: file ../../../gecko/ipc/chromium/src/base/string_util.cc, line 427
17:29:31     INFO -  07-11 00:27:46.890 I/GeckoDump(  740): 733 INFO TEST-INFO | MEMORY STAT vsize after test: 136790016
17:29:31     INFO -  07-11 00:27:46.931 I/GeckoDump(  740): 734 INFO TEST-INFO | MEMORY STAT residentFast after test: 67993600
17:29:31     INFO -  07-11 00:27:46.970 I/GeckoDump(  740): 735 INFO TEST-INFO | MEMORY STAT heapAllocated after test: 19745332
17:29:31     INFO -  07-11 00:27:47.030 I/Gecko   (  740): [Child 740] WARNING: Transparent content with displayports can be expensive.: file ../../../gecko/layout/base/nsDisplayList.cpp, line 1371
17:29:31     INFO -  07-11 00:27:47.420 F/MOZ_Assert(  740): Assertion failure: !NS_HasPendingEvents(currentThread), at ../../../gecko/dom/workers/WorkerPrivate.cpp:4531
No stack, but only a couple ways to get to this assert.  Seems very likely caused by extra messaging added by bug 1013571.
Blocks: 1013571
Will look at this after I return from PTO.  Assigning now so I don't forget.  Also, I may do some try debugging when I have time this week since it takes so long.
Assignee: nobody → bkelly
Status: NEW → ASSIGNED
Got it to trigger in this try:

  https://tbpl.mozilla.org/?tree=Try&rev=5b48de1b8921

We seem to hit the assert when we schedule WorkerPrivate::NotifyInternal() to run on the worker thread from the finalizer via NotifyPrivate().
I think this is probably enough for me to force the race in the windows debugger when I get back next week.  I'm doing a few more triggers to try to reproduce again to verify it happens in the same place.  That first reproduction was at about 1-in-50 triggers or 2% repro rate.
Finally got this to reproduce again as described in comment 6.  It took ~200 triggers for 2 total reproductions.
I haven't been able to force this in the debugger yet, but my current theory is something like this:

1) Worker thread runs WorkerThreadPrimaryRunnable.
2) Worker thread enters event spin loop in SynchronouslyCreatePBackground()
3) Main thread cycle collector delete SharedWorker triggering scheduling of worker thread notification runnable.
4) Worker thread executes cancelation process from runnable, including ClearMainEventQueue().
5) PBackground actor creation completes on main thread while worker thread is in the middle of ClearMainEventQueue().

This would essentially place a new event in the queue between these two lines, triggering the assert:

  MOZ_ASSERT(mCancelAllPendingRunnables);
  mCancelAllPendingRunnables = false;
Theory in comment 14 is based on the fact that in this test I consistently see the worker thread in SynchronouslyCreatePBackground() event loop when the ~SharedWorker() occurs on the main thread.
Hmm, on second thought I don't think that can happen.  The worker status isn't set to Running until WorkerPrivate::DoRunLoop() is entered, so ClearMainEventQueue() would not get called from SynchronouslyCreatePBackground().
Perhaps this is related to the TestPBackground that occurs between creating PBackground and starting DoRunLoop().  I believe this protocol is asynchronous, so perhaps the response is coming in just as we cancel the loop.
Here is a possible fix.

It strikes me that these two lines will always be racy in the presence of an IPC actor that can receive messages at any time:

    NS_ProcessPendingEvents(currentThread);
    MOZ_ASSERT(!NS_HasPendingEvents(currentThread));

Those lines are in WorkerPrivate::ClearMainEventQueue() and get called when the worker is closed, canceled, or killed.

This patch simply shuts down the PBackground actor when we determine we must clear the event queue.

The problem with this, though, is that the content script may still be running.  For example, they could do something like this:

  close();
  FetchStore.delete('mystore');

In this case the FetchStore (SW cache) would not be able to function because PBackground actor was destroyed in close().

Another alternative is to just drop the assertion because its racy with IPC in the mix.

Any thoughts Ben?

Here's a try to see how this patch works:

  https://tbpl.mozilla.org/?tree=Try&rev=07a85bbfdee1
Flags: needinfo?(bent.mozilla)
Hmm, clearly that particular patch is flawed.  Lots of errors like in bug 1040288, although its not clear if its the exact same circumstances.  I'll have to investigate further.
A full try shows that all the other platforms are successful.  Just b2g emulator hits bug 1040288.

  https://tbpl.mozilla.org/?tree=Try&rev=7d4e9373260b

It would be quite fortunate if this allows me to solve bug 1040288 more easily.
The errors from the try build in comment 18 are do to trying to run ProcessLink::OnChannelConnected() after we close the actor.  This causes AssertWorkerThread to fail because the message loop ID has been cleared.

I think the solution here is to simply not perform OnChannelConnected() if the channel has already closed.
(In reply to Ben Kelly [:bkelly] from comment #21)
> I think the solution here is to simply not perform OnChannelConnected() if
> the channel has already closed.

So I tried this, but in general closing the agent sooner created many more problems than it solved.

While it seems a bit of a cop-out, I am leaning towards relaxing the assert instead.  It just does not seem valid with IPC enabled on the thread.
I spoke with Ben on IRC this morning.  He pointed out that ClearMainEventQueue() is called in two places:

1) When we first transition out of the Running state.
2) When ScheduleDeletion() is executed.

We only really need to assert that the event queue is empty in the second case.  So our plan is to move the assertion into ScheduleDeletion().
Flags: needinfo?(bent.mozilla)
Comment on attachment 8463468 [details] [diff] [review]
Move empty queue assert to ScheduleDeletion() as its no longer always valid in ClearMainEventQueue() (v0)

Try has 60+ triggers so far without a failure on B2G emulator debug M11.
Attachment #8463468 - Flags: review?(bent.mozilla)
Attachment #8463468 - Flags: review?(bent.mozilla) → review+
https://hg.mozilla.org/mozilla-central/rev/0dade0bad72a
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla34
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: