Intermittent B2G debug mochitest "MOZ_Assert( 740): Assertion failure: !NS_HasPendingEvents(currentThread), at ../../../gecko/dom/workers/WorkerPrivate.cpp:4531"

RESOLVED FIXED in Firefox 34, Firefox OS v2.1

Status

()

Core
DOM: Workers
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: RyanVM, Assigned: bkelly)

Tracking

({assertion, crash, intermittent-failure})

Trunk
mozilla34
ARM
Gonk (Firefox OS)
assertion, crash, intermittent-failure
Points:
---

Firefox Tracking Flags

(firefox32 unaffected, firefox33 unaffected, firefox34 fixed, firefox-esr24 unaffected, firefox-esr31 unaffected, b2g-v2.0 unaffected, b2g-v2.1 fixed)

Details

Attachments

(1 attachment, 1 obsolete attachment)

(Reporter)

Description

3 years ago
https://tbpl.mozilla.org/php/getParsedLog.php?id=43561970&tree=Mozilla-Inbound

b2g_emulator_vm mozilla-inbound debug test mochitest-debug-11 on 2014-07-10 16:47:24 PDT for push e1a037c085d1
slave: tst-linux64-spot-673

17:29:31     INFO -  07-11 00:27:41.020 I/GeckoDump(  740): 732 INFO TEST-START | /tests/dom/workers/test/test_bug1002702.html
17:29:31     INFO -  07-11 00:27:41.320 I/Gecko   (  740): [Child 740] WARNING: Transparent content with displayports can be expensive.: file ../../../gecko/layout/base/nsDisplayList.cpp, line 1371
17:29:31     INFO -  07-11 00:27:41.600 I/Gecko   (  740): ++DOMWINDOW == 18 (0x440952b0) [pid = 740] [serial = 649] [outer = 0x440933e0]
17:29:31     INFO -  07-11 00:27:41.850 I/Gecko   (  740): [Child 740] WARNING: Failed to retarget HTML data delivery to the parser thread.: file ../../../gecko/parser/html/nsHtml5StreamParser.cpp, line 947
17:29:31     INFO -  07-11 00:27:41.920 I/Gecko   (  740): [Child 740] WARNING: Transparent content with displayports can be expensive.: file ../../../gecko/layout/base/nsDisplayList.cpp, line 1371
17:29:31     INFO -  07-11 00:27:43.290 I/Gecko   (  740): [Child 740] WARNING: Called close() before start()!: 'mStarted', file ../../../gecko/dom/workers/MessagePort.cpp, line 214
17:29:31     INFO -  07-11 00:27:46.210 I/Gecko   (  740): --DOMWINDOW == 17 (0x44093b20) [pid = 740] [serial = 637] [outer = 0x0] [url = http://mochi.test:8888/tests/dom/workers/test/serviceworkers/test_installation_simple.html]
17:29:31     INFO -  07-11 00:27:46.210 I/Gecko   (  740): --DOMWINDOW == 16 (0x44092ca0) [pid = 740] [serial = 635] [outer = 0x0] [url = http://mochi.test:8888/tests/dom/workers/test/serviceworkers/test_install_event.html]
17:29:31     INFO -  07-11 00:27:46.240 E/GeckoConsole(  740): [JavaScript Warning: "The character encoding of a framed document was not declared. The document may appear different if viewed without the document framing it." {file: "http://mochi.test:8888/tests/dom/workers/test/test_bug1002702.html" line: 0}]
17:29:31     INFO -  07-11 00:27:46.410 I/Gecko   (  740): [Child 740] WARNING: Transparent content with displayports can be expensive.: file ../../../gecko/layout/base/nsDisplayList.cpp, line 1371
17:29:31     INFO -  07-11 00:27:46.442 I/Gecko   (  740): [Child 740] WARNING: Unable to printf the requested string due to error.: file ../../../gecko/ipc/chromium/src/base/string_util.cc, line 427
17:29:31     INFO -  07-11 00:27:46.890 I/GeckoDump(  740): 733 INFO TEST-INFO | MEMORY STAT vsize after test: 136790016
17:29:31     INFO -  07-11 00:27:46.931 I/GeckoDump(  740): 734 INFO TEST-INFO | MEMORY STAT residentFast after test: 67993600
17:29:31     INFO -  07-11 00:27:46.970 I/GeckoDump(  740): 735 INFO TEST-INFO | MEMORY STAT heapAllocated after test: 19745332
17:29:31     INFO -  07-11 00:27:47.030 I/Gecko   (  740): [Child 740] WARNING: Transparent content with displayports can be expensive.: file ../../../gecko/layout/base/nsDisplayList.cpp, line 1371
17:29:31     INFO -  07-11 00:27:47.420 F/MOZ_Assert(  740): Assertion failure: !NS_HasPendingEvents(currentThread), at ../../../gecko/dom/workers/WorkerPrivate.cpp:4531
(Assignee)

Comment 1

3 years ago
No stack, but only a couple ways to get to this assert.  Seems very likely caused by extra messaging added by bug 1013571.
Blocks: 1013571
(Reporter)

Comment 2

3 years ago
https://tbpl.mozilla.org/php/getParsedLog.php?id=43663335&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=43667221&tree=Mozilla-Inbound
(Reporter)

Comment 3

3 years ago
https://tbpl.mozilla.org/php/getParsedLog.php?id=43737373&tree=B2g-Inbound
(Reporter)

Comment 4

3 years ago
https://tbpl.mozilla.org/php/getParsedLog.php?id=43741230&tree=B2g-Inbound
(Assignee)

Comment 5

3 years ago
Will look at this after I return from PTO.  Assigning now so I don't forget.  Also, I may do some try debugging when I have time this week since it takes so long.
Assignee: nobody → bkelly
Status: NEW → ASSIGNED
(Assignee)

Comment 6

3 years ago
Got it to trigger in this try:

  https://tbpl.mozilla.org/?tree=Try&rev=5b48de1b8921

We seem to hit the assert when we schedule WorkerPrivate::NotifyInternal() to run on the worker thread from the finalizer via NotifyPrivate().
(Assignee)

Comment 7

3 years ago
I think this is probably enough for me to force the race in the windows debugger when I get back next week.  I'm doing a few more triggers to try to reproduce again to verify it happens in the same place.  That first reproduction was at about 1-in-50 triggers or 2% repro rate.
(Assignee)

Comment 8

3 years ago
Finally got this to reproduce again as described in comment 6.  It took ~200 triggers for 2 total reproductions.
(Reporter)

Comment 9

3 years ago
https://tbpl.mozilla.org/php/getParsedLog.php?id=44077367&tree=B2g-Inbound
(Reporter)

Comment 10

3 years ago
https://tbpl.mozilla.org/php/getParsedLog.php?id=44138347&tree=Mozilla-Central
(Reporter)

Comment 11

3 years ago
https://tbpl.mozilla.org/php/getParsedLog.php?id=44189178&tree=B2g-Inbound
(Reporter)

Comment 12

3 years ago
https://tbpl.mozilla.org/php/getParsedLog.php?id=44235732&tree=Mozilla-Inbound
(Reporter)

Comment 13

3 years ago
https://tbpl.mozilla.org/php/getParsedLog.php?id=44368471&tree=Fx-Team
(Assignee)

Comment 14

3 years ago
I haven't been able to force this in the debugger yet, but my current theory is something like this:

1) Worker thread runs WorkerThreadPrimaryRunnable.
2) Worker thread enters event spin loop in SynchronouslyCreatePBackground()
3) Main thread cycle collector delete SharedWorker triggering scheduling of worker thread notification runnable.
4) Worker thread executes cancelation process from runnable, including ClearMainEventQueue().
5) PBackground actor creation completes on main thread while worker thread is in the middle of ClearMainEventQueue().

This would essentially place a new event in the queue between these two lines, triggering the assert:

  MOZ_ASSERT(mCancelAllPendingRunnables);
  mCancelAllPendingRunnables = false;
(Assignee)

Comment 15

3 years ago
Theory in comment 14 is based on the fact that in this test I consistently see the worker thread in SynchronouslyCreatePBackground() event loop when the ~SharedWorker() occurs on the main thread.
(Assignee)

Comment 16

3 years ago
Hmm, on second thought I don't think that can happen.  The worker status isn't set to Running until WorkerPrivate::DoRunLoop() is entered, so ClearMainEventQueue() would not get called from SynchronouslyCreatePBackground().
(Assignee)

Comment 17

3 years ago
Perhaps this is related to the TestPBackground that occurs between creating PBackground and starting DoRunLoop().  I believe this protocol is asynchronous, so perhaps the response is coming in just as we cancel the loop.
(Assignee)

Comment 18

3 years ago
Created attachment 8461741 [details] [diff] [review]
Close PBackground actor immediately when worker leaves Running state. (v0)

Here is a possible fix.

It strikes me that these two lines will always be racy in the presence of an IPC actor that can receive messages at any time:

    NS_ProcessPendingEvents(currentThread);
    MOZ_ASSERT(!NS_HasPendingEvents(currentThread));

Those lines are in WorkerPrivate::ClearMainEventQueue() and get called when the worker is closed, canceled, or killed.

This patch simply shuts down the PBackground actor when we determine we must clear the event queue.

The problem with this, though, is that the content script may still be running.  For example, they could do something like this:

  close();
  FetchStore.delete('mystore');

In this case the FetchStore (SW cache) would not be able to function because PBackground actor was destroyed in close().

Another alternative is to just drop the assertion because its racy with IPC in the mix.

Any thoughts Ben?

Here's a try to see how this patch works:

  https://tbpl.mozilla.org/?tree=Try&rev=07a85bbfdee1
Flags: needinfo?(bent.mozilla)
(Assignee)

Comment 19

3 years ago
Hmm, clearly that particular patch is flawed.  Lots of errors like in bug 1040288, although its not clear if its the exact same circumstances.  I'll have to investigate further.
(Assignee)

Comment 20

3 years ago
A full try shows that all the other platforms are successful.  Just b2g emulator hits bug 1040288.

  https://tbpl.mozilla.org/?tree=Try&rev=7d4e9373260b

It would be quite fortunate if this allows me to solve bug 1040288 more easily.
(Assignee)

Comment 21

3 years ago
The errors from the try build in comment 18 are do to trying to run ProcessLink::OnChannelConnected() after we close the actor.  This causes AssertWorkerThread to fail because the message loop ID has been cleared.

I think the solution here is to simply not perform OnChannelConnected() if the channel has already closed.
(Assignee)

Comment 22

3 years ago
(In reply to Ben Kelly [:bkelly] from comment #21)
> I think the solution here is to simply not perform OnChannelConnected() if
> the channel has already closed.

So I tried this, but in general closing the agent sooner created many more problems than it solved.

While it seems a bit of a cop-out, I am leaning towards relaxing the assert instead.  It just does not seem valid with IPC enabled on the thread.
(Assignee)

Comment 23

3 years ago
I spoke with Ben on IRC this morning.  He pointed out that ClearMainEventQueue() is called in two places:

1) When we first transition out of the Running state.
2) When ScheduleDeletion() is executed.

We only really need to assert that the event queue is empty in the second case.  So our plan is to move the assertion into ScheduleDeletion().
Flags: needinfo?(bent.mozilla)
(Assignee)

Comment 24

3 years ago
Created attachment 8463468 [details] [diff] [review]
Move empty queue assert to ScheduleDeletion() as its no longer always valid in ClearMainEventQueue() (v0)

Here is the patch that moves the assertion.

  https://tbpl.mozilla.org/?tree=Try&rev=e8cb5f86cd3e
Attachment #8461741 - Attachment is obsolete: true
(Assignee)

Comment 25

3 years ago
Comment on attachment 8463468 [details] [diff] [review]
Move empty queue assert to ScheduleDeletion() as its no longer always valid in ClearMainEventQueue() (v0)

Try has 60+ triggers so far without a failure on B2G emulator debug M11.
Attachment #8463468 - Flags: review?(bent.mozilla)
Attachment #8463468 - Flags: review?(bent.mozilla) → review+
(Assignee)

Comment 26

3 years ago
https://hg.mozilla.org/integration/mozilla-inbound/rev/0dade0bad72a
https://hg.mozilla.org/mozilla-central/rev/0dade0bad72a
Status: ASSIGNED → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla34
(Reporter)

Updated

3 years ago
status-b2g-v2.0: --- → unaffected
status-b2g-v2.1: --- → fixed
status-firefox32: --- → unaffected
status-firefox33: --- → unaffected
status-firefox34: --- → fixed
(Reporter)

Updated

3 years ago
status-firefox-esr24: --- → unaffected
status-firefox-esr31: --- → unaffected
You need to log in before you can comment on or make changes to this bug.