Closed
Bug 1267819
Opened 9 years ago
Closed 6 years ago
Intermittent shutdown crashes masquerading as test_videocontrols_video_direction.html | application crashed [@ mozilla::dom::workers::WorkerPrivate::DoRunLoop] (and other tests)
Categories
(Core :: DOM: Workers, defect, P5)
Core
DOM: Workers
Tracking
()
People
(Reporter: RyanVM, Unassigned)
References
()
Details
(Keywords: assertion, crash, intermittent-failure)
Attachments
(2 files, 1 obsolete file)
7.12 KB,
patch
|
baku
:
review+
|
Details | Diff | Splinter Review |
801 bytes,
patch
|
froydnj
:
review+
|
Details | Diff | Splinter Review |
Abridged IRC convo about it:
RyanVM: khuey: hey, I see a lot of bugs on file for intermittent [@ mozilla::dom::workers::WorkerPrivate::DoRunLoop] crashes like https://treeherder.mozilla.org/logviewer.html#?job_id=171238&repo=ash
RyanVM: any suggestions for what could help make them actionable?
khuey: the assertion is we expect to have an event but we don't
khuey: RyanVM: does this always seem to happen during shutdown?
RyanVM: khuey: i'd be willing to make that claim from all the logs I'm looking at so far, yes
khuey: RyanVM: if we usually see it during shutdown we could be racing with a control runnable that clears the queue
khuey: RyanVM: can you send me email or needinfo me on a bug to remind me to look at this more?
URL field contains an OrangeFactor link to more failures if more logs are desired.
https://treeherder.mozilla.org/logviewer.html#?job_id=171238&repo=ash
09:17:20 INFO - Assertion failure: (NS_ProcessNextEvent(mThread, false)), at /builds/slave/ash-lx-d-000000000000000000000/build/src/dom/workers/WorkerPrivate.cpp:4603
09:17:20 INFO - #01: mozilla::dom::workers::WorkerPrivate::DoRunLoop [dom/workers/WorkerPrivate.cpp:4603]
09:17:20 INFO - #02: WorkerThreadPrimaryRunnable::Run [dom/workers/RuntimeService.cpp:2693]
09:17:20 INFO - #03: nsThread::ProcessNextEvent [xpcom/threads/nsThread.cpp:989]
09:17:20 INFO - #04: NS_ProcessNextEvent [xpcom/glue/nsThreadUtils.cpp:290]
09:17:20 INFO - #05: mozilla::ipc::MessagePumpForNonMainThreads::Run [ipc/glue/MessagePump.cpp:341]
09:17:20 INFO - #06: MessageLoop::RunInternal [ipc/chromium/src/base/message_loop.cc:230]
09:17:20 INFO - #07: MessageLoop::Run [ipc/chromium/src/base/message_loop.cc:505]
09:17:20 INFO - #08: nsThread::ThreadFunc [xpcom/threads/nsThread.cpp:393]
09:17:21 INFO - #09: _pt_root [nsprpub/pr/src/pthreads/ptthread.c:219]
09:17:21 INFO - #10: libpthread.so.0 + 0x6d4c
09:17:21 INFO - nsStringStats
09:17:21 INFO - => mAllocCount: 49160
09:17:21 INFO - => mReallocCount: 1751
09:17:21 INFO - => mFreeCount: 49160
09:17:21 INFO - => mShareCount: 66348
09:17:21 INFO - => mAdoptCount: 4025
09:17:21 INFO - => mAdoptFreeCount: 4025
09:17:21 INFO - => Process ID: 4894, Thread ID: 2970646848
09:17:21 INFO - ExceptionHandler::GenerateDump cloned child 4941
09:17:21 INFO - ExceptionHandler::SendContinueSignalToChild sent continue signal to child
09:17:21 INFO - ExceptionHandler::WaitForContinueSignal waiting for continue signal...
09:17:21 INFO - TEST-INFO | Main app process: exit 11
09:17:21 INFO - 4811 INFO TEST-START | Shutdown
09:17:21 INFO - 4812 INFO Passed: 23
09:17:21 INFO - 4813 INFO Failed: 0
09:17:21 INFO - 4814 INFO Todo: 0
09:17:21 INFO - 4815 INFO Mode: e10s
09:17:21 INFO - 4816 INFO Slowest: 9140ms - /tests/toolkit/content/tests/widgets/test_videocontrols_standalone.html
09:17:21 INFO - 4817 INFO SimpleTest FINISHED
09:17:21 INFO - 4818 INFO TEST-INFO | Ran 1 Loops
09:17:21 INFO - 4819 INFO SimpleTest FINISHED
09:17:21 WARNING - TEST-UNEXPECTED-FAIL | toolkit/content/tests/widgets/test_videocontrols_video_direction.html | application terminated with exit code 11
09:17:21 INFO - runtests.py | Application ran for: 0:00:56.644503
09:17:21 INFO - zombiecheck | Reading PID log: /tmp/tmp1prsiNpidlog
09:17:21 INFO - ==> process 4838 launched child process 4894
09:17:21 INFO - zombiecheck | Checking for orphan process with PID: 4894
09:17:21 INFO - mozcrash Copy/paste: /builds/slave/test/build/linux32-minidump_stackwalk /tmp/tmpkgWhy1.mozrunner/minidumps/01218037-d265-e4e1-6331c53c-635bb975.dmp /builds/slave/test/build/symbols
09:17:37 INFO - mozcrash Saved minidump as /builds/slave/test/build/blobber_upload_dir/01218037-d265-e4e1-6331c53c-635bb975.dmp
09:17:37 INFO - mozcrash Saved app info as /builds/slave/test/build/blobber_upload_dir/01218037-d265-e4e1-6331c53c-635bb975.extra
09:17:37 WARNING - PROCESS-CRASH | toolkit/content/tests/widgets/test_videocontrols_video_direction.html | application crashed [@ mozilla::dom::workers::WorkerPrivate::DoRunLoop]
09:17:37 INFO - Crash dump filename: /tmp/tmpkgWhy1.mozrunner/minidumps/01218037-d265-e4e1-6331c53c-635bb975.dmp
09:17:37 INFO - Operating system: Linux
09:17:37 INFO - 0.0.0 Linux 3.2.0-76-generic-pae #111-Ubuntu SMP Tue Jan 13 22:34:29 UTC 2015 i686
09:17:37 INFO - CPU: x86
09:17:37 INFO - GenuineIntel family 6 model 62 stepping 4
09:17:37 INFO - 1 CPU
09:17:37 INFO - Crash reason: SIGSEGV
09:17:37 INFO - Crash address: 0x0
09:17:37 INFO - Process uptime: not available
09:17:37 INFO - Thread 19 (crashed)
09:17:37 INFO - 0 libxul.so!mozilla::dom::workers::WorkerPrivate::DoRunLoop [WorkerPrivate.cpp:9ce31e9f90cb : 4603 + 0x27]
09:17:37 INFO - eip = 0xb304685d esp = 0x9e7fedc0 ebp = 0x9e7fee78 ebx = 0xb6183fd0
09:17:37 INFO - esi = 0x00000001 edi = 0x9e865c00 eax = 0xb4dd3ad8 ecx = 0xb6183fd0
09:17:37 INFO - edx = 0x00000000 efl = 0x00010292
09:17:37 INFO - Found by: given as instruction pointer in context
09:17:37 INFO - 1 libxul.so!WorkerThreadPrimaryRunnable::Run [RuntimeService.cpp:9ce31e9f90cb : 2689 + 0xe]
09:17:37 INFO - eip = 0xb2fe941e esp = 0x9e7fee80 ebp = 0x9e7ff118 ebx = 0xb6183fd0
09:17:37 INFO - esi = 0x9e8d8c00 edi = 0x9e878000
09:17:37 INFO - Found by: call frame info
09:17:37 INFO - 2 libxul.so!nsThread::ProcessNextEvent [nsThread.cpp:9ce31e9f90cb : 989 + 0x14]
09:17:37 INFO - eip = 0xb163fa3b esp = 0x9e7ff120 ebp = 0x9e7ff198 ebx = 0xb6183fd0
09:17:37 INFO - esi = 0xa22ec640 edi = 0xb6188b28
09:17:37 INFO - Found by: call frame info
09:17:37 INFO - 3 libxul.so!NS_ProcessNextEvent [nsThreadUtils.cpp:9ce31e9f90cb : 290 + 0x10]
09:17:37 INFO - eip = 0xb1667011 esp = 0x9e7ff1a0 ebp = 0x9e7ff1d8 ebx = 0xb6183fd0
09:17:37 INFO - esi = 0x9e852340 edi = 0x9eb1c200
09:17:37 INFO - Found by: call frame info
09:17:37 INFO - 4 libxul.so!mozilla::ipc::MessagePumpForNonMainThreads::Run [MessagePump.cpp:9ce31e9f90cb : 340 + 0xc]
09:17:37 INFO - eip = 0xb19f24c1 esp = 0x9e7ff1e0 ebp = 0x9e7ff228 ebx = 0xb6183fd0
09:17:37 INFO - esi = 0x9e852340 edi = 0x9eb1c200
09:17:37 INFO - Found by: call frame info
09:17:37 INFO - 5 libxul.so!MessageLoop::RunInternal [message_loop.cc:9ce31e9f90cb : 230 + 0x14]
09:17:37 INFO - eip = 0xb19ce894 esp = 0x9e7ff230 ebp = 0x9e7ff258 ebx = 0xb6183fd0
09:17:37 INFO - esi = 0x9eb1c200 edi = 0x9eb1c200
09:17:37 INFO - Found by: call frame info
09:17:37 INFO - 6 libxul.so!MessageLoop::Run [message_loop.cc:9ce31e9f90cb : 223 + 0x8]
09:17:37 INFO - eip = 0xb19ce8ba esp = 0x9e7ff260 ebp = 0x9e7ff288 ebx = 0xb6183fd0
09:17:37 INFO - esi = 0x9eb1c200 edi = 0x9eb1c200
09:17:37 INFO - Found by: call frame info
09:17:37 INFO - 7 libxul.so!nsThread::ThreadFunc [nsThread.cpp:9ce31e9f90cb : 391 + 0x9]
09:17:37 INFO - eip = 0xb1640318 esp = 0x9e7ff290 ebp = 0x9e7ff2d8 ebx = 0xb6183fd0
09:17:37 INFO - esi = 0xa22ec640 edi = 0x9eb1c200
09:17:37 INFO - Found by: call frame info
09:17:37 INFO - 8 libnspr4.so!_pt_root [ptthread.c:9ce31e9f90cb : 216 + 0x7]
09:17:37 INFO - eip = 0xb748ae83 esp = 0x9e7ff2e0 ebp = 0x9e7ff328 ebx = 0xb749c470
09:17:37 INFO - esi = 0x00000000 edi = 0xa22ec700
09:17:37 INFO - Found by: call frame info
09:17:37 INFO - 9 libpthread-2.15.so + 0x6d4c
09:17:37 INFO - eip = 0xb778fd4c esp = 0x9e7ff330 ebp = 0x9e7ff428 ebx = 0xb77a0ff4
09:17:37 INFO - esi = 0x00000000 edi = 0x003d0f00
09:17:37 INFO - Found by: call frame info
09:17:37 INFO - 10 libc-2.15.so + 0xeeb8e
09:17:37 INFO - eip = 0xb7590b8e esp = 0x9e7ff430 ebp = 0x00000000
09:17:37 INFO - Found by: previous frame's frame pointer
Flags: needinfo?(khuey)
Assignee: nobody → khuey
Status: NEW → ASSIGNED
Flags: needinfo?(khuey)
Attachment #8747372 -
Flags: review?(amarchesini)
ping?
Flags: needinfo?(amarchesini)
Comment 3•9 years ago
|
||
Comment on attachment 8747372 [details] [diff] [review]
Patch
Review of attachment 8747372 [details] [diff] [review]:
-----------------------------------------------------------------
::: dom/workers/WorkerPrivate.cpp
@@ +4790,5 @@
> for (;;) {
> // Run all control events now.
> + MOZ_ASSERT(!mForceKill);
> + ProcessAllControlRunnables();
> + mayContinue = !mForceKill;
This would be:
mayContinue = ProcessAllControlRunnables() != ControlRunnablesFailed;
@@ +5029,4 @@
> }
>
> bool
> WorkerPrivate::ProcessAllControlRunnablesLocked()
can you add a comment in the header, saying that this method returns true if we have at least 1 control runnable executed?
@@ +5068,5 @@
> MutexAutoUnlock unlock(mMutex);
>
> MOZ_ASSERT(event);
> if (NS_FAILED(static_cast<nsIRunnable*>(event)->Run())) {
> + mForceKill = true;
It seems that you don't care about the mForceKill state. Everywhere you set it to false after using the value. What about if this method instead a boolean returns:
enum {
NoControlRunnables,
ControlRunnablesSucceded,
ControlRunnablesFailed
} ?
@@ +5411,5 @@
> + bool processedControl = ProcessAllControlRunnablesLocked();
> + // There's no running JS, so ignore mForceKill.
> + mForceKill = false;
> +
> + if (processedControl) {
This would be:
if (processedControl != NoControlRunnables) { ...
Attachment #8747372 -
Flags: review?(amarchesini)
Updated•9 years ago
|
Flags: needinfo?(amarchesini)
Kinda verbose but *shrug*
Attachment #8747372 -
Attachment is obsolete: true
Attachment #8751402 -
Flags: review?(amarchesini)
Comment 5•9 years ago
|
||
Comment on attachment 8751402 [details] [diff] [review]
Patch with enum
Review of attachment 8751402 [details] [diff] [review]:
-----------------------------------------------------------------
Thank you!
Attachment #8751402 -
Flags: review?(amarchesini) → review+
Ryan, can you watch this over the next few days and see how it goes?
Flags: needinfo?(ryanvm)
Comment 8•9 years ago
|
||
bugherder |
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla49
Reporter | ||
Comment 9•9 years ago
|
||
Still a few hits since this was merged to m-c.
https://treeherder.mozilla.org/logviewer.html#?repo=fx-team&job_id=9325961#L6037
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=3893649#L6028
Status: RESOLVED → REOPENED
Flags: needinfo?(ryanvm) → needinfo?(khuey)
Resolution: FIXED → ---
Target Milestone: mozilla49 → ---
Ok, WTF.
We're at https://hg.mozilla.org/mozilla-central/annotate/f3f2fa1d7eed/dom/workers/WorkerPrivate.cpp#l4653. Note that that means we just passed the NS_HasPendingEvents assertion above.
We call https://hg.mozilla.org/mozilla-central/annotate/f3f2fa1d7eed/xpcom/glue/nsThreadUtils.cpp#l270. The NS_GetCurrentThread call must succeed, because otherwise NS_HasPendingEvents would have returned false. So either ProcessNextEvent failed or it returned false.
We end up at https://hg.mozilla.org/mozilla-central/annotate/f3f2fa1d7eed/xpcom/threads/nsThread.cpp#l964. The same thread check must pass, at https://hg.mozilla.org/mozilla-central/annotate/f3f2fa1d7eed/xpcom/threads/nsThread.cpp#l1052 we set rv to NS_OK, and we don't take the only branch that modifies rv because aMayWait is false. And aResult is set to whether or not we found an event at https://hg.mozilla.org/mozilla-central/annotate/f3f2fa1d7eed/xpcom/threads/nsThread.cpp#l1066. So how on earth does this fail?
Flags: needinfo?(khuey) → needinfo?(nfroyd)
![]() |
||
Comment 11•9 years ago
|
||
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #10)
> Ok, WTF.
>
> We're at
> https://hg.mozilla.org/mozilla-central/annotate/f3f2fa1d7eed/dom/workers/
> WorkerPrivate.cpp#l4653. Note that that means we just passed the
> NS_HasPendingEvents assertion above.
>
> We call
> https://hg.mozilla.org/mozilla-central/annotate/f3f2fa1d7eed/xpcom/glue/
> nsThreadUtils.cpp#l270. The NS_GetCurrentThread call must succeed, because
> otherwise NS_HasPendingEvents would have returned false. So either
> ProcessNextEvent failed or it returned false.
>
> We end up at
> https://hg.mozilla.org/mozilla-central/annotate/f3f2fa1d7eed/xpcom/threads/
> nsThread.cpp#l964. The same thread check must pass, at
> https://hg.mozilla.org/mozilla-central/annotate/f3f2fa1d7eed/xpcom/threads/
> nsThread.cpp#l1052 we set rv to NS_OK, and we don't take the only branch
> that modifies rv because aMayWait is false. And aResult is set to whether
> or not we found an event at
> https://hg.mozilla.org/mozilla-central/annotate/f3f2fa1d7eed/xpcom/threads/
> nsThread.cpp#l1066. So how on earth does this fail?
I guess the only reasonable explanation is that:
1. NS_HasPendingEvents returns true.
2. Somebody else grabs whatever events were still in the queue.
3. NS_ProcessNextEvent returns false, because there's nothing to process.
Is that even possible? I thought nothing else encroached on a worker's event loop like that.
But even if we replaced:
MOZ_ASSERT(NS_HasPendingEvents(...))
MOZ_ALWAYS_TRUE(NS_ProcessNextEvent(...))
with just:
MOZ_ALWAYS_TRUE(NS_ProcessNextEvent(...))
(since that formulation should be equivalent, right?), then would we will run into problems at:
https://hg.mozilla.org/mozilla-central/annotate/f3f2fa1d7eed/dom/workers/WorkerPrivate.cpp#l4655
because somebody could pull runnables out of the queue after we had found out there were events still pending? I guess that would be a performance issue, not necessarily a correctness one.
That's only after a couple minutes of thought and not knowing workers all that well...does that make any sense?
Flags: needinfo?(nfroyd)
(In reply to Nathan Froyd [:froydnj] from comment #11)
> (In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #10)
> > Ok, WTF.
> >
> > We're at
> > https://hg.mozilla.org/mozilla-central/annotate/f3f2fa1d7eed/dom/workers/
> > WorkerPrivate.cpp#l4653. Note that that means we just passed the
> > NS_HasPendingEvents assertion above.
> >
> > We call
> > https://hg.mozilla.org/mozilla-central/annotate/f3f2fa1d7eed/xpcom/glue/
> > nsThreadUtils.cpp#l270. The NS_GetCurrentThread call must succeed, because
> > otherwise NS_HasPendingEvents would have returned false. So either
> > ProcessNextEvent failed or it returned false.
> >
> > We end up at
> > https://hg.mozilla.org/mozilla-central/annotate/f3f2fa1d7eed/xpcom/threads/
> > nsThread.cpp#l964. The same thread check must pass, at
> > https://hg.mozilla.org/mozilla-central/annotate/f3f2fa1d7eed/xpcom/threads/
> > nsThread.cpp#l1052 we set rv to NS_OK, and we don't take the only branch
> > that modifies rv because aMayWait is false. And aResult is set to whether
> > or not we found an event at
> > https://hg.mozilla.org/mozilla-central/annotate/f3f2fa1d7eed/xpcom/threads/
> > nsThread.cpp#l1066. So how on earth does this fail?
>
> I guess the only reasonable explanation is that:
>
> 1. NS_HasPendingEvents returns true.
> 2. Somebody else grabs whatever events were still in the queue.
> 3. NS_ProcessNextEvent returns false, because there's nothing to process.
>
> Is that even possible? I thought nothing else encroached on a worker's
> event loop like that.
It's not even workers ... how does one thread get its events stolen by another anywhere in Gecko? I don't see how #2 can happen.
> But even if we replaced:
>
> MOZ_ASSERT(NS_HasPendingEvents(...))
>
> MOZ_ALWAYS_TRUE(NS_ProcessNextEvent(...))
>
> with just:
>
> MOZ_ALWAYS_TRUE(NS_ProcessNextEvent(...))
>
> (since that formulation should be equivalent, right?), then would we will
> run into problems at:
I believe it should be, yes.
> https://hg.mozilla.org/mozilla-central/annotate/f3f2fa1d7eed/dom/workers/
> WorkerPrivate.cpp#l4655
>
> because somebody could pull runnables out of the queue after we had found
> out there were events still pending? I guess that would be a performance
> issue, not necessarily a correctness one.
We're only processing one runnable at a time, so that's effectively just checking to see if there was more than one runnable in the thread to begin with.
I have no idea how we would get to this state ... but this seems like the only thing that could cause this.
Attachment #8765660 -
Flags: review?(nfroyd)
![]() |
||
Comment 14•9 years ago
|
||
Comment on attachment 8765660 [details] [diff] [review]
Add an assertion
Review of attachment 8765660 [details] [diff] [review]:
-----------------------------------------------------------------
Sure, why not?
::: xpcom/threads/nsEventQueue.cpp
@@ +57,5 @@
> }
>
> if (aResult) {
> *aResult = mHead->mEvents[mOffsetHead++];
> + MOZ_ASSERT(*aResult);
As long as we're asserting for impossible things, do you want to add a check prior to the assignment that mOffsetHead is < EVENTS_PER_PAGE, and a check after the assignment that mOffsetHead is <= EVENTS_PER_PAGE? Maybe there's some sort of weird memory corruption going on?
Attachment #8765660 -
Flags: review?(nfroyd) → review+
![]() |
||
Comment 15•9 years ago
|
||
(In reply to Nathan Froyd [:froydnj] from comment #14)
> As long as we're asserting for impossible things, do you want to add a check
> prior to the assignment that mOffsetHead is < EVENTS_PER_PAGE, and a check
> after the assignment that mOffsetHead is <= EVENTS_PER_PAGE? Maybe there's
> some sort of weird memory corruption going on?
Or maybe mHead and mTail are pointing at the same |Page| structure and mOffsetHead is actually getting advanced beyond mTailOffset?
Comment 16•9 years ago
|
||
Pushed by khuey@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/8186a2778f9d
Add a diagnostic assertion. r=froydnj
Keywords: leave-open
Comment 17•9 years ago
|
||
bugherder |
Assignee: khuey → nobody
Comment 18•8 years ago
|
||
Bulk assigning P3 to all open intermittent bugs without a priority set in Firefox components per bug 1298978.
Priority: -- → P3
![]() |
||
Updated•8 years ago
|
Priority: P3 → P5
Comment 19•6 years ago
|
||
The leave-open keyword is there and there is no activity for 6 months.
:overholt, maybe it's time to close this bug?
Flags: needinfo?(overholt)
Comment 20•6 years ago
|
||
I wonder if Blake has seen anything like this recently and therefore has opinions on whether this bug should be closed or not. Lots of see-also bugs ...
Flags: needinfo?(overholt) → needinfo?(mrbkap)
Comment 21•6 years ago
|
||
As far as I can tell, there were a bunch of mochitest crashes in WorkerPrivate::DoRunLoop up until about 2 years ago and then they stopped. None of the open see-also bugs have had action (either fixes or new intermittent failures) since then, so I'm going to close this WFM until proven otherwise.
Status: REOPENED → RESOLVED
Closed: 9 years ago → 6 years ago
Flags: needinfo?(mrbkap)
Resolution: --- → WORKSFORME
Updated•6 years ago
|
Keywords: leave-open
You need to log in
before you can comment on or make changes to this bug.
Description
•