Closed Bug 1267819 Opened 9 years ago Closed 6 years ago

Intermittent shutdown crashes masquerading as test_videocontrols_video_direction.html | application crashed [@ mozilla::dom::workers::WorkerPrivate::DoRunLoop] (and other tests)

Categories

(Core :: DOM: Workers, defect, P5)

defect

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox46 --- wontfix
firefox47 --- affected
firefox48 --- affected
firefox49 --- affected
firefox-esr38 --- wontfix
firefox-esr45 --- affected

People

(Reporter: RyanVM, Unassigned)

References

()

Details

(Keywords: assertion, crash, intermittent-failure)

Attachments

(2 files, 1 obsolete file)

Abridged IRC convo about it: RyanVM: khuey: hey, I see a lot of bugs on file for intermittent [@ mozilla::dom::workers::WorkerPrivate::DoRunLoop] crashes like https://treeherder.mozilla.org/logviewer.html#?job_id=171238&repo=ash RyanVM: any suggestions for what could help make them actionable? khuey: the assertion is we expect to have an event but we don't khuey: RyanVM: does this always seem to happen during shutdown? RyanVM: khuey: i'd be willing to make that claim from all the logs I'm looking at so far, yes khuey: RyanVM: if we usually see it during shutdown we could be racing with a control runnable that clears the queue khuey: RyanVM: can you send me email or needinfo me on a bug to remind me to look at this more? URL field contains an OrangeFactor link to more failures if more logs are desired. https://treeherder.mozilla.org/logviewer.html#?job_id=171238&repo=ash 09:17:20 INFO - Assertion failure: (NS_ProcessNextEvent(mThread, false)), at /builds/slave/ash-lx-d-000000000000000000000/build/src/dom/workers/WorkerPrivate.cpp:4603 09:17:20 INFO - #01: mozilla::dom::workers::WorkerPrivate::DoRunLoop [dom/workers/WorkerPrivate.cpp:4603] 09:17:20 INFO - #02: WorkerThreadPrimaryRunnable::Run [dom/workers/RuntimeService.cpp:2693] 09:17:20 INFO - #03: nsThread::ProcessNextEvent [xpcom/threads/nsThread.cpp:989] 09:17:20 INFO - #04: NS_ProcessNextEvent [xpcom/glue/nsThreadUtils.cpp:290] 09:17:20 INFO - #05: mozilla::ipc::MessagePumpForNonMainThreads::Run [ipc/glue/MessagePump.cpp:341] 09:17:20 INFO - #06: MessageLoop::RunInternal [ipc/chromium/src/base/message_loop.cc:230] 09:17:20 INFO - #07: MessageLoop::Run [ipc/chromium/src/base/message_loop.cc:505] 09:17:20 INFO - #08: nsThread::ThreadFunc [xpcom/threads/nsThread.cpp:393] 09:17:21 INFO - #09: _pt_root [nsprpub/pr/src/pthreads/ptthread.c:219] 09:17:21 INFO - #10: libpthread.so.0 + 0x6d4c 09:17:21 INFO - nsStringStats 09:17:21 INFO - => mAllocCount: 49160 09:17:21 INFO - => mReallocCount: 1751 09:17:21 INFO - => mFreeCount: 49160 09:17:21 INFO - => mShareCount: 66348 09:17:21 INFO - => mAdoptCount: 4025 09:17:21 INFO - => mAdoptFreeCount: 4025 09:17:21 INFO - => Process ID: 4894, Thread ID: 2970646848 09:17:21 INFO - ExceptionHandler::GenerateDump cloned child 4941 09:17:21 INFO - ExceptionHandler::SendContinueSignalToChild sent continue signal to child 09:17:21 INFO - ExceptionHandler::WaitForContinueSignal waiting for continue signal... 09:17:21 INFO - TEST-INFO | Main app process: exit 11 09:17:21 INFO - 4811 INFO TEST-START | Shutdown 09:17:21 INFO - 4812 INFO Passed: 23 09:17:21 INFO - 4813 INFO Failed: 0 09:17:21 INFO - 4814 INFO Todo: 0 09:17:21 INFO - 4815 INFO Mode: e10s 09:17:21 INFO - 4816 INFO Slowest: 9140ms - /tests/toolkit/content/tests/widgets/test_videocontrols_standalone.html 09:17:21 INFO - 4817 INFO SimpleTest FINISHED 09:17:21 INFO - 4818 INFO TEST-INFO | Ran 1 Loops 09:17:21 INFO - 4819 INFO SimpleTest FINISHED 09:17:21 WARNING - TEST-UNEXPECTED-FAIL | toolkit/content/tests/widgets/test_videocontrols_video_direction.html | application terminated with exit code 11 09:17:21 INFO - runtests.py | Application ran for: 0:00:56.644503 09:17:21 INFO - zombiecheck | Reading PID log: /tmp/tmp1prsiNpidlog 09:17:21 INFO - ==> process 4838 launched child process 4894 09:17:21 INFO - zombiecheck | Checking for orphan process with PID: 4894 09:17:21 INFO - mozcrash Copy/paste: /builds/slave/test/build/linux32-minidump_stackwalk /tmp/tmpkgWhy1.mozrunner/minidumps/01218037-d265-e4e1-6331c53c-635bb975.dmp /builds/slave/test/build/symbols 09:17:37 INFO - mozcrash Saved minidump as /builds/slave/test/build/blobber_upload_dir/01218037-d265-e4e1-6331c53c-635bb975.dmp 09:17:37 INFO - mozcrash Saved app info as /builds/slave/test/build/blobber_upload_dir/01218037-d265-e4e1-6331c53c-635bb975.extra 09:17:37 WARNING - PROCESS-CRASH | toolkit/content/tests/widgets/test_videocontrols_video_direction.html | application crashed [@ mozilla::dom::workers::WorkerPrivate::DoRunLoop] 09:17:37 INFO - Crash dump filename: /tmp/tmpkgWhy1.mozrunner/minidumps/01218037-d265-e4e1-6331c53c-635bb975.dmp 09:17:37 INFO - Operating system: Linux 09:17:37 INFO - 0.0.0 Linux 3.2.0-76-generic-pae #111-Ubuntu SMP Tue Jan 13 22:34:29 UTC 2015 i686 09:17:37 INFO - CPU: x86 09:17:37 INFO - GenuineIntel family 6 model 62 stepping 4 09:17:37 INFO - 1 CPU 09:17:37 INFO - Crash reason: SIGSEGV 09:17:37 INFO - Crash address: 0x0 09:17:37 INFO - Process uptime: not available 09:17:37 INFO - Thread 19 (crashed) 09:17:37 INFO - 0 libxul.so!mozilla::dom::workers::WorkerPrivate::DoRunLoop [WorkerPrivate.cpp:9ce31e9f90cb : 4603 + 0x27] 09:17:37 INFO - eip = 0xb304685d esp = 0x9e7fedc0 ebp = 0x9e7fee78 ebx = 0xb6183fd0 09:17:37 INFO - esi = 0x00000001 edi = 0x9e865c00 eax = 0xb4dd3ad8 ecx = 0xb6183fd0 09:17:37 INFO - edx = 0x00000000 efl = 0x00010292 09:17:37 INFO - Found by: given as instruction pointer in context 09:17:37 INFO - 1 libxul.so!WorkerThreadPrimaryRunnable::Run [RuntimeService.cpp:9ce31e9f90cb : 2689 + 0xe] 09:17:37 INFO - eip = 0xb2fe941e esp = 0x9e7fee80 ebp = 0x9e7ff118 ebx = 0xb6183fd0 09:17:37 INFO - esi = 0x9e8d8c00 edi = 0x9e878000 09:17:37 INFO - Found by: call frame info 09:17:37 INFO - 2 libxul.so!nsThread::ProcessNextEvent [nsThread.cpp:9ce31e9f90cb : 989 + 0x14] 09:17:37 INFO - eip = 0xb163fa3b esp = 0x9e7ff120 ebp = 0x9e7ff198 ebx = 0xb6183fd0 09:17:37 INFO - esi = 0xa22ec640 edi = 0xb6188b28 09:17:37 INFO - Found by: call frame info 09:17:37 INFO - 3 libxul.so!NS_ProcessNextEvent [nsThreadUtils.cpp:9ce31e9f90cb : 290 + 0x10] 09:17:37 INFO - eip = 0xb1667011 esp = 0x9e7ff1a0 ebp = 0x9e7ff1d8 ebx = 0xb6183fd0 09:17:37 INFO - esi = 0x9e852340 edi = 0x9eb1c200 09:17:37 INFO - Found by: call frame info 09:17:37 INFO - 4 libxul.so!mozilla::ipc::MessagePumpForNonMainThreads::Run [MessagePump.cpp:9ce31e9f90cb : 340 + 0xc] 09:17:37 INFO - eip = 0xb19f24c1 esp = 0x9e7ff1e0 ebp = 0x9e7ff228 ebx = 0xb6183fd0 09:17:37 INFO - esi = 0x9e852340 edi = 0x9eb1c200 09:17:37 INFO - Found by: call frame info 09:17:37 INFO - 5 libxul.so!MessageLoop::RunInternal [message_loop.cc:9ce31e9f90cb : 230 + 0x14] 09:17:37 INFO - eip = 0xb19ce894 esp = 0x9e7ff230 ebp = 0x9e7ff258 ebx = 0xb6183fd0 09:17:37 INFO - esi = 0x9eb1c200 edi = 0x9eb1c200 09:17:37 INFO - Found by: call frame info 09:17:37 INFO - 6 libxul.so!MessageLoop::Run [message_loop.cc:9ce31e9f90cb : 223 + 0x8] 09:17:37 INFO - eip = 0xb19ce8ba esp = 0x9e7ff260 ebp = 0x9e7ff288 ebx = 0xb6183fd0 09:17:37 INFO - esi = 0x9eb1c200 edi = 0x9eb1c200 09:17:37 INFO - Found by: call frame info 09:17:37 INFO - 7 libxul.so!nsThread::ThreadFunc [nsThread.cpp:9ce31e9f90cb : 391 + 0x9] 09:17:37 INFO - eip = 0xb1640318 esp = 0x9e7ff290 ebp = 0x9e7ff2d8 ebx = 0xb6183fd0 09:17:37 INFO - esi = 0xa22ec640 edi = 0x9eb1c200 09:17:37 INFO - Found by: call frame info 09:17:37 INFO - 8 libnspr4.so!_pt_root [ptthread.c:9ce31e9f90cb : 216 + 0x7] 09:17:37 INFO - eip = 0xb748ae83 esp = 0x9e7ff2e0 ebp = 0x9e7ff328 ebx = 0xb749c470 09:17:37 INFO - esi = 0x00000000 edi = 0xa22ec700 09:17:37 INFO - Found by: call frame info 09:17:37 INFO - 9 libpthread-2.15.so + 0x6d4c 09:17:37 INFO - eip = 0xb778fd4c esp = 0x9e7ff330 ebp = 0x9e7ff428 ebx = 0xb77a0ff4 09:17:37 INFO - esi = 0x00000000 edi = 0x003d0f00 09:17:37 INFO - Found by: call frame info 09:17:37 INFO - 10 libc-2.15.so + 0xeeb8e 09:17:37 INFO - eip = 0xb7590b8e esp = 0x9e7ff430 ebp = 0x00000000 09:17:37 INFO - Found by: previous frame's frame pointer
Flags: needinfo?(khuey)
Attached patch Patch (obsolete) — Splinter Review
Assignee: nobody → khuey
Status: NEW → ASSIGNED
Flags: needinfo?(khuey)
Attachment #8747372 - Flags: review?(amarchesini)
ping?
Flags: needinfo?(amarchesini)
Comment on attachment 8747372 [details] [diff] [review] Patch Review of attachment 8747372 [details] [diff] [review]: ----------------------------------------------------------------- ::: dom/workers/WorkerPrivate.cpp @@ +4790,5 @@ > for (;;) { > // Run all control events now. > + MOZ_ASSERT(!mForceKill); > + ProcessAllControlRunnables(); > + mayContinue = !mForceKill; This would be: mayContinue = ProcessAllControlRunnables() != ControlRunnablesFailed; @@ +5029,4 @@ > } > > bool > WorkerPrivate::ProcessAllControlRunnablesLocked() can you add a comment in the header, saying that this method returns true if we have at least 1 control runnable executed? @@ +5068,5 @@ > MutexAutoUnlock unlock(mMutex); > > MOZ_ASSERT(event); > if (NS_FAILED(static_cast<nsIRunnable*>(event)->Run())) { > + mForceKill = true; It seems that you don't care about the mForceKill state. Everywhere you set it to false after using the value. What about if this method instead a boolean returns: enum { NoControlRunnables, ControlRunnablesSucceded, ControlRunnablesFailed } ? @@ +5411,5 @@ > + bool processedControl = ProcessAllControlRunnablesLocked(); > + // There's no running JS, so ignore mForceKill. > + mForceKill = false; > + > + if (processedControl) { This would be: if (processedControl != NoControlRunnables) { ...
Attachment #8747372 - Flags: review?(amarchesini)
Flags: needinfo?(amarchesini)
Attached patch Patch with enumSplinter Review
Kinda verbose but *shrug*
Attachment #8747372 - Attachment is obsolete: true
Attachment #8751402 - Flags: review?(amarchesini)
Comment on attachment 8751402 [details] [diff] [review] Patch with enum Review of attachment 8751402 [details] [diff] [review]: ----------------------------------------------------------------- Thank you!
Attachment #8751402 - Flags: review?(amarchesini) → review+
Ryan, can you watch this over the next few days and see how it goes?
Flags: needinfo?(ryanvm)
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla49
Status: RESOLVED → REOPENED
Flags: needinfo?(ryanvm) → needinfo?(khuey)
Resolution: FIXED → ---
Target Milestone: mozilla49 → ---
Ok, WTF. We're at https://hg.mozilla.org/mozilla-central/annotate/f3f2fa1d7eed/dom/workers/WorkerPrivate.cpp#l4653. Note that that means we just passed the NS_HasPendingEvents assertion above. We call https://hg.mozilla.org/mozilla-central/annotate/f3f2fa1d7eed/xpcom/glue/nsThreadUtils.cpp#l270. The NS_GetCurrentThread call must succeed, because otherwise NS_HasPendingEvents would have returned false. So either ProcessNextEvent failed or it returned false. We end up at https://hg.mozilla.org/mozilla-central/annotate/f3f2fa1d7eed/xpcom/threads/nsThread.cpp#l964. The same thread check must pass, at https://hg.mozilla.org/mozilla-central/annotate/f3f2fa1d7eed/xpcom/threads/nsThread.cpp#l1052 we set rv to NS_OK, and we don't take the only branch that modifies rv because aMayWait is false. And aResult is set to whether or not we found an event at https://hg.mozilla.org/mozilla-central/annotate/f3f2fa1d7eed/xpcom/threads/nsThread.cpp#l1066. So how on earth does this fail?
Flags: needinfo?(khuey) → needinfo?(nfroyd)
See Also: → 1268201
See Also: → 1269873
See Also: → 1273316
See Also: → 1273378
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #10) > Ok, WTF. > > We're at > https://hg.mozilla.org/mozilla-central/annotate/f3f2fa1d7eed/dom/workers/ > WorkerPrivate.cpp#l4653. Note that that means we just passed the > NS_HasPendingEvents assertion above. > > We call > https://hg.mozilla.org/mozilla-central/annotate/f3f2fa1d7eed/xpcom/glue/ > nsThreadUtils.cpp#l270. The NS_GetCurrentThread call must succeed, because > otherwise NS_HasPendingEvents would have returned false. So either > ProcessNextEvent failed or it returned false. > > We end up at > https://hg.mozilla.org/mozilla-central/annotate/f3f2fa1d7eed/xpcom/threads/ > nsThread.cpp#l964. The same thread check must pass, at > https://hg.mozilla.org/mozilla-central/annotate/f3f2fa1d7eed/xpcom/threads/ > nsThread.cpp#l1052 we set rv to NS_OK, and we don't take the only branch > that modifies rv because aMayWait is false. And aResult is set to whether > or not we found an event at > https://hg.mozilla.org/mozilla-central/annotate/f3f2fa1d7eed/xpcom/threads/ > nsThread.cpp#l1066. So how on earth does this fail? I guess the only reasonable explanation is that: 1. NS_HasPendingEvents returns true. 2. Somebody else grabs whatever events were still in the queue. 3. NS_ProcessNextEvent returns false, because there's nothing to process. Is that even possible? I thought nothing else encroached on a worker's event loop like that. But even if we replaced: MOZ_ASSERT(NS_HasPendingEvents(...)) MOZ_ALWAYS_TRUE(NS_ProcessNextEvent(...)) with just: MOZ_ALWAYS_TRUE(NS_ProcessNextEvent(...)) (since that formulation should be equivalent, right?), then would we will run into problems at: https://hg.mozilla.org/mozilla-central/annotate/f3f2fa1d7eed/dom/workers/WorkerPrivate.cpp#l4655 because somebody could pull runnables out of the queue after we had found out there were events still pending? I guess that would be a performance issue, not necessarily a correctness one. That's only after a couple minutes of thought and not knowing workers all that well...does that make any sense?
Flags: needinfo?(nfroyd)
(In reply to Nathan Froyd [:froydnj] from comment #11) > (In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #10) > > Ok, WTF. > > > > We're at > > https://hg.mozilla.org/mozilla-central/annotate/f3f2fa1d7eed/dom/workers/ > > WorkerPrivate.cpp#l4653. Note that that means we just passed the > > NS_HasPendingEvents assertion above. > > > > We call > > https://hg.mozilla.org/mozilla-central/annotate/f3f2fa1d7eed/xpcom/glue/ > > nsThreadUtils.cpp#l270. The NS_GetCurrentThread call must succeed, because > > otherwise NS_HasPendingEvents would have returned false. So either > > ProcessNextEvent failed or it returned false. > > > > We end up at > > https://hg.mozilla.org/mozilla-central/annotate/f3f2fa1d7eed/xpcom/threads/ > > nsThread.cpp#l964. The same thread check must pass, at > > https://hg.mozilla.org/mozilla-central/annotate/f3f2fa1d7eed/xpcom/threads/ > > nsThread.cpp#l1052 we set rv to NS_OK, and we don't take the only branch > > that modifies rv because aMayWait is false. And aResult is set to whether > > or not we found an event at > > https://hg.mozilla.org/mozilla-central/annotate/f3f2fa1d7eed/xpcom/threads/ > > nsThread.cpp#l1066. So how on earth does this fail? > > I guess the only reasonable explanation is that: > > 1. NS_HasPendingEvents returns true. > 2. Somebody else grabs whatever events were still in the queue. > 3. NS_ProcessNextEvent returns false, because there's nothing to process. > > Is that even possible? I thought nothing else encroached on a worker's > event loop like that. It's not even workers ... how does one thread get its events stolen by another anywhere in Gecko? I don't see how #2 can happen. > But even if we replaced: > > MOZ_ASSERT(NS_HasPendingEvents(...)) > > MOZ_ALWAYS_TRUE(NS_ProcessNextEvent(...)) > > with just: > > MOZ_ALWAYS_TRUE(NS_ProcessNextEvent(...)) > > (since that formulation should be equivalent, right?), then would we will > run into problems at: I believe it should be, yes. > https://hg.mozilla.org/mozilla-central/annotate/f3f2fa1d7eed/dom/workers/ > WorkerPrivate.cpp#l4655 > > because somebody could pull runnables out of the queue after we had found > out there were events still pending? I guess that would be a performance > issue, not necessarily a correctness one. We're only processing one runnable at a time, so that's effectively just checking to see if there was more than one runnable in the thread to begin with.
Attached patch Add an assertionSplinter Review
I have no idea how we would get to this state ... but this seems like the only thing that could cause this.
Attachment #8765660 - Flags: review?(nfroyd)
Comment on attachment 8765660 [details] [diff] [review] Add an assertion Review of attachment 8765660 [details] [diff] [review]: ----------------------------------------------------------------- Sure, why not? ::: xpcom/threads/nsEventQueue.cpp @@ +57,5 @@ > } > > if (aResult) { > *aResult = mHead->mEvents[mOffsetHead++]; > + MOZ_ASSERT(*aResult); As long as we're asserting for impossible things, do you want to add a check prior to the assignment that mOffsetHead is < EVENTS_PER_PAGE, and a check after the assignment that mOffsetHead is <= EVENTS_PER_PAGE? Maybe there's some sort of weird memory corruption going on?
Attachment #8765660 - Flags: review?(nfroyd) → review+
(In reply to Nathan Froyd [:froydnj] from comment #14) > As long as we're asserting for impossible things, do you want to add a check > prior to the assignment that mOffsetHead is < EVENTS_PER_PAGE, and a check > after the assignment that mOffsetHead is <= EVENTS_PER_PAGE? Maybe there's > some sort of weird memory corruption going on? Or maybe mHead and mTail are pointing at the same |Page| structure and mOffsetHead is actually getting advanced beyond mTailOffset?
Bulk assigning P3 to all open intermittent bugs without a priority set in Firefox components per bug 1298978.
Priority: -- → P3
Priority: P3 → P5
The leave-open keyword is there and there is no activity for 6 months. :overholt, maybe it's time to close this bug?
Flags: needinfo?(overholt)
I wonder if Blake has seen anything like this recently and therefore has opinions on whether this bug should be closed or not. Lots of see-also bugs ...
Flags: needinfo?(overholt) → needinfo?(mrbkap)
As far as I can tell, there were a bunch of mochitest crashes in WorkerPrivate::DoRunLoop up until about 2 years ago and then they stopped. None of the open see-also bugs have had action (either fixes or new intermittent failures) since then, so I'm going to close this WFM until proven otherwise.
Status: REOPENED → RESOLVED
Closed: 9 years ago6 years ago
Flags: needinfo?(mrbkap)
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: