Zombie workers should not be shown
Categories
(DevTools :: Debugger, defect, P2)
Tracking
(firefox67 fixed)
Tracking | Status | |
---|---|---|
firefox67 | --- | fixed |
People
(Reporter: jlast, Assigned: bhackett1024)
References
Details
Attachments
(5 files)
433.54 KB,
image/png
|
Details | |
134.75 KB,
image/png
|
Details | |
579 bytes,
patch
|
asuth
:
review+
|
Details | Diff | Splinter Review |
972 bytes,
patch
|
Details | Diff | Splinter Review | |
2.17 KB,
patch
|
tarek
:
superreview+
|
Details | Diff | Splinter Review |
Workers are currently shown until they are GC'd. This creates a weird effect where "Zombies" are shown in the UI.
Reporter | ||
Comment 1•6 years ago
|
||
Updated•6 years ago
|
Assignee | ||
Comment 2•6 years ago
|
||
After doing some investigation, this isn't actually related to the GC. When the worker thread shuts down the debugger is informed immediately, but the problem here is that there can be a large gap in time (multiple seconds) between a worker being terminate()ed and the actual shutdown.
When testing on a simple worker example, the problem seems to be that if the worker's event queues are empty after it enters the cancel state, it blocks until a new event comes in. The attached patch adds a test for the conditions under which the thread can shutdown, and avoids such blocking. This causes the worker to shutdown immediately after the terminate() call, updating the debugger UI.
Comment 3•6 years ago
|
||
Can you clarify situation where you were seeing this? Is this under debugger tests involving workers, or even when there's no debuggers involved? In particular, if you can point me at a test I can run to reproduce, that'd be appreciated.
Operating theory-wise: mStatus should change as a result as a result of a NotifyRunnable WorkerControlRunnable executing. So I'd suspect the problem here is that we were in a debugger runnable which had a nested event loop pushed and that's where the NotifyRunnable was processed. So we fell out of that loop and out to the top-level WorkerPrivate::DoRunLoop. The logic there doesn't check for the possibility that mStatus changed while we were in the debugger runnable, so we go back to the top of the loop and block. Thankfully we will have scheduled the GC timer, which is a control runnable, so we will notice that state change at that point.
Assignee | ||
Comment 4•6 years ago
|
||
This problem happens even when no debugger is attached. Using the debugger just makes this more obvious, since the debugger shows all workers that are currently running. To reproduce, first apply this diagnostic patch to print out the calls to worker.terminate() and the actual point where the worker terminates itself. Then visit this example worker page:
https://firefox-dev.tools/debugger-examples/examples/workers/pi.html
The Start button starts up a worker, and the Stop button terminate()s it. On my machine, pressing Start and then Stop prints out CALLED_TERMINATE, then usually waits for several seconds before printing out TERMINATING.
Assignee | ||
Comment 5•6 years ago
|
||
(In reply to Andrew Sutherland [:asuth] from comment #3)
Can you clarify situation where you were seeing this? Is this under debugger tests involving workers, or even when there's no debuggers involved? In particular, if you can point me at a test I can run to reproduce, that'd be appreciated.
Operating theory-wise: mStatus should change as a result as a result of a NotifyRunnable WorkerControlRunnable executing. So I'd suspect the problem here is that we were in a debugger runnable which had a nested event loop pushed and that's where the NotifyRunnable was processed. So we fell out of that loop and out to the top-level WorkerPrivate::DoRunLoop. The logic there doesn't check for the possibility that mStatus changed while we were in the debugger runnable, so we go back to the top of the loop and block. Thankfully we will have scheduled the GC timer, which is a control runnable, so we will notice that state change at that point.
I think the problem is related to the other check performed before terminating the worker. DoRunLoop() only starts terminating when (currentStatus != Running && !HasActiveHolders()). After the NotifyRunnable sets the status to canceling, we end up at this check but there are still holders. When the last holder is removed (stack below) it happens in a normal runnable, so we end up at the top of DoRunLoop before we have a chance to do this check again.
- frame #0: 0x0000000113d350a0 XUL
mozilla::dom::WorkerPrivate::RemoveHolder(this=0x000000012cd18800, aHolder=0x000000012c1ae1a0) at WorkerPrivate.cpp:3438 [opt] frame #1: 0x0000000113d44daf XUL
nsTArray_Impl<mozilla::dom::WorkerHolder*, nsTArrayInfallibleAllocator>::RemoveElementsAt(unsigned long, unsigned long) [inlined] nsTArrayFallibleAllocator::Free(void*) at nsTArray.h:194 [opt]
frame #2: 0x0000000113d44daa XULnsTArray_Impl<mozilla::dom::WorkerHolder*, nsTArrayInfallibleAllocator>::RemoveElementsAt(unsigned long, unsigned long) [inlined] nsTArray_base<nsTArrayInfallibleAllocator, nsTArray_CopyWithMemutils>::ShrinkCapacity(this=<unavailable>, aElemSize=8, aElemAlign=8) at nsTArray-inl.h:236 [opt] frame #3: 0x0000000113d44d93 XUL
nsTArray_Impl<mozilla::dom::WorkerHolder*, nsTArrayInfallibleAllocator>::RemoveElementsAt(unsigned long, unsigned long) [inlined] void nsTArray_base<nsTArrayInfallibleAllocator, nsTArray_CopyWithMemutils>::ShiftData<nsTArrayInfallibleAllocator>(this=<unavailable>, aStart=<unavailable>, aOldLen=<unavailable>, aNewLen=0, aElemSize=8, aElemAlign=8) at nsTArray-inl.h:266 [opt]
frame #4: 0x0000000113d44d5a XULnsTArray_Impl<mozilla::dom::WorkerHolder*, nsTArrayInfallibleAllocator>::RemoveElementsAt(unsigned long, unsigned long) [inlined] nsTArray_Impl<mozilla::dom::WorkerHolder*, nsTArrayInfallibleAllocator>::RemoveElementsAtUnsafe(this=<unavailable>, aStart=<unavailable>, aCount=<unavailable>) at nsTArray.h:2238 [opt] frame #5: 0x0000000113d44d5a XUL
nsTArray_Impl<mozilla::dom::WorkerHolder*, nsTArrayInfallibleAllocator>::RemoveElementsAt(this=0x000000012cd18800, aStart=<unavailable>, aCount=<unavailable>) at nsTArray.h:2231 [opt]
frame #6: 0x0000000113d1e247 XULmozilla::dom::WorkerHolderToken::~WorkerHolderToken() [inlined] mozilla::dom::WorkerHolder::ReleaseWorkerInternal(this=<unavailable>) at WorkerHolder.cpp:65 [opt] frame #7: 0x0000000113d1e228 XUL
mozilla::dom::WorkerHolderToken::~WorkerHolderToken() [inlined] mozilla::dom::WorkerHolder::~WorkerHolder(this=<unavailable>) at WorkerHolder.cpp:31 [opt]
frame #8: 0x0000000113d1e210 XULmozilla::dom::WorkerHolderToken::~WorkerHolderToken(this=0x000000011956f388) at WorkerHolderToken.cpp:73 [opt] frame #9: 0x0000000113d1e28e XUL
mozilla::dom::WorkerHolderToken::~WorkerHolderToken() [inlined] mozilla::dom::WorkerHolderToken::~WorkerHolderToken(this=0x000000012c1ae1a0) at WorkerHolderToken.cpp:70 [opt]
frame #10: 0x0000000113d1e289 XULmozilla::dom::WorkerHolderToken::~WorkerHolderToken(this=0x000000012c1ae1a0) at WorkerHolderToken.cpp:70 [opt] frame #11: 0x00000001135d2df1 XUL
mozilla::dom::ClientManagerChild::ActorDestroy(mozilla::ipc::IProtocol::ActorDestroyReason) [inlined] mozilla::dom::WorkerHolderToken::Release(this=<unavailable>) at WorkerHolderToken.h:68 [opt]
frame #12: 0x00000001135d2de3 XULmozilla::dom::ClientManagerChild::ActorDestroy(mozilla::ipc::IProtocol::ActorDestroyReason) [inlined] mozilla::RefPtrTraits<mozilla::dom::WorkerHolderToken>::Release(aPtr=<unavailable>) at RefPtr.h:45 [opt] frame #13: 0x00000001135d2de3 XUL
mozilla::dom::ClientManagerChild::ActorDestroy(mozilla::ipc::IProtocol::ActorDestroyReason) [inlined] RefPtr<mozilla::dom::WorkerHolderToken>::ConstRemovingRefPtrTraits<mozilla::dom::WorkerHolderToken>::Release(aPtr=<unavailable>) at RefPtr.h:362 [opt]
frame #14: 0x00000001135d2de3 XULmozilla::dom::ClientManagerChild::ActorDestroy(mozilla::ipc::IProtocol::ActorDestroyReason) [inlined] RefPtr<mozilla::dom::WorkerHolderToken>::assign_assuming_AddRef(this=<unavailable>) at RefPtr.h:64 [opt] frame #15: 0x00000001135d2de3 XUL
mozilla::dom::ClientManagerChild::ActorDestroy(mozilla::ipc::IProtocol::ActorDestroyReason) [inlined] RefPtr<mozilla::dom::WorkerHolderToken>::operator=(this=<unavailable>) at RefPtr.h:155 [opt]
frame #16: 0x00000001135d2de3 XULmozilla::dom::ClientManagerChild::ActorDestroy(this=0x000000012ce39d90, aReason=<unavailable>) at ClientManagerChild.cpp:20 [opt] frame #17: 0x00000001121704b1 XUL
mozilla::dom::PClientManagerChild::DestroySubtree(this=0x000000012ce39d90, why=Deletion) at PClientManagerChild.cpp:500 [opt]
frame #18: 0x000000011216ff06 XULmozilla::dom::PClientManagerChild::OnMessageReceived(this=<unavailable>, msg__=0x000000012ce8c240) at PClientManagerChild.cpp:407 [opt] frame #19: 0x0000000111e82792 XUL
mozilla::ipc::MessageChannel::DispatchAsyncMessage(this=0x000000012ce780f8, aMsg=0x000000012ce8c240) at MessageChannel.cpp:2150 [opt]
frame #20: 0x0000000111e819e7 XULmozilla::ipc::MessageChannel::DispatchMessage(this=0x000000012ce780f8, aMsg=0x000000012ce8c240) at MessageChannel.cpp:2077 [opt] frame #21: 0x0000000111e822ee XUL
mozilla::ipc::MessageChannel::MessageTask::Run(this=0x000000012ce8c1f0) at MessageChannel.cpp:1967 [opt]
frame #22: 0x0000000111979ae1 XULnsThread::ProcessNextEvent(this=0x000000012bf646d0, aMayWait=<unavailable>, aResult=0x0000700010a1962f) at nsThread.cpp:1162 [opt] frame #23: 0x000000011197c2d9 XUL
NS_ProcessNextEvent(aThread=<unavailable>, aMayWait=false) at nsThreadUtils.cpp:474 [opt]
frame #24: 0x0000000113d3243f XULmozilla::dom::WorkerPrivate::DoRunLoop(this=0x000000012cd18800, aCx=0x000000012c0a7000) at WorkerPrivate.cpp:2738 [opt] frame #25: 0x0000000113d2077d XUL
mozilla::dom::workerinternals::(anonymous namespace)::WorkerThreadPrimaryRunnable::Run(this=0x000000012cd63ac0) at RuntimeService.cpp:2289 [opt]
frame #26: 0x0000000111979ae1 XULnsThread::ProcessNextEvent(this=0x000000012bf646d0, aMayWait=<unavailable>, aResult=0x0000700010a19def) at nsThread.cpp:1162 [opt] frame #27: 0x000000011197c2d9 XUL
NS_ProcessNextEvent(aThread=<unavailable>, aMayWait=true) at nsThreadUtils.cpp:474 [opt]
frame #28: 0x0000000111e852da XULmozilla::ipc::MessagePumpForNonMainThreads::Run(this=0x000000012cd63a40, aDelegate=0x000000012c1ef100) at MessagePump.cpp:333 [opt] frame #29: 0x0000000111e48449 XUL
MessageLoop::Run() [inlined] MessageLoop::RunInternal(this=<unavailable>) at message_loop.cc:315 [opt]
frame #30: 0x0000000111e4843a XULMessageLoop::Run() [inlined] MessageLoop::RunHandler(this=<unavailable>) at message_loop.cc:308 [opt] frame #31: 0x0000000111e4843a XUL
MessageLoop::Run(this=<unavailable>) at message_loop.cc:290 [opt]
frame #32: 0x0000000111977a42 XULnsThread::ThreadFunc(aArg=<unavailable>) at nsThread.cpp:449 [opt] frame #33: 0x000000010fde7a3c libnss3.dylib
_pt_root(arg=0x000000012c178840) at ptthread.c:201 [opt]
Comment 6•6 years ago
|
||
Assignee | ||
Comment 9•6 years ago
|
||
Hi Tarek, I had to disable a check in the browser_test_performance_metrics.js mochitest in revision bf486c5f798b. Workers in content processes shut down immediately now, instead of waiting for multiple seconds, and it seems like this happens fast enough that the shared worker in this test never reports any performance metrics. Does this test change look OK?
Comment 10•6 years ago
|
||
Backed out changeset 592350649c0f (Bug 1522246) for browser_test_performance_metrics.js failures
Also backed out the fix: changeset bf486c5f798b (Bug 1522246) for ES Lint failure
Backout links: https://hg.mozilla.org/integration/mozilla-inbound/rev/f07eec8b4f2c820d2f1f15c0976f77cd0f7d9321
Failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=230648258&repo=mozilla-inbound&lineNumber=4951
23:53:07 INFO - TEST-PASS | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | Categories with an empty count are dropped - true == true -
23:53:07 INFO - TEST-PASS | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | An entry should have a host or a windowId - true == true -
23:53:07 INFO - TEST-PASS | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | Categories with an empty count are dropped - true == true -
23:53:07 INFO - TEST-PASS | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | An entry should have a host or a windowId - true == true -
23:53:07 INFO - TEST-PASS | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | Categories with an empty count are dropped - true == true -
23:53:07 INFO - TEST-PASS | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | An entry should have a host or a windowId - true == true -
23:53:07 INFO - TEST-PASS | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | Categories with an empty count are dropped - true == true -
23:53:07 INFO - TEST-PASS | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | Categories with an empty count are dropped - true == true -
23:53:07 INFO - TEST-PASS | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | Categories with an empty count are dropped - true == true -
23:53:07 INFO - TEST-PASS | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | Categories with an empty count are dropped - true == true -
23:53:07 INFO - TEST-PASS | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | An entry should have a host or a windowId - true == true -
23:53:07 INFO - TEST-PASS | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | Categories with an empty count are dropped - true == true -
23:53:07 INFO - TEST-PASS | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | Categories with an empty count are dropped - true == true -
23:53:07 INFO - TEST-PASS | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | An entry should have a host or a windowId - true == true -
23:53:07 INFO - TEST-PASS | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | Categories with an empty count are dropped - true == true -
23:53:07 INFO - TEST-PASS | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | Worker duration should be positive - true == true -
23:53:07 INFO - TEST-PASS | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | Worker count should be positive - true == true -
23:53:07 INFO - TEST-PASS | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | Duration should be positive - true == true -
23:53:07 INFO - TEST-PASS | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | Should get a positive count - true == true -
23:53:07 INFO - TEST-PASS | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | parent process sent back some events - true == true -
23:53:07 INFO - TEST-PASS | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | example.com as a top level window - true == true -
23:53:07 INFO - TEST-PASS | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | about:memory - true == true -
23:53:07 INFO - TEST-PASS | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | got some memory value reported - true == true -
23:53:07 INFO - Buffered messages finished
23:53:07 INFO - TEST-UNEXPECTED-FAIL | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | We got some info from a shared worker - false == true - JS frame :: chrome://mochitests/content/browser/dom/tests/browser/perfmetrics/browser_test_performance_metrics.js :: test/< :: line 141
23:53:07 INFO - Stack trace:
23:53:07 INFO - chrome://mochitests/content/browser/dom/tests/browser/perfmetrics/browser_test_performance_metrics.js:test/<:141
23:53:07 INFO - resource://testing-common/BrowserTestUtils.jsm:withNewTab:111
23:53:07 INFO - chrome://mochitests/content/browser/dom/tests/browser/perfmetrics/browser_test_performance_metrics.js:test:63
23:53:07 INFO - chrome://mochikit/content/browser-test.js:Tester_execTest/<:1106
23:53:07 INFO - chrome://mochikit/content/browser-test.js:Tester_execTest:1134
23:53:07 INFO - chrome://mochikit/content/browser-test.js:nextTest/<:995
23:53:07 INFO - chrome://mochikit/content/tests/SimpleTest/SimpleTest.js:SimpleTest.waitForFocus/waitForFocusInner/focusedOrLoaded/<:803
23:53:07 INFO - TEST-PASS | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | This test generated at least 30 unique counters - true == true -
Comment 11•6 years ago
|
||
Comment 12•6 years ago
|
||
Comment 13•6 years ago
|
||
Hi Brian,
I am sorry but I don't think commenting out the test is the proper fix, could you make it work so we don't lose it?
I think the proper fix would be to wait until the worker sent enough info before shutting down everything.
I think this can be done by waiting a bit inside the worker itself before it's shut down
Comment 14•6 years ago
|
||
bugherder |
Assignee | ||
Comment 15•6 years ago
|
||
This patch should fix browser_test_performance_metrics.js, per the try run below:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=763aed555f879d1ea01d56f20d6cd2dd4a84763f
Comment 16•6 years ago
|
||
Comment 17•6 years ago
|
||
Comment 18•6 years ago
|
||
bugherder |
Description
•