Closed Bug 1522246 Opened 10 months ago Closed 9 months ago

Zombie workers should not be shown

Categories

(DevTools :: Debugger, defect, P2)

defect

Tracking

(firefox67 fixed)

RESOLVED FIXED
Firefox 67
Tracking Status
firefox67 --- fixed

People

(Reporter: jlast, Assigned: bhackett)

References

(Blocks 1 open bug)

Details

Attachments

(5 files)

Workers are currently shown until they are GC'd. This creates a weird effect where "Zombies" are shown in the UI.

Priority: -- → P2
Attached patch patchSplinter Review

After doing some investigation, this isn't actually related to the GC. When the worker thread shuts down the debugger is informed immediately, but the problem here is that there can be a large gap in time (multiple seconds) between a worker being terminate()ed and the actual shutdown.

When testing on a simple worker example, the problem seems to be that if the worker's event queues are empty after it enters the cancel state, it blocks until a new event comes in. The attached patch adds a test for the conditions under which the thread can shutdown, and avoids such blocking. This causes the worker to shutdown immediately after the terminate() call, updating the debugger UI.

Assignee: nobody → bhackett1024
Attachment #9043457 - Flags: review?(bugmail)

Can you clarify situation where you were seeing this? Is this under debugger tests involving workers, or even when there's no debuggers involved? In particular, if you can point me at a test I can run to reproduce, that'd be appreciated.

Operating theory-wise: mStatus should change as a result as a result of a NotifyRunnable WorkerControlRunnable executing. So I'd suspect the problem here is that we were in a debugger runnable which had a nested event loop pushed and that's where the NotifyRunnable was processed. So we fell out of that loop and out to the top-level WorkerPrivate::DoRunLoop. The logic there doesn't check for the possibility that mStatus changed while we were in the debugger runnable, so we go back to the top of the loop and block. Thankfully we will have scheduled the GC timer, which is a control runnable, so we will notice that state change at that point.

Flags: needinfo?(bhackett1024)
Attached patch diagnostic patchSplinter Review

This problem happens even when no debugger is attached. Using the debugger just makes this more obvious, since the debugger shows all workers that are currently running. To reproduce, first apply this diagnostic patch to print out the calls to worker.terminate() and the actual point where the worker terminates itself. Then visit this example worker page:

https://firefox-dev.tools/debugger-examples/examples/workers/pi.html

The Start button starts up a worker, and the Stop button terminate()s it. On my machine, pressing Start and then Stop prints out CALLED_TERMINATE, then usually waits for several seconds before printing out TERMINATING.

Flags: needinfo?(bhackett1024)

(In reply to Andrew Sutherland [:asuth] from comment #3)

Can you clarify situation where you were seeing this? Is this under debugger tests involving workers, or even when there's no debuggers involved? In particular, if you can point me at a test I can run to reproduce, that'd be appreciated.

Operating theory-wise: mStatus should change as a result as a result of a NotifyRunnable WorkerControlRunnable executing. So I'd suspect the problem here is that we were in a debugger runnable which had a nested event loop pushed and that's where the NotifyRunnable was processed. So we fell out of that loop and out to the top-level WorkerPrivate::DoRunLoop. The logic there doesn't check for the possibility that mStatus changed while we were in the debugger runnable, so we go back to the top of the loop and block. Thankfully we will have scheduled the GC timer, which is a control runnable, so we will notice that state change at that point.

I think the problem is related to the other check performed before terminating the worker. DoRunLoop() only starts terminating when (currentStatus != Running && !HasActiveHolders()). After the NotifyRunnable sets the status to canceling, we end up at this check but there are still holders. When the last holder is removed (stack below) it happens in a normal runnable, so we end up at the top of DoRunLoop before we have a chance to do this check again.

  • frame #0: 0x0000000113d350a0 XULmozilla::dom::WorkerPrivate::RemoveHolder(this=0x000000012cd18800, aHolder=0x000000012c1ae1a0) at WorkerPrivate.cpp:3438 [opt] frame #1: 0x0000000113d44daf XULnsTArray_Impl<mozilla::dom::WorkerHolder*, nsTArrayInfallibleAllocator>::RemoveElementsAt(unsigned long, unsigned long) [inlined] nsTArrayFallibleAllocator::Free(void*) at nsTArray.h:194 [opt]
    frame #2: 0x0000000113d44daa XULnsTArray_Impl<mozilla::dom::WorkerHolder*, nsTArrayInfallibleAllocator>::RemoveElementsAt(unsigned long, unsigned long) [inlined] nsTArray_base<nsTArrayInfallibleAllocator, nsTArray_CopyWithMemutils>::ShrinkCapacity(this=<unavailable>, aElemSize=8, aElemAlign=8) at nsTArray-inl.h:236 [opt] frame #3: 0x0000000113d44d93 XULnsTArray_Impl<mozilla::dom::WorkerHolder*, nsTArrayInfallibleAllocator>::RemoveElementsAt(unsigned long, unsigned long) [inlined] void nsTArray_base<nsTArrayInfallibleAllocator, nsTArray_CopyWithMemutils>::ShiftData<nsTArrayInfallibleAllocator>(this=<unavailable>, aStart=<unavailable>, aOldLen=<unavailable>, aNewLen=0, aElemSize=8, aElemAlign=8) at nsTArray-inl.h:266 [opt]
    frame #4: 0x0000000113d44d5a XULnsTArray_Impl<mozilla::dom::WorkerHolder*, nsTArrayInfallibleAllocator>::RemoveElementsAt(unsigned long, unsigned long) [inlined] nsTArray_Impl<mozilla::dom::WorkerHolder*, nsTArrayInfallibleAllocator>::RemoveElementsAtUnsafe(this=<unavailable>, aStart=<unavailable>, aCount=<unavailable>) at nsTArray.h:2238 [opt] frame #5: 0x0000000113d44d5a XULnsTArray_Impl<mozilla::dom::WorkerHolder*, nsTArrayInfallibleAllocator>::RemoveElementsAt(this=0x000000012cd18800, aStart=<unavailable>, aCount=<unavailable>) at nsTArray.h:2231 [opt]
    frame #6: 0x0000000113d1e247 XULmozilla::dom::WorkerHolderToken::~WorkerHolderToken() [inlined] mozilla::dom::WorkerHolder::ReleaseWorkerInternal(this=<unavailable>) at WorkerHolder.cpp:65 [opt] frame #7: 0x0000000113d1e228 XULmozilla::dom::WorkerHolderToken::~WorkerHolderToken() [inlined] mozilla::dom::WorkerHolder::~WorkerHolder(this=<unavailable>) at WorkerHolder.cpp:31 [opt]
    frame #8: 0x0000000113d1e210 XULmozilla::dom::WorkerHolderToken::~WorkerHolderToken(this=0x000000011956f388) at WorkerHolderToken.cpp:73 [opt] frame #9: 0x0000000113d1e28e XULmozilla::dom::WorkerHolderToken::~WorkerHolderToken() [inlined] mozilla::dom::WorkerHolderToken::~WorkerHolderToken(this=0x000000012c1ae1a0) at WorkerHolderToken.cpp:70 [opt]
    frame #10: 0x0000000113d1e289 XULmozilla::dom::WorkerHolderToken::~WorkerHolderToken(this=0x000000012c1ae1a0) at WorkerHolderToken.cpp:70 [opt] frame #11: 0x00000001135d2df1 XULmozilla::dom::ClientManagerChild::ActorDestroy(mozilla::ipc::IProtocol::ActorDestroyReason) [inlined] mozilla::dom::WorkerHolderToken::Release(this=<unavailable>) at WorkerHolderToken.h:68 [opt]
    frame #12: 0x00000001135d2de3 XULmozilla::dom::ClientManagerChild::ActorDestroy(mozilla::ipc::IProtocol::ActorDestroyReason) [inlined] mozilla::RefPtrTraits<mozilla::dom::WorkerHolderToken>::Release(aPtr=<unavailable>) at RefPtr.h:45 [opt] frame #13: 0x00000001135d2de3 XULmozilla::dom::ClientManagerChild::ActorDestroy(mozilla::ipc::IProtocol::ActorDestroyReason) [inlined] RefPtr<mozilla::dom::WorkerHolderToken>::ConstRemovingRefPtrTraits<mozilla::dom::WorkerHolderToken>::Release(aPtr=<unavailable>) at RefPtr.h:362 [opt]
    frame #14: 0x00000001135d2de3 XULmozilla::dom::ClientManagerChild::ActorDestroy(mozilla::ipc::IProtocol::ActorDestroyReason) [inlined] RefPtr<mozilla::dom::WorkerHolderToken>::assign_assuming_AddRef(this=<unavailable>) at RefPtr.h:64 [opt] frame #15: 0x00000001135d2de3 XULmozilla::dom::ClientManagerChild::ActorDestroy(mozilla::ipc::IProtocol::ActorDestroyReason) [inlined] RefPtr<mozilla::dom::WorkerHolderToken>::operator=(this=<unavailable>) at RefPtr.h:155 [opt]
    frame #16: 0x00000001135d2de3 XULmozilla::dom::ClientManagerChild::ActorDestroy(this=0x000000012ce39d90, aReason=<unavailable>) at ClientManagerChild.cpp:20 [opt] frame #17: 0x00000001121704b1 XULmozilla::dom::PClientManagerChild::DestroySubtree(this=0x000000012ce39d90, why=Deletion) at PClientManagerChild.cpp:500 [opt]
    frame #18: 0x000000011216ff06 XULmozilla::dom::PClientManagerChild::OnMessageReceived(this=<unavailable>, msg__=0x000000012ce8c240) at PClientManagerChild.cpp:407 [opt] frame #19: 0x0000000111e82792 XULmozilla::ipc::MessageChannel::DispatchAsyncMessage(this=0x000000012ce780f8, aMsg=0x000000012ce8c240) at MessageChannel.cpp:2150 [opt]
    frame #20: 0x0000000111e819e7 XULmozilla::ipc::MessageChannel::DispatchMessage(this=0x000000012ce780f8, aMsg=0x000000012ce8c240) at MessageChannel.cpp:2077 [opt] frame #21: 0x0000000111e822ee XULmozilla::ipc::MessageChannel::MessageTask::Run(this=0x000000012ce8c1f0) at MessageChannel.cpp:1967 [opt]
    frame #22: 0x0000000111979ae1 XULnsThread::ProcessNextEvent(this=0x000000012bf646d0, aMayWait=<unavailable>, aResult=0x0000700010a1962f) at nsThread.cpp:1162 [opt] frame #23: 0x000000011197c2d9 XULNS_ProcessNextEvent(aThread=<unavailable>, aMayWait=false) at nsThreadUtils.cpp:474 [opt]
    frame #24: 0x0000000113d3243f XULmozilla::dom::WorkerPrivate::DoRunLoop(this=0x000000012cd18800, aCx=0x000000012c0a7000) at WorkerPrivate.cpp:2738 [opt] frame #25: 0x0000000113d2077d XULmozilla::dom::workerinternals::(anonymous namespace)::WorkerThreadPrimaryRunnable::Run(this=0x000000012cd63ac0) at RuntimeService.cpp:2289 [opt]
    frame #26: 0x0000000111979ae1 XULnsThread::ProcessNextEvent(this=0x000000012bf646d0, aMayWait=<unavailable>, aResult=0x0000700010a19def) at nsThread.cpp:1162 [opt] frame #27: 0x000000011197c2d9 XULNS_ProcessNextEvent(aThread=<unavailable>, aMayWait=true) at nsThreadUtils.cpp:474 [opt]
    frame #28: 0x0000000111e852da XULmozilla::ipc::MessagePumpForNonMainThreads::Run(this=0x000000012cd63a40, aDelegate=0x000000012c1ef100) at MessagePump.cpp:333 [opt] frame #29: 0x0000000111e48449 XULMessageLoop::Run() [inlined] MessageLoop::RunInternal(this=<unavailable>) at message_loop.cc:315 [opt]
    frame #30: 0x0000000111e4843a XULMessageLoop::Run() [inlined] MessageLoop::RunHandler(this=<unavailable>) at message_loop.cc:308 [opt] frame #31: 0x0000000111e4843a XULMessageLoop::Run(this=<unavailable>) at message_loop.cc:290 [opt]
    frame #32: 0x0000000111977a42 XULnsThread::ThreadFunc(aArg=<unavailable>) at nsThread.cpp:449 [opt] frame #33: 0x000000010fde7a3c libnss3.dylib_pt_root(arg=0x000000012c178840) at ptthread.c:201 [opt]
Comment on attachment 9043457 [details] [diff] [review]
patch

Review of attachment 9043457 [details] [diff] [review]:
-----------------------------------------------------------------

Aha!  The Clients API holder and the very confusing train-wreck that is how IPC runnables get into the worker event loop!  (In theory the IPC message should actually be wrapped and processed as a control runnable thanks to the parent status and the hybrid event target, but I don't think that actually works, as we can see with what's going on here.  I think we're falling back to using the XPCOM thread's event target rather than having handed the hybrid event target to the root pbackground child actor.)  Thanks very much for the stack and analysis.

So, to restate what's going on:
- DoRunLoop's waiting logic assumes that the transition from mStatus == Running to mStatus == Canceling will happen inside a control runnable and that this will decisively cause the worker to advance to Killing.
- This assumption is wrong because we can have holders that prevent us from moving to Killing.  The exact reason we have holders like this is so that IPC objects can be torn down cleanly, and in the case of the Cache API, so that they can even continue to finish up pending work before the worker is torn down.  (I believe the core of the desire for clean teardown was because at some point the receipt of messages for a dead actor on a dead thread would lead to IPC panics that would crash the browser.)
- Those holders are frequently removed from normal IPC runnables.  (Note: we do have the wacky hybrid event target that will upgrade normal runnables to control runnables.  And that's the event target we give to the ClientManager which it will direct all its MozPromise runnables at.  However, the holder is removed directly from a received IPC message which successfully ends up in the normal queue.)
- As observed, we currently end up hanging until the GC control runnable or something else wakes up the thread so that the termination can complete.  It's notable that only control runnables will be processed in such a case; we will not give the normal runnables a chance to run.

The possible solutions to such an issue are:
1. This patch.  Update the waiting loop invariant to be aware of termination criteria.  This assumes that mStatus is only updated on the worker thread.  And that is a correct assumption!  This also has the benefit of having the least change in behavior of workers.
2. Cause removing a WorkerHolder to generate a control runnable, triggering a wake-up.  This only makes sense in cross-thread situations where the thread is already blocking and needs to be signaled to wake up.  It makes little sense when we're already on the thread in question.  There's also another idiom where you'd schedule a runnable to maintain ordering, but the whole point of control runnables is that they are processed first, so that would be very silly.
3. Fix the hybrid event target usage so that once the worker's termination is initiated, IPC runnables are wrapped and made into control runnables.  This aligns well with the original intent but ends up constituting a potentially massive/risky change.

So this patch seems like a nice pragmatic change.  I do think medium/longer term we want to have the behavior for what happens with IPC and the various event targets during termination be more clear and better documented.

::: dom/workers/WorkerPrivate.cpp
@@ +2615,2 @@
>  
>        while (mControlQueue.IsEmpty() &&

I suggest we add a comment like the following on this block of code:

Wait for a runnable to arrive that we can execute, or for it to be okay to shutdown this worker once all holders have been removed.  (Holders may be removed from inside normal runnables, but we don't check for that after processing normal runnables, so we need to let control flow to the shutdown logic without blocking.)
Attachment #9043457 - Flags: review?(bugmail) → review+
Pushed by bhackett@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/592350649c0f
Shutdown workers immediately after terminate()ing them, r=asuth.
Pushed by bhackett@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/bf486c5f798b
Don't check for information from shared worker in test.

Hi Tarek, I had to disable a check in the browser_test_performance_metrics.js mochitest in revision bf486c5f798b. Workers in content processes shut down immediately now, instead of waiting for multiple seconds, and it seems like this happens fast enough that the shared worker in this test never reports any performance metrics. Does this test change look OK?

Flags: needinfo?(tarek)

Backed out changeset 592350649c0f (Bug 1522246) for browser_test_performance_metrics.js failures
Also backed out the fix: changeset bf486c5f798b (Bug 1522246) for ES Lint failure

Pushes with failures: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&fromchange=8fcd04c595a9b49eb779b978d4ebd4b7d2232761&searchStr=os%2Cx%2C10.10%2Copt%2Cmochitests%2Cwith%2Ce10s%2Ctest-macosx64%2Fopt-mochitest-browser-chrome-e10s-6%2Cm-e10s%28bc6%29&tochange=f07eec8b4f2c820d2f1f15c0976f77cd0f7d9321&selectedJob=230648258

https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=bf486c5f798baff9c7f8fb09882159584917a0ef&selectedJob=230663013

Backout links: https://hg.mozilla.org/integration/mozilla-inbound/rev/f07eec8b4f2c820d2f1f15c0976f77cd0f7d9321

Failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=230648258&repo=mozilla-inbound&lineNumber=4951

23:53:07 INFO - TEST-PASS | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | Categories with an empty count are dropped - true == true -
23:53:07 INFO - TEST-PASS | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | An entry should have a host or a windowId - true == true -
23:53:07 INFO - TEST-PASS | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | Categories with an empty count are dropped - true == true -
23:53:07 INFO - TEST-PASS | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | An entry should have a host or a windowId - true == true -
23:53:07 INFO - TEST-PASS | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | Categories with an empty count are dropped - true == true -
23:53:07 INFO - TEST-PASS | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | An entry should have a host or a windowId - true == true -
23:53:07 INFO - TEST-PASS | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | Categories with an empty count are dropped - true == true -
23:53:07 INFO - TEST-PASS | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | Categories with an empty count are dropped - true == true -
23:53:07 INFO - TEST-PASS | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | Categories with an empty count are dropped - true == true -
23:53:07 INFO - TEST-PASS | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | Categories with an empty count are dropped - true == true -
23:53:07 INFO - TEST-PASS | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | An entry should have a host or a windowId - true == true -
23:53:07 INFO - TEST-PASS | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | Categories with an empty count are dropped - true == true -
23:53:07 INFO - TEST-PASS | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | Categories with an empty count are dropped - true == true -
23:53:07 INFO - TEST-PASS | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | An entry should have a host or a windowId - true == true -
23:53:07 INFO - TEST-PASS | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | Categories with an empty count are dropped - true == true -
23:53:07 INFO - TEST-PASS | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | Worker duration should be positive - true == true -
23:53:07 INFO - TEST-PASS | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | Worker count should be positive - true == true -
23:53:07 INFO - TEST-PASS | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | Duration should be positive - true == true -
23:53:07 INFO - TEST-PASS | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | Should get a positive count - true == true -
23:53:07 INFO - TEST-PASS | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | parent process sent back some events - true == true -
23:53:07 INFO - TEST-PASS | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | example.com as a top level window - true == true -
23:53:07 INFO - TEST-PASS | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | about:memory - true == true -
23:53:07 INFO - TEST-PASS | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | got some memory value reported - true == true -
23:53:07 INFO - Buffered messages finished
23:53:07 INFO - TEST-UNEXPECTED-FAIL | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | We got some info from a shared worker - false == true - JS frame :: chrome://mochitests/content/browser/dom/tests/browser/perfmetrics/browser_test_performance_metrics.js :: test/< :: line 141
23:53:07 INFO - Stack trace:
23:53:07 INFO - chrome://mochitests/content/browser/dom/tests/browser/perfmetrics/browser_test_performance_metrics.js:test/<:141
23:53:07 INFO - resource://testing-common/BrowserTestUtils.jsm:withNewTab:111
23:53:07 INFO - chrome://mochitests/content/browser/dom/tests/browser/perfmetrics/browser_test_performance_metrics.js:test:63
23:53:07 INFO - chrome://mochikit/content/browser-test.js:Tester_execTest/<:1106
23:53:07 INFO - chrome://mochikit/content/browser-test.js:Tester_execTest:1134
23:53:07 INFO - chrome://mochikit/content/browser-test.js:nextTest/<:995
23:53:07 INFO - chrome://mochikit/content/tests/SimpleTest/SimpleTest.js:SimpleTest.waitForFocus/waitForFocusInner/focusedOrLoaded/<:803
23:53:07 INFO - TEST-PASS | dom/tests/browser/perfmetrics/browser_test_performance_metrics.js | This test generated at least 30 unique counters - true == true -

Flags: needinfo?(bhackett1024)
Backout by btara@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/2ef9c49c8f27
Backed out changeset 592350649c0f for browser_test_performance_metrics.js failures CLOSED TREE
Pushed by bhackett@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/831da989e317
Shutdown workers immediately after terminate()ing them, r=asuth.

Hi Brian,

I am sorry but I don't think commenting out the test is the proper fix, could you make it work so we don't lose it?

I think the proper fix would be to wait until the worker sent enough info before shutting down everything.
I think this can be done by waiting a bit inside the worker itself before it's shut down

Flags: needinfo?(tarek)
Status: NEW → RESOLVED
Closed: 9 months ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 67
Attached patch fix testSplinter Review

This patch should fix browser_test_performance_metrics.js, per the try run below:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=763aed555f879d1ea01d56f20d6cd2dd4a84763f

Flags: needinfo?(bhackett1024)
Attachment #9047593 - Flags: superreview?(tarek)
Comment on attachment 9047593 [details] [diff] [review]
fix test

Review of attachment 9047593 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks Brian. This is very confusing, looks like it was already broken.
Attachment #9047593 - Flags: superreview?(tarek) → superreview+
Pushed by bhackett@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/6889cdacf015
Fix browser_test_performance_metrics.js test, r=tarek.
You need to log in before you can comment on or make changes to this bug.