ServiceWorker event dispatch should not go through the main thread of the content process
Categories
(Core :: DOM: Service Workers, enhancement, P2)
Tracking
()
| Tracking | Status | |
|---|---|---|
| firefox134 | --- | fixed |
People
(Reporter: asuth, Assigned: edenchuang)
References
(Blocks 2 open bugs)
Details
Attachments
(5 files, 2 obsolete files)
As expressed in bug 1672491, ServiceWorker "ops" including fetch ops are bounced off the main thread of the content process from the RemoteWorkerService "Worker Launcher" thread due to invariants about the main thread needing to be the owner of any top-level worker. This is hypothesized to be less than awesome for performance and the most likely point of contention with other browser activity and should be addressed.
Note that in theory this impacts SharedWorkers too, but the reality is that SharedWorkers' API surface is exposed via MessageChannel and so all steady-state interactions with a SharedWorker don't contend with the main thread. (However, binding to an existing SharedWorker will involve the main thread.)
| Reporter | ||
Updated•4 years ago
|
| Reporter | ||
Comment 1•4 years ago
•
|
||
:edenchuang and :jesup and I just had a quick talk on this and another potential means of taking the main thread out of latency was identified.
PRemoteWorker direct to Worker thread for non-lifecycle operations
- Currently PRemoteWorker is used to connect the RemoteWorkerManager in the IPDL Background thread to the RemoteWorkerService in the Worker Launcher thread on a content process. As operations are received in the Worker Launcher thread, they are dispatched by way of the main thread to the worker thread proper.
- This provides a homogenous data path for the lifecycle operations of spawning the worker and terminating the worker as well as the non-lifecyle operations like "fetch" events, "message" events, etc.
- Instead, the PRemoteWorker connections could be established for non-lifecycle events directly to the worker, allowing "fetch" and "message" events to be dispatched directly to the worker.
- This would mean that lifecycle events would need to be split into their own protocol. The existing PRemoteWorkerService could be used for this purpose. This would cover both spawning and termination.
- Arguably there are advantages to having an actor whose lifecycle corresponds to the length of the worker (versus passing identifiers back and forth). So it might make sense to create a PRemoteWorkerParent which could be managed by PRemoteWorkerService for additional clarity. (I'm not 100% clear on why PRemoteWorker wasn't already managed by PRemoteWorkerService at this moment.)
The main logistical complication is that non-lifecycle operations could be directly sent over PRemoteWorker as soon as it was constructed as part of the lifecycle action of spawning the ServiceWorker. But under this new split-protocol scheme, the operations would not be able to be sent until PRemoteWorker has registered itself with the parent process (and this would potentially require a new rendezvous via identifier mechanism unless IPC enhancements related to endpoints allow the worker spawning to pass an endpoint down over PRemoteWorkerService to be used to provide a pre-named/pre-allocated RemoteWorkerParent to be established from a different thread in the content process and thereby via a different PBackground parent/child pair).
Benefits
- This would allow us to side-step bug 1672491 as a blocking dependency. Until bug 1672491 is fixed, there would of course potentially be some amount of main-thread contention experienced when spawning the ServiceWorker, but the steady state operation would not involve the main thread.
- Thread hopping elimination:
- Relative to the current state of the tree: this eliminates the hop from the worker launcher thread to the main thread and then to the worker.
- Relative to the proposal from comment 0: this eliminates the hop from the worker launcher thread to the worker. This is less a performance win, as the worker launcher thread should never be contended, but would represent a meaningful simplification in implementation and conceptual complexity as all non-lifecycle events would only be dealing with a single thread.
| Assignee | ||
Comment 2•2 years ago
|
||
Updated•2 years ago
|
| Assignee | ||
Comment 3•2 years ago
|
||
In this patch, IPC PRemoteWorkerNonLifeCycleOpController is created for non-life cycle operations of SharedWorker/ServiceWorker.
PRemoteWorkerNonLifeCycleOpController is the IPC between the content process worker thread and the parent process background thread.
This IPC helps to build a direct communication instead of go through worker launcher thread and the main thread.
When a worker gets into Running status, PRemoteWorkerNonLifeCycleOpControllerChild/Parent is created from content process,
and PRemoteWorkerNonLifeCycleOpControllerParent would binds to the coresponding RemoteWorkerController by re-using RemoteWorkerManager to find the the target RemoteWorkerServiceParent.
After connection is built, RemoteWorkerController can send non-life cycle related operations by PRemoteWorkerNonLifeCycleOpController and send life cycle related operations by PRemoteWorker.
Depends on D196946
| Assignee | ||
Comment 4•2 years ago
|
||
To reuse the code of RemoteWorker::State and RemoteWorker::Op in RemoteWorkerNonLifeCycleOpControllerChild, this patch isolates these reused codes into a new file RemoteWorkerOp.h. And also extract RemoteWorker::SharedWorkerOp into SharedWorkerOp.h.cpp
Depends on D197563
| Assignee | ||
Comment 5•2 years ago
|
||
Depends on D198005
| Assignee | ||
Comment 6•2 years ago
|
||
Depends on D198022
| Assignee | ||
Comment 7•2 years ago
|
||
Depends on D198029
Comment 9•1 year ago
|
||
Backed out for causing servicewoker related perma failures
Backout link: https://hg.mozilla.org/integration/autoland/rev/f75273742b59bde984e44be59057c880e946ddb3
| Assignee | ||
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
| Reporter | ||
Comment 10•1 year ago
|
||
As I dig into the revised patch stack which gets rid of direct IPC delivery to the worker, I wanted to better understand the deadlock concern we're seeing:
Extracted deadlock stack from https://treeherder.mozilla.org/logviewer?job_id=464223247&repo=autoland&lineNumber=120346-120390 which is the 3rd failure link from comment 9 is:
[task 2024-06-27T09:56:35.105Z] 09:56:35 INFO - GECKO(1531) | ###!!! Deadlock may happen NOW!
[task 2024-06-27T09:56:35.106Z] 09:56:35 INFO - GECKO(1531) | : 'Error', file /builds/worker/checkouts/gecko/xpcom/threads/BlockingResourceBase.cpp:245
[task 2024-06-27T09:56:35.110Z] 09:56:35 INFO - Initializing stack-fixing for the first stack frame, this may take a while...
[task 2024-06-27T09:56:48.493Z] 09:56:48 INFO - GECKO(1531) | #01: NS_DebugBreak [xpcom/base/nsDebugImpl.cpp:0]
[task 2024-06-27T09:56:48.494Z] 09:56:48 INFO - GECKO(1531) | #02: mozilla::BlockingResourceBase::CheckAcquire() [xpcom/threads/BlockingResourceBase.cpp:0]
[task 2024-06-27T09:56:48.495Z] 09:56:48 INFO - GECKO(1531) | #03: mozilla::OffTheBooksMutex::Lock() [xpcom/threads/BlockingResourceBase.cpp:311]
[task 2024-06-27T09:56:48.496Z] 09:56:48 INFO - GECKO(1531) | #04: mozilla::dom::WorkerThread::Dispatch(already_AddRefed<nsIRunnable>, unsigned int) [dom/workers/WorkerThread.cpp:286]
[task 2024-06-27T09:56:48.497Z] 09:56:48 INFO - GECKO(1531) | #05: mozilla::ipc::MessageChannel::MessageTask::Post() [ipc/glue/MessageChannel.cpp:1677]
[task 2024-06-27T09:56:48.497Z] 09:56:48 INFO - GECKO(1531) | #06: mozilla::ipc::MessageChannel::OnMessageReceivedFromLink(mozilla::UniquePtr<IPC::Message, mozilla::DefaultDelete<IPC::Message> >) [ipc/glue/MessageChannel.cpp:1135]
[task 2024-06-27T09:56:48.498Z] 09:56:48 INFO - GECKO(1531) | #07: mozilla::ipc::PortLink::OnPortStatusChanged() [ipc/glue/MessageLink.cpp:195]
[task 2024-06-27T09:56:48.499Z] 09:56:48 INFO - GECKO(1531) | #08: mozilla::ipc::PortLink::PortObserverThunk::OnPortStatusChanged() [ipc/glue/MessageLink.cpp:60]
[task 2024-06-27T09:56:48.499Z] 09:56:48 INFO - GECKO(1531) | #09: mozilla::ipc::NodeController::PortStatusChanged(mojo::core::ports::PortRef const&) [ipc/glue/NodeController.cpp:434]
[task 2024-06-27T09:56:48.500Z] 09:56:48 INFO - GECKO(1531) | #10: mojo::core::ports::Node::OnUserMessage(mojo::core::ports::PortRef const&, mojo::core::ports::NodeName const&, mozilla::UniquePtr<mojo::core::ports::UserMessageEvent, mozilla::DefaultDelete<mojo::core::ports::UserMessageEvent> >) [ipc/chromium/src/mojo/core/ports/node.cc:0]
[task 2024-06-27T09:56:48.501Z] 09:56:48 INFO - GECKO(1531) | #11: mojo::core::ports::Node::AcceptEventInternal(mojo::core::ports::PortRef const&, mojo::core::ports::NodeName const&, mozilla::UniquePtr<mojo::core::ports::Event, mozilla::DefaultDelete<mojo::core::ports::Event> >) [ipc/chromium/src/mojo/core/ports/node.cc:466]
[task 2024-06-27T09:56:48.502Z] 09:56:48 INFO - GECKO(1531) | #12: mojo::core::ports::Node::AcceptEvent(mojo::core::ports::NodeName const&, mozilla::UniquePtr<mojo::core::ports::Event, mozilla::DefaultDelete<mojo::core::ports::Event> >) [ipc/chromium/src/mojo/core/ports/node.cc:538]
[task 2024-06-27T09:56:48.503Z] 09:56:48 INFO - GECKO(1531) | #13: mojo::core::ports::Node::SendUserMessageInternal(mojo::core::ports::PortRef const&, mozilla::UniquePtr<mojo::core::ports::UserMessageEvent, mozilla::DefaultDelete<mojo::core::ports::UserMessageEvent> >*) [ipc/chromium/src/mojo/core/ports/node.cc:1276]
[task 2024-06-27T09:56:48.504Z] 09:56:48 INFO - GECKO(1531) | #14: mojo::core::ports::Node::SendUserMessage(mojo::core::ports::PortRef const&, mozilla::UniquePtr<mojo::core::ports::UserMessageEvent, mozilla::DefaultDelete<mojo::core::ports::UserMessageEvent> >) [ipc/chromium/src/mojo/core/ports/node.cc:0]
[task 2024-06-27T09:56:48.504Z] 09:56:48 INFO - GECKO(1531) | #15: mozilla::ipc::NodeController::SendUserMessage(mojo::core::ports::PortRef const&, mozilla::UniquePtr<IPC::Message, mozilla::DefaultDelete<IPC::Message> >) [ipc/glue/NodeController.cpp:150]
[task 2024-06-27T09:56:48.505Z] 09:56:48 INFO - GECKO(1531) | #16: mozilla::ipc::PortLink::SendMessage(mozilla::UniquePtr<IPC::Message, mozilla::DefaultDelete<IPC::Message> >) [ipc/glue/MessageLink.cpp:133]
[task 2024-06-27T09:56:48.506Z] 09:56:48 INFO - GECKO(1531) | #17: mozilla::ipc::MessageChannel::SendMessageToLink(mozilla::UniquePtr<IPC::Message, mozilla::DefaultDelete<IPC::Message> >) [ipc/glue/MessageChannel.cpp:811]
[task 2024-06-27T09:56:48.506Z] 09:56:48 INFO - GECKO(1531) | #18: mozilla::ipc::MessageChannel::Send(mozilla::UniquePtr<IPC::Message, mozilla::DefaultDelete<IPC::Message> >) [ipc/glue/MessageChannel.cpp:780]
[task 2024-06-27T09:56:48.507Z] 09:56:48 INFO - GECKO(1531) | #19: mozilla::ipc::MessageChannel::Send<mozilla::wr::MemoryReport>(mozilla::UniquePtr<IPC::Message, mozilla::DefaultDelete<IPC::Message> >, int, unsigned int, std::function<void (mozilla::wr::MemoryReport&&)>&&, std::function<void (mozilla::ipc::ResponseRejectReason)>&&) [ipc/glue/MessageChannel.h:249]
[task 2024-06-27T09:56:48.508Z] 09:56:48 INFO - GECKO(1531) | #20: mozilla::ipc::IProtocol::ChannelSend<nsTArray<mozilla::net::DNSCacheEntries> >(mozilla::UniquePtr<IPC::Message, mozilla::DefaultDelete<IPC::Message> >, unsigned int, std::function<void (nsTArray<mozilla::net::DNSCacheEntries>&&)>&&, std::function<void (mozilla::ipc::ResponseRejectReason)>&&) [ipc/glue/ProtocolUtils.h:268]
[task 2024-06-27T09:56:48.509Z] 09:56:48 INFO - GECKO(1531) | #21: mozilla::dom::PRemoteWorkerNonLifeCycleOpControllerParent::SendExecServiceWorkerOp(mozilla::dom::ServiceWorkerOpArgs const&, std::function<void (mozilla::dom::ServiceWorkerOpResult&&)>&&, std::function<void (mozilla::ipc::ResponseRejectReason)>&&) [s3:gecko-generated-sources:8c53156864ddfe3eaa597c5b3c06d64c1c01de0c996a28d119ad93c4313ca1539812c171316b5ee019c170e12912ea4552e95ccf38149e5cc0da97bb1dca3e6c/ipc/ipdl/PRemoteWorkerNonLifeCycleOpControllerParent.cpp::131]
[task 2024-06-27T09:56:48.510Z] 09:56:48 INFO - GECKO(1531) | #22: mozilla::dom::PRemoteWorkerNonLifeCycleOpControllerParent::SendExecServiceWorkerOp(mozilla::dom::ServiceWorkerOpArgs const&) [s3:gecko-generated-sources:8c53156864ddfe3eaa597c5b3c06d64c1c01de0c996a28d119ad93c4313ca1539812c171316b5ee019c170e12912ea4552e95ccf38149e5cc0da97bb1dca3e6c/ipc/ipdl/PRemoteWorkerNonLifeCycleOpControllerParent.cpp::139]
[task 2024-06-27T09:56:48.511Z] 09:56:48 INFO - GECKO(1531) | #23: mozilla::dom::RemoteWorkerController::PendingServiceWorkerOp::MaybeStart(mozilla::dom::RemoteWorkerController*) [dom/workers/remoteworkers/RemoteWorkerController.cpp:525]
[task 2024-06-27T09:56:48.511Z] 09:56:48 INFO - GECKO(1531) | #24: mozilla::dom::RemoteWorkerController::ExecServiceWorkerOp(mozilla::dom::ServiceWorkerOpArgs&&) [dom/workers/remoteworkers/RemoteWorkerController.cpp:300]
[task 2024-06-27T09:56:48.512Z] 09:56:48 INFO - GECKO(1531) | #25: mozilla::dom::RemoteWorkerControllerParent::RecvExecServiceWorkerOp(mozilla::dom::ServiceWorkerOpArgs&&, std::function<void (mozilla::dom::ServiceWorkerOpResult const&)>&&) [dom/workers/remoteworkers/RemoteWorkerControllerParent.cpp:121]
[task 2024-06-27T09:56:48.513Z] 09:56:48 INFO - GECKO(1531) | #26: mozilla::dom::PRemoteWorkerControllerParent::OnMessageReceived(IPC::Message const&) [s3:gecko-generated-sources:d8a704b92fbe14572882e948f3c7ad5ef853373192386eedc8dfe55c7a3a6804af2d587d9ede77833cca76985c030c620beea37413303698c6e97b988ef58eea/ipc/ipdl/PRemoteWorkerControllerParent.cpp::0]
[task 2024-06-27T09:56:48.514Z] 09:56:48 INFO - GECKO(1531) | #27: mozilla::ipc::PBackgroundParent::OnMessageReceived(IPC::Message const&) [s3:gecko-generated-sources:00584c61a28555adbaa5dd60645091919a161a144888b9d4fa73f12f08e7e441ab853382517837f64ce76e781bb098d6f81d61b9ecfd844e97e1b6a73223c8b4/ipc/ipdl/PBackgroundParent.cpp::2131]
[task 2024-06-27T09:56:48.514Z] 09:56:48 INFO - GECKO(1531) | #28: mozilla::ipc::MessageChannel::DispatchAsyncMessage(mozilla::ipc::ActorLifecycleProxy*, IPC::Message const&) [ipc/glue/MessageChannel.cpp:1820]
[task 2024-06-27T09:56:48.515Z] 09:56:48 INFO - GECKO(1531) | #29: mozilla::ipc::MessageChannel::DispatchMessage(mozilla::ipc::ActorLifecycleProxy*, mozilla::UniquePtr<IPC::Message, mozilla::DefaultDelete<IPC::Message> >) [ipc/glue/MessageChannel.cpp:0]
[task 2024-06-27T09:56:48.516Z] 09:56:48 INFO - GECKO(1531) | #30: mozilla::ipc::MessageChannel::RunMessage(mozilla::ipc::ActorLifecycleProxy*, mozilla::ipc::MessageChannel::MessageTask&) [ipc/glue/MessageChannel.cpp:1530]
[task 2024-06-27T09:56:48.516Z] 09:56:48 INFO - GECKO(1531) | #31: mozilla::ipc::MessageChannel::MessageTask::Run() [ipc/glue/MessageChannel.cpp:1639]
[task 2024-06-27T09:56:48.517Z] 09:56:48 INFO - GECKO(1531) | #32: nsThread::ProcessNextEvent(bool, bool*) [xpcom/threads/nsThread.cpp:1199]
All 3 failure are from "1proc" variants which we know explicitly disables e10s and fission which makes sense with what we're seeing in the stack. We're seeing same-process messaging because the remote worker has been placed in the same process as the RemoteWorkerController. We explicitly do this when e10s is disabled. However, we also do this for SharedWorkers that have a system principal, so it's not an entirely theoretical problem.
The specific lock acquisition that the deadlock detector is angry about is the WorkerPrivate::mMutex which is being used to notify the condition variable we use to wake up the worker if it's waiting for things to happen.
Not having entirely dug into the specific locking discipline failure here, but just going from a hygiene perspective, I think a viable option would be to switch from using the very broadly used WorkerPrivate::mMutex for wake-up purposes to a much more tightly scoped mWakeupMutex or something like that. (We could potentially change to using Monitor.h's Monitor/MonitorAutoLock if we really only want to protect the wakeup.) Or I suppose a question might be why aren't we actually using the WorkerThread and its WorkerThread::mLock and already-extant (confusingly named) mWorkerPrivateCondVar that is inited to use the mLock and having it be what we use to let the worker thread block and be notified since that is the lock that is being used for WorkerThread::HasPendingEvents.
Where things get a little fuzzy is that, for example, we maintain the list of debug runnables behind WorkerPrivate::mMutex. But there already inherently is this split where the normal runnables are using the WorkerThread's (more tightly scoped) mLock. And I think it works out okay if we acquire the WorkerPrivate::mMutex, push a runnable in the queue, release WorkerPrivate::mMutex and then call into WorkerThread to tell it to notify a wakeup may be necessary. Using the same lock for manipulating the debugger queue as we use for notifying the condvar really only matters if we wanted to avoid spurious wakeups. Specifically, we can imagine dropping WorkerPrivate::mMutex on the main thread after queueing a debug runnable, the worker then goes and eagerly processes the debug runnable and there is no work for it to do and it goes to sleep before we actually get to asking the WorkerThread to notify the condvar. The worker wakes up, find there is nothing to do and goes back to sleep.
Eden, what do you think of trying something like the above so that we can go back to your initial attempted landing above? Or is there some other locking discipline problem you observed that would still be a problem?
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
| Assignee | ||
Comment 11•1 year ago
|
||
Depends on D198029
Updated•1 year ago
|
| Assignee | ||
Updated•1 year ago
|
Updated•1 year ago
|
Comment 12•1 year ago
|
||
Comment 13•1 year ago
|
||
Backed out for causing frequent wpt failures at RemoteWorkerController.cpp
| Assignee | ||
Updated•1 year ago
|
Comment 14•1 year ago
|
||
Comment 15•1 year ago
|
||
Backed out for causing process-crashes @ @ mozilla::ThreadEventTarget::Dispatch
Backout link: https://hg.mozilla.org/integration/autoland/rev/108d038388ff70457b2ec3e74241f7769c281d34
Comment 16•1 year ago
|
||
| Assignee | ||
Updated•1 year ago
|
Comment 17•1 year ago
|
||
| bugherder | ||
https://hg.mozilla.org/mozilla-central/rev/544e3946eff0
https://hg.mozilla.org/mozilla-central/rev/a8dcadbfd190
https://hg.mozilla.org/mozilla-central/rev/9e61d1f904d5
https://hg.mozilla.org/mozilla-central/rev/4b65c5dbd103
https://hg.mozilla.org/mozilla-central/rev/42d2a6f8fe60
Comment 18•1 year ago
|
||
Can this bug still depend on bug 1672491 if it is closed FIXED?
| Assignee | ||
Comment 19•1 year ago
•
|
||
No, the dependency could be removed.
Description
•