Open Bug 1752287 Opened 3 years ago Updated 1 year ago

MessagePort.postMessage for received MessagePorts will fail to send messages if blocking APIs (Sync XHR, Atomics) are used prior to to the Entangling state machine stabilizing; workaround is to wait for receipt of a message

Categories

(Core :: DOM: postMessage, defect, P2)

Firefox 96
defect

Tracking

()

ASSIGNED

People

(Reporter: roberto.vidal, Assigned: asuth)

Details

Steps to reproduce:

Working example: https://github.com/jrvidal/message-port-repro

I have a main document (main), an iframe and workers spawned by the latter.

We send a MessagePort from (main) to (iframe), and then from (iframe) to a fresh (worker).
(iframe) and (worker) share a SharedArrayBuffer.
(worker) posts to the port and Atomic.wait()s using the SAB.
(main) receives the message and notifies (iframe).
(iframe) calls Atomic.notify() and lets (worker) resume.

Actual results:

I see this in the console:

[iframe] start worker #0                      iframe.js:14:15
[iframe worker #0] posting and sleeping       worker.js:7:11
[iframe worker #0] posting and sleeping       worker.js:7:11
[iframe] start worker #1                      iframe.js:14:15
[iframe worker #1] posting and sleeping       worker.js:7:11
[iframe worker #1] posting and sleeping       worker.js:7:11
[iframe] start worker #2                      iframe.js:14:15
[iframe worker #2] posting and sleeping       worker.js:7:11
[iframe worker #2] posting and sleeping       worker.js:7:11

Ignoring the duplicated logs (??), this means that the receiving port on the main thread never gets a message, and the worker never unblocks.

There is a TIMEOUT constant in worker.js that can force the worker to wait for a bit before posting-and-blocking. With TIMEOUT=-1 (no timeout) the issue is quite persistent. With TIMEOUT=0 is more intermittent. With TIMEOUT around 100ms I can't observe the issue.

Expected results:

Not sure if this is expected behavior, but Chrome (Chromium 97.0.4692.99) the worker is always awakened:

iframe.js:14        [iframe] start worker #0
worker.js:7         [iframe worker #0] posting and sleeping
root.js:11          [main thread] port received a message {counter: 0}
iframe.js:23        [iframe] awake worker #0
worker.js:13        [iframe worker #0] done waiting: ok
iframe.js:14        [iframe] start worker #1
worker.js:7         [iframe worker #1] posting and sleeping
root.js:11          [main thread] port received a message {counter: 1}
iframe.js:23        [iframe] awake worker #1
worker.js:13        [iframe worker #1] done waiting: ok
iframe.js:14        [iframe] start worker #2
worker.js:7         [iframe worker #2] posting and sleeping
root.js:11          [main thread] port received a message {counter: 2}
iframe.js:23        [iframe] awake worker #2
worker.js:13        [iframe worker #2] done waiting: ok

The Bugbug bot thinks this bug should belong to the 'Core::DOM: Workers' component, and is moving the bug to that component. Please revert this change in case you think the bot is wrong.

Component: Untriaged → DOM: Workers
Product: Firefox → Core

Eden, can you help me with triage here? Thanks!

Flags: needinfo?(echuang)

Mind taking a quick look Lars? StackBlitz indicated that this is blocking their work to support Firefox.

Flags: needinfo?(lhansen)

I'll take a look next week.

The symptoms are consistent with an implementation where postMessage performs its work asynchronously, as the implementation of Atomic.wait will block the thread and prevent anything else from happening. Indeed, looking in WorkerPrivate.cpp at the implementation of PostMessageToParent, it creates a runnable which it then dispatches. If the thread subsequently blocks in the wait, the runnable will not be processed until the wait ends.

(In a sense, the addition of Atomics.wait to JS made the timing of the processing of posted messages observable, as pre-Atomics.wait the delayed processing was probably acceptable since the worker would be expected to return to its event loop quickly. That said, even pre-Atomics.wait a worker could observe that posted messages would be delayed until it returned to the event loop. That might in itself be a webcompat concern (and a perf concern, about which I've filed bugs in the past).)

There's a callback mechanism in the Atomics.wait functionality now, that should really be used to handle this situation, https://searchfox.org/mozilla-central/source/js/public/WaitCallbacks.h#20. The BeforeWait callback installed in the engine by the browser should probably process postMessage runnables when invoked to avoid the situation reported in this bug. Implementing that is something the DOM team probably has to figure out, I have no expertise in that area.

Flags: needinfo?(lhansen)

That makes sense — thanks for diagnosing this Lars. Strange that this hasn't come up before.

I don't know enough about how the worker event queues are designed to know whether it would be feasible to synchronously process control runnables in BeforeWait without breaking invariants of the calling script or synchronously reentering new script. Andrew, wdyt?

Flags: needinfo?(bugmail)

Aha, so I think the problem here is that MessagePort has a state machine related to the entangling and disentangling of ports as they are transferred that has to deal with the asynchronous complications of MessagePorts being a point-to-point communication mechanism that is built on top of the non-point-to-point PBackground mechanism. Because the MessagePort in question here is getting shipped twice, the state machine gets involved.

I believe the state machine is built on IPC which is using normal runnables, not control runnables. Control runnables will trigger the JS interrupt handler, and so if atomics support that (which they probably have to?), control runnables will do the right thing. However, we can't just naively tell the existing PMessagePort ipdl to use an event target that wraps them in control runnables because the normal message transmission needs to use normal runnable scheduling.

So unless IPC advances let us use magic endpoint stuff that can avoid the state machine (hence the :nika needinfo, hi Nika! :) by letting us pass around a pipe-ish thing that can handle the situation where a MessagePort can be re-shipped before the normal task containing the messages actually gets to run (which necessitates re-transmission under the current regime), someone who is probably :smaug may need to look at doing some IPC-related refactoring so that the entangling state machine can use control runnables. (Noting that I think there are methods that there may be different IPC hacks we could do... for example, here's BackgroundChildImpl::OnChannelReceivedMessage doing things to give the IPC team nightmares and here's LocalStorage doing it in PContent.)

Note that I have not looked at the spec text recently about the entangling/disentangling, etc. I'm assuming that even if the spec text allows shipping a port to depend on a task to run instead of just hand-waving it away that we would still want this to work despite the possibility about being technically correct in not doing what content would expect/want.

Flags: needinfo?(nika)
Flags: needinfo?(echuang)
Flags: needinfo?(bugs)
Flags: needinfo?(bugmail)

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #7)

... Control runnables will trigger the JS interrupt handler, and so if atomics support that (which they probably have to?)

I believe so, yes.

Thanks so much for the detailed yet minimal reproduction of the problem!

In the reproduction, I believe the worker is the first global to actually try and send a message over the MessagePort it receives. Can you confirm if this is also true in your motivating real-world scenario? While it's clear we have some larger engineering work to do here, it is probably possible for us to introduce a "fast start" mechanism for our state machine that would allow the worker to assume that its request to entangle will succeed and thereby avoid deferring message transmission.

Specifically: the message port would ship a message sequence number, and in the case that we are entangling a port with a message sequence number of 0, our entangling request can be assumed to succeed for the purposes of then being able to send messages. This message sequence number would potentially be used in the future as well for other enhancements :nika and I have been discussing.

Flags: needinfo?(vidal.roberto.j)

Thanks so much for the detailed yet minimal reproduction of the problem!

My pleasure!

In the reproduction, I believe the worker is the first global to actually try and send a message over the MessagePort it receives. Can you confirm if this is also true in your motivating real-world scenario?

Yes, the reproduction is quite similar to our actual use case. The port that crosses boundaries several times is the only side of the channel that sends messages, and it does so from the final worker where it lands. However in our system there's one extra hop, since the channel is not created from (main), but from a worker spawned from (main).

FWIW, we have some leeway with this setup. We could, while this issue is resolved, consider the worker "unusable" until it receives an initialization message from the other side, which I suspect might alleviate the problem?

I'd also like to add that we rely heavily on this "postMessage-then-wait" pattern in many different parts of the system. Once we're able to add Firefox to our CI, I believe we'll be quite the stress-test for your implementation.

Flags: needinfo?(vidal.roberto.j)

(In reply to vidal.roberto.j from comment #10)

Yes, the reproduction is quite similar to our actual use case. The port that crosses boundaries several times is the only side of the channel that sends messages, and it does so from the final worker where it lands. However in our system there's one extra hop, since the channel is not created from (main), but from a worker spawned from (main).

That's good to know for coverage, thank you. That shouldn't be a(n additional) problem.

FWIW, we have some leeway with this setup. We could, while this issue is resolved, consider the worker "unusable" until it receives an initialization message from the other side, which I suspect might alleviate the problem?

Yes. When the message is received in the worker, the entangling will definitely have been completed. (Because our state machine currently blocks both sending and receiving until the entangling process completes.)

And under our implementation it should be fine to put that message in immediately after initially shipping the port in the originating global. Ex:

const { port1, port2 } = new MessageChannel();
iframe.contentWindow.postMessage({ port: port1 }, iframe.src, [port1]);
port2.postMessage("yo yo yo!");

Intermediary globals shouldn't consume the messages unless their message ports are explicitly started via start() or by the implicit start of assigning an onmessage attribute, so re-transmission should work properly. At least as long as the global is not blocked by an atomic...

I'd also like to add that we rely heavily on this "postMessage-then-wait" pattern in many different parts of the system. Once we're able to add Firefox to our CI, I believe we'll be quite the stress-test for your implementation.

Indeed! (And deeply appreciated!)

I'm going to clear the needinfos for now since I think we have a short term plan here (fast start, avoiding control runnable complications) and a longer term plan (use raw ports for direct point-to-point communication) and now it's a question of finding an assignee for this bug as the short term fix. We'll asynchronously get the longer term plan on the books.

Flags: needinfo?(nika)
Flags: needinfo?(bugs)
Flags: needinfo?(bugmail)
Component: DOM: Workers → DOM: postMessage
Flags: needinfo?(bugmail)

Hi Andrew, as you have all the context: would you mind triaging this bug? Thanks!

Flags: needinfo?(bugmail)
Severity: -- → S3
Flags: needinfo?(bugmail)
Priority: -- → P2
Status: UNCONFIRMED → NEW
Ever confirmed: true
Summary: postMessage does not send the message if worker thread is immediately put to sleep → MessagePort.postMessage for received MessagePorts will fail to send messages if blocking APIs (Sync XHR, Atomics) are used prior to to the Entangling state machine stabilizing; workaround is to wait for receipt of a message

I encountered what I think is the same issue when working on an app that uses a MessagePort in a Web Worker to communicate progress updates for a slow operation to the main thread. The sequence of steps looks like this:

  1. An async function on the main thread posts a message to a Worker to begin an expensive operation, and transfers a MessagePort to communicate progress updates during the operation
  2. Worker calls synchronous expensive function in WebAssembly module which takes several seconds to complete
  3. A JS callback is provided to this expensive function (via Emscripten) which the WASM code invokes synchronously several times during the operation with a progress counter
  4. The worker thread posts the progress counter value back to the main thread via the MessagePort
  5. The result of the operation is posted back to main thread

When I tried this initially, progress updates were correctly reported during the operation in Chrome and Safari but not Firefox. In Firefox the progress updates were reported back to the main thread all at once after the actual result of the operation had been received.

I then tried reworking Step 1 of the above so that the progress update MessagePort was sent to the worker ahead of time, before the main thread => worker message that triggered the expensive operation, and this resolved the problem. The commit where I added the workaround is https://github.com/robertknight/tesseract-wasm/pull/21/commits/a26c0659889099611fa9dce9c54f7191630a77e8.

Hi! I just ran into this bug trying to improve how Emscripten proxies work from one Worker to another. For providing a synchronous interface to async Web APIs (for example to implement a POSIX file system), we often have to proxy work to a dedicated Worker and synchronously wait for the work to complete. Currently that proxying involves sending a message to the dedicated Worker via the main thread, but in https://github.com/emscripten-core/emscripten/pull/18563 I was trying to reduce latency when the main thread is busy with other work by relaying messages via a central message relay Worker instead. That scheme works on Chrome and Safari, but not on FireFox because of this bug.

I've also come up with a small reproducer showing that relaying messages via the main thread works but that relaying messages via a Worker using MessageChannels does not work: https://gist.github.com/tlively/4216eecc8286381d9746a4c928c0b4c5

Assignee: nobody → bugmail
Status: NEW → ASSIGNED
You need to log in before you can comment on or make changes to this bug.