Open Bug 1752287 Opened 3 years ago Updated 3 months ago

MessagePort.postMessage for received MessagePorts will fail to send messages if blocking APIs (Sync XHR, Atomics) are used prior to to the Entangling state machine stabilizing; workaround is to wait for receipt of a message

Categories

(Core :: DOM: postMessage, defect, P2)

Firefox 96
defect

Tracking

()

ASSIGNED

People

(Reporter: _rvidal, Assigned: asuth)

References

(Blocks 1 open bug)

Details

(Keywords: webcompat:platform-bug)

User Story

platform-scheduled:2025-10-01
user-impact-score:200

Steps to reproduce:

Working example: https://github.com/jrvidal/message-port-repro

I have a main document (main), an iframe and workers spawned by the latter.

We send a MessagePort from (main) to (iframe), and then from (iframe) to a fresh (worker).
(iframe) and (worker) share a SharedArrayBuffer.
(worker) posts to the port and Atomic.wait()s using the SAB.
(main) receives the message and notifies (iframe).
(iframe) calls Atomic.notify() and lets (worker) resume.

Actual results:

I see this in the console:

[iframe] start worker #0                      iframe.js:14:15
[iframe worker #0] posting and sleeping       worker.js:7:11
[iframe worker #0] posting and sleeping       worker.js:7:11
[iframe] start worker #1                      iframe.js:14:15
[iframe worker #1] posting and sleeping       worker.js:7:11
[iframe worker #1] posting and sleeping       worker.js:7:11
[iframe] start worker #2                      iframe.js:14:15
[iframe worker #2] posting and sleeping       worker.js:7:11
[iframe worker #2] posting and sleeping       worker.js:7:11

Ignoring the duplicated logs (??), this means that the receiving port on the main thread never gets a message, and the worker never unblocks.

There is a TIMEOUT constant in worker.js that can force the worker to wait for a bit before posting-and-blocking. With TIMEOUT=-1 (no timeout) the issue is quite persistent. With TIMEOUT=0 is more intermittent. With TIMEOUT around 100ms I can't observe the issue.

Expected results:

Not sure if this is expected behavior, but Chrome (Chromium 97.0.4692.99) the worker is always awakened:

iframe.js:14        [iframe] start worker #0
worker.js:7         [iframe worker #0] posting and sleeping
root.js:11          [main thread] port received a message {counter: 0}
iframe.js:23        [iframe] awake worker #0
worker.js:13        [iframe worker #0] done waiting: ok
iframe.js:14        [iframe] start worker #1
worker.js:7         [iframe worker #1] posting and sleeping
root.js:11          [main thread] port received a message {counter: 1}
iframe.js:23        [iframe] awake worker #1
worker.js:13        [iframe worker #1] done waiting: ok
iframe.js:14        [iframe] start worker #2
worker.js:7         [iframe worker #2] posting and sleeping
root.js:11          [main thread] port received a message {counter: 2}
iframe.js:23        [iframe] awake worker #2
worker.js:13        [iframe worker #2] done waiting: ok

The Bugbug bot thinks this bug should belong to the 'Core::DOM: Workers' component, and is moving the bug to that component. Please revert this change in case you think the bot is wrong.

Component: Untriaged → DOM: Workers
Product: Firefox → Core

Eden, can you help me with triage here? Thanks!

Flags: needinfo?(echuang)

Mind taking a quick look Lars? StackBlitz indicated that this is blocking their work to support Firefox.

Flags: needinfo?(lhansen)

I'll take a look next week.

The symptoms are consistent with an implementation where postMessage performs its work asynchronously, as the implementation of Atomic.wait will block the thread and prevent anything else from happening. Indeed, looking in WorkerPrivate.cpp at the implementation of PostMessageToParent, it creates a runnable which it then dispatches. If the thread subsequently blocks in the wait, the runnable will not be processed until the wait ends.

(In a sense, the addition of Atomics.wait to JS made the timing of the processing of posted messages observable, as pre-Atomics.wait the delayed processing was probably acceptable since the worker would be expected to return to its event loop quickly. That said, even pre-Atomics.wait a worker could observe that posted messages would be delayed until it returned to the event loop. That might in itself be a webcompat concern (and a perf concern, about which I've filed bugs in the past).)

There's a callback mechanism in the Atomics.wait functionality now, that should really be used to handle this situation, https://searchfox.org/mozilla-central/source/js/public/WaitCallbacks.h#20. The BeforeWait callback installed in the engine by the browser should probably process postMessage runnables when invoked to avoid the situation reported in this bug. Implementing that is something the DOM team probably has to figure out, I have no expertise in that area.

Flags: needinfo?(lhansen)

That makes sense — thanks for diagnosing this Lars. Strange that this hasn't come up before.

I don't know enough about how the worker event queues are designed to know whether it would be feasible to synchronously process control runnables in BeforeWait without breaking invariants of the calling script or synchronously reentering new script. Andrew, wdyt?

Flags: needinfo?(bugmail)

Aha, so I think the problem here is that MessagePort has a state machine related to the entangling and disentangling of ports as they are transferred that has to deal with the asynchronous complications of MessagePorts being a point-to-point communication mechanism that is built on top of the non-point-to-point PBackground mechanism. Because the MessagePort in question here is getting shipped twice, the state machine gets involved.

I believe the state machine is built on IPC which is using normal runnables, not control runnables. Control runnables will trigger the JS interrupt handler, and so if atomics support that (which they probably have to?), control runnables will do the right thing. However, we can't just naively tell the existing PMessagePort ipdl to use an event target that wraps them in control runnables because the normal message transmission needs to use normal runnable scheduling.

So unless IPC advances let us use magic endpoint stuff that can avoid the state machine (hence the :nika needinfo, hi Nika! :) by letting us pass around a pipe-ish thing that can handle the situation where a MessagePort can be re-shipped before the normal task containing the messages actually gets to run (which necessitates re-transmission under the current regime), someone who is probably :smaug may need to look at doing some IPC-related refactoring so that the entangling state machine can use control runnables. (Noting that I think there are methods that there may be different IPC hacks we could do... for example, here's BackgroundChildImpl::OnChannelReceivedMessage doing things to give the IPC team nightmares and here's LocalStorage doing it in PContent.)

Note that I have not looked at the spec text recently about the entangling/disentangling, etc. I'm assuming that even if the spec text allows shipping a port to depend on a task to run instead of just hand-waving it away that we would still want this to work despite the possibility about being technically correct in not doing what content would expect/want.

Flags: needinfo?(nika)
Flags: needinfo?(echuang)
Flags: needinfo?(bugs)
Flags: needinfo?(bugmail)

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #7)

... Control runnables will trigger the JS interrupt handler, and so if atomics support that (which they probably have to?)

I believe so, yes.

Thanks so much for the detailed yet minimal reproduction of the problem!

In the reproduction, I believe the worker is the first global to actually try and send a message over the MessagePort it receives. Can you confirm if this is also true in your motivating real-world scenario? While it's clear we have some larger engineering work to do here, it is probably possible for us to introduce a "fast start" mechanism for our state machine that would allow the worker to assume that its request to entangle will succeed and thereby avoid deferring message transmission.

Specifically: the message port would ship a message sequence number, and in the case that we are entangling a port with a message sequence number of 0, our entangling request can be assumed to succeed for the purposes of then being able to send messages. This message sequence number would potentially be used in the future as well for other enhancements :nika and I have been discussing.

Flags: needinfo?(vidal.roberto.j)

Thanks so much for the detailed yet minimal reproduction of the problem!

My pleasure!

In the reproduction, I believe the worker is the first global to actually try and send a message over the MessagePort it receives. Can you confirm if this is also true in your motivating real-world scenario?

Yes, the reproduction is quite similar to our actual use case. The port that crosses boundaries several times is the only side of the channel that sends messages, and it does so from the final worker where it lands. However in our system there's one extra hop, since the channel is not created from (main), but from a worker spawned from (main).

FWIW, we have some leeway with this setup. We could, while this issue is resolved, consider the worker "unusable" until it receives an initialization message from the other side, which I suspect might alleviate the problem?

I'd also like to add that we rely heavily on this "postMessage-then-wait" pattern in many different parts of the system. Once we're able to add Firefox to our CI, I believe we'll be quite the stress-test for your implementation.

Flags: needinfo?(vidal.roberto.j)

(In reply to vidal.roberto.j from comment #10)

Yes, the reproduction is quite similar to our actual use case. The port that crosses boundaries several times is the only side of the channel that sends messages, and it does so from the final worker where it lands. However in our system there's one extra hop, since the channel is not created from (main), but from a worker spawned from (main).

That's good to know for coverage, thank you. That shouldn't be a(n additional) problem.

FWIW, we have some leeway with this setup. We could, while this issue is resolved, consider the worker "unusable" until it receives an initialization message from the other side, which I suspect might alleviate the problem?

Yes. When the message is received in the worker, the entangling will definitely have been completed. (Because our state machine currently blocks both sending and receiving until the entangling process completes.)

And under our implementation it should be fine to put that message in immediately after initially shipping the port in the originating global. Ex:

const { port1, port2 } = new MessageChannel();
iframe.contentWindow.postMessage({ port: port1 }, iframe.src, [port1]);
port2.postMessage("yo yo yo!");

Intermediary globals shouldn't consume the messages unless their message ports are explicitly started via start() or by the implicit start of assigning an onmessage attribute, so re-transmission should work properly. At least as long as the global is not blocked by an atomic...

I'd also like to add that we rely heavily on this "postMessage-then-wait" pattern in many different parts of the system. Once we're able to add Firefox to our CI, I believe we'll be quite the stress-test for your implementation.

Indeed! (And deeply appreciated!)

I'm going to clear the needinfos for now since I think we have a short term plan here (fast start, avoiding control runnable complications) and a longer term plan (use raw ports for direct point-to-point communication) and now it's a question of finding an assignee for this bug as the short term fix. We'll asynchronously get the longer term plan on the books.

Flags: needinfo?(nika)
Flags: needinfo?(bugs)
Flags: needinfo?(bugmail)
Component: DOM: Workers → DOM: postMessage
Flags: needinfo?(bugmail)

Hi Andrew, as you have all the context: would you mind triaging this bug? Thanks!

Flags: needinfo?(bugmail)
Severity: -- → S3
Flags: needinfo?(bugmail)
Priority: -- → P2
Status: UNCONFIRMED → NEW
Ever confirmed: true
Summary: postMessage does not send the message if worker thread is immediately put to sleep → MessagePort.postMessage for received MessagePorts will fail to send messages if blocking APIs (Sync XHR, Atomics) are used prior to to the Entangling state machine stabilizing; workaround is to wait for receipt of a message

I encountered what I think is the same issue when working on an app that uses a MessagePort in a Web Worker to communicate progress updates for a slow operation to the main thread. The sequence of steps looks like this:

  1. An async function on the main thread posts a message to a Worker to begin an expensive operation, and transfers a MessagePort to communicate progress updates during the operation
  2. Worker calls synchronous expensive function in WebAssembly module which takes several seconds to complete
  3. A JS callback is provided to this expensive function (via Emscripten) which the WASM code invokes synchronously several times during the operation with a progress counter
  4. The worker thread posts the progress counter value back to the main thread via the MessagePort
  5. The result of the operation is posted back to main thread

When I tried this initially, progress updates were correctly reported during the operation in Chrome and Safari but not Firefox. In Firefox the progress updates were reported back to the main thread all at once after the actual result of the operation had been received.

I then tried reworking Step 1 of the above so that the progress update MessagePort was sent to the worker ahead of time, before the main thread => worker message that triggered the expensive operation, and this resolved the problem. The commit where I added the workaround is https://github.com/robertknight/tesseract-wasm/pull/21/commits/a26c0659889099611fa9dce9c54f7191630a77e8.

Hi! I just ran into this bug trying to improve how Emscripten proxies work from one Worker to another. For providing a synchronous interface to async Web APIs (for example to implement a POSIX file system), we often have to proxy work to a dedicated Worker and synchronously wait for the work to complete. Currently that proxying involves sending a message to the dedicated Worker via the main thread, but in https://github.com/emscripten-core/emscripten/pull/18563 I was trying to reduce latency when the main thread is busy with other work by relaying messages via a central message relay Worker instead. That scheme works on Chrome and Safari, but not on FireFox because of this bug.

I've also come up with a small reproducer showing that relaying messages via the main thread works but that relaying messages via a Worker using MessageChannels does not work: https://gist.github.com/tlively/4216eecc8286381d9746a4c928c0b4c5

Assignee: nobody → bugmail
Status: NEW → ASSIGNED
Duplicate of this bug: 1956778

To whom it might concern, while I could workaround MessageChannel not forwarding postMessage, falling back to the regular postMessage dance, I've just discovered that Atomics.pause() in a while loop after either postMessage or MessageChannel has the same bug/effect: the worker is stuck forever because everything is blocked and nothing passes through.

This is not the case in other browsers so I hope it is all about the same bug but please keep in mind Atomics.pause() scenarios too when the fix lands or we cannot remove all the Firefox specific only workarounds.

The version was 138 but apparently somebody in Ubuntu and 136 had no issues, although I didn't witness that directly.

for history and completeness sake, this bug is affecting also BroadcastChannel API exactly the same way it affects MessageChannel ... basically all APIs meant to communicate a worker is busy, before it's busy, are doomed in Firefox workers and this should have (imho?) both higher priority and severity as it impacts everything less trivial that happens in Workers, for the reason Workers are being used, which is the ability to sync block via XHR calls and whatnot that could signal the worker is busy and listening parts should be aware of such situation.

I am falling back everything to an ugly, monkey-patched, global postMessage indirect fallback, but the amount of branching code I need to support Firefox starts "smelling" and feeling pretty ugly ... it's been 3 years since this bug was first mentioned/proved and it's touching in the most unexpected and breaking ways the most modern primitives we have on the Web so I am not sure this comment is super welcome but the time I've spent to workaround all this in a way I should be able to get rid of it sooner than later made me add such comment.

Thanks for reconsidering the priority and severity of something that screams for broken Worker related APIs all over.

Before responding, I want to emphasize that we are working this bug and the related bug 1899507 about making worker script-loading not dependent on the main thread (and possibly parent thread). In many cases where people are experiencing problems with atomics, it may actually be bug 1899507.

(In reply to Andrea Giammarchi from comment #16)

To whom it might concern, while I could workaround MessageChannel not forwarding postMessage, falling back to the regular postMessage dance, I've just discovered that Atomics.pause() in a while loop after either postMessage or MessageChannel has the same bug/effect: the worker is stuck forever because everything is blocked and nothing passes through.

Atomics.pause should never change browser behavior on any browser other than timing. If you're seeing a difference, I think it suggests that the workers were dominating the machine sufficiently that other threads couldn't make forward progress.

This is not the case in other browsers so I hope it is all about the same bug but please keep in mind Atomics.pause() scenarios too when the fix lands or we cannot remove all the Firefox specific only workarounds.

Looking at the blink/chrome implementation and the webkit implementation they definitely seems to be complying with spec and not doing anything that would allow other code to run on the worker during these times. (Note that interrupt mechanisms might still apply, but those presumably could already run.)

(In reply to Andrea Giammarchi from comment #17)

for history and completeness sake, this bug is affecting also BroadcastChannel API exactly the same way it affects MessageChannel ... basically all APIs meant to communicate a worker is busy, before it's busy, are doomed in Firefox workers and this should have (imho?) both higher priority and severity as it impacts everything less trivial that happens in Workers, for the reason Workers are being used, which is the ability to sync block via XHR calls and whatnot that could signal the worker is busy and listening parts should be aware of such situation.

If you're seeing problems with BroadcastChannel, I think that's something different than this bug. Our BroadcastChannel state machine is always able to send messages from the moment of creation and the message will be sent over IPC without waiting for the task to complete. That said, there is currently a cleanup mechanism built-on mozilla_dom::PBroadcastChannel::RefMessageDelivered that does depend on control being yielded, but that would manifest as a memory leak if the worker never returns to the top-level loop. (I should note this is something we want to fix, and the situation may be improved or completely addressed in this bug, but it's not the focus of this bug. There's just a common underlying subsystem.)

Can you briefly describe how you tried using BroadcastChannel and how it failed? Probably of most interest is the code that expected to be receiving and processing the message via BroadcastChannel was doing/expected to be doing. Also of interest is if there were any notable objects that are part of the sent message and whether you were listening for messageerror events on the BroadcastChannel (or MessagePort, or Worker, etc.). We will fire those in cases where we fail to successfully serialize the message, and there unfortunately are some Firefox specific situations like ImageData right now that you could potentially encounter and that would manifest in that way.

Atomics.pause should never change browser behavior on any browser other than timing. If you're seeing a difference, I think it suggests that the workers were dominating the machine sufficiently that other threads couldn't make forward progress.

The exact same code that works on WebKit and Chromium fails in Firefox ... the test is simple: take any example based on Atomics.wait(view, 0) and use const wait = (view, index) => { while (view[index] === 0) Atomics.pause(); } and postMessage that view elsewhere and change the value at index 0 ... goodbye worker.

Can you briefly describe how you tried using BroadcastChannel and how it failed?

I can create live examples that showcase the issue but in the bug I've previously filed there was already an example that could easily demo also the Atomics.pause() issue ... would that help? The test for the BroadcastChannel would be basically identical to the one for MessageChannel and admittedly I might have mixed up too many patches and reached the wrong conclusion around the BroadcastChannel but all I know is that if I workaround Firefox entirely via postMessage I have 100% successful code, like I have 100% successful code in all other browsers by using just MessageChannel, BroadcastChannel and Atomics.pause() (which made sync roundtrips way faster than just Atomics.wait() and for things that need to be as fast as possible).

(In reply to Andrea Giammarchi from comment #19)

Atomics.pause should never change browser behavior on any browser other than timing. If you're seeing a difference, I think it suggests that the workers were dominating the machine sufficiently that other threads couldn't make forward progress.

The exact same code that works on WebKit and Chromium fails in Firefox ... the test is simple: take any example based on Atomics.wait(view, 0) and use const wait = (view, index) => { while (view[index] === 0) Atomics.pause(); } and postMessage that view elsewhere and change the value at index 0 ... goodbye worker.

Who is mutating the view though? I'm suggesting the Firefox bug/limitation is as it relates to the global that mutates the view in response to a postMessage.

Can you briefly describe how you tried using BroadcastChannel and how it failed?

I can create live examples that showcase the issue but in the bug I've previously filed there was already an example that could easily demo also the Atomics.pause() issue ... would that help? The test for the BroadcastChannel would be basically identical to the one for MessageChannel and admittedly I might have mixed up too many patches and reached the wrong conclusion around the BroadcastChannel but all I know is that if I workaround Firefox entirely via postMessage I have 100% successful code, like I have 100% successful code in all other browsers by using just MessageChannel, BroadcastChannel and Atomics.pause() (which made sync roundtrips way faster than just Atomics.wait() and for things that need to be as fast as possible).

If you can create a live example without too much work like you did at https://github.com/WebReflection/issue/tree/main/sab/message-channel that would be appreciated to help understand if there's something weird happening with BroadcastChannel that we should address as part of this bug. But it's not essential since we know we need to fix this bug and address the worker scriptloader main-thread/parent-thread dependencies and obviously we can iterate once those fixes are landed.

For performance purposes, in general I think you would want to avoid BroadcastChannel since it seems the least likely for browsers to (be able to) optimize. The Worker.postMessage and DedicatedWorkerGlobalScope.postMessage are the easiest for browsers to optimize, with MessageChannel being something that can be optimized, but usually later (like we are doing now).

Who is mutating the view though?

main receives Int32Array view with SharedArrayBuffer in it, and it does i32v[0] = 1 on the main thread ... that's it, it notifies at index 0 something changes, that index 0 is then reset to have 0 as value right after, to avoid waiting next time on a different value.

But it's not essential since we know we need to fix this bug and address the worker scriptloader main-thread/parent-thread dependencies and obviously we can iterate once those fixes are landed.

I am, like you, working on my things ... if you tell me "it might be useful but we don't need it" I hear "don't bother", which is my preferred option as time isn't free these days.

For performance purposes, in general I think you would want to avoid BroadcastChannel since ...

I am sorry to hear patronizing from vendors always hit first ... you have no idea why I am using that API and all I am saying is that API is supposed to work.

I understand you believing you know best but honestly, the BroadcastChannel variant I have implemented to fallback to a ServiceWorker synchronous XHR request to reach out all tabs involved, solve a Promise in the ether and validate its result on the right main is something that took years to master and I am not filing bugs to learn what "I am soing wrong" when all I am saying is that an API meant to work is not working.

So, back to the topic, I will come back with at least the BroadcastChannel failing demo or I will apologize for reaching the wrong conclusion about that bit, no need to tell me what I am developing is not good, thanks for your understanding.

P.S. I also understand the "hint" was behind good intentions, telling me that BroadcastChannel is not the best choice or whatever, but honestly folks ... this bug is about the most hidden features of the Web mostly nobody else out there knows ... can we keep "decisions around needs" outside bugs filing? Somebody with a specific situation battle tested (and also fixed via workarounds) is trying to help ... can you see that doesn't require a "btw, you're doing it wrong" in the equation? Thank you!

For performance purposes, in general I think you would want to avoid BroadcastChannel since ...

OK, in advance, apologies that did hit a nerve (days if not weeks behind trying to find fixes around this issue) ... no excuses to come back badly, but I really hope this stuff, all bugs around this stuff, will get solved instead of hinting users that BC API is slow, because new APIs such as OPFS are being used and the leading pattern is being discussed to provide reliable and persistent SQLite among tons of other use-case stories where all these APIs are needed, and nobody stating otherwise is really welcomed because we have no other options on the Web to orchestrate stuff.

So, once again, I am sorry for my first reply/reaction but please be assured if anyone in this space talks about BroadcastChannel API and Atomics, it's not because they didn't know what to do that day, it's because there are tons of relevant conversations behind the Web possibilities scene.

I will provide that demo though or demonstrate BroadcastChannel is fine, or eventually give you a free test to integrate in your stack, let's try to move forward together though 🙏

(In reply to Andrea Giammarchi from comment #21)

Who is mutating the view though?

main receives Int32Array view with SharedArrayBuffer in it, and it does i32v[0] = 1 on the main thread ... that's it, it notifies at index 0 something changes, that index 0 is then reset to have 0 as value right after, to avoid waiting next time on a different value.

A-ha! I hadn't appreciated that the message might be sending a SharedArrayBuffer in it. We have checks about whether that's okay to deserialize which depend on both an agent cluster check (agent cluster spec link) and a check on the global if we're allowed to use shared memory (which hinges on COOP/COEP stuff and it looks like crossOriginIsolated may be how we surface that on globals to content) and where we'll fail on deserialization and then fire a "messageerror" event. I think browsers like Chrome have potentially doing things about the COEP requirements that I am not fully up on and I think we may be more strict than other browsers in some cases, so it's possible we are deviating from other browsers on this.

This is something we definitely should have coverage for, but I'm not finding coverage in our mochitests for BroadcastChannel that were added with the RefMessageBodyService that would have helped with this or immediately in the web-platform-tests. If you can easily confirm if you were experiencing a "messageerror" or receiving a SharedArrayBuffer that was broken (like we failed to actually share the memory), that's appreciated but it will definitely be an action item for my test stack to make sure we have in-tree test coverage for the SharedArrayBuffer case for BroadcastChannel.

(In reply to Andrea Giammarchi from comment #23)

For performance purposes, in general I think you would want to avoid BroadcastChannel since ...

OK, in advance, apologies that did hit a nerve (days if not weeks behind trying to find fixes around this issue) ...

No worries, thank you for apologizing and my apologies that while I was aiming for terse in my phrasing (I tend to be overly verbose), upon re-reading what I wrote I think I came off brusque. I very much appreciate how frustrating Firefox's technical debt can be and how when trying to workaround the technical debt, one can run into other technical debt that makes the workaround even harder! I also very, very much appreciate that in a world where Firefox's market share is small, it's a major effort to try and support Firefox and that it takes a lot of effort to report these bugs and provide reproducible test cases as you've done in https://bugzilla.mozilla.org/show_bug.cgi?id=1956778#c6. Thank you for caring about supporting Firefox and helping keep the web more than just two browser engines!

Expanding on what I made too terse and going into a little more detail on the Firefox implementation details for BroadcastChannel: all our messages end up being sent from whichever thread they're on in the content process via IPC to our parent process "IPDL Background" thread which is shared across all content processes and origins, so it can be subject to a fair amount of contention (all our storage APIs get routed through there too, for example). That currently happens for our MessageChannel implementation too, but I will be fixing that in this bug to use our IPC mechanism to directly bind the source and target threads in the same process so "IPDL Background" will no longer be involved and I think this will bring us into line with Chrome/Blink. For Worker.postMessage, we always just directly dispatch a runnable to the target thread.

I think other browsers will similarly have a need to perform rendezvous for BroadcastChannel in some central location, although other browsers may already do a better job of using different threads for different origins for that. Which is to say, all things being equal, I think Worker.postMessage and transferred MessageChannel ports are going to have the most reliably low latency and I don't know that there's any documentation out there that helps make that clear. BroadcastChannel should definitely work and it is a very useful API! It's a real lifesaver for tests in particular where plumbing all the MessagePorts around can be a logistical nightmare.

Thanks for your kind reply, appreciated ... so I went ahead and created another example that:

  • tests that BroadcastChannel is not stuck like MessagePort is (so it was my mistake in testing but it's because there is another issue that is affecting other browsers so my previous tests were weird and off regardless, I'll explain it now ...)
  • use Atomics.pause() instead of Atomics.wait(view, index) ... please note that if I use Atomics.wait everything seems to be fine, but as soon as I switch to Atomics.pause which is almost instantaneous in Chromium, Firefox might reply in seconds ... if ever ...
  • the postMessage is used anyway to verify that it delivers, the view is sent through it to ensure that browsers incapable of dealing with shared array buffer based views via broadcastChannel.postMessage(thatView) can still satisfy and solve the requirements

The repo/folder is here: https://github.com/WebReflection/issue/tree/main/sab/broadcast-channel

It can be tested live here: https://webreflection.github.io/issue/sab/broadcast-channel/?sw

Chrome/ium & Edge will produce this outcome in console:

sending message
post is fine
data received via postMessage <-- ⚠️
roundtrip for 123

Firefox will show this instead ...

sending message
post is fine
data received directly <-- 🥳
... tick, tock ...

After seconds or minutes or ... who knows what ... it will eventually complete with roundtrip for 123

As Summary

  • yes, BroadcastChannel always deliver and it's capable of passing along views with SharedArrayBuffer attached
  • if these views are notified and Firefox uses Atomics.pause() in a while (view[0] === 0) ... loop, the worker will take "forever" to unlock. This is what made me initially assume the BroadcastChannel was broken but that wasn't the case, it's just that the Worker becomes unresponsive for an indefinite amount of time so I've thought the Atomics.notify was never reached but it was the Atomics.pause() the issue

My apologies for not testing this better before plus I was wrong about BroadcastChannel at the end ... I have just discovered even more discrepancies across browsers though and this is super annoying because these primitives are wonderful and impractical at the same time ... I understand the stack behind is complex and everything but we have:

  • MessagePort blocked in Firefox
  • BroadcastChannel incapable of passing SharedArrayBuffer via postMessage in Chromium
  • Atomics.pause() unpredictable in Firefox
  • postMessage at least seems to be the only one that always delivers but it's a pain to deal with because every library can attach listeners to a worker out there so that what MessageChannel or BroadcastChannel could've solved with relative ease becomes an orchestration of fixes, workarounds, and inconsistencies that makes developers "cry" if they hear one more time "write once, works everywhere" about these standards and Web development in general.

I hope this closes the circle of issues around all these topics, but I also wonder if I should file a bug in Chromium about BroadcastChannel now ... and if you are using this issue/bug to fix also Atomics.pause or you need another issue about it, just let me know if that's the case.

This is a wrap from me, you all enjoy the rest of the day and have a lovely weekend 👋

(In reply to Andrea Giammarchi from comment #19)

The exact same code that works on WebKit and Chromium fails in Firefox ... the test is simple: take any example based on Atomics.wait(view, 0) and use const wait = (view, index) => { while (view[index] === 0) Atomics.pause(); }

The loop in the view arrow function compiles to the following x86 assembly:

;; Load from Int32Array
movl       0x0(%rcx,%rdx,4), %eax
;; Loop start
.set .L1
;; Loop condition
testl      %eax, %eax
jne        .L2
;; Loop body
pause
jmp        .L1
;; After loop
.set .L2

Calling Atomics.pause() can't modify view, so the JIT compiler moves view[index] before the loop. IOW the function is compiled as if it had been written as const wait = (view, index) => { const value = view[index]; while (value === 0) Atomics.pause(); }.

Why does it apparently work in Chrome/Safari? Both browsers don't yet have JIT inlining support for Atomics.pause, so in their implementation Atomics.pause isn't compile to a single x86 pause instruction, but instead to a generic call into the VM. Because it's a generic call, their compiler can't assume Atomics.pause doesn't modify view, so view[index] is re-evaluated repeatedly.

Replacing view[index] with Atomics.load(view, index) should give you consistent behaviour across browsers:
const wait = (view, index) => { while (Atomics.load(view, index) === 0) Atomics.pause(); }.

Spec links:

  • view[index] is TypedArrayGetElement, which calls GetValueFromBuffer with UNORDERED, allowing implementations to reorder the load operation.
  • Whereas Atomics.load(view, index) calls GetValueFromBuffer with SEQ-CST, which prevents reordering.

Thanks André Bargull, but it looks to me that the JIT is changing semantics of my code ... because with a SharedArrayBuffer checking or assuming that the index at view[x] is always the same seems to be an hazard ... right? Atomics.pause() cannot change view[index] but everything else while that paus happens could, if it's a shared array buffer ... indeed at some point in time the loop resolves but it's unpredictable "when" that happens so I might use Atomics.load as suggested but I need to verify that doesn't degrade too much performance in both Chrome/ium and Safari, yet I think something is off with current JIT assumption around Atomics.pause() and Shared ArrayBuffer checks, happy to learn otherwise.

User Story: (updated)
User Story: (updated)

On a second thought ...

Calling Atomics.pause() can't modify view, so the JIT compiler moves view[index] before the loop. IOW the function is compiled as if it had been written as const wait = (view, index) => { const value = view[index]; while (value === 0) Atomics.pause(); }.

... this means that the JIT creates an infinite loop out of the box and the more I think about this, the more I believe it's a bug in Firefox.

If that view was a proxy or something or Atomics.pause was polyfilled somehow (which is what I do) I wonder if the JIT would actually destroy intents too ... I understand for a synchronous, non observable, Atomics.pause() behavior could be optimized that way, but if the view uses a Shared ArrayBuffer where this can be modified at any time elsewhere, the current JIT fails at optimizing, causing an infinite loop the programmer never meant when the shared buffer already landed elsewhere exactly to wait for it to be modified at any index.

This is also inconsistent across browsers in a subtle way hard to debug on userland so it should be either forbidden or never inlined like other browsers do, imho, or surprises like this could happen all over the place.

(In reply to Andrea Giammarchi from comment #27)

Atomics.pause() cannot change view[index] but everything else while that paus happens could, if it's a shared array buffer ...

Right. It could, but it does not have to. As in, if something else changes that value, then the change could be visible after the pause, but it does not have to be, because [[Order]]==UNORDERED. The change could be local to the agent doing the writing and only visible to the reading agent after a synchronizing event.

indeed at some point in time the loop resolves but it's unpredictable "when" that happens so I might use Atomics.load as suggested but I need to verify that doesn't degrade too much performance in both Chrome/ium and Safari

The performance degradation seems less important than the possibility that Chrom* or Safari could optimize this similarly in the future and start exhibiting the same behavior as Firefox, since it is explicitly allowed by the spec (UNORDERED).

(And I won't say any more to avoid saying something incorrect, because I don't even understand why SEQ-CST would be good enough. It seems to me like reading the stale value forever would be sequentially consistent, even linearizable. But I may not understand the scenario correctly, and I definitely don't understand memory models very well.)

if something else changes that value, then the change could be visible after the pause, but it does not have to be, because [[Order]]==UNORDERED. The change could be local to the agent doing the writing and only visible to the reading agent after a synchronizing event.

I don't understand deeply memory models neither but here we're talking a view for a shared array buffer which is either the same memory (shared) or it isn't ... if it isn't I don't understand anymore anything related to spectre and meltdown concerns because the state is not granted to be predictable, if I understand that concern, which I might not, although here there is an agent explicitly notifying that view that index X has been changed and I expect internally that it would trigger something at the logic level that current code cannot predict.

In short, inlining that operation as always 0 === 0 makes no sense because:

  • the shared array buffer can be mutated at any time
  • the other side of the world can do anything it wants and it expects that notify would actually notify anything waiting for something to happen in that SAB

As it is now, it's like fetching a URL expecting that it will always be a 200 because the first time it was accessed was, indeed, a 200 (here it's the first time that index is accessed but the concept is similar, SAB are used to distribute work across realms so that realms should not assume).

If the answer is always use Atomics.load the fine, but before I was using Atomics.wait(view, 0) and performance were 2.5X slower (in Chrome/ium) than via the while loop and Atomics.pause(). We're synchronizing WASM based runtimes in workers that access directly the DOM and whatnot on the main thread so that these kind of slowdowns do matter a lot, way more than a "maybe in the future this will break" as we can act, when that happens, quickly, but here I am questioning the fact JIT is destroying developers intents ... and if not JIT, then specs around pause() because it's basically a footgun backed in as it is now if it's allowed to be written without warnings whatsoever.

Views are "proxies" to buffers, and even more to shared buffers, this behavior in Firefox and nowhere else feels off ... and if it's a "won't fix" then I'll just keep the logic as it is, which is keep Firefox slower than others via Atomics.wait instead of using a while pause loop.

Sorry for the late reply. For some reason I wasn't CC-ed to this bug, so I didn't get any notifications.

Maybe it's better to ignore Atomics.pause for the moment and only concentrate on Atomics.load and why it's different from bracketed loads (view[index]).

For example this program can loop indefinitely:

let sab = new SharedArrayBuffer(2 * Int32Array.BYTES_PER_ELEMENT)
let i32 = new Int32Array(sab);

executeInWorker(function(sab) {
  let i32 = new Int32Array(sab);

  // Wait a bit to give main thread enough time to
  // enter highest JIT compiler level.
  console.log("waiting...");
  Atomics.wait(i32, 0, 0, 1000);

  console.log("store...");
  Atomics.store(i32, 1, 1);
}, sab);

console.log("start loop");

// Can loop indefinitely, because it's not guaranteed
// that stores in a different thread are observed.
while (i32[1] === 0);

// Guaranteed to eventually see the updated value.
// while (Atomics.load(i32, 1) === 0);

console.log("end loop");

The ECMAScript memory model is defined in https://tc39.es/ecma262/#sec-memory-model, but it's probably more useful to just skip directly to the guidelines in https://tc39.es/ecma262/#sec-shared-memory-guidelines.

Note 1 applies to ECMAScript programmers and recommends to keep programs free of data-races, specifically:

  • No concurrent non-atomic operations on the same memory location.
    • Non-atomic operations are for example view[index].
  • Ensure that different memory cells are used by atomic and non-atomic operations.
    • That means it's best to avoid mixing non-atomic reads/writes using view1[index1] with atomic reads/writes using Atomics.{load,store}(view2, index2) when view1 and view2 operate on the same shared memory and index1 is equal to index2.

Note 2 lists some valid optimisations for ECMAScript implementers. Among others:

Examples of transformations that remain valid are: [...] hoisting non-atomic reads out of loops even if that affects termination.

This is exactly the transformation happening here, where implementations are allowed to tranform the loop while (i32[1] === 0); to const value = i32[1]; while (value === 0); by hoisting the non-atomic read i32[1] out of the loop.

I've measured ~5% slowdown in Chrome/ium after changing the loop to use load(view, 0) in the while loop but my thinking is that these are footguns that shouldn't require users' attention.

The fact both Safari and Chrome never JIT or inline that loop makes the API easier to reason about and, imho, I don't understand any different use case for views of shared array buffers ... it's already super cumbersome to have SAB enabled due meltdown/spectre paranoia it's unclear what those APIs bring or why just accessing the view of a shared array buffer should not consider the fact somewhere else such buffer might have been modified.

Guards / hints to understand that are all over the place, as example the fact such buffer travels via postMessage right before these loops or checks could happen looks like an easy way to infer that the JIT should mind its business in there or around that view but hey ... I guess this will remain a won't fix, developers will remain confused by the inconsistency, and I'll remain skeptical by a JIT compiler that transforms a perfect valid loop (considering the context and considering other browsers don't make it infinite neither) into an infinite loop without even a warning.

You can close this bug if you feel like no action can or will be taken but imho somebody should underline this unexpected harakiri decided by the JIT on MDN.

User Story: (updated)
You need to log in before you can comment on or make changes to this bug.