2008103 - WebGPU scheduling issues with `onSubmittedWorkDone` and crash on window resize

Reporter

Description

•

6 months ago

Steps to reproduce:

Hello Mozilla!

We are the developers of RSX Engine, a new 3D Engine that focuses on realtime collaboration and high-end rendering with workflows similar to Unity and UE.

Our browser focus was Chrome for many years, but we're stable and fast on Chrome so we're extending to Firefox. The runtime and the editor run almost perfectly on Firefox, however there are some critical issues that prevent us from being stable. RSX Engine is multi-threaded, one worker runs the "main loop" with simulation etc, another thread "render" does the rendering.

Scheduling issues via `onSubmittedWorkDone`

In order for "main" to not submit too many frames to "render" and reduce latency, the "render" thread signals a condition variable when it finished rendering a frame. On "render" this works by ensuring that command buffers from the previous frame have been exhausted using the queue's onSubmittedWorkDone callback.

The "render" thread is command based, and "yields" as soon as the command queue is exhausted. This should allow the browser to process microtasks and dispatch completion events.

This works great on Safari and Chrome rendering with a performance identical to native. However, on Firefox, as soon as the framerate drops below a specific threshold, the whole system falls apart and framerates tank even further. If the engine keeps rendering low workloads at e.g. 150fps, it's fine.

Looking at the Firefox profiler (which is awesome), shows that there is nothing happening at all. Firefox's WebGPU renderer is idle and so are our worker threads.

We have spent a lot of time investigating this and changing our approaches, but we can find no workaround.

Crash on window resize

Resizing the window when pumping frames can lead to a crash of the page. The entire browser, not just the tab hangs - it eventutally recovers and the page is black. There are no logs in the console.

Demo to reproduce

You can find a demo here: https://runtime.rsxengine.com/109a0b8f-4361-4de8-9d16-07de25c326bb/
Note, this demo allocates a wasm32 heap of 3.9gb - this is intentional for the tests we currently do. The actual heap memory used is a fraction of the total heap size.

Actual results:

Scheduling is incorrect, framerate drops.
Window resize, causes a crash or hang with potentially recovery to black page.

Expected results:

Scheduling should be high performance and with low latency.
Window resize should work gracefully.

BugBot [:suhaib / :marco/ :calixte]

Comment 1

•

6 months ago

The Bugbug bot thinks this bug should belong to the 'Core::Graphics: WebGPU' component, and is moving the bug to that component. Please correct in case you think the bot is wrong.

Component: Untriaged → Graphics: WebGPU

Product: Firefox → Core

Erich Gubler [:ErichDonGubler] (he/him)

Comment 2

•

6 months ago

I am willing to bet that the timing-related issues are, at the very least, related to bug 1870699, where we have a thread polling for GPU work completion every 100 milliseconds. 😅 If you're feeling adventurous, you could try compiling Firefox with a smaller POLL_TIME_MS (see that bug's OP for more details), and see if that changes things for you. Alternatively, we might be able to kick off a CI build you could test out locally. Just let us know!

For the crashes, what crashes do you see in about:crashes? I'm betting you'll find crash instances there that should help us move forward on that front.

Severity: -- → S3

Depends on: 1870699

Erich Gubler [:ErichDonGubler] (he/him)

Comment 3

•

6 months ago

Trying to reproduce on macOS, I don't get any crashes, though we do get many fewer frames than Chrome. Is this specific to the platform you're testing on? Presumably Windows? Can you confirm that, please?

Flags: needinfo?(dutty83)

Erich Gubler [:ErichDonGubler] (he/him)

Updated

•

6 months ago

URL: https://runtime.rsxengine.com/109a0b8...

cryptocroc

Reporter

Comment 4

•

6 months ago

Thanks for the swift feedback! It's very much appreciated and shows great commitment to the project.

I am willing to bet that the timing-related issues are, at the very least, related to bug 1870699, where we have a thread polling for GPU work completion every 100 milliseconds. 😅
Oh boy, that is slow! We really need asap completion of the callbacks to achieve a high-performance loop.

If you're feeling adventurous, you could try compiling Firefox with a smaller POLL_TIME_MS (see that bug's OP for more details), and see if that changes things for you. Alternatively, we might be able to kick off a CI build you could test out locally. Just let us know!
I think it'll be difficult for me to build, but if you can procure a build with the POLL_TIME_MS set to 1 I'd be happy to try. Ideally it's for MacOS so I could test on this machine.

Trying to reproduce on macOS, I don't get any crashes, though we do get many fewer frames than Chrome. Is this specific to the platform you're testing on? Presumably Windows? Can you confirm that, please?
We can reproduce the crash/hang on both MacOS and Windows. Try to aggressively resize the window, it's 100% for us. We can reproduce even when running inside an iframe: https://hello.polyverse.cloud/Assets/da667966-c951-42bd-852e-44fbc5f2b17d/dab4e85f-37de-4c77-bc04-e58a8c574c79

On my MacOS (Tahoe 26.2, M4 Pro) the hang always recovered with the black screen after a 5-10s timeout, the crash was reported by a colleague on Windows. With that being said, I do have a crash reports upon the hang, I've submitted the most recent one here: https://crash-stats.mozilla.org/report/index/3e4f772e-994b-414d-a6ed-c228f0251231

Flags: needinfo?(dutty83)

cryptocroc

Reporter

Comment 5

•

6 months ago

NOTE: This is a repost as the markdown formatter broke my replies. Please remove the previous post.

Thanks for the swift feedback! It's very much appreciated and shows great commitment to the project.

I am willing to bet that the timing-related issues are, at the very least, related to bug 1870699, where we have a thread polling for GPU work completion every 100 milliseconds. 😅

Oh boy, that is slow! We really need asap completion of the callbacks to achieve a high-performance loop.

If you're feeling adventurous, you could try compiling Firefox with a smaller POLL_TIME_MS (see that bug's OP for more details), and see if that changes things for you. Alternatively, we might be able to kick off a CI build you could test out locally. Just let us know!

I think it'll be difficult for me to build, but if you can procure a build with the POLL_TIME_MS set to 1 I'd be happy to try. Ideally it's for MacOS so I could test on this machine.

Trying to reproduce on macOS, I don't get any crashes, though we do get many fewer frames than Chrome. Is this specific to the platform you're testing on? Presumably Windows? Can you confirm that, please?

We can reproduce the crash/hang on both MacOS and Windows. Try to aggressively resize the window, it's 100% for us. We can reproduce even when running inside an iframe: https://hello.polyverse.cloud/Assets/da667966-c951-42bd-852e-44fbc5f2b17d/dab4e85f-37de-4c77-bc04-e58a8c574c79

On my MacOS (Tahoe 26.2, M4 Pro) the hang always recovered with the black screen after a 5-10s timeout, the crash was reported by a colleague on Windows. With that being said, I do have a crash reports upon the hang, I've submitted the most recent one here: https://crash-stats.mozilla.org/report/index/3e4f772e-994b-414d-a6ed-c228f0251231

cryptocroc

Reporter

Comment 6

•

6 months ago

Also, just saw this:

Polling at 100ms is ridiculous: that's an average 50s of latency, unless you're submitting continuously and taking advantage of the implicit poll that submission does.
https://bugzilla.mozilla.org/show_bug.cgi?id=1870699#c6

This explains why we're getting high fps when rendering at high fps. As we the continuously submit work, and as soon as the GPU gets more busy we fail and the whole "poll on submit"-luck falls apart and we're stuck at 100ms. At least that's a workaround 😅 Keep submitting empty CBs.

cryptocroc

Reporter

Comment 7

•

6 months ago

Also, to provide some ISV feedback on this comment:

That is probably true for most rendering scenarios where you have requestAnimationFrame doing submits for every frame
https://bugzilla.mozilla.org/show_bug.cgi?id=1870699#c7

Multi-threaded engines cannot submit at every RAF. Otherwise, if the scene takes longer to render than the RAF interval, rendering will fall further and further behind, latency will grow and the app will become unusable quick.

I highly recommend dropping poll-time to 0/each browser frame, when at least one CB is midflight that has a completion callback attached. Otherwise it undermines performance of hardcore multi-threaded renderers. Hopefully that's a quick fix. For real-time 3D, every millisecond matters and developers profile draw-calls in ns time.

Erich Gubler [:ErichDonGubler] (he/him)

Comment 8

•

6 months ago

•

Edited

(In reply to cryptocroc from comment #6)

This explains why we're getting high fps when rendering at high fps. As we the continuously submit work, and as soon as the GPU gets more busy we fail and the whole "poll on submit"-luck falls apart and we're stuck at 100ms. At least that's a workaround 😅 Keep submitting empty CBs.

Just please be careful about shipping any workarounds in places that are difficult to remember to undo; many workarounds like this get forgotten, and Firefox isn't forced to do the right thing.

Erich Gubler [:ErichDonGubler] (he/him)

Comment 9

•

6 months ago

Okay, I actually was able to reproduce this, but I did the very silly thing where I didn't follow the written steps to reproduce and actually resize the window. I can reproduce this on my M1 MacBook Pro. 😅 For example, in this crash report:

|[0][GFX1-]: Killing GPU process due to IPC reply timeout (t=4334.09)

Next step: to figure out what IPC message(s) is/are causing the GPU process to choke.

cryptocroc

Reporter

Comment 10

•

6 months ago

Glad you managed to reproduce this!

Just please be careful about shipping any workarounds in places that are difficult to remember to undo; many workarounds like this get forgotten, and Firefox isn't forced to do the right thing

Indeed, this is best fixed in Firefox. The workaround to poll when the "render" thread is paused and waiting might fix FPS instability - but would not fix other cases, such as mapAsync callbacks where the "render" thread is not explicitly paused and yielding.
As I mentioned, I'm happy to test a build if you can procure a MacOS build with poll set to a low value.

Erich Gubler [:ErichDonGubler] (he/him)

Comment 11

•

6 months ago

I was wondering if this might be because of a similar root cause as bug 1971452, but if this affects Windows, that seems unlikely.

Erich Gubler [:ErichDonGubler] (he/him)

Comment 12

•

6 months ago

Made some builds for you that set POLL_TIME_MS to 1. You can find it at this set of CI jobs I pushed (often called a Try push): https://treeherder.mozilla.org/jobs?repo=try&revision=e5663b9b3c641aa76ed33f15db440b3f021f6f29&selectedTaskRun=eHSnj1gaTfWfT3T9BTB-IA.0

☝🏻This link applies some filters to the set of jobs visible, so that builds are the only thing presented. In any of them, you can click on a job chip and do the following to get to a build:

The lower half of the page will present some controls when a job is selected.

This step is already done for you by the link; in this case, the link selects the optimized macOS for you.
The top of the lower controls is a tab strip. One of its tabs is titled Artifacts and Debugging Tools. Click this to show the artifacts produced by the job.
In the list of artifacts, find one that installs Firefox. I handle installs on macOS infrequently enough to not feel 100% confident which one you should pick, but I suspect target.dmg will be what you want.

LMK how this goes.

Flags: needinfo?(dutty83)

cryptocroc

Reporter

Comment 13

•

6 months ago

Thanks for providing the build, Erich! I've used "macOS opt":

✅ Poll set to 1 seems to perform similar to Chrome in a static scene now. Way better than the setTimeout command-buffer submission.
❌ However, this build seems to have issues with the swapchain. Every other frame presents the oldest image in the swapchain. I recorded a video, but did not find a way to attach it to this reply.
❌ As expected, crashing on resize still present

Personally, I'd say that this fix is an OK hotfix for mainline FF until the proper fix proposed in https://bugzilla.mozilla.org/show_bug.cgi?id=1870699#c11 lands. I am aware that the following comment has limited value in the dev process, given my unfamiliarity with FF's architecture, but: why not keep the current thread that is spawned. Simply have it wait on a condition var that it signaled whenever a CB is submitted or mapAsync is invoked. Increment an atomic, and keep it running until the atomic hits 0 again, then have it wait again.
Or, if FF has a low-overhead threadpool, then maybe it's worth just spawning a worker if we increased the atomic from 0. Then the worker can just poll without the need for the condition variable.

Flags: needinfo?(dutty83)

cryptocroc

Reporter

Comment 14

•

6 months ago

Attached video FireFoxSwapchainIssue_720p.mov — Details

Attach a video of the swap chain issue that the custom build produces (poll=1ms).

Jim Blandy :jimb

Comment 15

•

5 months ago

I highly recommend dropping poll-time to 0/each browser frame, when at least one CB is midflight that has a completion callback attached. Otherwise it undermines performance of hardcore multi-threaded renderers. Hopefully that's a quick fix. For real-time 3D, every millisecond matters and developers profile draw-calls in ns time.

The best place to discuss polling strategy is in bug 1870699. I think we're going to do pretty much what you're suggesting there, but keep in mind that a browser is a constrained environment, since the DOM has so many other features and requirements that we need to satisfy simultaneously.

Erich Gubler [:ErichDonGubler] (he/him)

Updated

•

5 months ago

Priority: -- → P1

Comment hidden (obsolete)

cryptocroc

Reporter

Comment 18

•

5 months ago

I have provided input to https://bugzilla.mozilla.org/show_bug.cgi?id=1870699

RE swapchain issue, we could narrow this down to the case when we render faster than 60fps - note that we are not using RAF but manual scheduling.

Andy Leiserson [:aleiserson]

Updated

•

8 days ago

WebGPU scheduling issues with `onSubmittedWorkDone` and crash on window resize

Scheduling issues via onSubmittedWorkDone

Crash on window resize

Demo to reproduce

Scheduling issues via `onSubmittedWorkDone`