Open Bug 1870699 Opened 2 years ago Updated 2 months ago

Don't poll WebGPU from a timer

Categories

(Core :: Graphics: WebGPU, defect, P3)

defect

Tracking

()

ASSIGNED

People

(Reporter: jimb, Assigned: jimb)

References

(Blocks 7 open bugs)

Details

(Keywords: leave-open)

Attachments

(6 files)

Right now, when Firefox creates a PWebGPU IPDL parent, the WebGPUParent constructor starts a timer thread calling wgpu_core::global::device::Global::poll_all_devices, to respond to the completion of work on the GPU. This polling runs ten times a second, and probably has an unacceptable impact on battery life.

Instead, creating a WebGPUParent should start a separate thread that polls the wgpu devices only as necessary. Submitting a command buffer to a queue should dispatch a runnable to that thread (in classic Gecko style, using nsIEventTarget::Dispatch or something like it) that calls Global::device_poll, passing Maintain::WaitForSubmissionIndex with the submission index returned by the submission. This will cause the thread to block until the GPU has completed the submitted work.

Various other WebGPU operations require calling device_poll as well, and these can be handled by sending the thread appropriate runnables.

This approach should ensure that WebGPU consumes the CPU's attention only when we are actually waiting for work.

Blocks: webgpu-v1
Severity: -- → S3
Priority: -- → P3

Linking this other report with an attached test case and some discussion: https://bugzilla.mozilla.org/show_bug.cgi?id=1900273 and the issue on wgpu github with a link to a solution to this problem on the servo browser: https://github.com/gfx-rs/wgpu/issues/6660 https://github.com/servo/servo/pull/32266

In terms of severity, this makes compute shaders so slow as to be unusuable on firefox.

Just pinging here to see if there is any chance of fixing this sometime soon? :) Thanks!

This is definitely a priority for before the release we're hoping for by June. However, we haven't gotten more fine-grained with our planning than that right now, sorry!

Here's your periodic ping to see if there is any update on when this might be fixed :)

Unfortunately, I think this task is going to wait until closer to EOY. It's not clear, given our current backlog and decision to ship (which will, in turn, impose more backlog that we haven't found yet).

NI'ing :jimb for his thoughts as the team's tech lead.

Flags: needinfo?(jimb)

Actually, I think we may want to raise the priority here.

I just tried lowering POLL_MS_TIME in WebGPUParent.cpp from 100ms to 5ms, and it lowers the time needed to run the WebGPU CTS test webgpu:shader,execution,expression,call,builtin,abs from 12s to 1.2s. So it seems plausible that our current polling approach is behind some of our CTS timeouts.

Polling at 100ms is ridiculous: that's an average 50s of latency, unless you're submitting continuously and taking advantage of the implicit poll that submission does.

Flags: needinfo?(jimb)

We did this in servo last year and let me tell you it was worth it. We slashed CTS run times in half.

Polling at 100ms is ridiculous: that's an average 50s of latency, unless you're submitting continuously and taking advantage of the implicit poll that submission does.

That is probably true for most rendering scenarios where you have requestAnimationFrame doing submits for every frame.

So is the plan to just change this POLL_MS_TIME constant or do the solution outlined above? Any timeframe for this happening? :)

Randall: I know that :jimb has been doing some implementation design research, and has converged on a design for wait-polling in a separate thread when Firefox knows that some GPU work is in flight. I don't know when we'll be getting to implementing it (minus what I said in comment 5 for before EOY). The changes seem straightforwardโ€”we just need to figure out when it will happen compared to the other priorities we have.

I was looking at this a bit today. Things are a bit hairy around WebGPUParent::GetFrontBufferSnapshot because it calls ffi::wgpu_server_poll_all_devices itself.

We talked about the best way to fit this into the solution proposed above. JimB proposed having the callback set a condition variable and have GetFrontBufferSnapshot wait on that condition variable and then do the readback. This will keep all the polling (expect maybe for the shutdown poll) on the polling thread.

Blocks: 1993333

I think we can use mozilla::LazyIdleThread for this. LazyIdleThread is an XPCOM utility that takes care of spinning up a thread when necessary, shutting it down when it's idle, managing its work queue, and so on.

The wgpu_bindings code needs to ensure that any wgpu_core operation that hopes for a callback to be invoked (basically just map_async and on_submitted_work_done) also submits a runnable to the LazyIdleThread that polls the given device for the submission index. All wgpu_core operations that accept a callback also return a submission index by which the callback will be invoked.

For blocking synchronously in WebGPUParent::GetFrontBufferSnapshot, we can:

  • create a mozilla::Monitor,
  • submit a runnable that notifies that monitor after polling the device for all work to complete,
  • and then wait on that monitor.

I had hoped we could do all this very simply entirely in the Rust code. But I think it's important that the polling thread should exit if it's idle. Using std::sync::mspc::channel to communicate with such a thread is racy: the sender could send exactly as the thread decides to exit, such that the send succeeds, but then the message is lost when the thread drops the receiver. You'd end up needing a Mutex protecting a work queue, and a CondVar for the thread to wait on with a timeout. It's not rocket science, but it's not as trivial as I had hoped.

On the other hand, if we did have our own Mutex/CondVar gadget, we could implement the synchronous blocking behavior directly with that, instead of bringing a Monitor into the story as well.

Ultimately I think it's a close call between the benefits of keeping things in Rust versus the benefits of using off-the-shelf XPCOM stuff that's already widely used in the rest of Firefox.

In case it is relevant, I'm seeing very slow behavior reading back from GPU via map_async in wgpu-native. I'm assuming this is due to the same underlying issue, but maybe not? In any case, if it IS the same issue, then that would definitely weigh in favor of doing it in rust!

In any case, getting this fixed one way or another would be the best xmas present :)

(In reply to Randall O'Reilly from comment #12)

In case it is relevant, I'm seeing very slow behavior reading back from GPU via map_async in wgpu-native. I'm assuming this is due to the same underlying issue, but maybe not? In any case, if it IS the same issue, then that would definitely weigh in favor of doing it in rust!

This madness is specific to Firefox. So, it depends:

  • If you're compiling wgpu-native to WebAssembly and running it in Firefox, then this bug affects you - but it will affect you whether we fix it in the C++ code or the Rust code.

  • If you're running wgpu-native directly on the desktop, then Firefox isn't in the picture and this bug doesn't affect you. It's up to your app to figure out how to poll the device effectively.

This is running directly on the desktop: I guess I need to dig into it more. It is nowhere near as slow as on the web under Firefox, but it is notably asymmetric in that copying up to the GPU is much faster than getting back from it. Thanks!

Depends on: 2007600

Relevant feedback from a user for this bug in bug 2008103, comment 6:

Multi-threaded engines cannot submit at every RAF. Otherwise, if the scene takes longer to render than the RAF interval, rendering will fall further and further behind, latency will grow and the app will become unusable quick.

I highly recommend dropping poll-time to 0/each browser frame, when at least one CB is midflight that has a completion callback attached. Otherwise it undermines performance of hardcore multi-threaded renderers. Hopefully that's a quick fix. For real-time 3D, every millisecond matters and developers profile draw-calls in ns time.

Although it is helpful to point to workarounds, this bug is not the right place to discuss general WebGPU worker architecture. Please don't post more comments on that topic here.

Depends on: 1972635
Blocks: 1972635
No longer depends on: 1972635
Keywords: leave-open

Don't implement Send for wgpu_bindings::server::BufferMapClosure.

Closures passed to wgpu_core::global::Global::buffer_map_async must implement
Send, since one can call buffer_map_async on one thread, and then poll the
device on another. However, Firefox's map closures need to call methods on
WebGPUParent, which should only be used on the GPU process's IPDL thread for
the PWebGPU actor.

The old code addresses this by calling both Global::buffer_map_async and
Global::poll_all_devices only from WebGPUParent's thread. The
WebGPUParentPtr captured by the map closure is not Send, but since it will
never actually be sent to another thread, the old code can justify an unsafe
implementation of Send for BufferMapClosure.

This patch uses the moz_task crate to solve the problem without any unsafe
Send implementations. The closure passed to Global::buffer_map_async, rather
than interacting with WebGPUParent itself, merely sends the mapping result to
an async task spawned on the WebGPUParent's thread via a one-shot channel.
That task then reports the result to the WebGPUParent. This way, the
WebGPUParentPtr pointer never has to leave its thread, so no unsafe Send
implementation is necessary. The synchronization which buffer_map_async
conservatively assumes might be taking place is handled by the one-shot channel,
which we get from the futures-channel crate.

Assignee: nobody → jimb
Status: NEW → ASSIGNED

try push for the above patch

Remove the unsafe impl Send for DeviceLostClosure, and instead use a
futures_channel::oneshot channel to report device loss to the
WebGPUParent's thread.

See the parent commit's message for a detailed explanation.

Remove the unsafe impl Send for SubmittedWorkDoneClosure, and
instead use a futures_channel::oneshot channel to report the
completion of the work to the WebGPUParent's thread.

See the grandparent commit's message for a detailed explanation.

try push for all three patches

The first patch in the stack causes failures because the callers of wgpu_server_buffer_map assume that the closure they pass will be invoked from the next poll call they perform, whereas the patch causes it to run in the next event.

Or at least, WebGPUParent::GetFrontBufferSnapshot assumes this. I can't be sure, but it seems like WebGPUParent::SwapChainPresent doesn't care.

That seems tricky, it's unfortunate that PCanvasManager's GetSnapshot is sync forcing a blocking wait in the GPU process. It would be ideal if it was a blocking call only in the content process.

(I was recently looking at GetFrontBufferSnapshot & SwapChainPresent and wondered why they used different code but executing almost the same steps)

I think I can just introduce a wgpu_server_buffer_map_blocking function, and it'll all work.

Define a new function, wgpu_server_buffer_map_blocking that maps a
wgpu buffer and blocks the calling thread until its contents are ready
for access by the CPU.

Use this function in WebGPUParent::GetFrontBufferSnapshot, rather
than calling wgpu_server_poll_all_devices explicitly to wait for the
buffer to be ready.

try push for D285397 alone

try push for the first two patches combined:

  • Use new wgpu_server_buffer_map_blocking for GetFrontBufferSnapshot.
  • Use moz_task for the wgpu_server_buffer_map (async) callback.

In wgpu_bindings::Message::QueueOnSubmittedWorkDone, include the id
of the device to which the queue belongs. Pass this value in
mozilla::webgpu::Queue::OnSubmittedWorkDone. Ignore it for now in
wgpu_bindings::server::process_message; later commits will use it to
request wgpu device polling.

Rather than polling all devices in each
mozilla::webgpu::WebGPUParent's Global` every 100ms, use the XPCOM
background I/O thread pool to make a blocking polling call for each
operation submitted to the GPU, on its device.

Delete WebGPUParent::mTimer and associated code.

In wgpu_bindings::server:

  • Add Global::poll_queue, an XPCOM task queue to which we can submit polling
    requests. This queue targets the XPCOM background I/O thread pool, and ensures
    that only one poll request will run at a time.

  • For each operation that submits work to the GPU, also submit a task to
    poll_queue to poll the wgpu_core device for the task's completion, ensuring
    that resources are cleaned up and callbacks are invoked promptly.

  • In code that submits a callback, do a non-blocking poll to cover the case
    where the callback is ready immediately.

  • In wgpu_server_back_blocking, rather than calling global.device_poll
    ourselves, use poll_queue as the other code does, and use a
    std::sync::mpsc channel to block the calling thread until the mapping is
    done.

try push for the whole stack

I reordered the stack a bit to put D285434 first, since it includes a clobber, which I would like to not have distracting me as I work on later bits of the stack.

I'm not sure, but I think the failures in the last try push are due to tests hanging here:

    if (o instanceof GPUDevice) {
      this.objectsToCleanUp.push({
        async destroyAsync() {
          o.destroy();
          await o.lost;
        }
      });
    } else {

I think this is because the patch doesn't arrange for a final poll of a device when it's destroyed.

I am not delighted with the way this patch stack depends on our catching every operation that might need a final poll. For example, see my comment on the queue write bug, pointing out another place we'll need to stay on top of. The timer had a nice "set and forget" character to it.

But I would argue that understanding which operations require follow-up is a reasonable ask of people hoping to maintain WebGPU in Firefox; this stuff is pretty central to the architecture, and if you can't handle that, you're going to have many other problems anyway. Obviously, introducing new critical background knowledge of this kind should be avoided whenever possible, but I don't see a way around it here.

And if my guess is correct that the problem is device loss promises, then that's something that I should reasonably have anticipated: buffer mapping, submitted work completion, and device loss are the only three kinds of callbacks in the wgpu-core API, and the current version of the patch only addresses the first two.

Confirmed that device loss promises no longer fire.

One more to-do item: there is a remaining call to wgpu_server_device_poll in WebGPUParent::GetFrontBufferSnapshot that needs to be moved to the poll queue. I don't really understand what this poll is doing, as it doesn't seem to be followed by anything that would map a buffer.

I think I fixed device loss; try push

Another to-do item: the background thread refers to the device it's to poll by id, but the device could be dropped or destroyed at any time by the main thread, removing its id from the registry and causing the polling task to panic. The polling tasks should probably have an Arc<wgpu_core::device::Device>, not an id.

try push for the initial five patches, which should be okay to land separately.

Pushed by jblandy@mozilla.com: https://github.com/mozilla-firefox/firefox/commit/56a8c9a001f2 https://hg.mozilla.org/integration/autoland/rev/7d39dee43ac9 WebGPU: Supply device in QueueOnSubmittedWorkDone. r=webgpu-reviewers,teoxoy https://github.com/mozilla-firefox/firefox/commit/a675f00c72d9 https://hg.mozilla.org/integration/autoland/rev/809b9d75bd06 Use new `wgpu_server_buffer_map_blocking` for `GetFrontBufferSnapshot`. r=webgpu-reviewers,teoxoy https://github.com/mozilla-firefox/firefox/commit/8fb05567ea11 https://hg.mozilla.org/integration/autoland/rev/cc7788c6b2df Remove `unsafe impl Send for BufferMapClosure` r=webgpu-reviewers,aleiserson https://github.com/mozilla-firefox/firefox/commit/feeac864dd26 https://hg.mozilla.org/integration/autoland/rev/7866eadfb032 Remove `unsafe impl Send for DeviceLostClosure` r=webgpu-reviewers,teoxoy https://github.com/mozilla-firefox/firefox/commit/33d59f4c8971 https://hg.mozilla.org/integration/autoland/rev/68ae236565fe Remove `unsafe impl Send for SubmittedWorkDoneClosure` r=webgpu-reviewers,teoxoy

One of the test failures in comment 41 (log) is in webgpu:api,operation,buffers,map:mapAsync,mapState:*. The crash is here: WebGPUParent::MapCallback, the callback for GPUBuffer.mapAsync completion, is not finding an entry for the buffer in WebGPUParent::mSharedMemoryMap.

We create mSharedMemoryMap entries when we create a buffer, and we remove them when the buffer is destroyed. I haven't looked at the test yet, but perhaps it calls mapAsync and then destroys the buffer before the map request can come back. But if that's the case, then polling from a timer should hit this crash as well.

In any case, I can reproduce the crash locally against the CTS web site. It doesn't happen every time, but it happened the second time I tried.

mach run -- 'https://gpuweb.github.io/cts/standalone/?runnow=1&q=webgpu:api,operation,buffers,map:mapAsync,mapState:*'

Hit refresh a few times.

I've added some logging, and here's the behavior before the crash:

[Child 2032140: Main Thread]: D/WebGPU Buffer::Create sent message to create 4294967433
[Child 2032140: Main Thread]: D/WebGPU Buffer::MapAsync 4294967433 
[Child 2032140: Main Thread]: D/WebGPU Buffer::Destroy 4294967433 calling wgpu_client_destroy_buffer
[Parent 2031787: CanvasRenderer]: D/WebGPU wgpu_server_set_buffer_map_data buffer 4294967433
[Child 2032140: Main Thread]: D/WebGPU Buffer::Destroy 4294967433 calling wgpu_client_destroy_buffer
[Parent 2031787: CanvasRenderer]: D/WebGPU WebGPUParent::DeallocBufferShmem buffer 4294967433
[Parent 2031787: CanvasRenderer]: D/WebGPU WebGPUParent::MapCallback 4294967433 mapData 0
[2031787] Assertion failure: mapData, at /home/jimb/moz/central/dom/webgpu/ipc/WebGPUParent.cpp:724

The 4294967433 is the buffer id.

We create mSharedMemoryMap entries when we create a buffer, and we remove them when the buffer is destroyed. I haven't looked at the test yet, but perhaps it calls mapAsync and then destroys the buffer before the map request can come back. But if that's the case, then polling from a timer should hit this crash as well.

wgpu-core calls the callback with BufferAccessError::MapAborted here so MapCallback should take the failure path that doesn't have the release assert. I'm not sure why it does with the changes in the patch stack.

Oh it's possible that the poll task enqueued by map_buffer_and_poll runs after the call to wgpu_server_dealloc_buffer_shmem but before the call to buffer_destroy right under it; since they run in parallel - it's a race.

Oh, so the determination of whether to pass Err(MapAborted) or Ok(()) to the map callback is now made on the background thread, instead of on the WebGPUParent thread, as it was when we polled from a timer, so now it can race with other operations on the buffer.

Here's what happens:

  • polling on the background I/O thread sends map success
  • WebGPUParent thread destroys the buffer
  • WebGPUParent thread then receives the map success

With a little more logging:

[DEBUG wgpu_bindings::server] process_buffer_map: requesting map
[INFO  wgpu_core::device::global] Buffer::map_async Id(40,1) offset 0 size Some(8) op: BufferMapOperation { host: Read, callback: Some("?") }
[INFO  wgpu_core::device::global] Device::poll Poll
[DEBUG wgpu_bindings::server] server.rs: parent received Message::DestroyBuffer(Id(40,1))
    
[Parent 2094185: CanvasRenderer]: D/WebGPU WebGPUParent::DeallocBufferShmem buffer 4294967336
[DEBUG wgpu_bindings::server] process_buffer_map: wgpu-core invoked callback on background I/O thread, sending map status
[INFO  wgpu_core::device::global] Buffer::destroy Id(40,1)
[DEBUG wgpu_bindings::server] process_buffer_map: WebGPUParent thread received map status
[INFO  wgpu_core::device::global] Device::poll Poll
[Parent 2094185: CanvasRenderer]: D/WebGPU WebGPUParent::MapCallback 4294967336 mapData 0
[2094185] Assertion failure: mapData, at /home/jimb/moz/central/dom/webgpu/ipc/WebGPUParent.cpp:724

Step by step:

  • CanvasRenderer thread receives Message::BufferMap:
    • requests mapping
    • enqueues a poll call on the background I/O thread
  • These two steps may happen in either order:
    • CanvasRenderer thread receives Message::DestroyBuffer
      • calls DeallocBufferShmem
    • wgpu-core invokes the map callback on the background I/O thread
      • sends map success status to the main thread
  • CanvasRenderer thread continues processing Message::DestroyBuffer
    • calls Global::buffer_destroy
  • CanvasRenderer thread receives map success status from background I/O thread
    • calls MapCallback
  • MapCallback is surprised to see that DeallocBufferShmem has
    destroyed the shmem of an apparently successfully mapped buffer.

According to the WebGPU spec, if the GPUBuffer.destroy message to the device timeline and the map success message to the content timeline cross in transit, the race has no effect because the content timeline re-checks [[pending_map]], which GPUBuffer.destroy cleared (by calling unmap), and processes the map as cancelled.

In our implementation, there's no recheck in the content process. The recheck occurs on the CanvasRenderer thread. So I think we have always had a problem with resolving map request promises successfully after destroying buffers.

Here's a summary of the overall state of this bug:

The current patch stack generally works and improves performance as expected.

However, it's not ready to land: there are inherent races in the WebGPU API between content and the GPU, and the spec explains how they need to be resolved, but wgpu makes it difficult to follow the spec while still being responsive. The difference between the two is causing assertion failures with the current patch.

Teo had some good review comments, but I don't anticipate any troubles addressing those.

I've been on PTO for the last week, but I'm back now, and this is my top priority.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: