Closed Bug 1860570 Opened 9 months ago Closed 8 months ago

Hit MOZ_CRASH(PipelineLayout[119] does not exist) at /third_party/rust/wgpu-core/src/storage.rs:125

Categories

(Core :: Graphics: WebGPU, defect, P2)

x86_64
Linux
defect

Tracking

()

VERIFIED FIXED
121 Branch
Tracking Status
firefox-esr115 --- unaffected
firefox119 --- unaffected
firefox120 --- disabled
firefox121 --- verified

People

(Reporter: jkratzer, Assigned: bradwerth)

References

(Blocks 2 open bugs, Regression)

Details

(Keywords: regression, testcase, Whiteboard: [bugmon:bisected,confirmed][fuzzblocker])

Attachments

(2 files)

Testcase found while fuzzing mozilla-central rev ffe93e4e0835 (built with: --enable-debug --enable-fuzzing).

Testcase can be reproduced using the following commands:

$ pip install fuzzfetch grizzly-framework
$ python -m fuzzfetch --build ffe93e4e0835 --debug --fuzzing -n firefox
$ python -m grizzly.replay ./firefox/firefox testcase.html
Hit MOZ_CRASH(PipelineLayout[119] does not exist) at /third_party/rust/wgpu-core/src/storage.rs:125

    ==232244==ERROR: UndefinedBehaviorSanitizer: SEGV on unknown address 0x000000000000 (pc 0x7f35c89e1b55 bp 0x7f35a069ae00 sp 0x7f35a069adf0 T232368)
    ==232244==The signal is caused by a WRITE memory access.
    ==232244==Hint: address points to the zero page.
        #0 0x7f35c89e1b55 in MOZ_Crash /builds/worker/workspace/obj-build/dist/include/mozilla/Assertions.h:281:3
        #1 0x7f35c89e1b55 in RustMozCrash /mozglue/static/rust/wrappers.cpp:18:3
        #2 0x7f35c89e1aea in mozglue_static::panic_hook::habfbf582d66d5c86 /mozglue/static/rust/lib.rs:96:9
        #3 0x7f35c89e14eb in core::ops::function::Fn::call::h081d0c2d4ea076dc /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/ops/function.rs:79:5
        #4 0x7f35c9a4d1fd in _$LT$alloc..boxed..Box$LT$F$C$A$GT$$u20$as$u20$core..ops..function..Fn$LT$Args$GT$$GT$::call::hb3a915ffd78277c6 /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/alloc/src/boxed.rs:2007:9
        #5 0x7f35c9a4d1fd in std::panicking::rust_panic_with_hook::h75cd912a39a34e8a /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panicking.rs:709:13
        #6 0x7f35c9a4cf86 in std::panicking::begin_panic_handler::_$u7b$$u7b$closure$u7d$$u7d$::h1498b46f7849e167 /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panicking.rs:597:13
        #7 0x7f35c9a4a245 in std::sys_common::backtrace::__rust_end_short_backtrace::hd36a39b27b98086b /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/sys_common/backtrace.rs:151:18
        #8 0x7f35c9a4ccd1 in rust_begin_unwind /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panicking.rs:593:5
        #9 0x7f35c9aac9b2 in core::panicking::panic_fmt::h98ef273141454c23 /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/panicking.rs:67:14
        #10 0x7f35c7bb7cec in wgpu_server_pipeline_layout_drop /gfx/wgpu_bindings/src/server.rs:982:5
        #11 0x7f35c1d7f3f9 in mozilla::webgpu::WebGPUParent::RecvImplicitLayoutDestroy(unsigned long, nsTArray<unsigned long> const&) /dom/webgpu/ipc/WebGPUParent.cpp:834:3
        #12 0x7f35c1d90dd0 in mozilla::webgpu::PWebGPUParent::OnMessageReceived(IPC::Message const&) /builds/worker/workspace/obj-build/ipc/ipdl/PWebGPUParent.cpp:2002:80
        #13 0x7f35bfe0bc0d in mozilla::gfx::PCanvasManagerParent::OnMessageReceived(IPC::Message const&) /builds/worker/workspace/obj-build/ipc/ipdl/PCanvasManagerParent.cpp:269:32
        #14 0x7f35bf37fc1f in mozilla::ipc::MessageChannel::DispatchAsyncMessage(mozilla::ipc::ActorLifecycleProxy*, IPC::Message const&) /ipc/glue/MessageChannel.cpp:1800:25
        #15 0x7f35bf37c972 in mozilla::ipc::MessageChannel::DispatchMessage(mozilla::ipc::ActorLifecycleProxy*, mozilla::UniquePtr<IPC::Message, mozilla::DefaultDelete<IPC::Message>>) /ipc/glue/MessageChannel.cpp:1725:9
        #16 0x7f35bf37d5f2 in mozilla::ipc::MessageChannel::RunMessage(mozilla::ipc::ActorLifecycleProxy*, mozilla::ipc::MessageChannel::MessageTask&) /ipc/glue/MessageChannel.cpp:1525:3
        #17 0x7f35bf37e73f in mozilla::ipc::MessageChannel::MessageTask::Run() /ipc/glue/MessageChannel.cpp:1623:14
        #18 0x7f35be6c7c4d in nsThread::ProcessNextEvent(bool, bool*) /xpcom/threads/nsThread.cpp:1192:16
        #19 0x7f35be6cebdd in NS_ProcessNextEvent(nsIThread*, bool) /xpcom/threads/nsThreadUtils.cpp:480:10
        #20 0x7f35bf386e4e in mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*) /ipc/glue/MessagePump.cpp:300:20
        #21 0x7f35bf29fc41 in RunHandler /ipc/chromium/src/base/message_loop.cc:363:3
        #22 0x7f35bf29fc41 in MessageLoop::Run() /ipc/chromium/src/base/message_loop.cc:345:3
        #23 0x7f35be6c2f33 in nsThread::ThreadFunc(void*) /xpcom/threads/nsThread.cpp:370:10
        #24 0x7f35d38d4d0f in _pt_root /nsprpub/pr/src/pthreads/ptthread.c:201:5
        #25 0x7f35d4175ac2 in start_thread nptl/pthread_create.c:442:8
        #26 0x7f35d4207a3f  misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
    
    UndefinedBehaviorSanitizer can not provide additional info.
    SUMMARY: UndefinedBehaviorSanitizer: SEGV /builds/worker/workspace/obj-build/dist/include/mozilla/Assertions.h:281:3 in MOZ_Crash
    ==232244==ABORTING
Attached file Testcase

Verified bug as reproducible on mozilla-central 20231023141548-0dce3814f2ad.
The bug appears to have been introduced in the following build range:

Start: e0dd0b10e8fd0ea751f11fb0a6548ad9b6780e16 (20231016153418)
End: fa12efd7ca249d06b27ea86690ae0d0478f5dcce (20231016182434)
Pushlog: https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=e0dd0b10e8fd0ea751f11fb0a6548ad9b6780e16&tochange=fa12efd7ca249d06b27ea86690ae0d0478f5dcce

Keywords: regression
Whiteboard: [bugmon:confirm][fuzzblocker] → [bugmon:bisected,confirmed][fuzzblocker]
Blocks: webgpu-v1
Severity: -- → S3
Priority: -- → P2

This bug has been marked as a regression. Setting status flag for Nightly to affected.

This bug prevents fuzzing from making progress; however, it has low severity. It is important for fuzz blocker bugs to be addressed in a timely manner (see here why?).
:jimb, could you consider increasing the severity?

For more information, please visit BugBot documentation.

Flags: needinfo?(jimb)
Assignee: nobody → bwerth

This is happening because when we create a pipeline on an invalid device, we don't respond adequately to the error and we throw away the error-generated id. What we need to do is one or both of:

  1. Respond to the error by invalidating the newly-created pipeline.
  2. Take in the error-generated id and change the newly-created pipeline to that id.

First and foremost, we need to respond better to the error, so I'll try to build a patch that pursues strategy #1, above.

Flags: needinfo?(jimb)

Ugh, this is tricky. Since creation of a render pipeline gets encoded as a device action and sent in a way that won't get an immediate failure on a lost device, it's not easy to invalidate the content-side pipeline object. There may need to be an invalidation message sent from parent to child when the error has been generated. Alternatively, maybe the parent can first check if the id maps to an error in the registry, before asking wgpu to pipeline_layout_drop it through the gfx_select! macro.

Actually, our panic is that there is no entry for the id, so that implies that the error setting in device_create_render_pipeline is not getting executed.

Okay, the id is getting correctly set to an error in hub.render_pipelines but it is being retrieved from hub.pipeline_layouts, where it doesn't exist. Perhaps we are calling the wrong function in response to the pipeline drop?

Okay, wgpu_client_create_compute_pipeline sets an implicit pipeline layout id in the child, without knowing that the pipeline creation itself will eventually fail. When that fails in device_create_render_pipeline, the pipeline layout id is never inserted by create_render_pipeline, so it doesn't exist when the pipeline is eventually dropped and tries to also drop its implicit pipeline layout.

Not sure what would be the best solution here. wgpu won't tolerate the retrieval of a non-existent id (that's the panic that motivates this Bug). That's not going to change in wgpu -- it's part of the design choice. Our child view of the render pipeline assumes all is well and is never notified that its creation failed. It also assumes that the render pipeline will have an implicit pipeline layout with the same id.

Possible fixes:

  1. wgpu could be made to supply an invalid pipeline layout id when failing to create a pipeline. But why should it?
  2. WebGPUParent::RecvDeviceAction could notify the child when something in the SendDeviceAction fails. But that would involve reparsing the byte buffer to see what actions were being attempted and then unwinding them. This would be nasty.

I'm going to think about this for awhile and see if I can come up with a more palatable fix, because both of these options are bad.

Alright, I think the fix will need to be in wgpu, but we'll need to build a temporary remediation in Firefox until wgpu is re-vendored with the fix. I've confirmed that moving the check of device.valid in device_create_render_pipeline further down, past the call to device.create_render_pipeline is sufficient to fix the Bug. That's because device.create_render_pipeline sets the error value for the implicit pipeline layout id, ensuring that when the pipeline is later dropped and the client assumes that layout id exists, it will be found in wgpu.

So, a four-stage fix, the first part of which will be done in this Bug:

  1. Stop destroying the implicit pipeline layout id, which will leak memory. Add a comment explaining why, referencing a new Bug that will revert this behavior.
  2. Build a fix in wgpu and get it accepted.
  3. Re-vendor wgpu, tracked in Bug 1851881.
  4. In a to-be-filed Bug, revert the changes in Step 1.
Pushed by bwerth@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b361ef38ca02
Stop destroying implicit pipeline layouts and implicit bind group layouts. r=webgpu-reviewers,ErichDonGubler
Status: NEW → RESOLVED
Closed: 8 months ago
Resolution: --- → FIXED
Target Milestone: --- → 121 Branch

Verified bug as fixed on rev mozilla-central 20231103051812-8be76292bf3f.
Removing bugmon keyword as no further action possible. Please review the bug and re-add the keyword for further analysis.

Status: RESOLVED → VERIFIED
Keywords: bugmon
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: