Closed Bug 1877461 Opened 2 years ago Closed 1 year ago

Categories

(Core :: Graphics: WebGPU, defect, P2)

defect

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox-esr115 --- unaffected
firefox122 --- unaffected
firefox123 --- disabled
firefox124 --- disabled
firefox125 --- disabled

People

(Reporter: mayankleoboy1, Unassigned)

References

(Blocks 1 open bug, Regression, )

Details

(Keywords: crash, regression)

Crash Data

Attachments

(3 files)

Attached file about:support
Keywords: regression
Regressed by: 1873164
Flags: needinfo?(nical.bugzilla)
Severity: -- → S3

Set release status flags based on info from the regressing bug 1873164

Flags: needinfo?(nical.bugzilla) → needinfo?(egubler)
Blocks: webgpu-apps
Flags: needinfo?(egubler)
Flags: needinfo?(egubler)
Flags: needinfo?(egubler)

This is an internal issue where wgpu-hal implementations are making assumptions about the order of bind group and bind group layout entries in Vecs getting passed to them that aren't being ensured by wgpu-core. In particular, DX12 and Metal (I've yet to confirm Vulkan) appear to assume that the shader-declared order of bindings will match the API-bound resources provided to a call to GPUDevice.createBindGroup.

Marking this issue as P1. This is an ugly and confusing bug that prevents valid WebGPU programs from working, and it's obviously already being run into in demos. This will need to be resolved upstream first (see wgpu#5421), and then consumed in a subsequent iteration of webgpu-update-wgpu.


This issue also seems to apply to the GLES backend in WGPU upstream, but that doesn't apply to Firefox.

No longer blocks: webgpu-triage
Flags: needinfo?(egubler)
Priority: -- → P1
Assignee: nobody → egubler
Status: NEW → ASSIGNED

wgpu#5421 is awaiting a review from somebody on the WebGPU team.

wgpu#5421 is now merged upstream, and awaiting webgpu-update-wgpu.

Depends on: 1887909
No longer depends on: webgpu-update-wgpu

WGPU has been re-vendored on mozilla-central, and we should now be able to consume the fix.

Pushed by egubler@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/2314c4e494ba chore(webgpu): enable panicking on OOMs, device loss, and other internal errors r=webgpu-reviewers,nical

After testing on some of our more commodity-tier CI hardware, it's unclear what the spread of this issue is. Gonna demote to P2 for now, but it's entirely possible that this will get bumped to P1 as we discover this issue to be more widespread.

Priority: P1 → P2
Attached file fx-teapot.zip

An attachment that I'll explain momentarily.

At least in the case of my own machine on which I could reproduce this (a top-of-the-line-ish laptop 7 years ago with the latest Windows 10 currently on it, using an Intel Graphics HD 530 driver with its iGPU), this workload appears to provoke a timeout in the DX12 runtime, which subsequently disconnected the DX12 device from wgpu with error code DXGI_ERROR_DEVICE_HUNG; I was able to get a trace from the machine that consistently reproduces using wgpu's player binary, but not on my current daily driver for Windows (see attached fx-teapot.zip). We may or may not be able to work around this; we fundamentally have limitations in what WebGPU back ends give us. We obviously aren't handling this very well, though. For the next assignee (likely myself): It's clear that there are at least two problems at play from my own reproductions of this issue:

  1. wgpu's DX12 backend (and likely others) may not be cutting off access to the DX12 backend fast enough. It appears that a subsequent allocation of a texture after receiving the failure succeeds in terms of the HRESULT (error code) returned by the texture initialization, but in fact, the API (ID3D12Device::CreatePlacedResource returns a null pointer.

    We need to ensure that we are invalidating all device-related WGPU resources once we detect that the underlying DX12 device has disconnected. We may already do this, but we should take this opportunity to double-check.

  2. Later, this null pointer causes an access violation exception when wgpu tries to call IUnknown::AddRef on it.

    This isn't surprising, given problem (1), but it is indicative of the fact that there are cases where Windows APIs may not indicate failure, but it still returns a null pointer for COM resources. We need to handle this case by explicitly checking that the returned COM pointer of resources we're attempting to initialize is not null before we accept them into tracked WGPU resources. I have some previous WIP work to make null checks earlier and more stringent against wgpu, but nothing that I've filed upstream yet.

Unassigning from myself, since there are yet higher priorities for the WebGPU team, ATM.

Assignee: egubler → nobody
Status: ASSIGNED → NEW

Closing because no crashes reported for 12 weeks.

Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: