Closed Bug 1877461 Opened 2 years ago Closed 1 year ago

index.html)

Tracking

()

Status:

RESOLVED WORKSFORME

Tracking Flags:

Tracking

Status

firefox-esr115

---

unaffected

firefox122

---

unaffected

firefox123

---

disabled

firefox124

---

disabled

firefox125

---

disabled

People

(Reporter: mayankleoboy1, Unassigned)

References

(Blocks 1 open bug, Regression,
URL
)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(3 files)

about:support 2 years ago Mayank Bansal 44.54 KB, text/plain		Details
Bug 1877461 - chore(webgpu): enable panicking on OOMs, device loss, and other internal errors r=#webgpu-reviewers! 1 year ago Erich Gubler [:ErichDonGubler] (he/him) 48 bytes, text/x-phabricator-request		Details \| Review
fx-teapot.zip 1 year ago Erich Gubler [:ErichDonGubler] (he/him) 539.77 KB, application/x-zip-compressed		Details

Mayank Bansal

Reporter

Description

•

2 years ago

Go to https://cx20.github.io/webgpu-test/examples/webgpu_glsl/teapot/index.html

AR: Crash
ER: Not so

Mayank Bansal

Reporter

Comment 1

•

2 years ago

Attached file about:support — Details

Mayank Bansal

Reporter

Updated

•

2 years ago

Keywords: regression

Regressed by: 1873164

Mayank Bansal

Reporter

Updated

•

2 years ago

Flags: needinfo?(nical.bugzilla)

Dianna Smith [:diannaS]

Updated

•

2 years ago

status-firefox122: --- → unaffected

status-firefox123: --- → disabled

status-firefox124: --- → affected

status-firefox-esr115: --- → unaffected

Lee Salzman [:lsalzman]

Updated

•

2 years ago

Severity: -- → S3

Bob Hood [:bhood]

Updated

•

2 years ago

status-firefox124: affected → disabled

BugBot [:suhaib / :marco/ :calixte]

Comment 2

•

1 years ago

Set release status flags based on info from the regressing bug 1873164

status-firefox125: --- → affected

Nicolas Silva [:nical]

Updated

•

1 years ago

Flags: needinfo?(nical.bugzilla) → needinfo?(egubler)

Erich Gubler [:ErichDonGubler] (he/him)

Updated

•

1 years ago

Blocks: webgpu-apps

Flags: needinfo?(egubler)

Erich Gubler [:ErichDonGubler] (he/him)

Updated

•

1 years ago

Flags: needinfo?(egubler)

Erich Gubler [:ErichDonGubler] (he/him)

Updated

•

1 years ago

status-firefox125: affected → disabled

Flags: needinfo?(egubler)

Erich Gubler [:ErichDonGubler] (he/him)

Updated

•

1 years ago

Flags: needinfo?(egubler)

Jim Blandy :jimb

Updated

•

1 year ago

Blocks: webgpu-triage

Erich Gubler [:ErichDonGubler] (he/him)

Comment 3

•

1 year ago

•

Edited

This is an internal issue where wgpu-hal implementations are making assumptions about the order of bind group and bind group layout entries in Vecs getting passed to them that aren't being ensured by wgpu-core. In particular, DX12 and Metal (I've yet to confirm Vulkan) appear to assume that the shader-declared order of bindings will match the API-bound resources provided to a call to GPUDevice.createBindGroup.

Marking this issue as P1. This is an ugly and confusing bug that prevents valid WebGPU programs from working, and it's obviously already being run into in demos. This will need to be resolved upstream first (see wgpu#5421), and then consumed in a subsequent iteration of webgpu-update-wgpu.

This issue also seems to apply to the GLES backend in WGPU upstream, but that doesn't apply to Firefox.

No longer blocks: webgpu-triage

Flags: needinfo?(egubler)

Priority: -- → P1

Erich Gubler [:ErichDonGubler] (he/him)

Updated

•

1 year ago

Assignee: nobody → egubler

Status: NEW → ASSIGNED

Erich Gubler [:ErichDonGubler] (he/him)

Comment 4

•

1 year ago

wgpu#5421 is awaiting a review from somebody on the WebGPU team.

Erich Gubler [:ErichDonGubler] (he/him)

Updated

•

1 year ago

Depends on: webgpu-update-wgpu

Erich Gubler [:ErichDonGubler] (he/him)

Comment 5

•

1 year ago

wgpu#5421 is now merged upstream, and awaiting webgpu-update-wgpu.

Erich Gubler [:ErichDonGubler] (he/him)

Updated

•

1 year ago

Depends on: 1887909
No longer depends on: webgpu-update-wgpu

Erich Gubler [:ErichDonGubler] (he/him)

Comment 6

•

1 year ago

WGPU has been re-vendored on mozilla-central, and we should now be able to consume the fix.

Mayank Bansal

Reporter

Comment 7

•

1 year ago

Crash signature has changed a bit :
https://crash-stats.mozilla.org/report/index/1fd349bd-ceb1-4b8e-b076-799e20240405#tab-bugzilla
https://crash-stats.mozilla.org/report/index/557802fa-8129-4be9-a794-7b8ee0240405#tab-bugzilla

Erich Gubler [:ErichDonGubler] (he/him)

Comment 8

•

1 year ago

Attached file Bug 1877461 - chore(webgpu): enable panicking on OOMs, device loss, and other internal errors r=#webgpu-reviewers! — Details

Pulsebot

Comment 9

•

1 year ago

Pushed by egubler@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/2314c4e494ba chore(webgpu): enable panicking on OOMs, device loss, and other internal errors r=webgpu-reviewers,nical

Erich Gubler [:ErichDonGubler] (he/him)

Updated

•

1 year ago

Keywords: leave-open

Erich Gubler [:ErichDonGubler] (he/him)

Comment 10

•

1 year ago

After testing on some of our more commodity-tier CI hardware, it's unclear what the spread of this issue is. Gonna demote to P2 for now, but it's entirely possible that this will get bumped to P1 as we discover this issue to be more widespread.

Priority: P1 → P2

Erich Gubler [:ErichDonGubler] (he/him)

Comment 11

•

1 year ago

Attached file fx-teapot.zip — Details

An attachment that I'll explain momentarily.

Erich Gubler [:ErichDonGubler] (he/him)

Comment 12

•

1 year ago

At least in the case of my own machine on which I could reproduce this (a top-of-the-line-ish laptop 7 years ago with the latest Windows 10 currently on it, using an Intel Graphics HD 530 driver with its iGPU), this workload appears to provoke a timeout in the DX12 runtime, which subsequently disconnected the DX12 device from wgpu with error code DXGI_ERROR_DEVICE_HUNG; I was able to get a trace from the machine that consistently reproduces using wgpu's player binary, but not on my current daily driver for Windows (see attached fx-teapot.zip). We may or may not be able to work around this; we fundamentally have limitations in what WebGPU back ends give us. We obviously aren't handling this very well, though. For the next assignee (likely myself): It's clear that there are at least two problems at play from my own reproductions of this issue:

wgpu's DX12 backend (and likely others) may not be cutting off access to the DX12 backend fast enough. It appears that a subsequent allocation of a texture after receiving the failure succeeds in terms of the HRESULT (error code) returned by the texture initialization, but in fact, the API (ID3D12Device::CreatePlacedResource returns a null pointer.

We need to ensure that we are invalidating all device-related WGPU resources once we detect that the underlying DX12 device has disconnected. We may already do this, but we should take this opportunity to double-check.
Later, this null pointer causes an access violation exception when wgpu tries to call IUnknown::AddRef on it.

This isn't surprising, given problem (1), but it is indicative of the fact that there are cases where Windows APIs may not indicate failure, but it still returns a null pointer for COM resources. We need to handle this case by explicitly checking that the returned COM pointer of resources we're attempting to initialize is not null before we accept them into tracked WGPU resources. I have some previous WIP work to make null checks earlier and more stringent against wgpu, but nothing that I've filed upstream yet.

Unassigning from myself, since there are yet higher priorities for the WebGPU team, ATM.

Assignee: egubler → nobody

Status: ASSIGNED → NEW

Cosmin Sabou [:CosminS]

Comment 13

•

1 year ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/2314c4e494ba

Erich Gubler [:ErichDonGubler] (he/him)

Updated

•

1 year ago

Keywords: leave-open

BugBot [:suhaib / :marco/ :calixte]

Comment 14

•

1 year ago

Closing because no crashes reported for 12 weeks.

Status: NEW → RESOLVED

Closed: 1 year ago

Resolution: --- → WORKSFORME

You need to log in before you can comment on or make changes to this bug.