Open Bug 1921193 Opened 1 year ago Updated 1 year ago

storageBarrier() in wgsl shader causes later firefox crash under specific conditions

Categories

(Core :: Graphics: WebGPU, defect, P2)

Firefox 132
Desktop
Linux
defect

Tracking

()

UNCONFIRMED

People

(Reporter: huscar, Unassigned)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

Attachments

(1 file)

User Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:130.0) Gecko/20100101 Firefox/130.0

Steps to reproduce:

I'm creating a WebGPU application. It crashes consistently on my laptop (Intel Core i7-7700HQ x 8, no GPU) but not on my desktop (Intel Core i5-6600K × 4, GeForce GTX 1660 SUPER). Both run firefox nightly 132.0a1 2024-09-25, though behavior has been consistent throughout a few auto-updates.

I've not been able to create a minimal application that reproduces the bug yet. The full application, at the crashing commit is at https://gitlab.com/oscar-leijendekker/weird-engine/-/tree/663690d775228fc543387b78fafd80be6844ab5d (no build required, run server at project root and navigate to /js)

What I've managed to discover:
The application has a compute pass and a render pass, submitted to the device queue separately. The crash occurs at the render pass when invoking device.queue.submit([commandBuffer]);. The cause of the crash however appears to be in the compute pass. Here, calling pass.dispatchWorkgroups(n, 1, n); with n > 32 causes the crash, no crash with lower values. The shader in question (/js/Rendering/wgsl/heightmap/compute-erosion.js) contains a line storageBarrier(), removing it prevents the crash. Futhermore, setting a certain variable in the private scope to vec2<f32>(0.0, 0.0) also prevents the crash:

@compute
@workgroup_size(8,1,8)
fn main(
  @builtin(global_invocation_id) id : vec3<u32>
) {
  raindrop.p = hash22(vec2<f32>(id.xz));
  raindrop.v = vec2<f32>(0.0,0.0);
  raindrop.s = 0.0;
  raindrop.w = 1.0;

  var halted = false;
  for (var ii : u32 = 0; ii < 40 && !halted; ii = ii + 1) {
    halted = !advance();
    storageBarrier(); // <-- offending line
  }
}

fn advance() -> bool {
  let stepSize = 1.0 / f32(sqrt_point_count);
  let p0 = raindrop.p;
  let dh0 = slope(p0);
  raindrop.v = raindrop.v * 0.8 + dh0*0.01;
  let d = normalize(raindrop.v);
  let v_ = length(raindrop.v);
  let p1 = p0 + d * stepSize;
  
  if (p1.x < 0.0 || p1.y < 0.0 || p1.x > 1.0 || p1.y > 1.0 ) {
    nearest_deposit(p0, raindrop.s);
    return false;
  }

  let h0 = get_height(p0);
  let h1 = get_height(p1);

  if (h1 >= h0) {

    let ds = (h1-h0) * cell_volume;

    if (ds >= raindrop.s) {
      nearest_deposit(p0, raindrop.s);
      return false;
    } else {
      nearest_deposit(p0, ds);
      raindrop.s -= ds;
      raindrop.v = vec2<f32>(0.0, 0.0);
      return true;
    }
  }

  let slope0 = length(dh0);
  raindrop.w = max(raindrop.w - Kw, 0.0);
  if (raindrop.w == 0) {
    return false;
  }
  let q = max(slope0, min_slope)*v_*raindrop.w*Kq;

  var ds : f32 = raindrop.s-q;

  if (ds >= 0.0) {
    ds *= Kd;
    nearest_deposit(p0, ds);
  } else {
    let ds_max = (h0-h1)*cell_volume;
    ds = min(ds*Kr, ds_max);
    nearest_erode(p0, -ds);
  }
  raindrop.s -= ds;
  // raindrop.v = vec2<f32>(0.0, 0.0); <-- adding this prevents the crash
  raindrop.p = p1;
  
  return true;
}

The crash report content contains the line "MozCrashReason: Device lost".

The offending storageBarrier() no longer has a function in this shader. I'm reporting the issue as I figured Firefox should not crash like that in any case.

Actual results:

Crash

Expected results:

Either context loss or continued execution.

OS: Unspecified → Linux
Hardware: Unspecified → Desktop

The Bugbug bot thinks this bug should belong to the 'Core::Widget: Gtk' component, and is moving the bug to that component. Please correct in case you think the bot is wrong.

Component: Untriaged → Widget: Gtk
Product: Firefox → Core
Component: Widget: Gtk → Graphics: WebGPU

HI, tried your application on my Windows machine and I did not crash. So this may be a Linux specific issue.

  1. In your browser, type "about:support" and copy-paste its contents here.
  2. Reproduce the crash in your browser. Then type "about"crashes". You will see some links. Take the latest link and click on it. It will take you to https://crash-stats.mozilla.org/. Copy-paste the URL of the page. (For some extra info, please see the help page at https://support.mozilla.org/en-US/kb/mozillacrashreporter)

Thanks!

Flags: needinfo?(huscar)

@egubler/@jimb :
I see the following warnings in the console when i run the application on my Win11x64+AMD iGPU machine

15:57:18.566 Uncaptured WebGPU error: Internal error: FXC D3DCompile error (Unspecified error (0x80004005)): C:\Program Files\Firefox Nightly\Shader@0x000001794EA63000(133,5): warning X4000: use of potentially uninitialized variable (advance)
C:\Program Files\Firefox Nightly\Shader@0x000001794EA63000(236,13-46): error X3663: thread sync operation found in varying flow control, consider reformulating your algorithm so all threads will hit the sync simultaneously
C:\Program Files\Firefox Nightly\Shader@0x000001794EA63000(133,5): error X3663: pot
15:57:18.851 Uncaptured WebGPU error: In a set_pipeline command, caused by: ComputePipeline with 'Compute Pipeline - hydraulic erosion' label is invalid
15:57:18.851 Uncaptured WebGPU error: Command encoder is locked by a previously created render/compute pass. Before recording any new commands, the pass must be ended.
15:57:18.851 Uncaptured WebGPU error: Command encoder is invalid
15:57:18.857 Uncaptured WebGPU error: Texture with '' label has been destroyed

Severity: -- → S3

WebGPU's validation rules should eliminate workloads that are not safe to submit. Therefore, I strongly suspect that there is some validation of one or more of the WebGPU objects being created that internals are missing. My current mental model of this problem is that Firefox's implementation is permitting something that should fail validation, and then:

  • DX12 rejects the bad shader it's being told to run, and the internal error causes WebGPU internals to mark objects as invalid (which sidesteps the issue).
  • Vulkan on Linux does not return an internal error, and therefore gives no signal that we are doing Bad Things™ that eventually cause a crash.

@huscar: Do you have any crash reports you can link to?

Chrome Beta rejects the render pipeline with the following validation errors reported by WebGPU in the console:

The shader uses more bytes of the buffer (80) than the layout's minBindingSize (64).
 - While validating that the entry-point's declaration for @group(0) @binding(0) matches [BindGroupLayout "HeightmapPipeline::cameraBindGroupLayout"]
 - While validating the entry-point's compatibility for group 0 with [BindGroupLayout "HeightmapPipeline::cameraBindGroupLayout"]
 - While validating vertex stage ([ShaderModule (unlabeled)], entryPoint: main_vertex).
 - While validating vertex state.
 - While calling [Device].CreateRenderPipeline([RenderPipelineDescriptor]).

This gives me confidence in my hypothesis in comment 4: Firefox should also be rejecting the creation of this pipeline, but it currently is not. Tentatively marking as P2, where we're currently putting missing core validation for WebGPU.

Priority: -- → P2
Attached file about:support content
Flags: needinfo?(huscar)

To clear up some confusion:

Both laptop (crashes) and desktop (does not crash) are Linux machines. The desktop runs the program correctly without errors.

The shader provided is incomplete, it's a snippet to show which lines cause/prevent the crash that I know of.

On the matter of DX12 rejecting the pipeline due to branches: the crash persists when (a) the halted condition is removed (i.e. loop can be unrolled) and even (b) when the loop is commented out altogether, provided the workgroup count is sufficiently increased.

Chrome correctly rejects the binding for the camera. The shader code is:

export const shadersrc = (/** @type {number} */ group, /** @type {number} */ binding) => /* wgsl */ `
struct CameraUniforms {
  MVP : mat4x4<f32>,
  position : vec3<f32>
}

@group(${group}) @binding(${binding}) var<uniform> camera : CameraUniforms;

So 1 mat4x4<f32> and 1 vec3<f32>. With alignment taken into consideration that should be 20 floats. I added the position later and forgot to edit the minBindingSize. This seems to be unrelated as the crash still occurs when increasing minBindingSize to 20. The actual buffer does have the right size.

I suspect what's going on here is that you have a synchronization barrier in non-uniform control flow. The WGSL language specification describes validation that should reject such programs, but Naga doesn't implement that algorithm yet. We'd thought Naga was stricter, but perhaps you have found a case where Naga is too liberal.

Could you try passing your shader through the Tint WGSL compiler to see if it complains?

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: