Closed Bug 1631331 Opened 3 years ago Closed 2 years ago

Crash in [@ core::result::unwrap_failed | webrender::renderer::Renderer::render_impl]

Categories

(Core :: Graphics: WebRender, defect, P1)

Unspecified
Windows 10
defect

Tracking

()

RESOLVED FIXED
mozilla78
Tracking Status
firefox-esr68 --- unaffected
firefox75 --- unaffected
firefox76 --- unaffected
firefox77 --- fixed
firefox78 --- fixed

People

(Reporter: gsvelto, Assigned: kvark)

References

(Blocks 1 open bug, Regression)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(1 file)

This bug is for crash report bp-88ab7b43-bba5-4f43-859d-63f4d0200419.

Top 10 frames of crashing thread:

0 xul.dll RustMozCrash mozglue/static/rust/wrappers.cpp:16
1 xul.dll mozglue_static::panic_hook mozglue/static/rust/lib.rs:89
2 xul.dll core::ops::function::Fn::call<fn ../f3e1a954d2ead4e2fc197c7da7d71e6c61bad196/src/libcore/ops/function.rs:232
3 xul.dll std::panicking::rust_panic_with_hook ../f3e1a954d2ead4e2fc197c7da7d71e6c61bad196//src/libstd/panicking.rs:475
4 xul.dll std::panicking::begin_panic_handler ../f3e1a954d2ead4e2fc197c7da7d71e6c61bad196//src/libstd/panicking.rs:375
5 xul.dll core::panicking::panic_fmt ../f3e1a954d2ead4e2fc197c7da7d71e6c61bad196//src/libcore/panicking.rs:84
6 xul.dll core::result::unwrap_failed ../f3e1a954d2ead4e2fc197c7da7d71e6c61bad196//src/libcore/result.rs:1188
7 xul.dll webrender::renderer::Renderer::render_impl gfx/wr/webrender/src/renderer.rs:3407
8 xul.dll webrender::renderer::Renderer::render gfx/wr/webrender/src/renderer.rs:3185
9 xul.dll webrender_bindings::bindings::wr_renderer_render gfx/webrender_bindings/src/bindings.rs:599

The oldest buildid with this crash is 20200414094457. The raw crash reason is:

called `Result::unwrap()` on an `Err` value: Shader(Compilation("gpu_cache_update", ""))

From what I can tell we should be failing around here.

Hey Glenn,

Could this have something to do with Bug 1629693?

Flags: needinfo?(gwatson)
Priority: -- → P1

It does seem like that is the bug that regressed it.

But it's hard to imagine why that would fail to compile, it's a really simple shader.

Dzmitry, Sotaro, Jeff, do you have any ideas what may cause that shader to fail compilation on the machine(s) in this crash report? I didn't see any shader compiler errors in the log, but I might have missed them.

Flags: needinfo?(sotaro.ikeda.g)
Flags: needinfo?(jmuizelaar)
Flags: needinfo?(gwatson)
Flags: needinfo?(dmalyshau)

I see that "System Memory Use Percentage" is at 98% - could this somehow be an OOM on startup!?

It's the only shader that assigns to gl_PointSize and is non-polygonal from Angle's point of view. It should work, of course. I'm not seeing anything inherently suggesting that it wouldn't. The good thing is - if it's Angle failing to compile the shader, we should be able to reproduce it. Just need to find a similar "UHD Graphics 620 (Whiskey Lake) (0x3ea0)" machine with Windows 10 and check what HLSL source gets generated, if any.

Side note - why are we even compiling it at render time? We should add it to the non-lazy list of shaders on the platforms that need it.

Flags: needinfo?(dmalyshau)

Sotaro was unable to reproduce this on the same GPU. Looking closer at the crash report, it seems to be low on both physical memory and also available page file memory. So I suspect the system in this case was heavily loaded and we're seeing an OOM crash in D3D / ANGLE.

Let's see if this continues to have reported crashes in the next few days?

Looking at all the crashes ~60% of them have very low available page file numbers which indicate they must be very close to an OOM condition. The rest however have plenty of memory available. Could they be running out of another resource? One other interesting tidbit is that almost all the crashes have the following error in the Graphics critical error annotation:

Failed to load a program object with a program binary: gpu_cache_update renderer ANGLE (...)

IIUC the shader had already been compiled and was available in binary form but it failed to load, so the code tried to recompile it and failed, does that sound correct?

Blocks: wr-77
Flags: needinfo?(jmuizelaar)

Maybe checking for a glError after calling glProgramBinary would be a good idea. It should also be possible to get the debug messages out of ANGLE in this case.

(In reply to Gabriele Svelto [:gsvelto] from comment #7)

Looking at all the crashes ~60% of them have very low available page file numbers which indicate they must be very close to an OOM condition. The rest however have plenty of memory available. Could they be running out of another resource? One other interesting tidbit is that almost all the crashes have the following error in the Graphics critical error annotation:

Failed to load a program object with a program binary: gpu_cache_update renderer ANGLE (...)

IIUC the shader had already been compiled and was available in binary form but it failed to load, so the code tried to recompile it and failed, does that sound correct?

Yes, from the error log, it seemed to fail to load program binary at first.

(In reply to Jeff Muizelaar [:jrmuizel] from comment #8)

Maybe checking for a glError after calling glProgramBinary would be a good idea. It should also be possible to get the debug messages out of ANGLE in this case.

If seems possible to forward ANGLE error to gfx critical note. WebRender already uses GL_KHR_debug for logging.

And rust log Error message is already forwarded to gfx critical note.

Depends on: 1632096

(In reply to Sotaro Ikeda [:sotaro] from comment #10)

If seems possible to forward ANGLE error to gfx critical note. WebRender already uses GL_KHR_debug for logging.

Created Bug 1632096 for it.

Since Bug 1632096 fix, crash reports had following ANGLE errors. It seems that ANGLE failed to allocate GPU resources.

  • Internal D3D11 error: HRESULT: 0x8007000E: Error allocating Texture2D
  • Internal D3D11 error: HRESULT: 0x887A0005: Error allocating VertexShader

I just ran into this locally on my gen6 machine.

Flags: needinfo?(jmuizelaar)
Severity: -- → critical
See Also: → 1538719

It looks like this is still an issue we should worry about. Glenn, should we disable 1629693 on Beta?

Flags: needinfo?(gwatson)

There were 5-10 crashes per day, but the last 5 Nightlies do not seem to be affected anymore:
Last crashing build: 20200504205324
First seemingly good: 20200505094621
Do you see a commit that could have fixed this crash?
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=34d71b4a00863e7615463592662dfe362c4a993e&tochange=2027edd3c62a65a57708984407f9f2aa0174f78e

Keywords: regression
Regressed by: 1629693
Has Regression Range: --- → yes

Oh, it seems it only changed its signature.
bp-af37f4f7-0c5d-4818-bc26-a86f10200506 [@ core::option::expect_none_failed ]

called Result::unwrap() on an Err value: Shader(Compilation("gpu_cache_update", ""))

Crash Signature: [@ core::result::unwrap_failed | webrender::renderer::Renderer::render_impl] → [@ core::result::unwrap_failed | webrender::renderer::Renderer::render_impl] [@ core::option::expect_none_failed ]

So it looks like this is only happening after we have a device lost.

Flags: needinfo?(jmuizelaar)

It's probably a good idea to disable on beta, I think.

It's a significant performance win on gen7.5 and below. It's probably / possibly a measurable performance win on all Windows/ANGLE devices, but would be less noticeable on higher end devices.

We really need to find a way to repro this, since we definitely want to enable this path everywhere, if possible.

Jeff, you said you saw it on your gen6 machine - was it a one off? Did you manage to repro again at all?

Do we know how many unique users have seen that crash? Initially we thought it might be an OOM, but maybe this hits a weird driver case sometime causing an OOM?

Flags: needinfo?(jmuizelaar)
Flags: needinfo?(jbonisteel)
Flags: needinfo?(gwatson)
Depends on: 1636307
Flags: needinfo?(jmuizelaar)

Ok, so I looked at this some more and it looks like it's just us being sloppy with our error handling. i.e.:
https://searchfox.org/mozilla-central/source/gfx/wr/webrender/src/renderer.rs#1480 can return an error if the shader fails to compile because the device has been lost.
We then panic on that error here: https://searchfox.org/mozilla-central/source/gfx/wr/webrender/src/renderer.rs#3717. Instead we should handle it like we do for other shader compilation.

Dzmitry, can you take a look at fixing this?

Flags: needinfo?(sotaro.ikeda.g)
Flags: needinfo?(jbonisteel)
Flags: needinfo?(dmalyshau)
Assignee: nobody → dmalyshau
Status: NEW → ASSIGNED
Flags: needinfo?(dmalyshau)
Pushed by dmalyshau@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/669ddccd1c34
Handle a case where WebRender can't initialize GPU cache r=jrmuizel
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla78

I a not seeing any crashes in 77 beta so marking 77 as wontfix.

The regressing bug landed in 77 so I think this should be still be uplifted.

Flags: needinfo?(jmuizelaar)

Comment on attachment 9146892 [details]
Handle a case where WebRender can't initialize GPU cache

Beta/Release Uplift Approval Request

  • User impact if declined: GPU crash in some cases where we are out of memory or failing to load a specific shader in WebRender
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): This is just turning a panic into something that reports an error. No functional changes otherwise.
  • String changes made/needed:
Attachment #9146892 - Flags: approval-mozilla-beta?

Comment on attachment 9146892 [details]
Handle a case where WebRender can't initialize GPU cache

Low risk patch for a crash, was on nightly for 2 weeks, approved for uplift on beta before we build our RC build tomorrow.

Attachment #9146892 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Flags: needinfo?(jmuizelaar)
You need to log in before you can comment on or make changes to this bug.