Closed Bug 1632005 Opened 1 year ago Closed 1 year ago

Visiting Quora with Webrender takes down entire system (Intel Linux)

Categories

(Core :: Graphics: WebRender, defect, P2)

x86_64
Linux
defect

Tracking

()

RESOLVED FIXED
mozilla80
Tracking Status
firefox80 --- fixed

People

(Reporter: TD-Linux, Assigned: gw)

References

()

Details

Attachments

(2 files)

When going to he URL, the firefox UI locks up, followed by locking up the entire desktop. I also see the following in the console:

[GFX1-]: Failed to map PBO of size 138432 bytes
[2020-04-22T01:22:03Z ERROR webrender::device::gl] Failed to map PBO of size 138432 bytes

implying this is some sort of resource exhaustion.

OpenGL vendor string: Intel Open Source Technology Center
OpenGL renderer string: Mesa DRI Intel(R) HD Graphics 4600 (HSW GT2)
OpenGL core profile version string: 4.5 (Core Profile) Mesa 20.0.4

OS: Unspecified → Linux
Hardware: Unspecified → x86_64
Version: unspecified → Trunk

Not able to reproduce on Linux Mint 19.3 with Radeon Pro WX 7100.
This might be intel specific.

@gw: Any idea what would be going on here?

Flags: needinfo?(gwatson)

I'm also unable to reproduce here on Kubuntu 19.10 with same GPU (HD4600) but older Mesa (19.2.8), so I'm not sure what's going on. Possibly an issue with newer mesa?

Could you run Gecko with the environment variable MESA_DEBUG=1 set? In theory, that should print out a descriptive error message when that occurs, which might help in diagnosing what's happening here.

Flags: needinfo?(gwatson)

It is only mildly informative:

MESA_DEBUG=1 ~/firefox/firefox
Mesa: User error: GL_CONTEXT_LOST in context lost
[GFX1-]: Failed to map PBO of size 65536 bytes
[2020-04-23T16:31:44Z ERROR webrender::device::gl] Failed to map PBO of size 65536 bytes
[GFX1-]: Failed to map PBO of size 1088 bytes
[2020-04-23T16:31:44Z ERROR webrender::device::gl] Failed to map PBO of size 1088 bytes
Flags: needinfo?(gwatson)
Blocks: gfx-triage
Priority: -- → P2

Huh, so GL_CONTEXT_LOST means that the GPU itself was reset, I think, which seems bad!

I'm not really sure what would cause this on a desktop GL Linux system - It possibly suggests that we're doing something bad like reading or writing to a mapped memory location that isn't valid. Although, I haven't seen this on any other system, so I wonder if it's actually a Mesa bug.

Jeff, any thoughts or suggestions for this?

Flags: needinfo?(gwatson) → needinfo?(jgilbert)
  • Run it with KHR_debug
  • Run it with abort/break-on-gl-error to figure out what command is failing.
  • If WR uses KHR_no_error (which we should be in production?), if we did provide malformed commands to GL it could choose to enter CONTEXT_LOST (or crash), so we should check without this.
Flags: needinfo?(jgilbert)

Interestingly, this time I looked in dmesg and saw I got a GPU hang:

[172004.466510] i915 0000:00:02.0: GPU HANG: ecode 7:1:85dffffe, in Renderer [130471]
[172004.466658] i915 0000:00:02.0: Resetting chip for stopped heartbeat on rcs0
[172004.568093] i915 0000:00:02.0: Renderer[130471] context reset due to GPU hang
[172018.222820] Renderer[130471]: segfault at 0 ip 00007f83dabd688e sp 00007f83cd7f0b90 error 6 in libxul.so[7f83d5750000+54a6000]
[172018.222826] Code: 8b 4d c0 e9 94 fe ff ff 48 83 c4 58 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 1f 44 00 00 55 48 89 e5 48 8b 05 c5 9f 3e 02 48 89 10 <89> 34 25 00 00 00 00 e8 56 3a b8 fa 66 0f 1f 44 00 00 85 ff 74 19
[172025.508687] i915 0000:00:02.0: GPU HANG: ecode 7:1:85dffffe, in Renderer [144628]
[172025.508987] i915 0000:00:02.0: Resetting chip for stopped heartbeat on rcs0
[172025.611886] i915 0000:00:02.0: Renderer[144628] context reset due to GPU hang
[172026.330617] Renderer[144628]: segfault at 0 ip 00007f35598aa88e sp 00007f354c4afc10 error 6 in libxul.so[7f3554424000+54a6000]
[172026.330630] Code: 8b 4d c0 e9 94 fe ff ff 48 83 c4 58 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 1f 44 00 00 55 48 89 e5 48 8b 05 c5 9f 3e 02 48 89 10 <89> 34 25 00 00 00 00 e8 56 3a b8 fa 66 0f 1f 44 00 00 85 ff 74 19
[172033.508821] i915 0000:00:02.0: GPU HANG: ecode 7:1:85dffffe, in Renderer [144699]
[172033.510523] i915 0000:00:02.0: Resetting chip for stopped heartbeat on rcs0
[172033.612054] i915 0000:00:02.0: Renderer[144699] context reset due to GPU hang

And also a crash report: https://crash-stats.mozilla.org/report/index/85d22bd9-415e-4b18-b56e-01ab00200424

I'm not sure which caused which, though.

The crash report has:

assertion failed: (left != right)
left: 0,
right: 0: glCopyImageSubData's behaviour is undefined if src and dst images are identical and the rectangles overlap.

Glenn, do you have feedback on comment 7?

Flags: needinfo?(gwatson)
Blocks: wr-78
No longer blocks: gfx-triage

It's certainly possible that the assertion failure above is exposing a GPU driver bug, causing the hang (also possible that it's unrelated). I'll see if I can repro that assertion with some more testing today.

Flags: needinfo?(gwatson)

I can't get that assert to fire on my system (which is also an HD4600) with that URL. We've seen that assertion fire once before in a bug, and never been able to repro.

Thomas, is that crash a reliable repro for you? If so, I could provide a patch / try build with artifacts you could run on your machine with some extra logging information to try and help narrow down what code path is causing that?

Flags: needinfo?(tdaede)

Yes, it still reproduces every time for me. I can test a patch. I also see that the KHR_debug patch has landed, but I don't know how to enable it.

Flags: needinfo?(tdaede)

If you apply this patch and report the output, it might help to diagnose what'soccurring. It may involve a bit of back and forth, since I can't repro locally, sorry.

In addition to logging some information, it also early exits from the function when this case occurs.

If that stops the overall GPU hang / crash, we know it's related to this assertion failure. I would expect to see visual corruption in this case, but hopefully it stops the GPU reset (otherwise it's probably unrelated to this assertion failure).

Thanks!

Assignee: nobody → gwatson
Flags: needinfo?(tdaede)
Blocks: wr-79
No longer blocks: wr-78
Blocks: wr-linux-mvp
No longer blocks: wr-79

Alright, this is really frustrating. After building in the patch and testing, it no longer reproduces, not with mozregression. I thought maybe a recent system update could have done it, so I rolled back to an earlier mesa (definitely installed at the time, though I think I reproduced it on a more recently installed version) and still couldn't. I'm going to continue to try rolling back more things, but it's also possible Quora changed their website...

Flags: needinfo?(tdaede)
Severity: -- → S3

I have been unable to reproduce with the same GPU and Mesa 20.0.8. I've been looking at the crash reports for the related assert, and it is mostly Intel, but there are a couple of reports for NVIDIA. Doesn't appear to be tied to any particular hardware. All of the crashes report the same thing in the gfx critical log, e.g.:

(t=29158.1) |[G55854][GFX1-]: Failed to map PBO of size 10108 bytes (t=29158.1) |[G55840][GFX1-]: Failed to map PBO of size 26712 bytes (t=29157.7) |[G55841][GFX1-]: Failed to map PBO of size 10792 bytes (t=29157.7) |[G55842][GFX1-]: Failed to map PBO of size 163840 bytes (t=29157.7) |[G55843][GFX1-]: Failed to map PBO of size 4192 bytes (t=29157.7) |[G55844][GFX1-]: Failed to map PBO of size 4192 bytes (t=29157.7) |[G55845][GFX1-]: Failed to map PBO of size 1152 bytes

I would say the particular URLs are not super pertinent unless one can reproduce. The volume is low, but in the last 6 months, it crashed on about:newtab, YouTube, DDG, Quora, all very popular sites.

All that said, I don't think the assertions are important unless there is other information we can glean from the reports (no aggregations stand out to me). We lose the GPU context, and then we see the assertions, suggesting perhaps we didn't handle the reset properly and tried to continue on our merry way, and we got garbage out?

Similar to ANGLE and WebGL, we should be checking if there is a device
reset after a render pass via the glGetGraphicsResetStatus API.

Additionally, we should allow for simulating a device reset on platforms
other than Windows when using WebRender.

Attachment #9164317 - Attachment description: Bug 1632005 - Check for context loss with WebRender with GL and not ANGLE. → Bug 1632005 - Check for context loss with WebRender with native GL.
Pushed by aosmond@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/1f950e392b06
Check for context loss with WebRender with native GL. r=nical
Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla80
You need to log in before you can comment on or make changes to this bug.