Visiting Quora with Webrender takes down entire system (Intel Linux)
Categories
(Core :: Graphics: WebRender, defect, P2)
Tracking
()
Tracking | Status | |
---|---|---|
firefox80 | --- | fixed |
People
(Reporter: TD-Linux, Assigned: gw)
References
()
Details
Attachments
(2 files)
3.01 KB,
patch
|
Details | Diff | Splinter Review | |
47 bytes,
text/x-phabricator-request
|
Details | Review |
When going to he URL, the firefox UI locks up, followed by locking up the entire desktop. I also see the following in the console:
[GFX1-]: Failed to map PBO of size 138432 bytes
[2020-04-22T01:22:03Z ERROR webrender::device::gl] Failed to map PBO of size 138432 bytes
implying this is some sort of resource exhaustion.
OpenGL vendor string: Intel Open Source Technology Center
OpenGL renderer string: Mesa DRI Intel(R) HD Graphics 4600 (HSW GT2)
OpenGL core profile version string: 4.5 (Core Profile) Mesa 20.0.4
Reporter | ||
Updated•5 years ago
|
Comment 1•5 years ago
|
||
Not able to reproduce on Linux Mint 19.3 with Radeon Pro WX 7100.
This might be intel specific.
@gw: Any idea what would be going on here?
Assignee | ||
Comment 2•5 years ago
|
||
I'm also unable to reproduce here on Kubuntu 19.10 with same GPU (HD4600) but older Mesa (19.2.8), so I'm not sure what's going on. Possibly an issue with newer mesa?
Could you run Gecko with the environment variable MESA_DEBUG=1
set? In theory, that should print out a descriptive error message when that occurs, which might help in diagnosing what's happening here.
Reporter | ||
Comment 3•5 years ago
|
||
It is only mildly informative:
MESA_DEBUG=1 ~/firefox/firefox
Mesa: User error: GL_CONTEXT_LOST in context lost
[GFX1-]: Failed to map PBO of size 65536 bytes
[2020-04-23T16:31:44Z ERROR webrender::device::gl] Failed to map PBO of size 65536 bytes
[GFX1-]: Failed to map PBO of size 1088 bytes
[2020-04-23T16:31:44Z ERROR webrender::device::gl] Failed to map PBO of size 1088 bytes
Updated•5 years ago
|
Updated•5 years ago
|
Assignee | ||
Comment 4•5 years ago
|
||
Huh, so GL_CONTEXT_LOST
means that the GPU itself was reset, I think, which seems bad!
I'm not really sure what would cause this on a desktop GL Linux system - It possibly suggests that we're doing something bad like reading or writing to a mapped memory location that isn't valid. Although, I haven't seen this on any other system, so I wonder if it's actually a Mesa bug.
Jeff, any thoughts or suggestions for this?
Comment 5•5 years ago
|
||
- Run it with KHR_debug
- Run it with abort/break-on-gl-error to figure out what command is failing.
- If WR uses KHR_no_error (which we should be in production?), if we did provide malformed commands to GL it could choose to enter CONTEXT_LOST (or crash), so we should check without this.
Reporter | ||
Comment 6•5 years ago
|
||
Interestingly, this time I looked in dmesg and saw I got a GPU hang:
[172004.466510] i915 0000:00:02.0: GPU HANG: ecode 7:1:85dffffe, in Renderer [130471]
[172004.466658] i915 0000:00:02.0: Resetting chip for stopped heartbeat on rcs0
[172004.568093] i915 0000:00:02.0: Renderer[130471] context reset due to GPU hang
[172018.222820] Renderer[130471]: segfault at 0 ip 00007f83dabd688e sp 00007f83cd7f0b90 error 6 in libxul.so[7f83d5750000+54a6000]
[172018.222826] Code: 8b 4d c0 e9 94 fe ff ff 48 83 c4 58 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 1f 44 00 00 55 48 89 e5 48 8b 05 c5 9f 3e 02 48 89 10 <89> 34 25 00 00 00 00 e8 56 3a b8 fa 66 0f 1f 44 00 00 85 ff 74 19
[172025.508687] i915 0000:00:02.0: GPU HANG: ecode 7:1:85dffffe, in Renderer [144628]
[172025.508987] i915 0000:00:02.0: Resetting chip for stopped heartbeat on rcs0
[172025.611886] i915 0000:00:02.0: Renderer[144628] context reset due to GPU hang
[172026.330617] Renderer[144628]: segfault at 0 ip 00007f35598aa88e sp 00007f354c4afc10 error 6 in libxul.so[7f3554424000+54a6000]
[172026.330630] Code: 8b 4d c0 e9 94 fe ff ff 48 83 c4 58 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 1f 44 00 00 55 48 89 e5 48 8b 05 c5 9f 3e 02 48 89 10 <89> 34 25 00 00 00 00 e8 56 3a b8 fa 66 0f 1f 44 00 00 85 ff 74 19
[172033.508821] i915 0000:00:02.0: GPU HANG: ecode 7:1:85dffffe, in Renderer [144699]
[172033.510523] i915 0000:00:02.0: Resetting chip for stopped heartbeat on rcs0
[172033.612054] i915 0000:00:02.0: Renderer[144699] context reset due to GPU hang
And also a crash report: https://crash-stats.mozilla.org/report/index/85d22bd9-415e-4b18-b56e-01ab00200424
I'm not sure which caused which, though.
Comment 7•5 years ago
|
||
The crash report has:
assertion failed:
(left != right)
left:0
,
right:0
: glCopyImageSubData's behaviour is undefined if src and dst images are identical and the rectangles overlap.
Updated•5 years ago
|
Assignee | ||
Comment 9•5 years ago
|
||
It's certainly possible that the assertion failure above is exposing a GPU driver bug, causing the hang (also possible that it's unrelated). I'll see if I can repro that assertion with some more testing today.
Assignee | ||
Comment 10•5 years ago
|
||
I can't get that assert to fire on my system (which is also an HD4600) with that URL. We've seen that assertion fire once before in a bug, and never been able to repro.
Thomas, is that crash a reliable repro for you? If so, I could provide a patch / try build with artifacts you could run on your machine with some extra logging information to try and help narrow down what code path is causing that?
Reporter | ||
Comment 11•5 years ago
|
||
Yes, it still reproduces every time for me. I can test a patch. I also see that the KHR_debug patch has landed, but I don't know how to enable it.
Assignee | ||
Comment 12•5 years ago
|
||
If you apply this patch and report the output, it might help to diagnose what'soccurring. It may involve a bit of back and forth, since I can't repro locally, sorry.
In addition to logging some information, it also early exits from the function when this case occurs.
If that stops the overall GPU hang / crash, we know it's related to this assertion failure. I would expect to see visual corruption in this case, but hopefully it stops the GPU reset (otherwise it's probably unrelated to this assertion failure).
Thanks!
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Reporter | ||
Comment 13•5 years ago
|
||
Alright, this is really frustrating. After building in the patch and testing, it no longer reproduces, not with mozregression. I thought maybe a recent system update could have done it, so I rolled back to an earlier mesa (definitely installed at the time, though I think I reproduced it on a more recently installed version) and still couldn't. I'm going to continue to try rolling back more things, but it's also possible Quora changed their website...
Updated•5 years ago
|
Comment 14•4 years ago
|
||
I have been unable to reproduce with the same GPU and Mesa 20.0.8. I've been looking at the crash reports for the related assert, and it is mostly Intel, but there are a couple of reports for NVIDIA. Doesn't appear to be tied to any particular hardware. All of the crashes report the same thing in the gfx critical log, e.g.:
(t=29158.1) |[G55854][GFX1-]: Failed to map PBO of size 10108 bytes (t=29158.1) |[G55840][GFX1-]: Failed to map PBO of size 26712 bytes (t=29157.7) |[G55841][GFX1-]: Failed to map PBO of size 10792 bytes (t=29157.7) |[G55842][GFX1-]: Failed to map PBO of size 163840 bytes (t=29157.7) |[G55843][GFX1-]: Failed to map PBO of size 4192 bytes (t=29157.7) |[G55844][GFX1-]: Failed to map PBO of size 4192 bytes (t=29157.7) |[G55845][GFX1-]: Failed to map PBO of size 1152 bytes
I would say the particular URLs are not super pertinent unless one can reproduce. The volume is low, but in the last 6 months, it crashed on about:newtab, YouTube, DDG, Quora, all very popular sites.
All that said, I don't think the assertions are important unless there is other information we can glean from the reports (no aggregations stand out to me). We lose the GPU context, and then we see the assertions, suggesting perhaps we didn't handle the reset properly and tried to continue on our merry way, and we got garbage out?
Comment 15•4 years ago
|
||
We log if we get a device reset:
And that isn't appearing in the gfx critical log.
Comment 16•4 years ago
|
||
Ah. That would do it. TODO, handle device resets :).
Comment 17•4 years ago
|
||
Similar to ANGLE and WebGL, we should be checking if there is a device
reset after a render pass via the glGetGraphicsResetStatus API.
Additionally, we should allow for simulating a device reset on platforms
other than Windows when using WebRender.
Updated•4 years ago
|
Comment 18•4 years ago
|
||
Comment 19•4 years ago
|
||
bugherder |
Description
•