Closed Bug 1632005 Opened 5 years ago Closed 4 years ago

Visiting Quora with Webrender takes down entire system (Intel Linux)

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla80

Tracking Flags:

Tracking

Status

firefox80

---

fixed

People

(Reporter: TD-Linux, Assigned: gw)

References

(
URL
)

Details

Attachments

(2 files)

0001-Log-extra-information-when-bad-texture-blit-occurs.patch 5 years ago Glenn Watson [:gw] 3.01 KB, patch		Details \| Diff \| Splinter Review
Bug 1632005 - Check for context loss with WebRender with native GL. 4 years ago Andrew Osmond [:aosmond] (he/him) 47 bytes, text/x-phabricator-request		Details \| Review

Thomas Daede [:TD-Linux]

Reporter

Description

•

5 years ago

When going to he URL, the firefox UI locks up, followed by locking up the entire desktop. I also see the following in the console:

[GFX1-]: Failed to map PBO of size 138432 bytes
[2020-04-22T01:22:03Z ERROR webrender::device::gl] Failed to map PBO of size 138432 bytes

implying this is some sort of resource exhaustion.

OpenGL vendor string: Intel Open Source Technology Center
OpenGL renderer string: Mesa DRI Intel(R) HD Graphics 4600 (HSW GT2)
OpenGL core profile version string: 4.5 (Core Profile) Mesa 20.0.4

Thomas Daede [:TD-Linux]

Reporter

Updated

•

5 years ago

OS: Unspecified → Linux

Hardware: Unspecified → x86_64

Version: unspecified → Trunk

Kris Taeleman (:ktaeleman)

Comment 1

•

5 years ago

Not able to reproduce on Linux Mint 19.3 with Radeon Pro WX 7100.
This might be intel specific.

@gw: Any idea what would be going on here?

Flags: needinfo?(gwatson)

Glenn Watson [:gw]

Assignee

Comment 2

•

5 years ago

I'm also unable to reproduce here on Kubuntu 19.10 with same GPU (HD4600) but older Mesa (19.2.8), so I'm not sure what's going on. Possibly an issue with newer mesa?

Could you run Gecko with the environment variable MESA_DEBUG=1 set? In theory, that should print out a descriptive error message when that occurs, which might help in diagnosing what's happening here.

Flags: needinfo?(gwatson)

Thomas Daede [:TD-Linux]

Reporter

Comment 3

•

5 years ago

It is only mildly informative:

MESA_DEBUG=1 ~/firefox/firefox
Mesa: User error: GL_CONTEXT_LOST in context lost
[GFX1-]: Failed to map PBO of size 65536 bytes
[2020-04-23T16:31:44Z ERROR webrender::device::gl] Failed to map PBO of size 65536 bytes
[GFX1-]: Failed to map PBO of size 1088 bytes
[2020-04-23T16:31:44Z ERROR webrender::device::gl] Failed to map PBO of size 1088 bytes

Kris Taeleman (:ktaeleman)

Updated

•

5 years ago

Flags: needinfo?(gwatson)

Kris Taeleman (:ktaeleman)

Updated

•

5 years ago

Blocks: gfx-triage

Priority: -- → P2

Glenn Watson [:gw]

Assignee

Comment 4

•

5 years ago

Huh, so GL_CONTEXT_LOST means that the GPU itself was reset, I think, which seems bad!

I'm not really sure what would cause this on a desktop GL Linux system - It possibly suggests that we're doing something bad like reading or writing to a mapped memory location that isn't valid. Although, I haven't seen this on any other system, so I wonder if it's actually a Mesa bug.

Jeff, any thoughts or suggestions for this?

Flags: needinfo?(gwatson) → needinfo?(jgilbert)

Kelsey Gilbert [:jgilbert]

Comment 5

•

5 years ago

Run it with KHR_debug
Run it with abort/break-on-gl-error to figure out what command is failing.
If WR uses KHR_no_error (which we should be in production?), if we did provide malformed commands to GL it could choose to enter CONTEXT_LOST (or crash), so we should check without this.

Flags: needinfo?(jgilbert)

Thomas Daede [:TD-Linux]

Reporter

Comment 6

•

5 years ago

Interestingly, this time I looked in dmesg and saw I got a GPU hang:

[172004.466510] i915 0000:00:02.0: GPU HANG: ecode 7:1:85dffffe, in Renderer [130471]
[172004.466658] i915 0000:00:02.0: Resetting chip for stopped heartbeat on rcs0
[172004.568093] i915 0000:00:02.0: Renderer[130471] context reset due to GPU hang
[172018.222820] Renderer[130471]: segfault at 0 ip 00007f83dabd688e sp 00007f83cd7f0b90 error 6 in libxul.so[7f83d5750000+54a6000]
[172018.222826] Code: 8b 4d c0 e9 94 fe ff ff 48 83 c4 58 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 1f 44 00 00 55 48 89 e5 48 8b 05 c5 9f 3e 02 48 89 10 <89> 34 25 00 00 00 00 e8 56 3a b8 fa 66 0f 1f 44 00 00 85 ff 74 19
[172025.508687] i915 0000:00:02.0: GPU HANG: ecode 7:1:85dffffe, in Renderer [144628]
[172025.508987] i915 0000:00:02.0: Resetting chip for stopped heartbeat on rcs0
[172025.611886] i915 0000:00:02.0: Renderer[144628] context reset due to GPU hang
[172026.330617] Renderer[144628]: segfault at 0 ip 00007f35598aa88e sp 00007f354c4afc10 error 6 in libxul.so[7f3554424000+54a6000]
[172026.330630] Code: 8b 4d c0 e9 94 fe ff ff 48 83 c4 58 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 1f 44 00 00 55 48 89 e5 48 8b 05 c5 9f 3e 02 48 89 10 <89> 34 25 00 00 00 00 e8 56 3a b8 fa 66 0f 1f 44 00 00 85 ff 74 19
[172033.508821] i915 0000:00:02.0: GPU HANG: ecode 7:1:85dffffe, in Renderer [144699]
[172033.510523] i915 0000:00:02.0: Resetting chip for stopped heartbeat on rcs0
[172033.612054] i915 0000:00:02.0: Renderer[144699] context reset due to GPU hang

And also a crash report: https://crash-stats.mozilla.org/report/index/85d22bd9-415e-4b18-b56e-01ab00200424

I'm not sure which caused which, though.

Kelsey Gilbert [:jgilbert]

Comment 7

•

5 years ago

The crash report has:

assertion failed: (left != right)
left: 0,
right: 0: glCopyImageSubData's behaviour is undefined if src and dst images are identical and the rectangles overlap.

Jessie [:jbonisteel] pls NI

Comment 8

•

5 years ago

Glenn, do you have feedback on comment 7?

Flags: needinfo?(gwatson)

Jessie [:jbonisteel] pls NI

Updated

•

5 years ago

Blocks: wr-78
No longer blocks: gfx-triage

Glenn Watson [:gw]

Assignee

Comment 9

•

5 years ago

It's certainly possible that the assertion failure above is exposing a GPU driver bug, causing the hang (also possible that it's unrelated). I'll see if I can repro that assertion with some more testing today.

Flags: needinfo?(gwatson)

Glenn Watson [:gw]

Assignee

Comment 10

•

5 years ago

I can't get that assert to fire on my system (which is also an HD4600) with that URL. We've seen that assertion fire once before in a bug, and never been able to repro.

Thomas, is that crash a reliable repro for you? If so, I could provide a patch / try build with artifacts you could run on your machine with some extra logging information to try and help narrow down what code path is causing that?

Flags: needinfo?(tdaede)

Thomas Daede [:TD-Linux]

Reporter

Comment 11

•

5 years ago

Yes, it still reproduces every time for me. I can test a patch. I also see that the KHR_debug patch has landed, but I don't know how to enable it.

Flags: needinfo?(tdaede)

Glenn Watson [:gw]

Assignee

Comment 12

•

5 years ago

Attached patch 0001-Log-extra-information-when-bad-texture-blit-occurs.patch — Details — Splinter Review

If you apply this patch and report the output, it might help to diagnose what'soccurring. It may involve a bit of back and forth, since I can't repro locally, sorry.

In addition to logging some information, it also early exits from the function when this case occurs.

If that stops the overall GPU hang / crash, we know it's related to this assertion failure. I would expect to see visual corruption in this case, but hopefully it stops the GPU reset (otherwise it's probably unrelated to this assertion failure).

Thanks!

Assignee: nobody → gwatson

Jeff Muizelaar [:jrmuizel]

Updated

•

5 years ago

Flags: needinfo?(tdaede)

Jeff Muizelaar [:jrmuizel]

Updated

•

5 years ago

Blocks: wr-79
No longer blocks: wr-78

Jeff Muizelaar [:jrmuizel]

Updated

•

5 years ago

Blocks: wr-linux-mvp
No longer blocks: wr-79

Thomas Daede [:TD-Linux]

Reporter

Comment 13

•

5 years ago

Alright, this is really frustrating. After building in the patch and testing, it no longer reproduces, not with mozregression. I thought maybe a recent system update could have done it, so I rolled back to an earlier mesa (definitely installed at the time, though I think I reproduced it on a more recently installed version) and still couldn't. I'm going to continue to try rolling back more things, but it's also possible Quora changed their website...

Flags: needinfo?(tdaede)

Jessie [:jbonisteel] pls NI

Updated

•

5 years ago

Severity: -- → S3

Andrew Osmond [:aosmond] (he/him)

Comment 14

•

4 years ago

I have been unable to reproduce with the same GPU and Mesa 20.0.8. I've been looking at the crash reports for the related assert, and it is mostly Intel, but there are a couple of reports for NVIDIA. Doesn't appear to be tied to any particular hardware. All of the crashes report the same thing in the gfx critical log, e.g.:

(t=29158.1) |[G55854][GFX1-]: Failed to map PBO of size 10108 bytes (t=29158.1) |[G55840][GFX1-]: Failed to map PBO of size 26712 bytes (t=29157.7) |[G55841][GFX1-]: Failed to map PBO of size 10792 bytes (t=29157.7) |[G55842][GFX1-]: Failed to map PBO of size 163840 bytes (t=29157.7) |[G55843][GFX1-]: Failed to map PBO of size 4192 bytes (t=29157.7) |[G55844][GFX1-]: Failed to map PBO of size 4192 bytes (t=29157.7) |[G55845][GFX1-]: Failed to map PBO of size 1152 bytes

I would say the particular URLs are not super pertinent unless one can reproduce. The volume is low, but in the last 6 months, it crashed on about:newtab, YouTube, DDG, Quora, all very popular sites.

All that said, I don't think the assertions are important unless there is other information we can glean from the reports (no aggregations stand out to me). We lose the GPU context, and then we see the assertions, suggesting perhaps we didn't handle the reset properly and tried to continue on our merry way, and we got garbage out?

Andrew Osmond [:aosmond] (he/him)

Comment 15

•

4 years ago

We log if we get a device reset:

https://searchfox.org/mozilla-central/rev/1b95a0179507a4dc7d4b0c94c2df420dc1a72885/gfx/webrender_bindings/RenderThread.cpp#802

And that isn't appearing in the gfx critical log.

Andrew Osmond [:aosmond] (he/him)

Comment 16

•

4 years ago

https://searchfox.org/mozilla-central/rev/1b95a0179507a4dc7d4b0c94c2df420dc1a72885/gfx/webrender_bindings/RenderCompositor.cpp#160

Ah. That would do it. TODO, handle device resets :).

Andrew Osmond [:aosmond] (he/him)

Comment 17

•

4 years ago

Attached file Bug 1632005 - Check for context loss with WebRender with native GL. — Details

Similar to ANGLE and WebGL, we should be checking if there is a device
reset after a render pass via the glGetGraphicsResetStatus API.

Additionally, we should allow for simulating a device reset on platforms
other than Windows when using WebRender.

Phabricator Automation

Updated

•

4 years ago

Attachment #9164317 - Attachment description: Bug 1632005 - Check for context loss with WebRender with GL and not ANGLE. → Bug 1632005 - Check for context loss with WebRender with native GL.

Pulsebot

Comment 18

•

4 years ago

Pushed by aosmond@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/1f950e392b06 Check for context loss with WebRender with native GL. r=nical

Atila Butkovits

Comment 19

•

4 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/1f950e392b06

Status: NEW → RESOLVED

Closed: 4 years ago

status-firefox80: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla80

You need to log in before you can comment on or make changes to this bug.