Infinite loop in GLContext::RawGetErrorAndClear from endless GL_CONTEXT_LOST with Nvidia Linux drivers and suspend/resume

VERIFIED FIXED in Firefox 64

Status

()

P1
normal
VERIFIED FIXED
6 months ago
2 months ago

People

(Reporter: jld, Assigned: jgilbert)

Tracking

(Depends on: 1 bug, {hang, regression})

unspecified
mozilla65
x86_64
Linux
hang, regression
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox-esr60 unaffected, firefox62 unaffected, firefox63 wontfix, firefox64+ verified, firefox65+ verified)

Details

(Whiteboard: [gfx-noted])

Attachments

(3 attachments, 1 obsolete attachment)

STR #1:
1. Load a WebGL page like Google Maps
2. Suspend/resume ("echo mem > /sys/power/state" as root, then hit a key)

STR #2:
1. Load a WebGL page
2. Navigate to a non-WebGL page (e.g., rfc-editor.org)
3. Suspend/resume
4. Navigate to a WebGL page

This causes the content process to lock up in the loop in GLContext::RawGetErrorAndClear, and I can reproduce it on official builds with a clean profile, using driver versions 390.77 and 390.87.

Debugging, I can see that GetError is endlessly returning GL_CONTEXT_LOST (0x0507); there's also a function inside the closed-source library returning 0x92bb which happens to be GL_PURGED_CONTEXT_RESET_NV.  I immediately suspected bug 1484782, and verified that that's when this regressed.  (Also, if it matters, the backtrace goes into WebGLRenderingContext_Binding::getError and then JS.)

I don't know OpenGL very well, but from reading documentation on glGetError, I don't think it's supposed to behave like that.
Component: Graphics → Canvas: WebGL
Has Regression Range: --- → yes
Keywords: hang
Priority: -- → P3
Whiteboard: [gfx-noted]
See also:
https://phabricator.kde.org/R108:aefb5f4dd9d4
https://gitlab.gnome.org/GNOME/mutter/commit/d4d2bf0f

I tried checking for GL_CONTEXT_LOST in RawGetErrorAndClear, but I hit a lot of assertions afterwards and I'm not familiar enough with the code to know the proper way of fixing those.
status-firefox62: --- → unaffected
status-firefox63: --- → affected
status-firefox64: --- → affected
If there isn't a simple fix for this, should we back out bug 1484782 from m-c and beta?  That one affected only WebRender, which isn't enabled by default on any Linux systems yet, but this seems to affect any use of WebGL.
Flags: needinfo?(jgilbert)
(Assignee)

Comment 3

6 months ago
I'll take a look tomorrow.
Assignee: nobody → jgilbert
(In reply to Jeff Gilbert [:jgilbert] from comment #3)
> I'll take a look tomorrow.

Jeff, can we get a status update on this one? Thanks
See Also: → bug 1489354
(Assignee)

Comment 5

5 months ago
Here's the extension:
https://www.khronos.org/registry/OpenGL/extensions/NV/NV_robustness_video_memory_purge.txt
Flags: needinfo?(jgilbert)
Priority: P3 → P1
(Assignee)

Comment 6

5 months ago
I have a theory. Will test tomorrow.
status-firefox63: affected → wontfix
(Assignee)

Comment 7

5 months ago
So the docs are weird here. From some parts of the docs, it looks like it is allowed to keep returning CONTEXT_LOST from glGetError until you check GetGraphicsResetStatus(), and that's what it sounds like it's doing.

I'm setting up a test machine now. I usually just use Mesa's drivers, even for NV these days. (I would encourage you to try them!)

Comment 8

5 months ago
Sorry to barge in. I have probably same issue in https://bugzilla.mozilla.org/show_bug.cgi?id=1489354 - hangs on linux on switching users, not surviving suspend\resume, etc. 
From 64.0a1 2018-10-17 build its got much severe - now hangs occurs randomly during normal browsing session, without any special action mentioned above. Is something changed? My best guess is https://bugzilla.mozilla.org/show_bug.cgi?id=1498070 but only because its only WebGL related fix.
(Assignee)

Comment 9

5 months ago
I can't seem to get Linux to resume from suspend on NV binary drivers.
(In reply to Jeff Gilbert [:jgilbert] from comment #9)
> I can't seem to get Linux to resume from suspend on NV binary drivers.

If you have patches I can test them, but I understand that that's not a great way to develop.

Incidentally, I tried amdgpu a few months ago and it reliably crashed the X server on resume; I'd be surprised if nouveau did any better, but I haven't actually tried it.
Jeff, given that

1) This is breaking users on release.
2) WebRender won't be turned on for Linux for the foreseeable future.

Would you reconsider comment 2, i.e. backing out that fix so WebRender (which isn't used) breaks but WebGL (which *is* used) works again?
Flags: needinfo?(jgilbert)

Comment 13

5 months ago
(In reply to Jeff Gilbert [:jgilbert] from comment #9)
> I can't seem to get Linux to resume from suspend on NV binary drivers.

As Ubuntu user with Nvidia binary drivers i can say, from my experience, that not being able to resume from suspend is abnormal situation (i dont have such problem). Yet, its not uncommon, especially on laptops.
I tried poking around in GLContext, a while ago, to see what I could do.  This patch seems to avoid crashing or locking up the process, and instead winds up throwing the error into content.  Google Maps puts up error UI saying to reload the page; a simpler WebGL demo I tried just stopped animating.

Note that, without any of these patches, Google Maps will continue trying to work but its textures will have been replaced with garbage, so this may not be much of a regression.  However, I have no idea what side effects this patch might cause.

It seems to me that a better solution would be to turn on the Nvidia magic just for WebRender's GL context and not contexts used for WebGL, which should be possible given that they're in different processes at the moment, but I don't know what the situation will be after WebGL remoting lands.
(Assignee)

Comment 15

5 months ago
I have a potential fix, but I can't repro, so I'll be needinfo-ing people with builds to try tomorrow.
Flags: needinfo?(jgilbert)
[Tracking Requested - why for this release]:
Linux webgl regression shipped in current release. Fix should be considered for a uplift.
tracking-firefox64: --- → ?
status-firefox65: --- → affected
tracking-firefox64: ? → +
(Assignee)

Comment 17

5 months ago
Ok, I split the fix into 'hack in a loop limit', and 'improve how we handle context loss'.
(Assignee)

Updated

5 months ago
Attachment #9023112 - Attachment is obsolete: true
(Assignee)

Comment 19

5 months ago
Landing is queued. Once it's in Nightly, I'll ask repro people to try to verify.

Comment 20

5 months ago
Pushed by jgilbert@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/f314c7e261c6
Limit glGetError flush loop and handle CONTEXT_LOST. r=lsalzman
I tried D11270 with a debug build:

[gl:0x7faff66e7000] void mozilla::gl::GLContext::raw_fBindFramebuffer(GLenum, GLuint): Generated unexpected  error. (0x0507)
Hit MOZ_CRASH(Unexpected error with MOZ_GL_DEBUG_ABORT_ON_ERROR. (Run with MOZ_GL_DEBUG_ABORT_ON_ERROR=0 to disable)) at /home/jld/src/gecko-dev/gfx/gl/GLContext.cpp:3043
#01: mozilla::gl::GLScreenBuffer::BindFB(unsigned int) (/home/jld/src/gecko-dev/gfx/gl/GLScreenBuffer.cpp:214)
#02: mozilla::WebGLContext::BindFramebuffer(unsigned int, mozilla::WebGLFramebuffer*) (/home/jld/src/gecko-dev/dom/canvas/WebGLContextGL.cpp:153)
#03: mozilla::dom::WebGLRenderingContext_Binding::bindFramebuffer(JSContext*, JS::Handle<JSObject*>, mozilla::WebGLContext*, JSJitMethodCallArgs const&) (/home/jld/src/obj.gecko-dev/obj-x86_64-pc-linux-gnu/dom/bindings/WebGLRenderingContextBinding.cpp:14045)

Okay, so let's set that env var to not abort on error:

Assertion failure: false (Unreachable.), at /home/jld/src/gecko-dev/dom/canvas/WebGLContext.cpp:1573
#01: mozilla::WebGLContextLossHandler::TimerCallback() (/home/jld/src/gecko-dev/dom/canvas/WebGLContextLossHandler.cpp:100)
#02: mozilla::WatchdogTimerEvent::Notify(nsITimer*) (/home/jld/src/gecko-dev/dom/canvas/WebGLContextLossHandler.cpp:42)
#03: nsTimerImpl::Fire(int) (/home/jld/src/gecko-dev/xpcom/threads/nsTimerImpl.cpp:687)

Finally, a non-debug build:

WebGL(0x7face8e56000)::ForceLoseContext

…followed by a black page and in-content error message, but no crash or hang.  So that's promising.



But, trying the other repro — navigate to a non-WebGL page, suspend+resume, navigate back to a WebGL page — gives me an infinite loop:

625                 while (mGL.fGetError()) {}
(gdb) fin
Run till exit from #0  mozilla::gl::GLContext::LocalErrorScope::GetError (this=0x7ffdccc16c20) at /home/jld/src/obj.gecko-dev/obj-x86_64-pc-linux-gnu/dist/include/GLContext.h:625

That's the while loop that I deleted in my patch, because I ran into this when I was experimenting earlier, and the comment on fGetError says “you never have to loop on this”.  A little more context from the stack (I can post the whole thing if it would help):

#10 0x00007faceea3850a in mozilla::gl::GLContext::LocalErrorScope::GetError() (this=0x7ffdccc16c20) at /home/jld/src/obj.gecko-dev/obj-x86_64-pc-linux-gnu/dist/include/GLContext.h:625
#11 0x00007faceea21e95 in mozilla::gl::GLContext::GetPotentialInteger(unsigned int, int*) (this=0x7faccf7f4000, pname=33309, param=<optimized out>) at /home/jld/src/obj.gecko-dev/obj-x86_64-pc-linux-gnu/dist/include/GLContext.h:647
#12 0x00007faceea21e95 in mozilla::gl::GLContext::InitExtensions()::$_4::operator()() const (this=<optimized out>) at /home/jld/src/gecko-dev/gfx/gl/GLContext.cpp:1633
#13 0x00007faceea21e95 in mozilla::gl::GLContext::InitExtensions() (this=0x7faccf7f4000) at /home/jld/src/gecko-dev/gfx/gl/GLContext.cpp:1630
...
#23 0x00007facef877094 in mozilla::WebGLContext::SetDimensions(int, int) (this=0x7facd0150800, signedWidth=<optimized out>, signedHeight=<optimized out>) at /home/jld/src/gecko-dev/dom/canvas/WebGLContext.cpp:874
(Assignee)

Comment 22

5 months ago
I hate it when I shoot from the hip and miss.

Please try these builds, once ready:
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=49f8757f861433ae2920614424063d155fc1238b
Flags: needinfo?(jld)

Comment 24

5 months ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/f314c7e261c6
Status: NEW → RESOLVED
Last Resolved: 5 months ago
status-firefox65: affected → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla65
(In reply to Jeff Gilbert [:jgilbert] from comment #23)
> Just kidding, please try these builds instead:
>  https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=9bde8e92d35906fbace0f07a92a701cb7117bf8d

The debug builds there are broken because MOZ_ASSERT(!err && err == LOCAL_GL_CONTEXT_LOST) should be an ||, which interestingly is the same mistake I made the first time I tried to fix the loop condition in RawGetErrorAndClear.

Using the PGO build[*], the second STR from comment #0 crashes thusly (and I saved the minidump in case it's useful):

Thread 0 (crashed)
 0  libxul.so!mozilla::gl::GLContext::InitWithPrefixImpl(char const*, bool) [GLContext.cpp:9bde8e92d35906fbace0f07a92a701cb7117bf8d : 853 + 0x0]
...
11  libxul.so!mozilla::dom::HTMLCanvasElement_Binding::getContext(JSContext*, JS::Handle<JSObject*>, mozilla::dom::HTMLCanvasElement*, JSJitMethodCallArgs const&) [HTMLCanvasElementBinding.cpp: : 277 + 0xb]

Which is https://hg.mozilla.org/try/file/d2238accb71d4ca6093fa7562350109bc36d7bc1/gfx/gl/GLContext.cpp#l853

> 852    const auto err = mSymbols.fGetError();
> 853    MOZ_RELEASE_ASSERT(!err); 


Maybe we should back out bug 1484782 from mozilla-beta while this gets ironed out?  These patches look unlikely to be accepted for uplift at this point, but a backout probably would, since the only thing it would break is WebRender on Linux, which has to be manually turned on in about:config.


[*] https://tools.taskcluster.net/groups/Pt_TxpssSn6unB93_bbTjQ/tasks/czbwBhoIRbqA60y0Sx7TjQ/runs/0/artifacts
Status: RESOLVED → REOPENED
Flags: needinfo?(jld)
Resolution: FIXED → ---
(Assignee)

Comment 26

5 months ago
Man, we check !err all over.
No difference with STR #1.  With STR #2, it doesn't crash, but WebGL no longer works in the affected content process, and there's nothing logged to stderr.  (http://vill.ee/eye shows an error in this case; Google Maps appears to quietly fall back to a 2D backend which uses Mercator projection instead of a 3D sphere.)

I pulled the patches from Try and made a debug build with the assertion from comment #21 removed, but it's not much more helpful:

[Child 8419, Main Thread] WARNING: GLContext::InitWithPrefix failed!: file /home/jld/src/gecko-dev/gfx/gl/GLContext.cpp, line 330
[Child 8419, Main Thread] WARNING: Failed to create GLXContext!: file /home/jld/src/gecko-dev/gfx/gl/GLContextProviderGLX.cpp, line 572
Flags: needinfo?(jld)
status-firefox65: fixed → affected
tracking-firefox65: --- → +
Target Milestone: mozilla65 → ---
(Assignee)

Comment 29

4 months ago
I have a repro machine now. There's a bunch of little issues here.
(Assignee)

Comment 30

4 months ago
Yeah, context lost handling needs a big refresh.
I'll try to make a minimal patch for Beta tomorrow.
(Assignee)

Comment 31

4 months ago
I added context loss/restore handling to one of my simple demos:
https://jdashg.github.io/misc/webgl/ccw-point.html

I have local patches that fix this. A minimal patch for beta is TBD.
(Assignee)

Comment 32

4 months ago
Simplify error handling in GLContext.
Modernize context loss handling in GLContext.
Remove various unused parts.
Fix WebGLContext's context loss/restoration.

MozReview-Commit-ID: Lu2hi5HnP8x
(Assignee)

Comment 33

4 months ago
I'm split, because it's hard to feel safe uplifting anything less than this whole patch.
We can't remove the purge-handling patch, because that might expose undefined data to WebGL.
I'll play with some hacks.
(Assignee)

Updated

4 months ago
Blocks: 1301803

Comment 34

4 months ago
Pushed by jgilbert@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/ff40e8ca1e42
Repair CONTEXT_LOST handling. r=lsalzman

Comment 35

4 months ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/ff40e8ca1e42
Status: REOPENED → RESOLVED
Last Resolved: 5 months ago4 months ago
status-firefox65: affected → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla65
status-firefox-esr60: --- → unaffected
(Assignee)

Comment 36

4 months ago
[Beta/Release Uplift Approval Request]

Feature/Bug causing the regression: None

User impact if declined: Linux proprietary nvidia gpu drivers will freeze/crash firefox on system suspend/resume from sleep.

Is this code covered by automated tests?: No

Has the fix been verified in Nightly?: Yes

Needs manual test from QE?: Yes

If yes, steps to reproduce: 

List of other uplifts needed: None

Risk to taking this patch: Medium

Why is the change risky/not risky? (and alternatives if risky): It's risky in that it's a bunch of relatively delicate code. This patch makes it more robust, but, well, it's just tricky, and I'm recommending that we take only half of the patch we have on Nightly to reduce the risk of touching more of the codebase than we have to.
I have tested this myself though, and we really don't want this issue on Release, *and* we really really shouldn't revert the bug that caused this, so I think we should go for it.

String changes made/needed: none
Attachment #9028425 - Flags: approval-mozilla-beta?

Comment 37

4 months ago
Hello, i previously reported bug 1489354 which seems to be directly related (or dupe of) to this one. After this bug patches was landed behavior i see in firefox with user switching is changed but its still far from being "As good as it was". For example, Firefox no longer hangs completely, instead it just paint black all tabs content untill firefox restart. Good thing is that UI is no longer blocked so now i can at least close firefox gracefully without using xkill or similar tools.
I running latest Nightly 65.0a1 (2018-11-30) build on Ubuntu 18.10 (Xorg) with "stock" Nvidia drivers version 390.87 with layers.acceleration.force-enabled=true and layers.offmainthreadcomposition.enabled=true, Webrender=off
Depends on: 1511508
(In reply to V. Korn from comment #37)
> Hello, i previously reported bug 1489354 which seems to be directly related
> (or dupe of) to this one. After this bug patches was landed behavior i see
> in firefox with user switching is changed but its still far from being "As
> good as it was". For example, Firefox no longer hangs completely, instead it
> just paint black all tabs content untill firefox restart.

I'm seeing the same problem.  I filed a new bug 1511508.
Duplicate of this bug: 1489354
Jeff, should we uplift despite 1511508?  We're building 64.0 rc1 today.
Flags: needinfo?(jgilbert)
(Assignee)

Comment 41

4 months ago
(In reply to Julien Cristau [:jcristau] from comment #40)
> Jeff, should we uplift despite 1511508?  We're building 64.0 rc1 today.

We want this. 1511508 is P3, since accelerated layers on Linux is not officially supported.
Flags: needinfo?(jgilbert)
(Assignee)

Comment 42

4 months ago
Also for time-sensitive things, please email me directly.
Comment on attachment 9028425 [details] [diff] [review]
0001-Bug-1492580-Restore-CONTEXT_LOST-handling.-minimal-r.patch

linux webgl fix, approved for 64.0 rc2.
Attachment #9028425 - Flags: approval-mozilla-beta? → approval-mozilla-release+
Flags: qe-verify+

Comment 44

4 months ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-release/rev/2360b266d799
status-firefox64: affected → fixed
I tested this issue using the STR from the description on Ubuntu 16.04 x32 with FF 65.0a1 (2018-11-20) and after resuming the OS and focus the WebGL tab (Google maps) I see a black screen and the tab is blocked. I also tested with FF 65.0a1(2018-12-06) and after resuming the OS and visit the WebGL page I see a black screen for ~1 sec and then everything is back to normal, the tab works, I also have this output in the terminal: 
WebGL(0xa7acac00)::ForceLoseContext
WebGL(0xa7acac00)::ForceRestoreContext
WebGL(0xa7acac00)::ForceLoseContext
WebGL(0xa7acac00)::ForceRestoreContext
Please let me know if this is the expected result. Thanks
Flags: needinfo?(jgilbert)
(Assignee)

Comment 46

4 months ago
That's correct, thanks for verifying.
Flags: needinfo?(jgilbert)
(Assignee)

Updated

4 months ago
Status: RESOLVED → VERIFIED
status-firefox65: fixed → verified
I also tested on RC build 3 and I can confirm the fix.
status-firefox64: fixed → verified

Updated

3 months ago
Flags: qe-verify+
Depends on: 1514985
(Assignee)

Updated

2 months ago
No longer depends on: 1514985
(Assignee)

Updated

2 months ago
Blocks: 1514985
No longer blocks: 1514985
Depends on: 1514985
You need to log in before you can comment on or make changes to this bug.