Closed Bug 1492580 Opened 6 years ago Closed 6 years ago

Infinite loop in GLContext::RawGetErrorAndClear from endless GL_CONTEXT_LOST with Nvidia Linux drivers and suspend/resume

Categories

(Core :: Graphics: CanvasWebGL, defect, P1)

x86_64
Linux
defect

Tracking

()

VERIFIED FIXED
mozilla65
Tracking Status
firefox-esr60 --- unaffected
firefox62 --- unaffected
firefox63 --- wontfix
firefox64 + verified
firefox65 + verified

People

(Reporter: jld, Assigned: jgilbert)

References

Details

(Keywords: hang, regression, Whiteboard: [gfx-noted])

Attachments

(3 files, 1 obsolete file)

STR #1:
1. Load a WebGL page like Google Maps
2. Suspend/resume ("echo mem > /sys/power/state" as root, then hit a key)

STR #2:
1. Load a WebGL page
2. Navigate to a non-WebGL page (e.g., rfc-editor.org)
3. Suspend/resume
4. Navigate to a WebGL page

This causes the content process to lock up in the loop in GLContext::RawGetErrorAndClear, and I can reproduce it on official builds with a clean profile, using driver versions 390.77 and 390.87.

Debugging, I can see that GetError is endlessly returning GL_CONTEXT_LOST (0x0507); there's also a function inside the closed-source library returning 0x92bb which happens to be GL_PURGED_CONTEXT_RESET_NV.  I immediately suspected bug 1484782, and verified that that's when this regressed.  (Also, if it matters, the backtrace goes into WebGLRenderingContext_Binding::getError and then JS.)

I don't know OpenGL very well, but from reading documentation on glGetError, I don't think it's supposed to behave like that.
Component: Graphics → Canvas: WebGL
Has Regression Range: --- → yes
Keywords: hang
Priority: -- → P3
Whiteboard: [gfx-noted]
See also:
https://phabricator.kde.org/R108:aefb5f4dd9d4
https://gitlab.gnome.org/GNOME/mutter/commit/d4d2bf0f

I tried checking for GL_CONTEXT_LOST in RawGetErrorAndClear, but I hit a lot of assertions afterwards and I'm not familiar enough with the code to know the proper way of fixing those.
If there isn't a simple fix for this, should we back out bug 1484782 from m-c and beta?  That one affected only WebRender, which isn't enabled by default on any Linux systems yet, but this seems to affect any use of WebGL.
Flags: needinfo?(jgilbert)
I'll take a look tomorrow.
Assignee: nobody → jgilbert
(In reply to Jeff Gilbert [:jgilbert] from comment #3)
> I'll take a look tomorrow.

Jeff, can we get a status update on this one? Thanks
See Also: → 1489354
Here's the extension:
https://www.khronos.org/registry/OpenGL/extensions/NV/NV_robustness_video_memory_purge.txt
Flags: needinfo?(jgilbert)
Priority: P3 → P1
I have a theory. Will test tomorrow.
So the docs are weird here. From some parts of the docs, it looks like it is allowed to keep returning CONTEXT_LOST from glGetError until you check GetGraphicsResetStatus(), and that's what it sounds like it's doing.

I'm setting up a test machine now. I usually just use Mesa's drivers, even for NV these days. (I would encourage you to try them!)
Sorry to barge in. I have probably same issue in https://bugzilla.mozilla.org/show_bug.cgi?id=1489354 - hangs on linux on switching users, not surviving suspend\resume, etc. 
From 64.0a1 2018-10-17 build its got much severe - now hangs occurs randomly during normal browsing session, without any special action mentioned above. Is something changed? My best guess is https://bugzilla.mozilla.org/show_bug.cgi?id=1498070 but only because its only WebGL related fix.
I can't seem to get Linux to resume from suspend on NV binary drivers.
(In reply to Jeff Gilbert [:jgilbert] from comment #9)
> I can't seem to get Linux to resume from suspend on NV binary drivers.

If you have patches I can test them, but I understand that that's not a great way to develop.

Incidentally, I tried amdgpu a few months ago and it reliably crashed the X server on resume; I'd be surprised if nouveau did any better, but I haven't actually tried it.
Jeff, given that

1) This is breaking users on release.
2) WebRender won't be turned on for Linux for the foreseeable future.

Would you reconsider comment 2, i.e. backing out that fix so WebRender (which isn't used) breaks but WebGL (which *is* used) works again?
Flags: needinfo?(jgilbert)
(In reply to Jeff Gilbert [:jgilbert] from comment #9)
> I can't seem to get Linux to resume from suspend on NV binary drivers.

As Ubuntu user with Nvidia binary drivers i can say, from my experience, that not being able to resume from suspend is abnormal situation (i dont have such problem). Yet, its not uncommon, especially on laptops.
Attached patch bug1492580-workaround-hg0.diff (obsolete) — Splinter Review
I tried poking around in GLContext, a while ago, to see what I could do.  This patch seems to avoid crashing or locking up the process, and instead winds up throwing the error into content.  Google Maps puts up error UI saying to reload the page; a simpler WebGL demo I tried just stopped animating.

Note that, without any of these patches, Google Maps will continue trying to work but its textures will have been replaced with garbage, so this may not be much of a regression.  However, I have no idea what side effects this patch might cause.

It seems to me that a better solution would be to turn on the Nvidia magic just for WebRender's GL context and not contexts used for WebGL, which should be possible given that they're in different processes at the moment, but I don't know what the situation will be after WebGL remoting lands.
I have a potential fix, but I can't repro, so I'll be needinfo-ing people with builds to try tomorrow.
Flags: needinfo?(jgilbert)
[Tracking Requested - why for this release]:
Linux webgl regression shipped in current release. Fix should be considered for a uplift.
Ok, I split the fix into 'hack in a loop limit', and 'improve how we handle context loss'.
Attachment #9023112 - Attachment is obsolete: true
Landing is queued. Once it's in Nightly, I'll ask repro people to try to verify.
Pushed by jgilbert@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/f314c7e261c6
Limit glGetError flush loop and handle CONTEXT_LOST. r=lsalzman
I tried D11270 with a debug build:

[gl:0x7faff66e7000] void mozilla::gl::GLContext::raw_fBindFramebuffer(GLenum, GLuint): Generated unexpected  error. (0x0507)
Hit MOZ_CRASH(Unexpected error with MOZ_GL_DEBUG_ABORT_ON_ERROR. (Run with MOZ_GL_DEBUG_ABORT_ON_ERROR=0 to disable)) at /home/jld/src/gecko-dev/gfx/gl/GLContext.cpp:3043
#01: mozilla::gl::GLScreenBuffer::BindFB(unsigned int) (/home/jld/src/gecko-dev/gfx/gl/GLScreenBuffer.cpp:214)
#02: mozilla::WebGLContext::BindFramebuffer(unsigned int, mozilla::WebGLFramebuffer*) (/home/jld/src/gecko-dev/dom/canvas/WebGLContextGL.cpp:153)
#03: mozilla::dom::WebGLRenderingContext_Binding::bindFramebuffer(JSContext*, JS::Handle<JSObject*>, mozilla::WebGLContext*, JSJitMethodCallArgs const&) (/home/jld/src/obj.gecko-dev/obj-x86_64-pc-linux-gnu/dom/bindings/WebGLRenderingContextBinding.cpp:14045)

Okay, so let's set that env var to not abort on error:

Assertion failure: false (Unreachable.), at /home/jld/src/gecko-dev/dom/canvas/WebGLContext.cpp:1573
#01: mozilla::WebGLContextLossHandler::TimerCallback() (/home/jld/src/gecko-dev/dom/canvas/WebGLContextLossHandler.cpp:100)
#02: mozilla::WatchdogTimerEvent::Notify(nsITimer*) (/home/jld/src/gecko-dev/dom/canvas/WebGLContextLossHandler.cpp:42)
#03: nsTimerImpl::Fire(int) (/home/jld/src/gecko-dev/xpcom/threads/nsTimerImpl.cpp:687)

Finally, a non-debug build:

WebGL(0x7face8e56000)::ForceLoseContext

…followed by a black page and in-content error message, but no crash or hang.  So that's promising.



But, trying the other repro — navigate to a non-WebGL page, suspend+resume, navigate back to a WebGL page — gives me an infinite loop:

625                 while (mGL.fGetError()) {}
(gdb) fin
Run till exit from #0  mozilla::gl::GLContext::LocalErrorScope::GetError (this=0x7ffdccc16c20) at /home/jld/src/obj.gecko-dev/obj-x86_64-pc-linux-gnu/dist/include/GLContext.h:625

That's the while loop that I deleted in my patch, because I ran into this when I was experimenting earlier, and the comment on fGetError says “you never have to loop on this”.  A little more context from the stack (I can post the whole thing if it would help):

#10 0x00007faceea3850a in mozilla::gl::GLContext::LocalErrorScope::GetError() (this=0x7ffdccc16c20) at /home/jld/src/obj.gecko-dev/obj-x86_64-pc-linux-gnu/dist/include/GLContext.h:625
#11 0x00007faceea21e95 in mozilla::gl::GLContext::GetPotentialInteger(unsigned int, int*) (this=0x7faccf7f4000, pname=33309, param=<optimized out>) at /home/jld/src/obj.gecko-dev/obj-x86_64-pc-linux-gnu/dist/include/GLContext.h:647
#12 0x00007faceea21e95 in mozilla::gl::GLContext::InitExtensions()::$_4::operator()() const (this=<optimized out>) at /home/jld/src/gecko-dev/gfx/gl/GLContext.cpp:1633
#13 0x00007faceea21e95 in mozilla::gl::GLContext::InitExtensions() (this=0x7faccf7f4000) at /home/jld/src/gecko-dev/gfx/gl/GLContext.cpp:1630
...
#23 0x00007facef877094 in mozilla::WebGLContext::SetDimensions(int, int) (this=0x7facd0150800, signedWidth=<optimized out>, signedHeight=<optimized out>) at /home/jld/src/gecko-dev/dom/canvas/WebGLContext.cpp:874
I hate it when I shoot from the hip and miss.

Please try these builds, once ready:
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=49f8757f861433ae2920614424063d155fc1238b
Flags: needinfo?(jld)
https://hg.mozilla.org/mozilla-central/rev/f314c7e261c6
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla65
(In reply to Jeff Gilbert [:jgilbert] from comment #23)
> Just kidding, please try these builds instead:
>  https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=9bde8e92d35906fbace0f07a92a701cb7117bf8d

The debug builds there are broken because MOZ_ASSERT(!err && err == LOCAL_GL_CONTEXT_LOST) should be an ||, which interestingly is the same mistake I made the first time I tried to fix the loop condition in RawGetErrorAndClear.

Using the PGO build[*], the second STR from comment #0 crashes thusly (and I saved the minidump in case it's useful):

Thread 0 (crashed)
 0  libxul.so!mozilla::gl::GLContext::InitWithPrefixImpl(char const*, bool) [GLContext.cpp:9bde8e92d35906fbace0f07a92a701cb7117bf8d : 853 + 0x0]
...
11  libxul.so!mozilla::dom::HTMLCanvasElement_Binding::getContext(JSContext*, JS::Handle<JSObject*>, mozilla::dom::HTMLCanvasElement*, JSJitMethodCallArgs const&) [HTMLCanvasElementBinding.cpp: : 277 + 0xb]

Which is https://hg.mozilla.org/try/file/d2238accb71d4ca6093fa7562350109bc36d7bc1/gfx/gl/GLContext.cpp#l853

> 852    const auto err = mSymbols.fGetError();
> 853    MOZ_RELEASE_ASSERT(!err); 


Maybe we should back out bug 1484782 from mozilla-beta while this gets ironed out?  These patches look unlikely to be accepted for uplift at this point, but a backout probably would, since the only thing it would break is WebRender on Linux, which has to be manually turned on in about:config.


[*] https://tools.taskcluster.net/groups/Pt_TxpssSn6unB93_bbTjQ/tasks/czbwBhoIRbqA60y0Sx7TjQ/runs/0/artifacts
Status: RESOLVED → REOPENED
Flags: needinfo?(jld)
Resolution: FIXED → ---
Man, we check !err all over.
No difference with STR #1.  With STR #2, it doesn't crash, but WebGL no longer works in the affected content process, and there's nothing logged to stderr.  (http://vill.ee/eye shows an error in this case; Google Maps appears to quietly fall back to a 2D backend which uses Mercator projection instead of a 3D sphere.)

I pulled the patches from Try and made a debug build with the assertion from comment #21 removed, but it's not much more helpful:

[Child 8419, Main Thread] WARNING: GLContext::InitWithPrefix failed!: file /home/jld/src/gecko-dev/gfx/gl/GLContext.cpp, line 330
[Child 8419, Main Thread] WARNING: Failed to create GLXContext!: file /home/jld/src/gecko-dev/gfx/gl/GLContextProviderGLX.cpp, line 572
Flags: needinfo?(jld)
Target Milestone: mozilla65 → ---
I have a repro machine now. There's a bunch of little issues here.
Yeah, context lost handling needs a big refresh.
I'll try to make a minimal patch for Beta tomorrow.
I added context loss/restore handling to one of my simple demos:
https://jdashg.github.io/misc/webgl/ccw-point.html

I have local patches that fix this. A minimal patch for beta is TBD.
Simplify error handling in GLContext.
Modernize context loss handling in GLContext.
Remove various unused parts.
Fix WebGLContext's context loss/restoration.

MozReview-Commit-ID: Lu2hi5HnP8x
I'm split, because it's hard to feel safe uplifting anything less than this whole patch.
We can't remove the purge-handling patch, because that might expose undefined data to WebGL.
I'll play with some hacks.
Blocks: 1301803
Pushed by jgilbert@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/ff40e8ca1e42
Repair CONTEXT_LOST handling. r=lsalzman
https://hg.mozilla.org/mozilla-central/rev/ff40e8ca1e42
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla65
[Beta/Release Uplift Approval Request]

Feature/Bug causing the regression: None

User impact if declined: Linux proprietary nvidia gpu drivers will freeze/crash firefox on system suspend/resume from sleep.

Is this code covered by automated tests?: No

Has the fix been verified in Nightly?: Yes

Needs manual test from QE?: Yes

If yes, steps to reproduce: 

List of other uplifts needed: None

Risk to taking this patch: Medium

Why is the change risky/not risky? (and alternatives if risky): It's risky in that it's a bunch of relatively delicate code. This patch makes it more robust, but, well, it's just tricky, and I'm recommending that we take only half of the patch we have on Nightly to reduce the risk of touching more of the codebase than we have to.
I have tested this myself though, and we really don't want this issue on Release, *and* we really really shouldn't revert the bug that caused this, so I think we should go for it.

String changes made/needed: none
Attachment #9028425 - Flags: approval-mozilla-beta?
Hello, i previously reported bug 1489354 which seems to be directly related (or dupe of) to this one. After this bug patches was landed behavior i see in firefox with user switching is changed but its still far from being "As good as it was". For example, Firefox no longer hangs completely, instead it just paint black all tabs content untill firefox restart. Good thing is that UI is no longer blocked so now i can at least close firefox gracefully without using xkill or similar tools.
I running latest Nightly 65.0a1 (2018-11-30) build on Ubuntu 18.10 (Xorg) with "stock" Nvidia drivers version 390.87 with layers.acceleration.force-enabled=true and layers.offmainthreadcomposition.enabled=true, Webrender=off
Depends on: 1511508
(In reply to V. Korn from comment #37)
> Hello, i previously reported bug 1489354 which seems to be directly related
> (or dupe of) to this one. After this bug patches was landed behavior i see
> in firefox with user switching is changed but its still far from being "As
> good as it was". For example, Firefox no longer hangs completely, instead it
> just paint black all tabs content untill firefox restart.

I'm seeing the same problem.  I filed a new bug 1511508.
Jeff, should we uplift despite 1511508?  We're building 64.0 rc1 today.
Flags: needinfo?(jgilbert)
(In reply to Julien Cristau [:jcristau] from comment #40)
> Jeff, should we uplift despite 1511508?  We're building 64.0 rc1 today.

We want this. 1511508 is P3, since accelerated layers on Linux is not officially supported.
Flags: needinfo?(jgilbert)
Also for time-sensitive things, please email me directly.
Comment on attachment 9028425 [details] [diff] [review]
0001-Bug-1492580-Restore-CONTEXT_LOST-handling.-minimal-r.patch

linux webgl fix, approved for 64.0 rc2.
Attachment #9028425 - Flags: approval-mozilla-beta? → approval-mozilla-release+
Flags: qe-verify+
I tested this issue using the STR from the description on Ubuntu 16.04 x32 with FF 65.0a1 (2018-11-20) and after resuming the OS and focus the WebGL tab (Google maps) I see a black screen and the tab is blocked. I also tested with FF 65.0a1(2018-12-06) and after resuming the OS and visit the WebGL page I see a black screen for ~1 sec and then everything is back to normal, the tab works, I also have this output in the terminal: 
WebGL(0xa7acac00)::ForceLoseContext
WebGL(0xa7acac00)::ForceRestoreContext
WebGL(0xa7acac00)::ForceLoseContext
WebGL(0xa7acac00)::ForceRestoreContext
Please let me know if this is the expected result. Thanks
Flags: needinfo?(jgilbert)
That's correct, thanks for verifying.
Flags: needinfo?(jgilbert)
Status: RESOLVED → VERIFIED
I also tested on RC build 3 and I can confirm the fix.
Flags: qe-verify+
Depends on: 1514985
See Also: → 1515253
No longer depends on: 1514985
Blocks: 1514985
No longer blocks: 1514985
Depends on: 1514985
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: