Closed Bug 624935 Opened 14 years ago Closed 13 years ago

SIGSEGV in dri2FlushFrontBuffer/MakeContextCurrent

Categories

(Core :: Graphics, defect)

x86_64
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: karlt, Unassigned)

References

Details

(Keywords: crash, Whiteboard: [fixed in bug 659842])

Attachments

(1 file)

STR:

1. Put MOZ_GLX_IGNORE_BLACKLIST=1 in the environment.
2. Start firefox -no-remote
3. load http://people.mozilla.com/~sicking/webgl/ray.html
4. Quit

SIGSEGV

#4  <signal handler called>
#5  0x00007f74fc606d27 in dri2FlushFrontBuffer (driDrawable=0x7f74d12179c0, loaderPrivate=0x7f74bf0fdbc0) at dri2_glx.c:460
#6  0x00007f74a5b8de60 in dri_st_framebuffer_flush_front (stfbi=<value optimized out>, statt=-1089479744) at dri_drawable.c:104
#7  0x00007f74a5b8cff4 in dri_unbind_context (cPriv=<value optimized out>) at dri_context.c:148
#8  0x00007f74a5b89aa6 in driUnbindContext (pcp=0x7f74df3e7a00) at ../common/dri_util.c:117
#9  0x00007f74fc60627b in dri2_unbind_context (context=0x7f74d20f4b20, new=0x7f74bf0fdbc0) at dri2_glx.c:172
#10 0x00007f74fc5e0449 in MakeContextCurrent (dpy=0x7f74fa93c000, draw=50332454, read=50332454, gc_user=<value optimized out>) at glxcurrent.c:250
#11 0x00007f7506c2897c in mozilla::gl::GLContextGLX::MakeCurrentImpl (this=0x7f74b29cd800, aForce=0) at /home/karl/moz/dev/gfx/thebes/GLContextProviderGLX.cpp:333
#12 0x00007f7505a124f7 in mozilla::gl::GLContext::MakeCurrent (this=0x7f74b29cd800, aForce=0) at ../../../dist/include/GLContext.h:454
#13 0x00007f7506c0aaee in mozilla::gl::GLContext::MarkDestroyed (this=0x7f74b29cd800) at /home/karl/moz/dev/gfx/thebes/GLContext.cpp:962
#14 0x00007f7506c287ee in mozilla::gl::GLContextGLX::~GLContextGLX (this=0x7f74b29cd800, __in_chrg=<value optimized out>) at /home/karl/moz/dev/gfx/thebes/GLContextProviderGLX.cpp:298
#15 0x00007f7505a124b2 in mozilla::gl::GLContext::Release (this=0x7f74b29cd800) at ../../../dist/include/GLContext.h:393
#16 0x00007f7505a1563b in nsRefPtr<mozilla::gl::GLContext>::assign_assuming_AddRef (this=0x7f75083602c8, newPtr=0x0) at ../../../dist/include/nsAutoPtr.h:957
#17 0x00007f7505a15544 in nsRefPtr<mozilla::gl::GLContext>::assign_with_AddRef (this=0x7f75083602c8, rawPtr=0x0) at ../../../dist/include/nsAutoPtr.h:941
#18 0x00007f7505a1446d in nsRefPtr<mozilla::gl::GLContext>::operator= (this=0x7f75083602c8, rhs=0x0) at ../../../dist/include/nsAutoPtr.h:1025
#19 0x00007f7506c283a9 in mozilla::gl::GLContextProviderGLX::Shutdown () at /home/karl/moz/dev/gfx/thebes/GLContextProviderGLX.cpp:761
#20 0x00007f7506bf20c6 in gfxPlatform::Shutdown () at /home/karl/moz/dev/gfx/thebes/gfxPlatform.cpp:340
#21 0x00007f750676f240 in nsThebesGfxModuleDtor () at /home/karl/moz/dev/gfx/src/thebes/nsThebesGfxFactory.cpp:136
#22 0x00007f7506acd2ed in nsComponentManagerImpl::KnownModule::~KnownModule (this=0x7f74f183e520, __in_chrg=<value optimized out>) at /home/karl/moz/dev/xpcom/components/nsComponentManager.h:204
#23 0x00007f7506ad1121 in nsAutoPtr<nsComponentManagerImpl::KnownModule>::~nsAutoPtr (this=0x7f74fa90b488, __in_chrg=<value optimized out>) at ../../dist/include/nsAutoPtr.h:104
#24 0x00007f7506ad0fa1 in nsTArrayElementTraits<nsAutoPtr<nsComponentManagerImpl::KnownModule> >::Destruct (e=0x7f74fa90b488) at ../../dist/include/nsTArray.h:279
#25 0x00007f7506ad09a5 in nsTArray<nsAutoPtr<nsComponentManagerImpl::KnownModule>, nsTArrayDefaultAllocator>::DestructRange (this=0x7f74f1805280, start=0, count=52) at ../../dist/include/nsTArray.h:1106
#26 0x00007f7506acf9ba in nsTArray<nsAutoPtr<nsComponentManagerImpl::KnownModule>, nsTArrayDefaultAllocator>::RemoveElementsAt (this=0x7f74f1805280, start=0, count=52) at ../../dist/include/nsTArray.h:834
#27 0x00007f7506ace321 in nsTArray<nsAutoPtr<nsComponentManagerImpl::KnownModule>, nsTArrayDefaultAllocator>::Clear (this=0x7f74f1805280) at ../../dist/include/nsTArray.h:845
#28 0x00007f7506ac9d5a in nsComponentManagerImpl::Shutdown (this=0x7f74f1805170) at /home/karl/moz/dev/xpcom/components/nsComponentManager.cpp:1018
#29 0x00007f7506a6769c in mozilla::ShutdownXPCOM (servMgr=0x0) at /home/karl/moz/dev/xpcom/build/nsXPComInit.cpp:726
#30 0x00007f7506a670ca in NS_ShutdownXPCOM_P (servMgr=0x7f74f1805178) at /home/karl/moz/dev/xpcom/build/nsXPComInit.cpp:594
#31 0x00007f75052a524c in ScopedXPCOMStartup::~ScopedXPCOMStartup (this=0x7fff6d3a9e30, __in_chrg=<value optimized out>) at /home/karl/moz/dev/toolkit/xre/nsAppRunner.cpp:1117
#32 0x00007f75052adcea in XRE_main (argc=3, argv=0x7fff6d3aa528, aAppData=0x7f74fa9250f0) at /home/karl/moz/dev/toolkit/xre/nsAppRunner.cpp:3725
#33 0x0000000000401f2f in main (argc=3, argv=0x7fff6d3aa528) at /home/karl/moz/dev/browser/app/nsBrowserApp.cpp:158

OpenGL vendor string: X.Org R300 Project
OpenGL renderer string: Gallium 0.4 on RV515
OpenGL version string: 2.1 Mesa 7.9.1
OpenGL shading language version string: 1.20

These messages are output on load of the webgl page:
--- WebGL context created: 0x7f3b82b15000
r300 FP: Compiler Error:
r500_fragprog_emit.c::emit_paired(): emit_alu: Too many instructions
Using a dummy shader instead.

Bug discovered in 0474f6b72e6e (1 changeset before MOZ_GLX_IGNORE_BLACKLIST was added).
Well, there's a reason for MOZ_GLX_IGNORE_BLACKLIST -- this bug should get filed upstream.  The GL impl using a dummy shader is busted; it should just fail to compile.  Even when it uses a dummy shader, it should not crash as a result.
Might be related to
https://bugs.freedesktop.org/review?bug=31940&attachment=40689
(Yet to check with that patch.)
No, that patch doesn't help.
Don't need to shutdown to see this.
Opening some tabs using webgl and closing some reproduces.
Summary: SIGSEGV on shutdown in dri2FlushFrontBuffer/MakeContextCurrent → SIGSEGV in dri2FlushFrontBuffer/MakeContextCurrent
Similar stack with

OpenGL vendor string: X.Org
OpenGL renderer string: Gallium 0.4 on AMD JUNIPER
OpenGL version string: 2.1 Mesa 7.10.2
OpenGL shading language version string: 1.20

#5  0x00007fdd62852787 in dri2FlushFrontBuffer (driDrawable=0x7fdcff161b60, 
    loaderPrivate=0x7fdd213e2400) at dri2_glx.c:460
#6  0x00007fdcfe3e10b0 in ?? () from /usr/lib64/dri/r600_dri.so
#7  0x00007fdcfe3e03f4 in ?? () from /usr/lib64/dri/r600_dri.so
#8  0x00007fdcfe3c4a96 in ?? () from /usr/lib64/dri/r600_dri.so
#9  0x00007fdd62851cdb in dri2_unbind_context (context=0x7fdd297bfa60, new=0x7fdd213e2400)
    at dri2_glx.c:172
#10 0x00007fdd6282be89 in MakeContextCurrent (dpy=0x7fdd6023a000, draw=50337079, read=50337079, 
    gc_user=<value optimized out>) at glxcurrent.c:250
#11 0x00007fdd6c403442 in mozilla::gl::GLContextGLX::MakeCurrentImpl (this=0x7fdcfc1c2000, 
    aForce=0) at /home/karl/moz/dev/gfx/thebes/GLContextProviderGLX.cpp:382
#12 0x00007fdd6b2804fb in mozilla::gl::GLContext::MakeCurrent (this=0x7fdcfc1c2000, aForce=0)
    at ../../../dist/include/GLContext.h:462
#13 0x00007fdd6b27b6f6 in mozilla::WebGLContext::DestroyResourcesAndContext (this=0x7fdd2138f000)
    at /home/karl/moz/dev/content/canvas/src/WebGLContext.cpp:209
#14 0x00007fdd6b27b20a in mozilla::WebGLContext::~WebGLContext (this=0x7fdd2138f000, 
    __in_chrg=<value optimized out>) at /home/karl/moz/dev/content/canvas/src/WebGLContext.cpp:134
#15 0x00007fdd6b27d890 in mozilla::WebGLContext::Release (this=0x7fdd2138f000)
    at /home/karl/moz/dev/content/canvas/src/WebGLContext.cpp:769
IRC log with MostAwesomeDude (radeon/Gallium developer)

Wednesday 20 April 2011] [16:44:31] <bjacob> MostAwesomeDude: i'm asking because we are already tolerating a few crashes in very complex shaders on Mac, so we could conceivably whitelist r600/gallium if its guaranteed to only happen in really big shaders
[Wednesday 20 April 2011] [16:44:46] Quit smontagu has left this server (Ping timeout).
[Wednesday 20 April 2011] [16:44:52] <bjacob> MostAwesomeDude: also, do you have a link about that bug?
[Wednesday 20 April 2011] [16:45:38] <ehsan> cool
[Wednesday 20 April 2011] [16:46:13] <MostAwesomeDude> bjacob: Nope! It's been a constant discussion (read: I bring it up, nobody replies) which basically revolves around making pipe_context::create_?s_state() return boolean instead of void.
[Wednesday 20 April 2011] [16:46:31] Join smontagu has joined this channel (chatzilla@moz-AE192F20.red.bezeqint.net).
[Wednesday 20 April 2011] [16:46:45] <MostAwesomeDude> Actually, I guess it's changed since then; always returns void*. So it'd be about returning NULL to indicate a failed shader.
[Wednesday 20 April 2011] [16:46:48] Join cdes_ has joined this channel (chatzilla@moz-FC98E910.cpe.net.cable.rogers.com).
[Wednesday 20 April 2011] [16:48:56] <bjacob> MostAwesomeDude: OK. can you reply to my other question about r600? By default I'm going to blacklist all of Gallium altogether because of that.
[Wednesday 20 April 2011] [16:49:24] <bjacob> MostAwesomeDude: for lack of a bug, can you give me a link to an archived ml discussion?
[Wednesday 20 April 2011] [16:49:37] <bjacob> i like to provide links for my blacklist entries
[Wednesday 20 April 2011] [16:52:30] <MostAwesomeDude> http://marc.info/?l=mesa3d-dev&m=127680073328903&w=2 isn't it, but it mentions it. I'm sure I can find a better one.
[Wednesday 20 April 2011] [16:52:38] <bjacob> thanks
[Wednesday 20 April 2011] [16:53:50] <MostAwesomeDude> Ah, hm.
[Wednesday 20 April 2011] [16:54:03] <MostAwesomeDude> Jakob just pointed out that the traceback doesn't even go down that path -- it's in the common DRI code.
[Wednesday 20 April 2011] [16:54:33] <bjacob> what does that mean in non-xorg-developer terms?
[Wednesday 20 April 2011] [16:54:47] Join ccliffe has joined this channel (~ccliffe@moz-CF430726.home3.cgocable.net).
[Wednesday 20 April 2011] [16:55:27] <MostAwesomeDude> It's definitely unrelated. Probably some issue with how the X server and Mesa are disagreeing on buffers somehow.
[Wednesday 20 April 2011] [16:55:50] <MostAwesomeDude> Also, http://marc.info/?l=mesa3d-dev&m=126525088903956&w=2 is the original ML thread; boils down to "we can do this, we probably should do this, but we haven't done it yet."
[Wednesday 20 April 2011] [16:56:12] <bjacob> "It's definitely unrelated." <-- do you mean unrelated to Gallium?
[Wednesday 20 April 2011] [16:57:02] <MostAwesomeDude> It's a problem in the part of Gallium that does DRI stuff, not in the r300 or r600 part.
[Wednesday 20 April 2011] [16:57:15] <bjacob> Ah OK, i get it now
[Wednesday 20 April 2011] [16:57:25] <MostAwesomeDude> Also, it's a known bug, but one that should have been fixed already. Lemme see if I can find it.
[Wednesday 20 April 2011] [16:57:55] <bjacob> so the conclusion is that I really want to blacklist Gallium altogether, not just on r300/r600
In bug 645407 we're now blacklisting Gallium altogether, which fixes this.
Depends on: 645407
I tried it on my RV530 with Ubuntu 11.04, Firefox 4.0.1 and mesa from git and it no longer crashes:

$ MOZ_GLX_IGNORE_BLACKLIST=1 firefox -no-remote
Failed to open VDPAU backend libvdpau_nvidia.so: impossibile aprire il file oggetto condiviso: File o directory non esistente
Failed to open VDPAU backend libvdpau_nvidia.so: impossibile aprire il file oggetto condiviso: File o directory non esistente
Failed to open VDPAU backend libvdpau_nvidia.so: impossibile aprire il file oggetto condiviso: File o directory non esistente
Failed to open VDPAU backend libvdpau_nvidia.so: impossibile aprire il file oggetto condiviso: File o directory non esistente
r300: DRM version: 2.8.0, Name: ATI RV530, ID: 0x71c5, GB: 1, Z: 2
r300: GART size: 509 MB, VRAM size: 256 MB
r300: AA compression: NO, Z compression: NO, HiZ: NO
r300: DRM version: 2.8.0, Name: ATI RV530, ID: 0x71c5, GB: 1, Z: 2
r300: GART size: 509 MB, VRAM size: 256 MB
r300: AA compression: NO, Z compression: NO, HiZ: NO
Mesa: User error: GL_INVALID_ENUM in glGetIntegerv(pname=GL_MAX_VERTEX_OUTPUT_COMPONENTS)
r300 FP: Compiler Error:
r500_fragprog_emit.c::emit_paired(): emit_alu: Too many instructions
Using a dummy shader instead.
NOTE: child process received `Goodbye', closing down

This (you have to compile mesa with the --enable-debug flag to shows warnings like this) looks like a firefox bug:
Mesa: User error: GL_INVALID_ENUM in glGetIntegerv(pname=GL_MAX_VERTEX_OUTPUT_COMPONENTS)

My OpenGL info:
OpenGL vendor string: X.Org R300 Project
OpenGL renderer string: Gallium 0.4 on ATI RV530
OpenGL version string: 2.1 Mesa 7.11-devel (git-32a95cb)
OpenGL shading language version string: 1.20

Anyway, if https://bugs.freedesktop.org/show_bug.cgi?id=31940 is related, the fix was merged in mesa master branch (which will be 7.11):
http://cgit.freedesktop.org/mesa/mesa/commit/?id=94ccc31ba4f64ac480137fd90f1ded44d2072f6e

and backported to 7.10 branch (included since 7.10.1):
http://cgit.freedesktop.org/mesa/mesa/commit/?h=7.10&id=e7d1b5489e3ec8e1e63120218efe5fcb72e879d6
Fabio: thanks for checking this, but in comment 5, Karl said he got the crash with Mesa 7.10.2. Also, our understanding (comment 6) was that this was a Gallium bug, not a Mesa bug.
Tried again with Mesa 7.10.3 and webgl.force-enabled true.

I could actually see the demo working correctly, and reloading the page was fine until the previous context was destroyed (during GC I assume).

--- WebGL context created: 0x7fde6732b800
--- WebGL context destroyed: 0x7fde68cfa000
== GLContext 0x7fde69424800 ==
Outstanding Textures:
  [0x7fde6732b800 - live] 2 
Outstanding Buffers:
  [0x7fde6732b800 - live] 3 4 
Outstanding Programs:
  [0x7fde6732b800 - live] 6 
Outstanding Shaders:
  [0x7fde6732b800 - live] 4 5 
Outstanding Framebuffers:
  [0x7fde6732b800 - live] 2 
Outstanding Renderbuffers:
  [0x7fde6732b800 - live] 2
(In reply to comment #11)
> Tried again with Mesa 7.10.3 and webgl.force-enabled true.
> 
> I could actually see the demo working correctly, and reloading the page was
> fine until the previous context was destroyed (during GC I assume).

...and then?

Are you saying that the bug seems to be fixed in Mesa 7.10.3? Did it fail with 7.10.2 on the same machine?
and then same crash as reported here.
Behaviour is probably no different between 7.10.2 and .3.  I just hadn't noticed the relationship to "WebGL context destroyed" before.

I'll try again with the fix in bug 659842.

FWIW, with classic r600 (not disabled by default), the context destruction after reload causes drawing to move to the root window and then an apparently infinite loop of soft GPU lookups even after the firefox process is killed.
linux-2.6.38.7 xorg-server-1.10.2
This (the Gallium issue) is fixed by attachment 539555 [details] [diff] [review].
It seems the only thing left to do here is unblacklist Gallium.
Depends on: 659842
Whiteboard: [fixed in bug 659842]
(In reply to comment #14)
> This (the Gallium issue) is fixed by attachment 539555 [details] [diff] [review] [review].
> It seems the only thing left to do here is unblacklist Gallium.

...what ?!?!??!

I don't get it. I thought that the Gallium bug was about http://marc.info/?l=mesa3d-dev&m=126525088903956&w=2 i.e. failure to handle correctly shaders that use more texture indirections than supported.

How can it be fixed by a patch that's purely about context destruction?
I expect "Using a dummy shader instead" has little to do with this bug, a crash while unbinding an old context.
Here's the patch unblacklisting Gallium, but can you please explain this into some more detail? I thought, after above-quoted conversation with MostAwesomeDude, that we had a serious reason to blacklist gallium, having to do with failure to correctly handle shaders that use too many texture indirections. I still don't understand the connection with the release-context-before-destroy patch.
Attachment #540962 - Flags: review?(karlt)
Yes, I don't think there's a connection.

The crash here was while unbinding a destroyed context during MakeCurrent.
Since the release-context-before-destroy patch, we don't do that.

https://cvs.khronos.org/svn/repos/registry/trunk/public/webgl/sdk/tests/webgl-conformance-tests.html
completed with

Test Summary (5872 total tests):
Tests PASSED: 5749
Tests FAILED: 123
Tests TIMED OUT: 3

OpenGL renderer string: Gallium 0.4 on AMD JUNIPER
OpenGL version string: 2.1 Mesa 7.10.3

i.e. r600g.

I don't have my RV515 with me to test now, but this looks like a common-code problem, which has now been resolved.  It is no longer a reason to blacklist Gallium drivers in general.
Attachment #540962 - Flags: review?(karlt) → review+
Interesting. Just one last question: you got a successful run with Mesa 7.10.3, but in https://bugs.freedesktop.org/show_bug.cgi?id=37253#c6 Stephane refers to a very recent commit that probably isn't part of Mesa 7.10.3:

http://cgit.freedesktop.org/mesa/mesa/commit/?id=bf69ce37f0dcbb479078ee676d5100ac63e20750

Do you confirm that we don't need this commit?

Also, currently we only require Mesa >= 7.10. Do we need to require specifically >= 7.10.3 ?
Yes, that commit is not in 7.10.3, so we don't need that commit to avoid the crash.  Perhaps that commit might mean that we wouldn't crash even without attachment 539555 [details] [diff] [review] (but we might get an error - I don't know).  I don't know how to check for leaking graphics memory, so I don't know whether or not the leak affects us.
With
OpenGL renderer string: Gallium 0.4 on ATI RV515
OpenGL version string: 2.1 Mesa 7.10.1

I still get
r300 FP: Compiler Error:
r500_fragprog_emit.c::emit_paired(): emit_alu: Too many instructions
Using a dummy shader instead.

on http://people.mozilla.com/~sicking/webgl/ray.html
and it doesn't render but there is no crash.

I tried the wegl conformance tests.
One time it completed with

Test Summary (5876 total tests):
Tests PASSED: 5747
Tests FAILED: 129
Tests TIMED OUT: 1

The other time I got
bp-06ab982b-1554-4a61-8083-63f4c2110622

That's not this bug.  Unfortunately I don't know which test caused it, so I
haven't found out how often it happens and I don't know whether that means
this driver is more crashy than others.

So there's a risk enabling Gallium, but there are also crashes in other
drivers and they are still enabled.
I got a couple of similar shader compiling hangs with 7.10.1 r300g (manual ABRT) and crash (SEGV).  They are not reliably reproducible:

hangs
bp-f29a24b1-960f-4696-9804-b6b6e2110622
bp-13a5c692-29e2-4316-a5fb-941a32110622

crash
bp-43fcb8e5-c324-4d0c-970c-0846d2110622

% addr2line -if -e /usr/lib/debug/lib64/libc-2.11.3.so.debug 0x7d8e2
strlen
/var/tmp/portage/sys-libs/glibc-2.11.3/work/glibc-2.11.3/string/../sysdeps/x86_64/strlen.S:31

% addr2line -if -e /usr/lib/debug/lib64/libc-2.11.3.so.debug 0x7d8fb
strlen
/var/tmp/portage/sys-libs/glibc-2.11.3/work/glibc-2.11.3/string/../sysdeps/x86_64/strlen.S:40

% addr2line -if -e /usr/lib/debug/usr/lib64/mesa/r300g_dri.so.debug 0x11734a
ralloc_vasprintf_append
/var/tmp/portage/media-libs/mesa-7.10.1/work/Mesa-7.10.1/src/glsl/ralloc.c:432

Looks like https://bugs.freedesktop.org/show_bug.cgi?id=35603, which AIUI was introduced in 7.10.1 and should be resolved in 7.10.3.
(In reply to comment #21)
> So there's a risk enabling Gallium, but there are also crashes in other
> drivers and they are still enabled.

OK, so at this point there is no definite reason to keep Gallium blacklisted; let's unblacklist it so at least we receive data.
Landed on central:
http://hg.mozilla.org/mozilla-central/rev/8e5753fe4939
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Comment on attachment 540962 [details] [diff] [review]
unblacklist gallium

Please give mozilla-aurora approval to this.

The risk seems well-understood as Karl has looked closely at it on multiple systems. While there are some crashes with Gallium, there also are crashes with non-Gallium so if anything, landing this will get us better / more neutral crash data. If we want to use blacklisting to avoid these crashes, then Gallium was the wrong thing to blacklist, and a much more powerful blacklisting change could be to require Mesa 7.11 instead of Mesa 7.10.

If you're wondering why we suddenly make this change, part of the answer is that the recent fixing of bug 659842 removed a lot of noise around GLX crashes and we now have much better data; another part is the recent investigation done by Karl, see above.
Attachment #540962 - Flags: approval-mozilla-aurora?
To put it shorter: what a blacklisting change like this, is _Beta_ testing i.e. where we have millions of testers.
Comment on attachment 540962 [details] [diff] [review]
unblacklist gallium

this can go through the normal cycle.
Attachment #540962 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora-
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: