Closed Bug 632969 Opened 12 years ago Closed 12 years ago

X error handling fixes

Categories

(Core :: Graphics, defect)

All
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
Tracking Status
blocking2.0 --- betaN+

People

(Reporter: bjacob, Assigned: bjacob)

References

Details

(Whiteboard: [hardblocker])

Attachments

(7 files)

This patch does 2 things:
 * it improves ScopedXErrorHandler
 * it improves the way it's used in GLContextProviderGLX.cpp

This comes mostly from bug 616416 which Karl already r-'d with comments, so I tried to apply his comments here; the other new change is that I'm now putting the glXQueryServerString call inside of the ScopedXErrorHandler scope, in an attempt to fix the very large number of crashes on X errors that we're seeing in beta 11 on linux. Indeed, since we're blacklisting most people, and I doubt that many people define MOZ_GLX_IGNORE_BLACKLIST, my guess is that the glXQueryServerString call is what is causing the X errors.
Attachment #511180 - Flags: review?(karlt)
Also, Karl: to be clear, I am the opposite of a X11 expert. I don't feel competent to precisely map X error crash reports to patches that fix them. So if you see a different thing to do here given the reports we're getting --- please provide the improved patch.
Attachment #511202 - Flags: review?(bjacob) → review+
I think we can make progress here by breaking the patch up into each separate change and identify what each part is fixing.

Whenever a workaround is added, it is important to know what the workaround is for so that we can know when the workaround is no-longer necessary.  I don't want this to get into a situation where people are afraid to change the code because they don't know why it is doing something a particular way.

(In reply to comment #0)
> the other new change is that I'm now putting
> the glXQueryServerString call inside of the ScopedXErrorHandler scope, in an
> attempt to fix the very large number of crashes on X errors that we're seeing
> in beta 11 on linux. Indeed, since we're blacklisting most people, and I doubt
> that many people define MOZ_GLX_IGNORE_BLACKLIST, my guess is that the
> glXQueryServerString call is what is causing the X errors.

Can you point me at some of the reports, please?
Comment on attachment 511202 [details] [diff] [review]
reset error variable before trying again with no sharing

>-          printf("[GLX] currently only allowing the NVIDIA proprietary driver, as other drivers are giving too many crashes. "
>-                 "To bypass this, define the MOZ_GLX_IGNORE_BLACKLIST environment variable.\n");
>+          printf("[GLX] your GL driver is currently blocked, define the MOZ_GLX_IGNORE_BLACKLIST "
>+                 "environment variable to bypass this.\n");

The "GL driver" wording change looks good.
Either use a full stop '.' and start a new sentence at "define", or use a semi-colon ";" after blocked.
Comment on attachment 511180 [details] [diff] [review]
fix X error handling

I like the merging of shareContext and !shareContext error checking.

>         context = sGLXLibrary.xCreateNewContext(display,
>                                                 cfg,
>                                                 GLX_RGBA_TYPE,
>                                                 shareContext ? shareContext->mContext : NULL,
>                                                 True);
> 
>-        if (context) {
>+        if (!error) {

There is a logic error here, though.  I assume you still need to check context.

>+            error |= !glContext->Init() || xErrorHandler.SyncAndGetError(display);

It's a bit unconventional to use C++ logical operators to control function execution, but I think I'm OK with that.
Attachment #511180 - Flags: review?(karlt) → review-
(In reply to comment #4)
> I think we can make progress here by breaking the patch up into each separate
> change and identify what each part is fixing.
> 
> Whenever a workaround is added, it is important to know what the workaround is
> for so that we can know when the workaround is no-longer necessary.  I don't
> want this to get into a situation where people are afraid to change the code
> because they don't know why it is doing something a particular way.

OK, I agree.

So, there are two things that we need to justify here:

  1. why we use this X error handler here and not elsewhere (justify the precise scope of this ScopedXErrorHandler)
  2. why we do this SyncAndGetError call here and not elsewhere

For 1., the code in GLContextGLX::CreateGLContext is known to be prone to generate X errors, and our default X error handler crashes the application upon X errors, so we need a ScopedXErrorHandler here. With today's patches, the scope of this ScopedXErrorHandler is just the scope of this function, which is natural.

For 2., there is only one SyncAndGetError left, at the end of one try. My rationalization for that is that, since we're doing two tries, we need to really know whether the first one failed before trying again. I am no longer doing this to work around driver crashes, since driver crashes are now worked around by the blacklist. If you think that this SyncAndGetError is unnecessary, feel free to remove it, but that then makes me uncomfortable as we'd then completely ignore X errors --- would that be OK? Note that there's another SyncAndGetError call in CreateOffscreenPixmapContext.

> 
> (In reply to comment #0)
> > the other new change is that I'm now putting
> > the glXQueryServerString call inside of the ScopedXErrorHandler scope, in an
> > attempt to fix the very large number of crashes on X errors that we're seeing
> > in beta 11 on linux. Indeed, since we're blacklisting most people, and I doubt
> > that many people define MOZ_GLX_IGNORE_BLACKLIST, my guess is that the
> > glXQueryServerString call is what is causing the X errors.
> 
> Can you point me at some of the reports, please?

Sure, here's a query for top linux browser crashes in beta 11:

https://crash-stats.mozilla.com/query/query?product=Firefox&version=Firefox%3A4.0b11&platform=linux&range_value=1&range_unit=weeks&date=02%2F09%2F2011+14%3A49%3A55&query_search=signature&query_type=exact&query=&build_id=&process_type=browser&hang_type=crash&do_query=1

The linux-gate.so crashes are the ones I'm talking about.

For example:

https://crash-stats.mozilla.com/report/index/c61d1ba5-106d-43be-81cb-409e82110209
https://crash-stats.mozilla.com/report/index/5535d190-b899-4d6e-8c06-9c53b2110209
(In reply to comment #7)
> With today's patches, the
> scope of this ScopedXErrorHandler is just the scope of this function, which is
> natural.

In general, I'd prefer to only ignore errors in the functions that are
expected to generate errors, or at least keep the scope as small as practical.
Otherwise it makes it hard to find the initial cause of subsequent errors.

> For 2., there is only one SyncAndGetError left, at the end of one try. My
> rationalization for that is that, since we're doing two tries, we need to
> really know whether the first one failed before trying again. I am no longer
> doing this to work around driver crashes, since driver crashes are now worked
> around by the blacklist. If you think that this SyncAndGetError is unnecessary,
> feel free to remove it, but that then makes me uncomfortable as we'd then
> completely ignore X errors --- would that be OK?

X errors are usually, but not always, programming errors.

BadAlloc errors are a specific kind error that are often not programming
errors, and these functions (glXCreateNewContext and glXMakeCurrent) may
generate BadAlloc errors.  Bug 613079 describes a reason why BadAlloc might be
likely to happen (though I don't see BadAlloc mentioned in the bug), so
keeping the check for errors here makes sense.

I don't really know enough about whether there are differences in resources
available for shared and non-shared contexts, so I can't really comment on
whether a SyncAndGetErrors for each try is worthwhile.

> https://crash-stats.mozilla.com/report/index/c61d1ba5-106d-43be-81cb-409e82110209

This is bug 576933, which is not relevant here, but we may want to silence it
so that we can find the more serious errors.

> https://crash-stats.mozilla.com/report/index/5535d190-b899-4d6e-8c06-9c53b2110209

> X_GLXVendorPrivateWithReply: BadLength (poly request too large or internal Xlib length error)xpcom_runtime_abort(###!!! ABORT: X_GLXVendorPrivateWithReply: BadLength (poly request too large or internal Xlib length error): file /build/buildd/firefox-4.0~b11+build3+nobinonly/build-tree/mozilla/toolkit/xre/nsX11ErrorHandler.cpp, line 190)

> 0 	linux-gate.so 	linux-gate.so@0x416 	
> 1 	libc-2.12.2.so 	libc-2.12.2.so@0x2e18d 	
> 2 	libmozalloc.so 	mozalloc_abort 	mozalloc_abort.cpp:75
> 3 		@0x1a0bf8f 	
> 4 	libpthread-2.12.2.so 	libpthread-2.12.2.so@0x8336 	
> 5 	libpthread-2.12.2.so 	libpthread-2.12.2.so@0x9778 	

This looks like an error from a call on a non-main thread (which our handler handles badly).  If CreateGLContext is called on the main thread then this is unlikely to be related.

I don't see libGLcore.so.* or libnvidia-* in the modules list, which I assume
implies that this is not happening with NVIDIA drivers.

I see libEGL.so.1.0 loaded.  I don't know what that implies.
Blocks: 589546
First of all, thanks for the great explanation!

(In reply to comment #8)
> (In reply to comment #7)
> > With today's patches, the
> > scope of this ScopedXErrorHandler is just the scope of this function, which is
> > natural.
> 
> In general, I'd prefer to only ignore errors in the functions that are
> expected to generate errors, or at least keep the scope as small as practical.
> Otherwise it makes it hard to find the initial cause of subsequent errors.

OK, I understand. But here, understanding precisely which functions are causing which errors would be a little complicated as far as I can see. The best way that I could see, you be to temporarily (in nightlies) introduce more SyncAndGetError calls and annotate crash reports with these errors. Then the crash-stats reports would start giving us the info we seem to need. Meanwhile, I would like the current solution (ScopedXErrorHandler on the whole function scope) to be pushed quickly so we make crashes go away for the firefox 4.0 release.

> 
> > For 2., there is only one SyncAndGetError left, at the end of one try. My
> > rationalization for that is that, since we're doing two tries, we need to
> > really know whether the first one failed before trying again. I am no longer
> > doing this to work around driver crashes, since driver crashes are now worked
> > around by the blacklist. If you think that this SyncAndGetError is unnecessary,
> > feel free to remove it, but that then makes me uncomfortable as we'd then
> > completely ignore X errors --- would that be OK?
> 
> X errors are usually, but not always, programming errors.
> 
> BadAlloc errors are a specific kind error that are often not programming
> errors, and these functions (glXCreateNewContext and glXMakeCurrent) may
> generate BadAlloc errors.  Bug 613079 describes a reason why BadAlloc might be
> likely to happen (though I don't see BadAlloc mentioned in the bug), so
> keeping the check for errors here makes sense.
> 
> I don't really know enough about whether there are differences in resources
> available for shared and non-shared contexts, so I can't really comment on
> whether a SyncAndGetErrors for each try is worthwhile.

Hm.. me neither, actually. I think that the shared context only has textures. Vlad knows that stuff.

> 
> > https://crash-stats.mozilla.com/report/index/c61d1ba5-106d-43be-81cb-409e82110209
> 
> This is bug 576933, which is not relevant here, but we may want to silence it
> so that we can find the more serious errors.
> 
> > https://crash-stats.mozilla.com/report/index/5535d190-b899-4d6e-8c06-9c53b2110209
> 
> > X_GLXVendorPrivateWithReply: BadLength (poly request too large or internal Xlib length error)xpcom_runtime_abort(###!!! ABORT: X_GLXVendorPrivateWithReply: BadLength (poly request too large or internal Xlib length error): file /build/buildd/firefox-4.0~b11+build3+nobinonly/build-tree/mozilla/toolkit/xre/nsX11ErrorHandler.cpp, line 190)
> 
> > 0 	linux-gate.so 	linux-gate.so@0x416 	
> > 1 	libc-2.12.2.so 	libc-2.12.2.so@0x2e18d 	
> > 2 	libmozalloc.so 	mozalloc_abort 	mozalloc_abort.cpp:75
> > 3 		@0x1a0bf8f 	
> > 4 	libpthread-2.12.2.so 	libpthread-2.12.2.so@0x8336 	
> > 5 	libpthread-2.12.2.so 	libpthread-2.12.2.so@0x9778 	
> 
> This looks like an error from a call on a non-main thread (which our handler
> handles badly).  If CreateGLContext is called on the main thread then this is
> unlikely to be related.

We don't do any GL or GLX calls from non-main-thread. The GL debug mode (MOZ_GL_DEBUG) even checks for that, not that that would be inherently wrong, but at least that allows to guarantee that that's currently true.

> I don't see libGLcore.so.* or libnvidia-* in the modules list, which I assume
> implies that this is not happening with NVIDIA drivers.

Indeed. In Firefox 5 we'll implement GfxInfo on X11 which will make it easier to check that (we'll annotate crash reports like we do on windows and mac).

> 
> I see libEGL.so.1.0 loaded.  I don't know what that implies.

Me neither.
Attachment #511466 - Flags: review?(karlt)
Attached patch scope changeSplinter Review
This makes the ScopedXErrorHandler have the whole function scope, as discussed above ideally we'd want something more fine-grained, but for the present purpose of quickly fixing this crasher, this has the highest probability of working.
Attachment #511469 - Flags: review?(karlt)
Attached patch logic changeSplinter Review
This is the only part that's not directly taken from earlier patch. I think that this makes the logic simpler.
Attachment #511470 - Flags: review?(karlt)
Blocks: 622127
Attachment #511466 - Flags: review?(karlt) → review+
Blocks: 632867
> > > https://crash-stats.mozilla.com/report/index/5535d190-b899-4d6e-8c06-9c53b2110209
> > 
> > > X_GLXVendorPrivateWithReply: BadLength (poly request too large or internal Xlib length error)xpcom_runtime_abort(###!!! ABORT: X_GLXVendorPrivateWithReply: BadLength (poly request too large or internal Xlib length error): file /build/buildd/firefox-4.0~b11+build3+nobinonly/build-tree/mozilla/toolkit/xre/nsX11ErrorHandler.cpp, line 190)

> > This looks like an error from a call on a non-main thread (which our handler
> > handles badly).  If CreateGLContext is called on the main thread then this is
> > unlikely to be related.
> 
> We don't do any GL or GLX calls from non-main-thread. The GL debug mode
> (MOZ_GL_DEBUG) even checks for that, not that that would be inherently wrong,
> but at least that allows to guarantee that that's currently true.

Thanks.  I must have misinterpreted the bad stack and jumped to conclusions too early.

There's a good stack in bug 632867 comment 3 indicating that happens in 
glXQueryVersion << mozilla::gl::GLXLibrary::EnsureInitialized.
This patch applies on top of the previous ones. It fixes GLXLibrary::EnsureInitialized by using a ScopedXErrorHandler there just around the GLX calls, to fix bug 632867. Also, since we were already querying the vendor string there, I moved the blacklisting code there.
Attachment #511502 - Flags: review?(karlt)
blocking2.0: --- → ?
Attachment #511470 - Flags: review?(karlt) → review+
Comment on attachment 511502 [details] [diff] [review]
fix GLXLibrary::EnsureInitialized

Can you make sure the change has a commit message some like
"Catch unexplained X errors in Mesa during the first GLX call", please?

I would have attached this to bug 632867 because that is what this is
fixing and I expect that will be a blocking bug.
It doesn't make so much sense to give blocking status to a bug containing
several patches fixing different issues.

>+    const char *vendor = nsnull;
>+    const char *serverVersionStr = nsnull;
>+    const char *extensionsStr = nsnull;

Don't initialize these here.  They are initialized below.

>+        if (xErrorHandler.SyncAndGetError(display))

It's not necessary to add another round trip with the XSync here because the
requests are synchronous requests.

How about making ScopedXErrorHandler::mXError public and looking at that,
or adding a GetError method?

>+        // GLX calls are prone to generating X errors

These comments are too vague to be helpful.
Attachment #511502 - Flags: review?(karlt)
Attachment #511502 - Flags: review-
Attachment #511502 - Flags: feedback+
Comment on attachment 511469 [details] [diff] [review]
scope change

In the end, all this is doing is extending the scope around the glXGetFBConfigAttrib call, but we've already managed to init the glx library and select the config.

glXGetFBConfigAttrib does not make any server requests, so there can be no reason to include it in the scope.
Attachment #511469 - Flags: review?(karlt) → review-
(In reply to comment #16)
> Comment on attachment 511469 [details] [diff] [review]
> scope change
> 
> In the end, all this is doing is extending the scope around the
> glXGetFBConfigAttrib call, but we've already managed to init the glx library
> and select the config.
> 
> glXGetFBConfigAttrib does not make any server requests, so there can be no
> reason to include it in the scope.

Ah ok. Thanks for the explanation!
blocking2.0: ? → betaN+
Whiteboard: [hardblocker]
Backed out due to test failures

http://hg.mozilla.org/mozilla-central/pushloghtml?changeset=80211f053c46
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee: nobody → bjacob
(In reply to comment #19)
> Backed out due to test failures
> 
> http://hg.mozilla.org/mozilla-central/pushloghtml?changeset=80211f053c46

Ah this seems to have been attached to the wrong bug.  The backout changeset has nothing to do with check-ins for this bug.
(In reply to comment #20)
> (In reply to comment #19)
> > Backed out due to test failures
> > 
> > http://hg.mozilla.org/mozilla-central/pushloghtml?changeset=80211f053c46
> 
> Ah this seems to have been attached to the wrong bug.  The backout changeset
> has nothing to do with check-ins for this bug.

Hmm. Something seems wrong here with the backout or hg or something.  The individual changesets, if you click on expand in the pushlog, seem to be for this bug, but the overall merge link seems to be for an entirely different, totally unrelated, check-in.
Re-landed:
http://hg.mozilla.org/mozilla-central/rev/895e4d18639a
http://hg.mozilla.org/mozilla-central/rev/0c4dc6d59c54
http://hg.mozilla.org/mozilla-central/rev/f3869889789c
http://hg.mozilla.org/mozilla-central/rev/65ea4af6833a

This had indeed nothing to do with the failure, it just was part of the same
push.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
(In reply to comment #21)
> Hmm. Something seems wrong here with the backout or hg or something.  The
> individual changesets, if you click on expand in the pushlog, seem to be for
> this bug, but the overall merge link seems to be for an entirely different,
> totally unrelated, check-in.

I wonder whether default was merged into the backout instead of the backout being merged into default.
You need to log in before you can comment on or make changes to this bug.