Closed Bug 632969 Opened 15 years ago Closed 15 years ago

X error handling fixes

Tracking

()

Status:

RESOLVED FIXED

Tracking Flags:

Tracking

Status

blocking2.0

---

betaN+

People

(Reporter: bjacob, Assigned: bjacob)

References

Details

(Whiteboard: [hardblocker])

Attachments

(7 files)

fix X error handling 15 years ago Benoit Jacob [:bjacob] (mostly away) 6.25 KB, patch	karlt : review-	Details \| Diff \| Splinter Review
make SyncAndGetError return the first error 15 years ago Karl Tomlinson (:karlt) 2.06 KB, patch	karlt : review+	Details \| Diff \| Splinter Review
reset error variable before trying again with no sharing 15 years ago Karl Tomlinson (:karlt) 1.14 KB, patch	bjacob : review+	Details \| Diff \| Splinter Review
message wording change 15 years ago Benoit Jacob [:bjacob] (mostly away) 1.29 KB, patch	karlt : review+	Details \| Diff \| Splinter Review
scope change 15 years ago Benoit Jacob [:bjacob] (mostly away) 1.65 KB, patch	karlt : review-	Details \| Diff \| Splinter Review
logic change 15 years ago Benoit Jacob [:bjacob] (mostly away) 1.75 KB, patch	karlt : review+	Details \| Diff \| Splinter Review
fix GLXLibrary::EnsureInitialized 15 years ago Benoit Jacob [:bjacob] (mostly away) 4.45 KB, patch	karlt : review- karlt : feedback+	Details \| Diff \| Splinter Review

Benoit Jacob [:bjacob] (mostly away)

Assignee

Description

•

15 years ago

Attached patch fix X error handling — Details — Splinter Review

This patch does 2 things: * it improves ScopedXErrorHandler * it improves the way it's used in GLContextProviderGLX.cpp This comes mostly from bug 616416 which Karl already r-'d with comments, so I tried to apply his comments here; the other new change is that I'm now putting the glXQueryServerString call inside of the ScopedXErrorHandler scope, in an attempt to fix the very large number of crashes on X errors that we're seeing in beta 11 on linux. Indeed, since we're blacklisting most people, and I doubt that many people define MOZ_GLX_IGNORE_BLACKLIST, my guess is that the glXQueryServerString call is what is causing the X errors.

Attachment #511180 - Flags: review?(karlt)

Benoit Jacob [:bjacob] (mostly away)

Assignee

Comment 1

•

15 years ago

Also, Karl: to be clear, I am the opposite of a X11 expert. I don't feel competent to precisely map X error crash reports to patches that fix them. So if you see a different thing to do here given the reports we're getting --- please provide the improved patch.

Karl Tomlinson (:karlt)

Comment 2

•

15 years ago

Attached patch make SyncAndGetError return the first error — Details — Splinter Review

Attachment #511196 - Flags: review+

Karl Tomlinson (:karlt)

Comment 3

•

15 years ago

Attached patch reset error variable before trying again with no sharing — Details — Splinter Review

Attachment #511202 - Flags: review?(bjacob)

Benoit Jacob [:bjacob] (mostly away)

Assignee

Updated

•

15 years ago

Attachment #511202 - Flags: review?(bjacob) → review+

Karl Tomlinson (:karlt)

Comment 4

•

15 years ago

I think we can make progress here by breaking the patch up into each separate change and identify what each part is fixing. Whenever a workaround is added, it is important to know what the workaround is for so that we can know when the workaround is no-longer necessary. I don't want this to get into a situation where people are afraid to change the code because they don't know why it is doing something a particular way. (In reply to comment #0) > the other new change is that I'm now putting > the glXQueryServerString call inside of the ScopedXErrorHandler scope, in an > attempt to fix the very large number of crashes on X errors that we're seeing > in beta 11 on linux. Indeed, since we're blacklisting most people, and I doubt > that many people define MOZ_GLX_IGNORE_BLACKLIST, my guess is that the > glXQueryServerString call is what is causing the X errors. Can you point me at some of the reports, please?

Karl Tomlinson (:karlt)

Comment 5

•

15 years ago

Comment on attachment 511202 [details] [diff] [review] reset error variable before trying again with no sharing >- printf("[GLX] currently only allowing the NVIDIA proprietary driver, as other drivers are giving too many crashes. " >- "To bypass this, define the MOZ_GLX_IGNORE_BLACKLIST environment variable.\n"); >+ printf("[GLX] your GL driver is currently blocked, define the MOZ_GLX_IGNORE_BLACKLIST " >+ "environment variable to bypass this.\n"); The "GL driver" wording change looks good. Either use a full stop '.' and start a new sentence at "define", or use a semi-colon ";" after blocked.

Karl Tomlinson (:karlt)

Comment 6

•

15 years ago

Comment on attachment 511180 [details] [diff] [review] fix X error handling I like the merging of shareContext and !shareContext error checking. > context = sGLXLibrary.xCreateNewContext(display, > cfg, > GLX_RGBA_TYPE, > shareContext ? shareContext->mContext : NULL, > True); > >- if (context) { >+ if (!error) { There is a logic error here, though. I assume you still need to check context. >+ error |= !glContext->Init() || xErrorHandler.SyncAndGetError(display); It's a bit unconventional to use C++ logical operators to control function execution, but I think I'm OK with that.

Attachment #511180 - Flags: review?(karlt) → review-

Benoit Jacob [:bjacob] (mostly away)

Assignee

Comment 7

•

15 years ago

(In reply to comment #4) > I think we can make progress here by breaking the patch up into each separate > change and identify what each part is fixing. > > Whenever a workaround is added, it is important to know what the workaround is > for so that we can know when the workaround is no-longer necessary. I don't > want this to get into a situation where people are afraid to change the code > because they don't know why it is doing something a particular way. OK, I agree. So, there are two things that we need to justify here: 1. why we use this X error handler here and not elsewhere (justify the precise scope of this ScopedXErrorHandler) 2. why we do this SyncAndGetError call here and not elsewhere For 1., the code in GLContextGLX::CreateGLContext is known to be prone to generate X errors, and our default X error handler crashes the application upon X errors, so we need a ScopedXErrorHandler here. With today's patches, the scope of this ScopedXErrorHandler is just the scope of this function, which is natural. For 2., there is only one SyncAndGetError left, at the end of one try. My rationalization for that is that, since we're doing two tries, we need to really know whether the first one failed before trying again. I am no longer doing this to work around driver crashes, since driver crashes are now worked around by the blacklist. If you think that this SyncAndGetError is unnecessary, feel free to remove it, but that then makes me uncomfortable as we'd then completely ignore X errors --- would that be OK? Note that there's another SyncAndGetError call in CreateOffscreenPixmapContext. > > (In reply to comment #0) > > the other new change is that I'm now putting > > the glXQueryServerString call inside of the ScopedXErrorHandler scope, in an > > attempt to fix the very large number of crashes on X errors that we're seeing > > in beta 11 on linux. Indeed, since we're blacklisting most people, and I doubt > > that many people define MOZ_GLX_IGNORE_BLACKLIST, my guess is that the > > glXQueryServerString call is what is causing the X errors. > > Can you point me at some of the reports, please? Sure, here's a query for top linux browser crashes in beta 11: https://crash-stats.mozilla.com/query/query?product=Firefox&version=Firefox%3A4.0b11&platform=linux&range_value=1&range_unit=weeks&date=02%2F09%2F2011+14%3A49%3A55&query_search=signature&query_type=exact&query=&build_id=&process_type=browser&hang_type=crash&do_query=1 The linux-gate.so crashes are the ones I'm talking about. For example: https://crash-stats.mozilla.com/report/index/c61d1ba5-106d-43be-81cb-409e82110209 https://crash-stats.mozilla.com/report/index/5535d190-b899-4d6e-8c06-9c53b2110209

Karl Tomlinson (:karlt)

Comment 8

•

15 years ago

(In reply to comment #7) > With today's patches, the > scope of this ScopedXErrorHandler is just the scope of this function, which is > natural. In general, I'd prefer to only ignore errors in the functions that are expected to generate errors, or at least keep the scope as small as practical. Otherwise it makes it hard to find the initial cause of subsequent errors. > For 2., there is only one SyncAndGetError left, at the end of one try. My > rationalization for that is that, since we're doing two tries, we need to > really know whether the first one failed before trying again. I am no longer > doing this to work around driver crashes, since driver crashes are now worked > around by the blacklist. If you think that this SyncAndGetError is unnecessary, > feel free to remove it, but that then makes me uncomfortable as we'd then > completely ignore X errors --- would that be OK? X errors are usually, but not always, programming errors. BadAlloc errors are a specific kind error that are often not programming errors, and these functions (glXCreateNewContext and glXMakeCurrent) may generate BadAlloc errors. Bug 613079 describes a reason why BadAlloc might be likely to happen (though I don't see BadAlloc mentioned in the bug), so keeping the check for errors here makes sense. I don't really know enough about whether there are differences in resources available for shared and non-shared contexts, so I can't really comment on whether a SyncAndGetErrors for each try is worthwhile. > https://crash-stats.mozilla.com/report/index/c61d1ba5-106d-43be-81cb-409e82110209 This is bug 576933, which is not relevant here, but we may want to silence it so that we can find the more serious errors. > https://crash-stats.mozilla.com/report/index/5535d190-b899-4d6e-8c06-9c53b2110209 > X_GLXVendorPrivateWithReply: BadLength (poly request too large or internal Xlib length error)xpcom_runtime_abort(###!!! ABORT: X_GLXVendorPrivateWithReply: BadLength (poly request too large or internal Xlib length error): file /build/buildd/firefox-4.0~b11+build3+nobinonly/build-tree/mozilla/toolkit/xre/nsX11ErrorHandler.cpp, line 190) > 0 linux-gate.so linux-gate.so@0x416 > 1 libc-2.12.2.so libc-2.12.2.so@0x2e18d > 2 libmozalloc.so mozalloc_abort mozalloc_abort.cpp:75 > 3 @0x1a0bf8f > 4 libpthread-2.12.2.so libpthread-2.12.2.so@0x8336 > 5 libpthread-2.12.2.so libpthread-2.12.2.so@0x9778 This looks like an error from a call on a non-main thread (which our handler handles badly). If CreateGLContext is called on the main thread then this is unlikely to be related. I don't see libGLcore.so.* or libnvidia-* in the modules list, which I assume implies that this is not happening with NVIDIA drivers. I see libEGL.so.1.0 loaded. I don't know what that implies.

Benoit Jacob [:bjacob] (mostly away)

Assignee

Updated

•

15 years ago

Blocks: 589546

Benoit Jacob [:bjacob] (mostly away)

Assignee

Comment 9

•

15 years ago

First of all, thanks for the great explanation! (In reply to comment #8) > (In reply to comment #7) > > With today's patches, the > > scope of this ScopedXErrorHandler is just the scope of this function, which is > > natural. > > In general, I'd prefer to only ignore errors in the functions that are > expected to generate errors, or at least keep the scope as small as practical. > Otherwise it makes it hard to find the initial cause of subsequent errors. OK, I understand. But here, understanding precisely which functions are causing which errors would be a little complicated as far as I can see. The best way that I could see, you be to temporarily (in nightlies) introduce more SyncAndGetError calls and annotate crash reports with these errors. Then the crash-stats reports would start giving us the info we seem to need. Meanwhile, I would like the current solution (ScopedXErrorHandler on the whole function scope) to be pushed quickly so we make crashes go away for the firefox 4.0 release. > > > For 2., there is only one SyncAndGetError left, at the end of one try. My > > rationalization for that is that, since we're doing two tries, we need to > > really know whether the first one failed before trying again. I am no longer > > doing this to work around driver crashes, since driver crashes are now worked > > around by the blacklist. If you think that this SyncAndGetError is unnecessary, > > feel free to remove it, but that then makes me uncomfortable as we'd then > > completely ignore X errors --- would that be OK? > > X errors are usually, but not always, programming errors. > > BadAlloc errors are a specific kind error that are often not programming > errors, and these functions (glXCreateNewContext and glXMakeCurrent) may > generate BadAlloc errors. Bug 613079 describes a reason why BadAlloc might be > likely to happen (though I don't see BadAlloc mentioned in the bug), so > keeping the check for errors here makes sense. > > I don't really know enough about whether there are differences in resources > available for shared and non-shared contexts, so I can't really comment on > whether a SyncAndGetErrors for each try is worthwhile. Hm.. me neither, actually. I think that the shared context only has textures. Vlad knows that stuff. > > > https://crash-stats.mozilla.com/report/index/c61d1ba5-106d-43be-81cb-409e82110209 > > This is bug 576933, which is not relevant here, but we may want to silence it > so that we can find the more serious errors. > > > https://crash-stats.mozilla.com/report/index/5535d190-b899-4d6e-8c06-9c53b2110209 > > > X_GLXVendorPrivateWithReply: BadLength (poly request too large or internal Xlib length error)xpcom_runtime_abort(###!!! ABORT: X_GLXVendorPrivateWithReply: BadLength (poly request too large or internal Xlib length error): file /build/buildd/firefox-4.0~b11+build3+nobinonly/build-tree/mozilla/toolkit/xre/nsX11ErrorHandler.cpp, line 190) > > > 0 linux-gate.so linux-gate.so@0x416 > > 1 libc-2.12.2.so libc-2.12.2.so@0x2e18d > > 2 libmozalloc.so mozalloc_abort mozalloc_abort.cpp:75 > > 3 @0x1a0bf8f > > 4 libpthread-2.12.2.so libpthread-2.12.2.so@0x8336 > > 5 libpthread-2.12.2.so libpthread-2.12.2.so@0x9778 > > This looks like an error from a call on a non-main thread (which our handler > handles badly). If CreateGLContext is called on the main thread then this is > unlikely to be related. We don't do any GL or GLX calls from non-main-thread. The GL debug mode (MOZ_GL_DEBUG) even checks for that, not that that would be inherently wrong, but at least that allows to guarantee that that's currently true. > I don't see libGLcore.so.* or libnvidia-* in the modules list, which I assume > implies that this is not happening with NVIDIA drivers. Indeed. In Firefox 5 we'll implement GfxInfo on X11 which will make it easier to check that (we'll annotate crash reports like we do on windows and mac). > > I see libEGL.so.1.0 loaded. I don't know what that implies. Me neither.

Benoit Jacob [:bjacob] (mostly away)

Assignee

Comment 10

•

15 years ago

Attached patch message wording change — Details — Splinter Review

Attachment #511466 - Flags: review?(karlt)

Benoit Jacob [:bjacob] (mostly away)

Assignee

Comment 11

•

15 years ago

Attached patch scope change — Details — Splinter Review

This makes the ScopedXErrorHandler have the whole function scope, as discussed above ideally we'd want something more fine-grained, but for the present purpose of quickly fixing this crasher, this has the highest probability of working.

Attachment #511469 - Flags: review?(karlt)

Benoit Jacob [:bjacob] (mostly away)

Assignee

Comment 12

•

15 years ago

Attached patch logic change — Details — Splinter Review

This is the only part that's not directly taken from earlier patch. I think that this makes the logic simpler.

Attachment #511470 - Flags: review?(karlt)

Benoit Jacob [:bjacob] (mostly away)

Assignee

Updated

•

15 years ago

Blocks: 622127

Karl Tomlinson (:karlt)

Updated

•

15 years ago

Attachment #511466 - Flags: review?(karlt) → review+

Benoit Jacob [:bjacob] (mostly away)

Assignee

Updated

•

15 years ago

Blocks: 632867

Karl Tomlinson (:karlt)

Comment 13

•

15 years ago

> > > https://crash-stats.mozilla.com/report/index/5535d190-b899-4d6e-8c06-9c53b2110209 > > > > > X_GLXVendorPrivateWithReply: BadLength (poly request too large or internal Xlib length error)xpcom_runtime_abort(###!!! ABORT: X_GLXVendorPrivateWithReply: BadLength (poly request too large or internal Xlib length error): file /build/buildd/firefox-4.0~b11+build3+nobinonly/build-tree/mozilla/toolkit/xre/nsX11ErrorHandler.cpp, line 190) > > This looks like an error from a call on a non-main thread (which our handler > > handles badly). If CreateGLContext is called on the main thread then this is > > unlikely to be related. > > We don't do any GL or GLX calls from non-main-thread. The GL debug mode > (MOZ_GL_DEBUG) even checks for that, not that that would be inherently wrong, > but at least that allows to guarantee that that's currently true. Thanks. I must have misinterpreted the bad stack and jumped to conclusions too early. There's a good stack in bug 632867 comment 3 indicating that happens in glXQueryVersion << mozilla::gl::GLXLibrary::EnsureInitialized.

Benoit Jacob [:bjacob] (mostly away)

Assignee

Comment 14

•

15 years ago

Attached patch fix GLXLibrary::EnsureInitialized — Details — Splinter Review

This patch applies on top of the previous ones. It fixes GLXLibrary::EnsureInitialized by using a ScopedXErrorHandler there just around the GLX calls, to fix bug 632867. Also, since we were already querying the vendor string there, I moved the blacklisting code there.

Attachment #511502 - Flags: review?(karlt)

Benoit Jacob [:bjacob] (mostly away)

Assignee

Updated

•

15 years ago

blocking2.0: --- → ?

Karl Tomlinson (:karlt)

Updated

•

15 years ago

Attachment #511470 - Flags: review?(karlt) → review+

Karl Tomlinson (:karlt)

Comment 15

•

15 years ago

Comment on attachment 511502 [details] [diff] [review] fix GLXLibrary::EnsureInitialized Can you make sure the change has a commit message some like "Catch unexplained X errors in Mesa during the first GLX call", please? I would have attached this to bug 632867 because that is what this is fixing and I expect that will be a blocking bug. It doesn't make so much sense to give blocking status to a bug containing several patches fixing different issues. >+ const char *vendor = nsnull; >+ const char *serverVersionStr = nsnull; >+ const char *extensionsStr = nsnull; Don't initialize these here. They are initialized below. >+ if (xErrorHandler.SyncAndGetError(display)) It's not necessary to add another round trip with the XSync here because the requests are synchronous requests. How about making ScopedXErrorHandler::mXError public and looking at that, or adding a GetError method? >+ // GLX calls are prone to generating X errors These comments are too vague to be helpful.

Attachment #511502 - Flags: review?(karlt)

Attachment #511502 - Flags: review-

Attachment #511502 - Flags: feedback+

Karl Tomlinson (:karlt)

Comment 16

•

15 years ago

Comment on attachment 511469 [details] [diff] [review] scope change In the end, all this is doing is extending the scope around the glXGetFBConfigAttrib call, but we've already managed to init the glx library and select the config. glXGetFBConfigAttrib does not make any server requests, so there can be no reason to include it in the scope.

Attachment #511469 - Flags: review?(karlt) → review-

Benoit Jacob [:bjacob] (mostly away)

Assignee

Comment 17

•

15 years ago

(In reply to comment #16) > Comment on attachment 511469 [details] [diff] [review] > scope change > > In the end, all this is doing is extending the scope around the > glXGetFBConfigAttrib call, but we've already managed to init the glx library > and select the config. > > glXGetFBConfigAttrib does not make any server requests, so there can be no > reason to include it in the scope. Ah ok. Thanks for the explanation!

(no longer active)

Updated

•

15 years ago

blocking2.0: ? → betaN+

Whiteboard: [hardblocker]

Benoit Jacob [:bjacob] (mostly away)

Assignee

Comment 18

•

15 years ago

http://hg.mozilla.org/mozilla-central/rev/d7a1bf23400d http://hg.mozilla.org/mozilla-central/rev/9c38a5b18482 http://hg.mozilla.org/mozilla-central/rev/89e569d4b7e2 http://hg.mozilla.org/mozilla-central/rev/450b6354b0fa

Status: NEW → RESOLVED

Closed: 15 years ago

Resolution: --- → FIXED

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 19

•

15 years ago

Backed out due to test failures http://hg.mozilla.org/mozilla-central/pushloghtml?changeset=80211f053c46

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

:Ms2ger (he/him; ⌚ UTC+1/+2)

Updated

•

15 years ago

Assignee: nobody → bjacob

Bill Gianopoulos [:WG9s]

Comment 20

•

15 years ago

(In reply to comment #19) > Backed out due to test failures > > http://hg.mozilla.org/mozilla-central/pushloghtml?changeset=80211f053c46 Ah this seems to have been attached to the wrong bug. The backout changeset has nothing to do with check-ins for this bug.

Bill Gianopoulos [:WG9s]

Comment 21

•

15 years ago

(In reply to comment #20) > (In reply to comment #19) > > Backed out due to test failures > > > > http://hg.mozilla.org/mozilla-central/pushloghtml?changeset=80211f053c46 > > Ah this seems to have been attached to the wrong bug. The backout changeset > has nothing to do with check-ins for this bug. Hmm. Something seems wrong here with the backout or hg or something. The individual changesets, if you click on expand in the pushlog, seem to be for this bug, but the overall merge link seems to be for an entirely different, totally unrelated, check-in.

Benoit Jacob [:bjacob] (mostly away)

Assignee

Comment 22

•

15 years ago

Re-landed: http://hg.mozilla.org/mozilla-central/rev/895e4d18639a http://hg.mozilla.org/mozilla-central/rev/0c4dc6d59c54 http://hg.mozilla.org/mozilla-central/rev/f3869889789c http://hg.mozilla.org/mozilla-central/rev/65ea4af6833a This had indeed nothing to do with the failure, it just was part of the same push.

Status: REOPENED → RESOLVED

Closed: 15 years ago → 15 years ago

Resolution: --- → FIXED

Karl Tomlinson (:karlt)

Comment 23

•

15 years ago

(In reply to comment #21) > Hmm. Something seems wrong here with the backout or hg or something. The > individual changesets, if you click on expand in the pushlog, seem to be for > this bug, but the overall merge link seems to be for an entirely different, > totally unrelated, check-in. I wonder whether default was merged into the backout instead of the backout being merged into default.

You need to log in before you can comment on or make changes to this bug.