Closed Bug 1004134 Opened 10 years ago Closed 8 years ago
Crash [@ GLEngine@0x3bfff ] on startup when hardware acceleration is enabled
A friend installed Firefox 29 and it crashed reliably on update as long as "Use hardware acceleration when available" was checked in preferences. Stack here: https://crash-stats.mozilla.com/report/index/0e2e3ee0-d82c-4397-8a0e-d81c52140430
Markus, can you have a look at this? :-)
FWIW, https://crash-stats.mozilla.com/report/list?product=Firefox&signature=GLEngine%400x3bfff#tab-reports says this happened to a single install of 29 only so far (and once to a 28 installation).
(In reply to :Gijs Kruitbosch from comment #1) > Markus, can you have a look at this? :-) The crashing line is a call to fBufferData: http://hg.mozilla.org/releases/mozilla-release/annotate/f60bc49e6bd5/gfx/gl/GLContext.h#l809 The GLEngine binary on my machine has the same debug identifier as the one in the crash report (51D58F76B9B33B4FB65AF6D213C2EED70), and `xcrun atos -o /System/Library/Frameworks/OpenGL.framework/Versions/A/Resources/GLEngine.bundle/GLEngine 0x3bfff` gives me "glBufferData_Exec (in GLEngine) + 489" - no surprises there. I don't really know what to do from here on. The GPU information says "AdapterVendorID: 0x8086, AdapterDeviceID: 0x a2e", and there are plenty of crash reports with that same information that *don't* crash in GLEngine, so we can't just blacklist this combination.
Hrm, there's only one call to fBufferData from CompositorOGL::Initialize, and it's here: http://hg.mozilla.org/releases/mozilla-release/annotate/f60bc49e6bd5/gfx/layers/opengl/CompositorOGL.cpp#l522 We're passing to fBufferData a stack pointer, 'vertices', and a compile-time fixed sizeof(vertices)==64. But the crash is a bad access at 0x8801. ...oh, look at this. The complete bufferData call on the above-linked line is: mGLContext->fBufferData(LOCAL_GL_ARRAY_BUFFER, sizeof(vertices), vertices, LOCAL_GL_STATIC_DRAW); The numeric value of GL_ARRAY_BUFFER is 0x8892 and the numeric value of LOCAL_GL_STATIC_DRAW is 0x88E4. So... this might just be friday night numerology, but this seems as if one of these enum parameter was misinterpreted by the driver's implementation of glBufferData as the address parameter. If that's the case then we actually have grounds to blacklist here.
(In reply to Markus Stange [:mstange] from comment #3) > The GLEngine binary on my machine has the same debug identifier as the one > in the crash report (51D58F76B9B33B4FB65AF6D213C2EED70), and `xcrun atos -o > /System/Library/Frameworks/OpenGL.framework/Versions/A/Resources/GLEngine. > bundle/GLEngine 0x3bfff` gives me "glBufferData_Exec (in GLEngine) + 489" - > no surprises there. > > I don't really know what to do from here on. I would find it very interesting if you could disassemble this function and see if it looks like it might be interpreting either its first parameter (the target enum) or last parameter (the usage enum) as a pointer or as an offset... basically, any kind of pointer arithmetic involving these enum values, would be wrong and would fit very well with the crash address we're seeing here. Understanding that might help us find a work-around....
Thanks for the disassembly; here is a copy with annotations. The bad news is that it actually doesn't match the crash: the crash instruction pointer is 0x3bfff, but the disassembly there is: GLEngine[0x3bffb]: movzbl %al, %r8d GLEngine[0x3bfff]: movq %rbx, %rdi GLEngine[0x3c002]: movq -0x38(%rbp), %rsi The instruction at 3bfff only touches registers, like the one before, and the one after touches the stack, but the bad access address 0x8801 is not a stack address. So it seems to be the case that Apple silently updated the OpenGL library without regenerating the "debug identifier"... this might be good news as it might mean that the problem was fixed already, which is why according to crash-stats this is not happening much at all. Note: I also queried crash-stats for crashes in nearby addresses in GLEngine and there isn't any. 0x3bfff is the only one. Some more remarks on the crashing function, glBufferData_Exec. Above, I was surprised that the stack showed our code calling directly into it whereas the function that we wanted to call was glBufferData. The disassembly shows that glBufferData_Exec takes 5 parameters: the first is a pointer to the actual OpenGL context object, and the 4 remaining ones are the 4 parameters to glBufferData. It would be very bad if we called glBufferData_Exec instead of glBufferData accidentally, as the parameters would then be all wrong. But I don't think that that's what's happening. The crash we got suggests that we took a GLenum we were passed (0x88**), overwrote its low byte with the value 1, obtaining 0x8801, and dereferenced that as a pointer. Looking at the code, the places were we do things that (modulo bugs) could end up doing that, are not hit if we pass a totally wrong value for the buffer target enum parameter (as would be the case if the argument sequence were shifted and we passed our sizeof(vertices) for it). Instead, I suppose that the OpenGL library just has a fast way to call _Exec internal functions by overwriting the current stack frame instead of pushing a new one.
Rereading this bug, comment 2 is of note and as explained in my previous comment, I also checked other nearby addresses in GLEngine, and this is the only one. So, I wouldn't worry more about this bug.
One thing that we probably should do in general is avoid passing stack pointers to the GL. It makes it much harder to reason about bugs like this, and the lower alignment (compared to heap) means we can more easily trigger bugs. Here, this disassembly is calling memcpy, and I trust memcpy, but maybe the version of GLEngine that crashed at 0x3bfff was doing a custom loop instead of memcpy... who knows.
(In reply to Benoit Jacob [:bjacob] from comment #7) > So it seems to be the case that Apple silently updated the OpenGL library We may want to confirm this. Mike, does your friend have XCode installed, and if so, can you ask him or her to get the disassembly of the function in question, so that we can confirm it matches the one I attached? Here's the command again: $ lldb /System/Library/Frameworks/OpenGL.framework/Versions/A/Resources/GLEngine.bundle/GLEngine (lldb) disassemble -n glBufferData_Exec
the requested disassembly of glBufferData_Exec
Thanks. 0x3bfff is still that same instruction, that shouldn't cause this crash, so this is still very mysterious to me. Is the crash still reproducing on your machine? If yes, and if you can grant us some more of your time, could you please test a current Nightly build, http://nightly.mozilla.org/ , as I landed a patch (bug 1005658) that is intended as a tentative fix for what might be the cause of your crash here? (Your crash is on one of the few GL calls that we made, where we passed a stack pointer to the GL; the tentative fix consists in avoiding that; see that bug for a rationale of why stack pointers might be specially prone to triggering driver bugs).
should we close this?
Severity: normal → critical
I don't have any new information since comment 13, where this was a complete mystery to me.
Chris, turns out fhis is for you ... (In reply to Benoit Jacob [:bjacob] (mostly away) from comment #13) > ... Is the crash still > reproducing on your machine? If yes, and if you can grant us some more of > your time, could you please test a current Nightly build, > http://nightly.mozilla.org/ , as I landed a patch (bug 1005658) that is > intended as a tentative fix for what might be the cause of your crash here? > (Your crash is on one of the few GL calls that we made, where we passed a > stack pointer to the GL; the tentative fix consists in avoiding that; see > that bug for a rationale of why stack pointers might be specially prone to > triggering driver bugs).
Mass resolving WFM: signature(s) hasn't(/haven't) reported in past 28 days. If this is still happening, feel free to reopen.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.