(spike in 67 of) Crash in mozilla::layers::CompositorOGL::Initialize
Categories
(Core :: Graphics, defect, P1)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr52 | --- | unaffected |
firefox-esr60 | --- | unaffected |
firefox59 | --- | wontfix |
firefox60 | --- | wontfix |
firefox61 | --- | wontfix |
firefox62 | --- | wontfix |
firefox63 | --- | wontfix |
firefox64 | --- | wontfix |
firefox65 | --- | wontfix |
firefox66 | --- | unaffected |
firefox67 | + | fixed |
People
(Reporter: jseward, Assigned: jgilbert)
References
Details
(Keywords: crash, regression, Whiteboard: [priority:low])
Crash Data
Attachments
(1 file)
Reporter | ||
Updated•8 years ago
|
Comment 1•8 years ago
|
||
Comment 2•8 years ago
|
||
Comment 3•7 years ago
|
||
Comment 4•7 years ago
|
||
Comment 6•7 years ago
|
||
Comment 8•7 years ago
|
||
Comment 10•7 years ago
|
||
Comment 11•7 years ago
|
||
Comment 12•7 years ago
|
||
Comment 13•7 years ago
|
||
Comment 14•7 years ago
|
||
Comment 15•7 years ago
|
||
Comment 16•6 years ago
|
||
Adding 65/66 as affected. Currently on 66 nightly this is the top 6 crashes.
Updated•6 years ago
|
Comment 17•6 years ago
|
||
Tracking for 67 as it spiked on Nightly over the last few days. James, could we have somebody investigate what worsened the situation since Feb 22. Thanks
Recent crashes appear to be a MOZ_CRASH() where we fail to create a GLContext for the compositor. I found the following messages from several logcats:
02-26 10:26:27.200 17406 18095 I Gecko : Attempting load of libEGL.so
02-26 10:26:27.220 17406 18095 I Gecko : [GFX1]: Flushing glGetError still 0x40514048 after 100 calls.
...
02-26 10:26:27.510 17406 18173 I Gecko : [GFX1]: Flushing glGetError still 0x40514048 after 100 calls.
02-26 10:26:27.510 17406 18173 I Gecko : [GFX1-]: Failed to create EGLContext!
That error value makes no sense to me and looks a lot like a pointer address, so not sure what's going on.
Moving this to GFX since it's clearly GLContext/Compositor stuff.
Assignee | ||
Comment 20•6 years ago
|
||
It's not a pointer:
https://searchfox.org/mozilla-central/rev/dbddac86aadf1d4871fb350bbe66db43728a9f81/gfx/gl/GLContext.cpp#2766
(E)GL library loading changed recently, so this might be fallout from that: Bug 1528396
Snorp, were these local logcats? If so, on what device(s), and is there an STR?
Comment 21•6 years ago
|
||
A significant number of the devices are emulators. see comment 12
It doesn't happen locally for me, I got the logs from crash-stats: https://crash-stats.mozilla.com/report/index/adbf502b-212c-4d54-ac20-b9f520190226#tab-metadata
(In reply to Jeff Gilbert [:jgilbert] from comment #20)
It's not a pointer:
https://searchfox.org/mozilla-central/rev/dbddac86aadf1d4871fb350bbe66db43728a9f81/gfx/gl/GLContext.cpp#2766
Yeah, but wtf is 0x40514048? That's not a known error AFAICT? And the value differs in reports.
Assignee | ||
Comment 24•6 years ago
|
||
Huh, nooo idea. None of those numbers are GLenums.
Assignee | ||
Comment 25•6 years ago
|
||
Here's isolated to the spiking build IDs:
https://crash-stats.mozilla.com/signature/?product=FennecAndroid&build_id=%3E%3D20190225102402&signature=mozilla%3A%3Alayers%3A%3ACompositorOGL%3A%3AInitialize&date=%3E%3D2019-02-19T23%3A01%3A00.000Z&date=%3C2019-02-26T23%3A01%3A00.000Z#aggregations
1 GT-I8552B 25 6.25 %
2 SO-04E 18 4.50 %
3 SHV-E300K 17 4.25 %
4 Vodafone Smart Tab III 10 16 4.00 %
5 GT-I9100 15 3.75 %
6 AOSP on ARM Emulator 12 3.00 %
7 M4 SS4040 12 3.00 %
8 GT-P5110 10 2.50 %
9 PSP5307DUO 10 2.50 %
10 V865M 10 2.50 %
There are multiple runs of users trying to restart the browser and accumulating crashes. (eesh, sorry all :( )
Interestingly that pointer-looking value changes across users, but is consistent across runs per-user.
Assignee | ||
Comment 26•6 years ago
|
||
It's crazy bizarre for glGetError to return any value like that. It's like we're hitting some shim and it's immediately returning to us, giving us a pointer-like value that happened to already be on the stack.
I've re-vetted our EGL loading code, and the only thing I can think of is that we try eglGetProcAddress before we try to load from the library, which is the opposite order of what we used to do. (and we no longer try to load from the process)
EGL <= 1.4:
eglGetProcAddress may not be queried for core (non-extension) functions in EGL or client APIs.
EGL >= 1.5:
eglGetProcAddress may be queried for all EGL and client API functions supported by the implementation (whether those functions are extensions or not, and whether they are supported by the current client API context or not).
If EGL 1.4's eglGetProcAddress is, on some hardware, giving us a pfn to a dummy thunk that logs an error somewhere and returns void, we might get this behavior.
The top three devices (all I checked) are all circa 2013, which does predate EGL 1.5 (2014).
Thing is, I think we want to prefer to use (egl/wgl)GetProcAddress first before trying to dlsym from the library, but maybe we can try to reverse this?
All told, this is a relatively small number of long-tail crashes.
Assignee | ||
Comment 27•6 years ago
|
||
Comment 28•6 years ago
|
||
Comment 29•6 years ago
|
||
bugherder |
Assignee | ||
Comment 30•6 years ago
|
||
I'll check back tomorrow, but 11 hours in, 1 crash from 1 reporter is encouraging!
Comment 31•6 years ago
|
||
Is this something we should consider backporting to Beta for Fennec 66?
Assignee | ||
Comment 32•6 years ago
|
||
So we sort of commandeered this bug for this crash spike, which is actually a different bug that is 67-only.
I'll duplicate this bug so we keep tracking the low-volume crash bug.
Assignee | ||
Updated•6 years ago
|
Assignee | ||
Updated•6 years ago
|
Assignee | ||
Comment 33•6 years ago
|
||
We're back to baseline crash rate, which we are now tracking in bug 1532456.
This spike in crash rate was successfully fixed by the patch.
Updated•6 years ago
|
Description
•