Closed Bug 1420745 Opened 2 years ago Closed 9 months ago

(spike in 67 of) Crash in mozilla::layers::CompositorOGL::Initialize

Categories

(Core :: Graphics, defect, P1, major)

Unspecified
Android
defect

Tracking

()

RESOLVED FIXED
mozilla67
Tracking Status
firefox-esr52 --- unaffected
firefox-esr60 --- unaffected
firefox59 --- wontfix
firefox60 --- wontfix
firefox61 --- wontfix
firefox62 --- wontfix
firefox63 --- wontfix
firefox64 --- wontfix
firefox65 --- wontfix
firefox66 --- unaffected
firefox67 + fixed

People

(Reporter: jseward, Assigned: jgilbert)

References

(Blocks 1 open bug)

Details

(Keywords: crash, regression, Whiteboard: [priority:low])

Crash Data

Attachments

(1 file)

This bug was filed from the Socorro interface and is
report bp-252c8d15-38fc-4b9d-a25e-620fe0171124.
=============================================================

This is topcrash #3 in the android nightly 20171123100056.

Top 10 frames of crashing thread:

0 libxul.so mozilla::layers::CompositorOGL::Initialize gfx/layers/opengl/CompositorOGL.cpp:238
1 libxul.so mozilla::layers::CompositorBridgeParent::NewCompositor gfx/layers/ipc/CompositorBridgeParent.cpp:1481
2 libxul.so mozilla::layers::CompositorBridgeParent::InitializeLayerManager gfx/layers/ipc/CompositorBridgeParent.cpp:1412
3 libxul.so mozilla::layers::CompositorBridgeParent::AllocPLayerTransactionParent gfx/layers/ipc/CompositorBridgeParent.cpp:1523
4 libxul.so mozilla::layers::PCompositorBridgeParent::OnMessageReceived ipc/ipdl/PCompositorBridgeParent.cpp:840
5 libxul.so mozilla::ipc::MessageChannel::DispatchAsyncMessage ipc/glue/MessageChannel.cpp:2114
6 libxul.so mozilla::ipc::MessageChannel::DispatchMessage ipc/glue/MessageChannel.cpp:2044
7 libxul.so mozilla::ipc::MessageChannel::MessageTask::Run ipc/glue/MessageChannel.cpp:1923
8 libxul.so MessageLoop::RunTask ipc/chromium/src/base/message_loop.cc:452
9 libxul.so MessageLoop::DeferOrRunPendingTask ipc/chromium/src/base/message_loop.cc:460

=============================================================
Flags: needinfo?(bugmail)
Jim, do you know if recent changes to GeckoLayerClient and friends would have caused this?
Component: Graphics: Layers → Widget: Android
Flags: needinfo?(bugmail) → needinfo?(nchen)
I don't think they did. We seem to get these crashes for 57 and 58 as well.
Flags: needinfo?(nchen)
This crash is currently #6 overall nightly on Fennec.
Almost 1500 crashes in the last week on 59.0.2. Is it possible for someone to take another look at this?
Flags: needinfo?(sdaswani)
Petru, can one of you look at this ASAP? I think it may land in 59 or 60 if a fix is found.
Flags: needinfo?(sdaswani) → needinfo?(petru.lingurar)
Whiteboard: [Leanplum][61]
Things I found:
- the crashes started appearing December 26th 2017 from [1] although no recent changes seems to have been made to that file prior to the crashes.
- before the line where the crash occurs the CreateContext()[2] method is executed, but that method hasn't been modified in more than a year prior to the crashes
- from the gathered crash reports there doesn't seem to be a clear scenario in which this crash would occur although there are a few situations which appeared appeared in 2-3 reports
    - try to do a search/tap the address bar
    - open the app in multi-window mode


[1] https://hg.mozilla.org/releases/mozilla-release/annotate/d2e449c73dac/gfx/layers/opengl/CompositorOGL.cpp#l238
[2] https://hg.mozilla.org/releases/mozilla-release/annotate/d2e449c73dac/gfx/layers/opengl/CompositorOGL.cpp#l111
Flags: needinfo?(petru.lingurar)
Thanks Petru.

Ryan, it looks like the crashes aren't related to a code change, per Petru's analysis. Do we have an idea if the crash is more prevalent on a set of devices or OS versions?
Flags: needinfo?(ryanvm)
Petru's analysis only reflects the crash data purge in late December. Note that the bug was filed a month prior in November.
Flags: needinfo?(ryanvm)
Ah, I wasn't away of the 'purge'. Petru can you spend some time trying to repro?
Flags: needinfo?(petru.lingurar)
Indeed, the first crash appeared in September [1] but only in the last few months they've become more prevalent.
Trying to reproduce based on the few comments in the crash reports (tapping in the address bar, multi-window), nothing yet.

[1] First crash on 57.0b3  - https://crash-stats.mozilla.com/report/index/1eaed915-4848-4e95-b324-488e10180120
Flags: needinfo?(petru.lingurar)
Whiteboard: [Leanplum][61] → --do_not_change--[priority:high]
Marcia is this still a frequent crasher?
Flags: needinfo?(mozillamarcia.knous)
Looking at the affected devices I am seeing mostly emulator devices. Common ways to detect this are known ARM devices running on x86 and mentions of emulator or VMWare. 

unknown    AOSP on ARM Emulator       18 (REL)    armeabi-v7a    36    5.5%
samsung    SM-G960F                   22 (REL)    x86            25    3.8%
zte        Z982                       22 (REL)    x86            24    3.7%
samsung    SM-G950F                   26 (REL)    armeabi-v7a    18    2.7%
gmbh       VirtualBox                 19 (REL)    x86            16    2.4%
innotek    VirtualBox                 19 (REL)    x86            16    2.4%
samsung    GT-P5210                   19 (REL)    x86            15    2.3%
oppo       A37f                       22 (REL)    x86            14    2.1%
lge        Nexus 5X                   27 (REL)    armeabi-v7a    13    2.0%
inc        VMware Virtual Platform    19 (REL)    x86            12    1.8%
vmware     VMware Virtual Platform    19 (REL)    x86            12    1.8%
samsung    SM-A520F                   22 (REL)    x86            11    1.7%
As Kevin notes there is a mix of emulator devices in the recent 62 data (including betas). One of the top crashing devices is SM-G965U, which is the Samsung Galaxy S9. There aren't many URLs to try to reproduce. I think we wait and see how this plays out in 62 volume since we just shipped, and we can reevaluate at a later time.
Flags: needinfo?(mozillamarcia.knous)
Volume in 62.0.1 is pretty low so far - 193 crashes so far.
Whiteboard: --do_not_change--[priority:high] → [priority:low]
Updating affected branches. While there are some emulator devices, it appears as if a fair amount of regular devices crash as well. Volume is relatively low on 62/63/64.
Priority: -- → P1

Adding 65/66 as affected. Currently on 66 nightly this is the top 6 crashes.

Tracking for 67 as it spiked on Nightly over the last few days. James, could we have somebody investigate what worsened the situation since Feb 22. Thanks

Recent crashes appear to be a MOZ_CRASH() where we fail to create a GLContext for the compositor. I found the following messages from several logcats:

02-26 10:26:27.200 17406 18095 I Gecko : Attempting load of libEGL.so
02-26 10:26:27.220 17406 18095 I Gecko : [GFX1]: Flushing glGetError still 0x40514048 after 100 calls.
...
02-26 10:26:27.510 17406 18173 I Gecko : [GFX1]: Flushing glGetError still 0x40514048 after 100 calls.
02-26 10:26:27.510 17406 18173 I Gecko : [GFX1-]: Failed to create EGLContext!

That error value makes no sense to me and looks a lot like a pointer address, so not sure what's going on.

Flags: needinfo?(snorp) → needinfo?(jgilbert)

Moving this to GFX since it's clearly GLContext/Compositor stuff.

Component: Widget: Android → Graphics

It's not a pointer:
https://searchfox.org/mozilla-central/rev/dbddac86aadf1d4871fb350bbe66db43728a9f81/gfx/gl/GLContext.cpp#2766

(E)GL library loading changed recently, so this might be fallout from that: Bug 1528396

Snorp, were these local logcats? If so, on what device(s), and is there an STR?

Flags: needinfo?(jgilbert) → needinfo?(snorp)

A significant number of the devices are emulators. see comment 12

It doesn't happen locally for me, I got the logs from crash-stats: https://crash-stats.mozilla.com/report/index/adbf502b-212c-4d54-ac20-b9f520190226#tab-metadata

Flags: needinfo?(snorp)

(In reply to Jeff Gilbert [:jgilbert] from comment #20)

It's not a pointer:
https://searchfox.org/mozilla-central/rev/dbddac86aadf1d4871fb350bbe66db43728a9f81/gfx/gl/GLContext.cpp#2766

Yeah, but wtf is 0x40514048? That's not a known error AFAICT? And the value differs in reports.

Huh, nooo idea. None of those numbers are GLenums.

Here's isolated to the spiking build IDs:
https://crash-stats.mozilla.com/signature/?product=FennecAndroid&build_id=%3E%3D20190225102402&signature=mozilla%3A%3Alayers%3A%3ACompositorOGL%3A%3AInitialize&date=%3E%3D2019-02-19T23%3A01%3A00.000Z&date=%3C2019-02-26T23%3A01%3A00.000Z#aggregations

1   GT-I8552B                  25   6.25 %
2   SO-04E                     18   4.50 %
3   SHV-E300K                  17   4.25 %
4   Vodafone Smart Tab III 10  16   4.00 %
5   GT-I9100                   15   3.75 %
6   AOSP on ARM Emulator       12   3.00 %
7   M4 SS4040                  12   3.00 %
8   GT-P5110                   10   2.50 %
9   PSP5307DUO                 10   2.50 %
10  V865M                      10   2.50 %

There are multiple runs of users trying to restart the browser and accumulating crashes. (eesh, sorry all :( )

Interestingly that pointer-looking value changes across users, but is consistent across runs per-user.

It's crazy bizarre for glGetError to return any value like that. It's like we're hitting some shim and it's immediately returning to us, giving us a pointer-like value that happened to already be on the stack.

I've re-vetted our EGL loading code, and the only thing I can think of is that we try eglGetProcAddress before we try to load from the library, which is the opposite order of what we used to do. (and we no longer try to load from the process)

EGL <= 1.4:

eglGetProcAddress may not be queried for core (non-extension) functions in EGL or client APIs.

EGL >= 1.5:

eglGetProcAddress may be queried for all EGL and client API functions supported by the implementation (whether those functions are extensions or not, and whether they are supported by the current client API context or not).

If EGL 1.4's eglGetProcAddress is, on some hardware, giving us a pfn to a dummy thunk that logs an error somewhere and returns void, we might get this behavior.

The top three devices (all I checked) are all circa 2013, which does predate EGL 1.5 (2014).

Thing is, I think we want to prefer to use (egl/wgl)GetProcAddress first before trying to dlsym from the library, but maybe we can try to reverse this?

All told, this is a relatively small number of long-tail crashes.

Assignee: nobody → jgilbert
Severity: critical → major
Depends on: 1528396
Pushed by jgilbert@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b437ff8ed47c
dlsym from lib before wsiGetProcAddress. r=snorp
Status: NEW → RESOLVED
Closed: 9 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla67

I'll check back tomorrow, but 11 hours in, 1 crash from 1 reporter is encouraging!

Flags: needinfo?(jgilbert)

Is this something we should consider backporting to Beta for Fennec 66?

So we sort of commandeered this bug for this crash spike, which is actually a different bug that is 67-only.
I'll duplicate this bug so we keep tracking the low-volume crash bug.

Blocks: 1532456
Flags: needinfo?(jgilbert)
Summary: Crash in mozilla::layers::CompositorOGL::Initialize → (spike in 67 of) Crash in mozilla::layers::CompositorOGL::Initialize

We're back to baseline crash rate, which we are now tracking in bug 1532456.
This spike in crash rate was successfully fixed by the patch.

You need to log in before you can comment on or make changes to this bug.