Closed Bug 884047 Opened 6 years ago Closed 6 years ago

crash in mozilla::gl::GLLibraryEGL::fMakeCurrent @ libEGL

Categories

(Core :: Graphics: Layers, defect, critical)

24 Branch
ARM
Android
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla28
Tracking Status
firefox23 --- unaffected
firefox24 + wontfix
firefox25 + wontfix
firefox26 + wontfix
firefox27 --- wontfix
firefox28 --- fixed
fennec + ---

People

(Reporter: scoobidiver, Unassigned)

References

Details

(Keywords: crash, regression, topcrash-android-armv7, Whiteboard: [native-crash][fixed in 28+ by bug 925608])

Crash Data

Attachments

(1 file)

It first showed up in 24.0a1/20130610. The regression range might be (low volume):
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=efbe547a7972&tochange=252b1ac4d537

Signature 	mozilla::gl::GLLibraryEGL::fMakeCurrent(void*, void*, void*, void*) More Reports Search
UUID	23f50911-f491-4d7c-939f-638952130617
Date Processed	2013-06-17 07:07:29
Uptime	70
Install Age	14.2 hours since version was first installed.
Install Time	2013-06-16 16:53:17
Product	FennecAndroid
Version	24.0a1
Build ID	20130616031139
Release Channel	nightly
OS	Android
OS Version	0.0.0 Linux 3.0.31-889555 #3 SMP PREEMPT Thu Jan 31 14:48:54 KST 2013 armv7l samsung/GT-I9100/GT-I9100:4.1.2/JZO54K/I9100XWLSD:user/release-keys
Build Architecture	arm
Build Architecture Info	ARMv0
Crash Reason	SIGSEGV
Crash Address	0x640064
App Notes 	
AdapterDescription: 'ARM -- Mali-400 MP -- OpenGL ES 2.0 -- Model: GT-I9100, Product: GT-I9100, Manufacturer: samsung, Hardware: smdk4210'
GL Layers! EGL? EGL+ GL Context? GL Context+ GL Layers+ 
samsung GT-I9100
samsung/GT-I9100/GT-I9100:4.1.2/JZO54K/I9100XWLSD:user/release-keys
Processor Notes 	sp-processor01_phx1_mozilla_com_15474:2012; non-integer value of "SecondsSinceLastCrash"; exploitability tool: ERROR: unable to analyze dump
EMCheckCompatibility	True
Adapter Vendor ID	ARM
Adapter Device ID	Mali-400 MP
Device	samsung GT-I9100
Android API Version	16 (REL)
Android CPU ABI	armeabi-v7a

Frame 	Module 	Signature 	Source
0 	libEGL.so 	libEGL.so@0xabf4 	
1 	libEGL.so 	libEGL.so@0xa9e3 	
2 	libEGL.so 	libEGL.so@0x2212a 	
3 	libEGL.so 	libEGL.so@0x22106 	
4 	libEGL.so 	libEGL.so@0xc995 	
5 	libEGL.so 	libEGL.so@0x22106 	
6 	libEGL.so 	libEGL.so@0xc8a7 	
7 	libxul.so 	mozilla::gl::GLLibraryEGL::fMakeCurrent 	gfx/gl/GLLibraryEGL.h:164
8 	libxul.so 	mozilla::gl::GLContextEGL::RenewSurface 	gfx/gl/GLContextProviderEGL.cpp:547
9 	libxul.so 	mozilla::layers::CompositorOGL::Resume 	gfx/layers/opengl/CompositorOGL.cpp:1396
10 	libxul.so 	mozilla::layers::CompositorOGL::Pause 	gfx/layers/opengl/CompositorOGL.cpp:1388
11 	libxul.so 	mozilla::layers::CompositorParent::ResumeComposition 	gfx/layers/ipc/CompositorParent.cpp:299
12 	libxul.so 	mozilla::layers::CompositorParent::SetEGLSurfaceSize 	gfx/layers/ipc/CompositorParent.cpp:333
13 	libxul.so 	RunnableMethod<mozilla::ipc::AsyncChannel, void 	ipc/chromium/src/base/tuple.h:400
14 	libxul.so 	mozilla::layers::CompositorParent::SetEGLSurfaceSize 	gfx/layers/ipc/CompositorParent.cpp:333
15 	libxul.so 	MessageLoop::RunTask 	ipc/chromium/src/base/message_loop.cc:337
16 	libxul.so 	MessageLoop::DeferOrRunPendingTask 	ipc/chromium/src/base/message_loop.cc:345
17 	libxul.so 	MessageLoop::DoWork 	ipc/chromium/src/base/message_loop.cc:445
18 	libxul.so 	base::MessagePumpDefault::Run 	ipc/chromium/src/base/message_pump_default.cc:23
19 	libxul.so 	MessageLoop::RunInternal 	ipc/chromium/src/base/message_loop.cc:219
20 	libxul.so 	MessageLoop::Run 	ipc/chromium/src/base/message_loop.cc:212
21 	libxul.so 	base::Thread::ThreadMain 	ipc/chromium/src/base/thread.cc:160

More reports at:
https://crash-stats.mozilla.com/report/list?signature=mozilla%3A%3Agl%3A%3AGLLibraryEGL%3A%3AfMakeCurrent%28void*%2C+void*%2C+void*%2C+void*%29
More reports also at:
https://crash-stats.mozilla.com/report/list?product=FennecAndroid&signature=_ZN7android13egl_display_t11loseCurrentEPNS_13egl_context_tE
Crash Signature: [@ mozilla::gl::GLLibraryEGL::fMakeCurrent(void*, void*, void*, void*)] → [@ mozilla::gl::GLLibraryEGL::fMakeCurrent(void*, void*, void*, void*)] [@ _ZN7android13egl_display_t11loseCurrentEPNS_13egl_context_tE ]
Summary: crash in mozilla::gl::GLLibraryEGL::fMakeCurrent @ libEGL → crash in mozilla::gl::GLLibraryEGL::fMakeCurrent @ _ZN7android13egl_display_t11loseCurrentEPNS_13egl_context_tE
I am not sure both signatures are related so I am removing the added one.
Crash Signature: [@ mozilla::gl::GLLibraryEGL::fMakeCurrent(void*, void*, void*, void*)] [@ _ZN7android13egl_display_t11loseCurrentEPNS_13egl_context_tE ] → [@ mozilla::gl::GLLibraryEGL::fMakeCurrent(void*, void*, void*, void*)]
Summary: crash in mozilla::gl::GLLibraryEGL::fMakeCurrent @ _ZN7android13egl_display_t11loseCurrentEPNS_13egl_context_tE → crash in mozilla::gl::GLLibraryEGL::fMakeCurrent @ libEGL
It's #5 crasher in 24.0a2 and #15 in 25.0a1.
tracking-fennec: --- → ?
Keywords: topcrash
Keywords: needURLs
Kairo looks like "Device samsung GT-I9100" would be one of the hardware here are there any other device/OS correlations for QA to try and reproduce here ?

Also adding needsinfo on :snorp to see if anything stands out from changeset as a culprit or if there were any major graphic related landings in FX24 which may have caused this.
Flags: needinfo?(snorp)
Flags: needinfo?(kairo)
Nope I don't know of anything specifically, though I think maybe there were some compositor pause/resume changes? Kats?
Flags: needinfo?(snorp) → needinfo?(bugmail.mozilla)
Crashes in a week of Aurora:

mozilla::gl::GLLibraryEGL::fMakeCurrent(void*, void*, void*, void*) 	6
Samsung GT-I9300 	2
Samsung GT-I9100 	2
Unknown Android 	1
HTC One 	1
Flags: needinfo?(kairo)
Assignee: nobody → bugmail.mozilla
tracking-fennec: ? → 24+
There's nothing related in the regression range in comment 0, as far as I can tell. I don't think anything related to compositor pause/resume landed in the aurora timeframe, actually.

I looked at the crash dump and the code and the code still makes sense to me (i.e. it has changed in unexpected ways). The UI thread should be blocked until the compositor renews the surface (or fails to do so) so the surface shouldn't disappear while in the middle of the compositor renew call. I also looked at a couple of submitted logcats to see if there was anything weird there and I didn't see anything that stood out.

I can stick some asserts and such in to try to get a better feel for what's going on but it's going to be slow and painful debugging via crash-stats. If we can get STR on this it would help a lot.
Adding qawanted to get some help with STR here.
Keywords: qawanted
Flags: needinfo?(bugmail.mozilla)
Keywords: qawantedsteps-wanted
Hey Kats,

I've looked at the crash-reports and there are no URLS or bug comments that would help QA to repro or get STR here.Do we have any specific ideas as to under what conditions this code path may be hit so QA can invest sometime in those scenario's else they have nothing else to get the steps here ?
The code that is shown in comment 0 runs pretty much only when the GL surface that web content is shown on is resized or created/destroyed. So in terms of testing I would try various combinations of:

1) Rotating the device (this resizes the surface)
2) Showing/hiding the virtual keyboard (this also resizes the surface)
3) Putting web content in the background, either by putting Fennec in the background, or by bringing up other content on top of it, like the awesomescreen or the settings pane. The surface is destroyed when the activity is in the background and is re-created when it comes back to the foreground.

Combinations of the above, probably with very little time in between, would be the best to try and reproduce this.
Crash Signature: [@ mozilla::gl::GLLibraryEGL::fMakeCurrent(void*, void*, void*, void*)] → [@ mozilla::gl::GLLibraryEGL::fMakeCurrent(void*, void*, void*, void*)] [@ @0x0 | mozilla::gl::GLLibraryEGL::fMakeCurrent(void*, void*, void*, void*) ]
Keywords: needURLs
Here is a comment: "Attempt to load phone number as page. Get error page. Attempt to Google for number using suggest picker. Crash."
Adding qawanted to give this bug a shot for steps-wanted given comment #10/11.
Keywords: qawanted
needinfoing :AAron to helpout here.
Flags: needinfo?(aaron.train)
No such luck having spent more than two hours using Beta on my GT-I9100 surfing, rotating and trying to trigger this crash.
Flags: needinfo?(aaron.train)
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #7)
> There's nothing related in the regression range in comment 0, as far as I
> can tell. I don't think anything related to compositor pause/resume landed
> in the aurora timeframe, actually.
> 
> I looked at the crash dump and the code and the code still makes sense to me
> (i.e. it has changed in unexpected ways). The UI thread should be blocked
> until the compositor renews the surface (or fails to do so) so the surface
> shouldn't disappear while in the middle of the compositor renew call. I also
> looked at a couple of submitted logcats to see if there was anything weird
> there and I didn't see anything that stood out.
> 
> I can stick some asserts and such in to try to get a better feel for what's
> going on but it's going to be slow and painful debugging via crash-stats. If
> we can get STR on this it would help a lot.

Kats, given QA is tried hard but is unable to reproduce this :(, can we please try some investigation by sticking some asserts as you suggest.
Is it possible for a nested event loop (on the main thread) to cause the state issue you're seeing?
Flags: needinfo?(bugmail.mozilla)
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #16)
> Is it possible for a nested event loop (on the main thread) to cause the
> state issue you're seeing?

No, I don't think so. The Java UI thread should be blocked for the surface resume to finish on the compositor thread so even if the main thread goes and does other things it shouldn't unblock the UI thread, and the surface shouldn't disappear.

(In reply to bhavana bajaj [:bajaj] from comment #15)
> Kats, given QA is tried hard but is unable to reproduce this :(, can we
> please try some investigation by sticking some asserts as you suggest.

Ok, I'll spend some time putting together a patch for this.
Flags: needinfo?(bugmail.mozilla)
Attachment #799117 - Flags: review?(snorp) → review+
Note to self: none of the crashes with this signature up to Sep 11, 2013 18:26 on builds with this patch have logcat info.
Comment on attachment 799117 [details] [diff] [review]
Diagnostic logging

We want to uplift this to 25 (which goes to beta next week) so that hopefully the higher user population there will give us more data faster.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): unknown
User impact if declined: we'll have to wait longer to get useful info from this patch
Testing completed (on m-c, etc.): on m-c
Risk to taking this patch (and alternatives if risky): low risk, mobile only
String or IDL/UUID changes made by this patch: none
Attachment #799117 - Flags: approval-mozilla-aurora?
Attachment #799117 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Thanks!
kats, did you get the data you needed? What are the next steps here?
Flags: needinfo?(bugmail.mozilla)
I looked through all the crash reports in the last week and found a few with logcat:

https://crash-stats.mozilla.com/report/index/10bad6e2-9181-42eb-ab3c-09ffc2131003
https://crash-stats.mozilla.com/report/index/7494eef9-3221-444f-b97f-decd92131006
https://crash-stats.mozilla.com/report/index/ead10324-24de-4664-b004-08b692131003
https://crash-stats.mozilla.com/report/index/f97fe05b-34d3-4f23-a3f8-76e692131002
https://crash-stats.mozilla.com/report/index/ebcd6f31-f4a2-40fa-ac32-bbee92131002

A couple of these have a bunch of GL error 0x506 messages but many do not. The first crash report listed above is the simplest, in that things are working fine until we get one call to resumeCompositor and then it crashes shortly after. The resumeCompositor codepath is run because the surface changes size. This is something we normally handle just fine, and so based on the code and output I don't think this is a problem in the Java code managing the compositor. Somebody more familiar with the C++ GL code needs to take over here, I think. bjacob, do you have any spare cycles to look at this?
Flags: needinfo?(bugmail.mozilla) → needinfo?(bjacob)
I have to say that I'm not at all an expert of the surface-renewal code around here and I just happened to be the sucker who worked on bug 90020 ;-) !

With that said, I made a debug build and tried reproducing but ran into an assertion in the JS engine, which I filed as bug 925024.

At the moment I'm trying to debug past that by disabling that assertion, but, no promises.
Er. bug 900020.
Flags: needinfo?(bjacob)
I tried really hard to reproduce in a non-optimized debug build following the suggestions in comment 10. I really can't reproduce.
I don't think we'll be able to reproduce (removing qawanted), which is why we landed the logcat patch. Any hardening or speculative fixes we can try? Or does the baton need to be passed?
Keywords: qawanted
Can I see the logcat for some crash? Comment 26 alludes to that, but I don't understand how to view the logcat.
(As discussed on IRC you can view the logcat from the raw JSON dump on the "raw dump" tab of the crash report. You need to be logged in to crash-stats and have some magic permission bit set in your LDAP.)
(In reply to Alex Keybl [:akeybl] from comment #30)
> Any hardening or speculative fixes we can try?

I got nothing :(
So, one thing that is suspiscious here is that in GLContextEGL::RenewSurface we are exiting without doing anything if we already had a surface, and that is happening in particular on orientation change.

But I don't currently understand whether that is normal or not, and I still can't reproduce crashes. So the way forward is probably to first understand how these things work and what we should be doing in theory, and what we are doing in practice. Filed bug 925608 for that. Once we understand that clearly, maybe we'll have ideas of patches to try here.
Just filed bug 925645 which may or may not be related.
topcrash is being replaced by more precise keywords per https://bugzilla.mozilla.org/show_bug.cgi?id=927557#c3
tracking-fennec: 24+ → ?
tracking-fennec: ? → +
This is one of the major regressions that has made our crash rate worsen between the 23 and 24 releases, and it's now the #5 topcrash on 25 release. Any chance for progress here?
Benoit Jacob has been looking into a family of related problems with our surface renewal. I expect that his changes in bug 925608 will fix this.
Assignee: bugmail.mozilla → nobody
Now that bug 925608 is fixed, I would appreciate testing on the latest Nightly build to see if the issues reported here still persist.
We'll need to watch crash-stats to verify, of course, as this was filed "just" based on data.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #40)
> We'll need to watch crash-stats to verify, of course, as this was filed
> "just" based on data.

Any news here?  Is this still a topcrash worth tracking?
Flags: needinfo?(kairo)
I don't see this crash on 28 or 27 but that is not helpful as only Firefox 28 has bug 925608 fixed. So this would need to be in at least beta to get the numbers of users to rank this crash.
Flags: needinfo?(kairo)
We're now heading to final Beta so we'll have to wontfix this over taking last-minute speculative landings to the branch. Since it did ship in the past two releases as well, we know we can live with this until we ship the fix in bug 925608 (which should probably be uplifted to Aurora).
Actually, according to https://bugzilla.mozilla.org/show_bug.cgi?id=925608#c43 we will not uplift to Aurora and let it ride on 28.
According to crash-stats there are no crashes on 28.0b*, 29* or 30* in the last 4 weeks with fMakeCurrent in the signature. I think we can call this fixed by 925608.
Status: NEW → RESOLVED
Closed: 6 years ago
No longer depends on: 925024
Resolution: --- → FIXED
Whiteboard: [native-crash][leave open] → [native-crash][fixed in 28+ by bug 925608]
Target Milestone: --- → mozilla28
You need to log in before you can comment on or make changes to this bug.