Closed Bug 884047 Opened 6 years ago Closed 6 years ago
crash in mozilla::gl::GLLibrary
EGL::f Make Current @ lib EGL
It first showed up in 24.0a1/20130610. The regression range might be (low volume): http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=efbe547a7972&tochange=252b1ac4d537 Signature mozilla::gl::GLLibraryEGL::fMakeCurrent(void*, void*, void*, void*) More Reports Search UUID 23f50911-f491-4d7c-939f-638952130617 Date Processed 2013-06-17 07:07:29 Uptime 70 Install Age 14.2 hours since version was first installed. Install Time 2013-06-16 16:53:17 Product FennecAndroid Version 24.0a1 Build ID 20130616031139 Release Channel nightly OS Android OS Version 0.0.0 Linux 3.0.31-889555 #3 SMP PREEMPT Thu Jan 31 14:48:54 KST 2013 armv7l samsung/GT-I9100/GT-I9100:4.1.2/JZO54K/I9100XWLSD:user/release-keys Build Architecture arm Build Architecture Info ARMv0 Crash Reason SIGSEGV Crash Address 0x640064 App Notes AdapterDescription: 'ARM -- Mali-400 MP -- OpenGL ES 2.0 -- Model: GT-I9100, Product: GT-I9100, Manufacturer: samsung, Hardware: smdk4210' GL Layers! EGL? EGL+ GL Context? GL Context+ GL Layers+ samsung GT-I9100 samsung/GT-I9100/GT-I9100:4.1.2/JZO54K/I9100XWLSD:user/release-keys Processor Notes sp-processor01_phx1_mozilla_com_15474:2012; non-integer value of "SecondsSinceLastCrash"; exploitability tool: ERROR: unable to analyze dump EMCheckCompatibility True Adapter Vendor ID ARM Adapter Device ID Mali-400 MP Device samsung GT-I9100 Android API Version 16 (REL) Android CPU ABI armeabi-v7a Frame Module Signature Source 0 libEGL.so libEGL.so@0xabf4 1 libEGL.so libEGL.so@0xa9e3 2 libEGL.so libEGL.so@0x2212a 3 libEGL.so libEGL.so@0x22106 4 libEGL.so libEGL.so@0xc995 5 libEGL.so libEGL.so@0x22106 6 libEGL.so libEGL.so@0xc8a7 7 libxul.so mozilla::gl::GLLibraryEGL::fMakeCurrent gfx/gl/GLLibraryEGL.h:164 8 libxul.so mozilla::gl::GLContextEGL::RenewSurface gfx/gl/GLContextProviderEGL.cpp:547 9 libxul.so mozilla::layers::CompositorOGL::Resume gfx/layers/opengl/CompositorOGL.cpp:1396 10 libxul.so mozilla::layers::CompositorOGL::Pause gfx/layers/opengl/CompositorOGL.cpp:1388 11 libxul.so mozilla::layers::CompositorParent::ResumeComposition gfx/layers/ipc/CompositorParent.cpp:299 12 libxul.so mozilla::layers::CompositorParent::SetEGLSurfaceSize gfx/layers/ipc/CompositorParent.cpp:333 13 libxul.so RunnableMethod<mozilla::ipc::AsyncChannel, void ipc/chromium/src/base/tuple.h:400 14 libxul.so mozilla::layers::CompositorParent::SetEGLSurfaceSize gfx/layers/ipc/CompositorParent.cpp:333 15 libxul.so MessageLoop::RunTask ipc/chromium/src/base/message_loop.cc:337 16 libxul.so MessageLoop::DeferOrRunPendingTask ipc/chromium/src/base/message_loop.cc:345 17 libxul.so MessageLoop::DoWork ipc/chromium/src/base/message_loop.cc:445 18 libxul.so base::MessagePumpDefault::Run ipc/chromium/src/base/message_pump_default.cc:23 19 libxul.so MessageLoop::RunInternal ipc/chromium/src/base/message_loop.cc:219 20 libxul.so MessageLoop::Run ipc/chromium/src/base/message_loop.cc:212 21 libxul.so base::Thread::ThreadMain ipc/chromium/src/base/thread.cc:160 More reports at: https://crash-stats.mozilla.com/report/list?signature=mozilla%3A%3Agl%3A%3AGLLibraryEGL%3A%3AfMakeCurrent%28void*%2C+void*%2C+void*%2C+void*%29
More reports also at: https://crash-stats.mozilla.com/report/list?product=FennecAndroid&signature=_ZN7android13egl_display_t11loseCurrentEPNS_13egl_context_tE
Crash Signature: [@ mozilla::gl::GLLibraryEGL::fMakeCurrent(void*, void*, void*, void*)] → [@ mozilla::gl::GLLibraryEGL::fMakeCurrent(void*, void*, void*, void*)] [@ _ZN7android13egl_display_t11loseCurrentEPNS_13egl_context_tE ]
Summary: crash in mozilla::gl::GLLibraryEGL::fMakeCurrent @ libEGL → crash in mozilla::gl::GLLibraryEGL::fMakeCurrent @ _ZN7android13egl_display_t11loseCurrentEPNS_13egl_context_tE
I am not sure both signatures are related so I am removing the added one.
Crash Signature: [@ mozilla::gl::GLLibraryEGL::fMakeCurrent(void*, void*, void*, void*)] [@ _ZN7android13egl_display_t11loseCurrentEPNS_13egl_context_tE ] → [@ mozilla::gl::GLLibraryEGL::fMakeCurrent(void*, void*, void*, void*)]
Summary: crash in mozilla::gl::GLLibraryEGL::fMakeCurrent @ _ZN7android13egl_display_t11loseCurrentEPNS_13egl_context_tE → crash in mozilla::gl::GLLibraryEGL::fMakeCurrent @ libEGL
It's #5 crasher in 24.0a2 and #15 in 25.0a1.
Kairo looks like "Device samsung GT-I9100" would be one of the hardware here are there any other device/OS correlations for QA to try and reproduce here ? Also adding needsinfo on :snorp to see if anything stands out from changeset as a culprit or if there were any major graphic related landings in FX24 which may have caused this.
Nope I don't know of anything specifically, though I think maybe there were some compositor pause/resume changes? Kats?
Flags: needinfo?(snorp) → needinfo?(bugmail.mozilla)
Crashes in a week of Aurora: mozilla::gl::GLLibraryEGL::fMakeCurrent(void*, void*, void*, void*) 6 Samsung GT-I9300 2 Samsung GT-I9100 2 Unknown Android 1 HTC One 1
Assignee: nobody → bugmail.mozilla
tracking-fennec: ? → 24+
There's nothing related in the regression range in comment 0, as far as I can tell. I don't think anything related to compositor pause/resume landed in the aurora timeframe, actually. I looked at the crash dump and the code and the code still makes sense to me (i.e. it has changed in unexpected ways). The UI thread should be blocked until the compositor renews the surface (or fails to do so) so the surface shouldn't disappear while in the middle of the compositor renew call. I also looked at a couple of submitted logcats to see if there was anything weird there and I didn't see anything that stood out. I can stick some asserts and such in to try to get a better feel for what's going on but it's going to be slow and painful debugging via crash-stats. If we can get STR on this it would help a lot.
Adding qawanted to get some help with STR here.
Hey Kats, I've looked at the crash-reports and there are no URLS or bug comments that would help QA to repro or get STR here.Do we have any specific ideas as to under what conditions this code path may be hit so QA can invest sometime in those scenario's else they have nothing else to get the steps here ?
The code that is shown in comment 0 runs pretty much only when the GL surface that web content is shown on is resized or created/destroyed. So in terms of testing I would try various combinations of: 1) Rotating the device (this resizes the surface) 2) Showing/hiding the virtual keyboard (this also resizes the surface) 3) Putting web content in the background, either by putting Fennec in the background, or by bringing up other content on top of it, like the awesomescreen or the settings pane. The surface is destroyed when the activity is in the background and is re-created when it comes back to the foreground. Combinations of the above, probably with very little time in between, would be the best to try and reproduce this.
Crash Signature: [@ mozilla::gl::GLLibraryEGL::fMakeCurrent(void*, void*, void*, void*)] → [@ mozilla::gl::GLLibraryEGL::fMakeCurrent(void*, void*, void*, void*)] [@ @0x0 | mozilla::gl::GLLibraryEGL::fMakeCurrent(void*, void*, void*, void*) ]
Here is a comment: "Attempt to load phone number as page. Get error page. Attempt to Google for number using suggest picker. Crash."
needinfoing :AAron to helpout here.
No such luck having spent more than two hours using Beta on my GT-I9100 surfing, rotating and trying to trigger this crash.
(In reply to Kartikaya Gupta (email:email@example.com) from comment #7) > There's nothing related in the regression range in comment 0, as far as I > can tell. I don't think anything related to compositor pause/resume landed > in the aurora timeframe, actually. > > I looked at the crash dump and the code and the code still makes sense to me > (i.e. it has changed in unexpected ways). The UI thread should be blocked > until the compositor renews the surface (or fails to do so) so the surface > shouldn't disappear while in the middle of the compositor renew call. I also > looked at a couple of submitted logcats to see if there was anything weird > there and I didn't see anything that stood out. > > I can stick some asserts and such in to try to get a better feel for what's > going on but it's going to be slow and painful debugging via crash-stats. If > we can get STR on this it would help a lot. Kats, given QA is tried hard but is unable to reproduce this :(, can we please try some investigation by sticking some asserts as you suggest.
Is it possible for a nested event loop (on the main thread) to cause the state issue you're seeing?
(In reply to Benjamin Smedberg [:bsmedberg] from comment #16) > Is it possible for a nested event loop (on the main thread) to cause the > state issue you're seeing? No, I don't think so. The Java UI thread should be blocked for the surface resume to finish on the compositor thread so even if the main thread goes and does other things it shouldn't unblock the UI thread, and the surface shouldn't disappear. (In reply to bhavana bajaj [:bajaj] from comment #15) > Kats, given QA is tried hard but is unable to reproduce this :(, can we > please try some investigation by sticking some asserts as you suggest. Ok, I'll spend some time putting together a patch for this.
Attachment #799117 - Flags: review?(snorp) → review+
Whiteboard: [native-crash] → [native-crash][leave open]
Note to self: none of the crashes with this signature up to Sep 11, 2013 18:26 on builds with this patch have logcat info.
Comment on attachment 799117 [details] [diff] [review] Diagnostic logging We want to uplift this to 25 (which goes to beta next week) so that hopefully the higher user population there will give us more data faster. [Approval Request Comment] Bug caused by (feature/regressing bug #): unknown User impact if declined: we'll have to wait longer to get useful info from this patch Testing completed (on m-c, etc.): on m-c Risk to taking this patch (and alternatives if risky): low risk, mobile only String or IDL/UUID changes made by this patch: none
Attachment #799117 - Flags: approval-mozilla-aurora?
Attachment #799117 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
kats, did you get the data you needed? What are the next steps here?
I looked through all the crash reports in the last week and found a few with logcat: https://crash-stats.mozilla.com/report/index/10bad6e2-9181-42eb-ab3c-09ffc2131003 https://crash-stats.mozilla.com/report/index/7494eef9-3221-444f-b97f-decd92131006 https://crash-stats.mozilla.com/report/index/ead10324-24de-4664-b004-08b692131003 https://crash-stats.mozilla.com/report/index/f97fe05b-34d3-4f23-a3f8-76e692131002 https://crash-stats.mozilla.com/report/index/ebcd6f31-f4a2-40fa-ac32-bbee92131002 A couple of these have a bunch of GL error 0x506 messages but many do not. The first crash report listed above is the simplest, in that things are working fine until we get one call to resumeCompositor and then it crashes shortly after. The resumeCompositor codepath is run because the surface changes size. This is something we normally handle just fine, and so based on the code and output I don't think this is a problem in the Java code managing the compositor. Somebody more familiar with the C++ GL code needs to take over here, I think. bjacob, do you have any spare cycles to look at this?
Flags: needinfo?(bugmail.mozilla) → needinfo?(bjacob)
I have to say that I'm not at all an expert of the surface-renewal code around here and I just happened to be the sucker who worked on bug 90020 ;-) ! With that said, I made a debug build and tried reproducing but ran into an assertion in the JS engine, which I filed as bug 925024. At the moment I'm trying to debug past that by disabling that assertion, but, no promises.
Er. bug 900020.
I tried really hard to reproduce in a non-optimized debug build following the suggestions in comment 10. I really can't reproduce.
I don't think we'll be able to reproduce (removing qawanted), which is why we landed the logcat patch. Any hardening or speculative fixes we can try? Or does the baton need to be passed?
Can I see the logcat for some crash? Comment 26 alludes to that, but I don't understand how to view the logcat.
(As discussed on IRC you can view the logcat from the raw JSON dump on the "raw dump" tab of the crash report. You need to be logged in to crash-stats and have some magic permission bit set in your LDAP.)
(In reply to Alex Keybl [:akeybl] from comment #30) > Any hardening or speculative fixes we can try? I got nothing :(
So, one thing that is suspiscious here is that in GLContextEGL::RenewSurface we are exiting without doing anything if we already had a surface, and that is happening in particular on orientation change. But I don't currently understand whether that is normal or not, and I still can't reproduce crashes. So the way forward is probably to first understand how these things work and what we should be doing in theory, and what we are doing in practice. Filed bug 925608 for that. Once we understand that clearly, maybe we'll have ideas of patches to try here.
Just filed bug 925645 which may or may not be related.
topcrash is being replaced by more precise keywords per https://bugzilla.mozilla.org/show_bug.cgi?id=927557#c3
This is one of the major regressions that has made our crash rate worsen between the 23 and 24 releases, and it's now the #5 topcrash on 25 release. Any chance for progress here?
Benoit Jacob has been looking into a family of related problems with our surface renewal. I expect that his changes in bug 925608 will fix this.
Assignee: bugmail.mozilla → nobody
Now that bug 925608 is fixed, I would appreciate testing on the latest Nightly build to see if the issues reported here still persist.
We'll need to watch crash-stats to verify, of course, as this was filed "just" based on data.
(In reply to Robert Kaiser (:firstname.lastname@example.org) from comment #40) > We'll need to watch crash-stats to verify, of course, as this was filed > "just" based on data. Any news here? Is this still a topcrash worth tracking?
I don't see this crash on 28 or 27 but that is not helpful as only Firefox 28 has bug 925608 fixed. So this would need to be in at least beta to get the numbers of users to rank this crash.
We're now heading to final Beta so we'll have to wontfix this over taking last-minute speculative landings to the branch. Since it did ship in the past two releases as well, we know we can live with this until we ship the fix in bug 925608 (which should probably be uplifted to Aurora).
Actually, according to https://bugzilla.mozilla.org/show_bug.cgi?id=925608#c43 we will not uplift to Aurora and let it ride on 28.
According to crash-stats there are no crashes on 28.0b*, 29* or 30* in the last 4 weeks with fMakeCurrent in the signature. I think we can call this fixed by 925608.
You need to log in before you can comment on or make changes to this bug.