Closed Bug 785597 Opened 9 years ago Closed 9 years ago
crash in Android
1.64 KB, patch
|Details | Diff | Splinter Review|
35.64 KB, text/plain
4.10 KB, patch
|Details | Diff | Splinter Review|
5.32 KB, patch
|Details | Diff | Splinter Review|
2.02 KB, patch
|Details | Diff | Splinter Review|
6.02 KB, patch
|Details | Diff | Splinter Review|
There are seven crashes over the last four weeks. Signature _JNIEnv::CallVoidMethod | AndroidGLController::SetGLVersion More Reports Search UUID d9810d76-f5ea-4398-b387-318132120825 Date Processed 2012-08-25 02:32:38 Uptime 8 Last Crash 13 seconds before submission Install Age 1.4 minutes since version was first installed. Install Time 2012-08-25 02:30:21 Product FennecAndroid Version 17.0a1 Build ID 20120824101755 Release Channel nightly OS Linux OS Version 0.0.0 Linux 3.1.10-digetx-thor-4.6 #411 SMP PREEMPT Mon Jun 11 23:56:23 EEST 2012 armv7l Build Architecture arm Build Architecture Info Crash Reason SIGSEGV Crash Address 0x0 App Notes AdapterDescription: 'NVIDIA Corporation -- NVIDIA Tegra -- OpenGL ES 2.0 14.01002 -- Model: Iconia A500, Product: full_a500, Manufacturer: Acer, Hardware: picasso' Acer Iconia A500 acer/a500_ww_gen1/picasso:4.0.4/IMM76L/1333032611:user/release-keys EMCheckCompatibility True Adapter Vendor ID NVIDIA Corporation Adapter Device ID NVIDIA Tegra Frame Module Signature Source 0 libdvm.so libdvm.so@0x58134 1 libdvm.so libdvm.so@0x58113 2 libxul.so _JNIEnv::CallVoidMethod jni.h:631 3 libxul.so AndroidGLController::SetGLVersion AndroidLayerViewWrapper.cpp:54 4 libxul.so mozilla::AndroidBridge::RegisterCompositor AndroidBridge.cpp:1219 5 libxul.so mozilla::gl::GLContextProviderEGL::CreateForWindow GLContextProviderEGL.cpp:2201 6 libxul.so mozilla::layers::LayerManagerOGL::CreateContext LayerManagerOGL.cpp:144 7 libxul.so mozilla::layers::CompositorParent::AllocPLayers LayerManagerOGL.h:75 8 libxul.so mozilla::layers::PCompositorParent::OnMessageReceived PCompositorParent.cpp:485 9 libxul.so mozilla::ipc::SyncChannel::OnDispatchMessage SyncChannel.cpp:143 10 libxul.so mozilla::ipc::RPCChannel::OnMaybeDequeueOne RPCChannel.cpp:400 11 libxul.so RunnableMethod<mozilla::ipc::RPCChannel, bool , Tuple0>::Run tuple.h:383 12 libxul.so mozilla::ipc::RPCChannel::DequeueTask::Run RPCChannel.h:425 13 libxul.so MessageLoop::RunTask message_loop.cc:326 14 libxul.so MessageLoop::DeferOrRunPendingTask message_loop.cc:334 15 libxul.so MessageLoop::DoWork message_loop.cc:434 16 libxul.so base::MessagePumpDefault::Run message_pump_default.cc:23 17 libxul.so MessageLoop::RunInternal message_loop.cc:208 18 libxul.so MessageLoop::Run message_loop.cc:201 19 libxul.so base::Thread::ThreadMain thread.cc:156 20 libxul.so ThreadFunc platform_thread_posix.cc:31 21 libc.so libc.so@0x12f3a 22 libc.so libc.so@0x12a66 More reports at: https://crash-stats.mozilla.com/report/list?signature=_JNIEnv%3A%3ACallVoidMethod+|+AndroidGLController%3A%3ASetGLVersion
The only possibility I can think of here is that the call to NewGlobalRef in AndroidGLController::Acquire returns null, so mJObj is null on the call to SetGLVersion. There isn't really any sane way to recover from that situation; putting in a guard here will will just move the crash elsewhere, if that's the cause.
Also on the 17 branch this hasn't happened since the July 18 build, so there's a good chance it was caused by something else and has since been fixed. We should wait and see if it there are any crashes from this in newer code before doing anything.
(In reply to Kartikaya Gupta (:kats) from comment #2) > Also on the 17 branch this hasn't happened since the July 18 build No, the latest crashes happened in 17.0a1/20120824, which is recent.
Oh, my mistake. Those were marked as "dup" so I ignored them. But they don't appear to be dups of anything sane.
Whiteboard: [native-crash] → [native-crash][startupcrash]
It still happens in the trunk: bp-92d3b851-5d72-496f-bf03-b72d12121105.
Crash Signature: [@ _JNIEnv::CallVoidMethod | AndroidGLController::SetGLVersion] → [@ _JNIEnv::CallVoidMethod | AndroidGLController::SetGLVersion] [@ JNI_CreateJavaVM | _JNIEnv::CallVoidMethod | AndroidGLController::SetGLVersion] [@ _JNIEnv::CallVoidMethod(_jobject*, _jmethodID*, ...) | AndroidGLController::SetGLVersion(int)]
A small subset of Avi's patch in bug 820167 reliably triggers this crash: https://tbpl.mozilla.org/?tree=Try&rev=82d078ed3eca
This appears to be caused by the compositor registration (which is triggered by the GetLayerManager call) happening before LayerView creation: 12-19 17:06:20.777 I/Gecko ( 3903): AndroidBridge::RegisterCompositor 12-19 17:06:20.777 E/GeckoLayerView( 3903): Error registering compositor! 12-19 17:06:20.777 E/GeckoLayerView( 3903): java.lang.NullPointerException 12-19 17:06:20.777 E/GeckoLayerView( 3903): at org.mozilla.gecko.gfx.LayerView.registerCxxCompositor(LayerView.java:349) 12-19 17:06:20.777 E/GeckoLayerView( 3903): at dalvik.system.NativeStart.run(Native Method) 12-19 17:06:20.777 E/GeckoLayerView( 3903): at dalvik.system.NativeStart.run(Native Method) 12-19 17:06:20.777 F/libc ( 3903): Fatal signal 11 (SIGSEGV) at 0x00000000 (code=1), thread 3944 (Compositor)
Actually it's the mLayerView.mListener that's null. This should be easy enough to fix.
The simple fix didn't work :( 12-19 17:28:40.808 E/libEGL ( 4083): EGLNativeWindowType 0x5d33c028 already connected to another API 12-19 17:28:40.808 E/libEGL ( 4083): eglCreateWindowSurface:298 error 300b (EGL_BAD_NATIVE_WINDOW) 12-19 17:28:40.808 W/System.err( 4083): org.mozilla.gecko.gfx.GLController$GLControllerException: EGL window surface could not be created! Error 12299 12-19 17:28:40.808 W/System.err( 4083): at org.mozilla.gecko.gfx.GLController.provideEGLSurface(GLController.java:176) 12-19 17:28:40.808 W/System.err( 4083): at dalvik.system.NativeStart.run(Native Method) 12-19 17:28:40.808 W/System.err( 4083): at dalvik.system.NativeStart.run(Native Method) 12-19 17:28:40.816 I/Gecko ( 4083): WARNING: Failed to create LayerManagerOGL context: file /Users/kats/zspace/mozilla-git/gfx/layers/opengl/LayerManagerOGL.cpp, line 469 12-19 17:28:40.816 I/Gecko ( 4083): ###!!! ABORT: We need a context on Android: file /Users/kats/zspace/mozilla-git/gfx/layers/opengl/LayerManagerOGL.cpp, line 497 12-19 17:28:40.816 F/libc ( 4083): Fatal signal 11 (SIGSEGV) at 0x00000000 (code=1), thread 4130 (Compositor) I'll need to take a closer look at how all the compositor startup binding code works. CC'ing snorp as well for any ideas here.
After some more investigation the problem appears to be that when GetLayerManager is called really early on, it is called on an nsWindow instance that is different from the nsWindow instance in the "steady state" later. I don't know why we create a bunch of different nsWindow instances on startup but I will look into that. (The different nsWindow instances have different layer manager variables and so we end up creating two different compositors, resulting in the crash).
So the nsWindow creation actually seems relatively sane after more investigation. I'm documenting what I found here so that it's recorded somewhere. 6 nsWindow instances are created on Fennec startup; see the attached backtraces to nsWindow::nsWindow(). The first one is the hidden window (I don't know exactly what this is used for, but apparently it has some useful purpose); the second is the content window inside that hidden window to load about:blank. The third is the main XUL window and the fourth is again the corresponding about:blank content window inside. The fifth is the actual hiddenWindow.html resource file that is loaded and replaces the about:blank that was the 2nd window created. And finally the sixth window created is for browser.xul and it replaces the about:blank that was the 4th window created. With the failure patch (attachment #694032 [details] [diff] [review]) is applied, the first call to nsWindow::GetLayerManager happens on the hiddenWindow.html window, and ends up initializing the graphics stack there even though it's a hidden window. The next call to GetLayerManager (which is what happens normally) is called on the browser.xul window, creates a second compositor, and crashes. I have patches that seem to fix this in a reasonable way; coming shortly.
Also scope some things down and fix up documentation.
Attachment #694490 - Flags: review?(snorp)
I don't know why we were doing this and I don't care.
Attachment #694491 - Flags: review?(snorp)
This makes GetLayerManager when called on the hiddenWindow.html window to just return null and not initialize a bunch of stuff improperly.
Comment on attachment 694492 [details] [diff] [review] (3/4) - Don't create a layer manager for a hidden window Actually this might be wrong. TopWindow() doesn't return what I thought it did. Let me test more and fix.
Ok, this seems to do the job. I don't need part 4 after all.
Attachment #694490 - Attachment description: (1/4) Cleanup - delete unused code → (1/3) - Cleanup - delete unused code
Attachment #694491 - Attachment description: (2/4) - Delete even more useless code → (2/3) - Delete even more useless code
Attachment #694490 - Flags: review?(snorp) → review+
Attachment #694491 - Flags: review?(snorp) → review+
Comment on attachment 694514 [details] [diff] [review] (3/3) Don't create a layer manager for a hidden window Review of attachment 694514 [details] [diff] [review]: ----------------------------------------------------------------- not sure I fully understand everything going on here, but I guess it makes sense enough to r+
Attachment #694514 - Flags: review?(snorp) → review+
(In reply to Kartikaya Gupta (email:firstname.lastname@example.org) from comment #12) > The first one is the hidden > window (I don't know exactly what this is used for, but apparently it has > some useful purpose) mfinkle pointed me to bug 71895 which is trying to rip this out.
https://hg.mozilla.org/mozilla-central/rev/7315385d2b38 https://hg.mozilla.org/mozilla-central/rev/3696e9df6e14 https://hg.mozilla.org/mozilla-central/rev/d9611ad3d8bf
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla20
This is still crashing with Avi's patch in bug 820167: https://tbpl.mozilla.org/?tree=Try&rev=ba3232b40309
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Also crashes with the "reliable trigger" patch in this bug: https://tbpl.mozilla.org/?tree=Try&rev=fd52a72f6f36
Interesting, I wasn't seeing this crash when I ran it locally. It looks like this one is even more timing-dependent since not all of the try runs are failing. I'll put up a patch to guard against the NPE that is causing this.
An update: I have a patch  that fixes the crash and that works perfectly when I build it locally. However when I make a try build out of it (e.g. ) and try to run on that on my device I get strange behaviour that looks like the compositor is always paused. Basically nothing gets painted to the screen, and I can't reproduce it on a local build so I don't have any good way of debugging it. I'm not sure how to move forward with this.  https://hg.mozilla.org/try/rev/d20181d0dc75  https://tbpl.mozilla.org/?tree=Try&rev=aed61693c996
I'm able to artificially reproduce the crash in comment #23 locally if I bump some stuff off to another thread and make it sleep() first. However my patch in comment #25 fixes the crash with no other apparent side-effects. I'm not sure why the try build corresponding to that patch had the weird blank screen issues. I'll try a few more things.
Ugh. Turns out this is being caused by the SetCompositor(NULL, NULL) in the nsWindow destructor. The early call to GetLayerManager() means that the compositor gets created early, and is then nulled out when the nsWindow destructor runs on windows #5 and #6 (see comment 12). Without the early call, the compositor is only created after those windows are destroyed, and so the compositor is never nulled out and the problem doesn't happen.
Attachment #694490 - Flags: checkin+
Attachment #694491 - Flags: checkin+
Attachment #694514 - Flags: checkin+
Pushed this to try at https://tbpl.mozilla.org/?tree=Try&rev=7e70d0658d03, will request review once that's green. It worked locally. I don't know why we null out the compositor in the destructor but given that we store the compositor as a static it seems like a bad idea to be nulling it out from an nsWindow instance (of which we have a bunch). As for the listener hunk, I verified that none of the listener methods actually gets called before gecko is ready during normal startup flow, and if they do, then we want to process them normally. I also considered taking out the null checks for mListener in LayerView but it looks like there's still a period between LayerView construction and LayerView.initializeView() getting called where mListener will be null and I didn't want to risk hitting that NPE.
That weird INJECT_EVENTS permission error still seems to be happening, not sure what's going on there. Looks like it happened before, back in bug 766445, but went away by itself.
And it started happening on inbound as well, bug 825426. I can reproduce it (slightly different error message) when I run the tests locally on a GN so I'll try to debug it.
Depends on: 825426
So what I think is happening here is ScheduleResumeComposition, which was intended to only get called from the java UI thread, is now getting called from the gecko thread (as of bug 797942 / cset ef02503c4387). The implementation of this function eventually calls back into java to get a GL surface, but since it's not on the java UI thread, the surface could have disappeared by then. This happens if you open the awesomescreen during startup. This locks up basically all of the threads until the awesome screen is dismissed, which may not even be possible because of deadlock.
This waitForValidSurface() code has irked me since it came into existence because it causes the compositor thread to block on java. And since the gecko thread blocks on the compositor thread in CompositorParent::ScheduleResumeOnCompositorThread, the gecko thread gets blocked too. Getting rid of it and letting the compositor remain in a paused state if a surface is not available seems to fix the problem. I have a small try run at https://tbpl.mozilla.org/?tree=Try&rev=200a4f0ded28 which shows the testAllPagesTab is fixed; pushed a more comprehensive try run to https://tbpl.mozilla.org/?tree=Try&rev=e972c0fb8292. Will request review if that's good.
Depends on: 826300
Comment on attachment 697437 [details] [diff] [review] (5/3) Kill the deadlock This patch almost works, but not quite. I am moving this fix into bug 826300 since it's a pretty large bug in its own right.
Attachment #697437 - Attachment is obsolete: true
No longer depends on: 825426
In addition to moving the setListener call earlier in java-land, I'm also making some changes to the c++ code to prevent creation of multiple layer managers. The previous patches were insufficient for fixing pageloader-based talos tests, because those tests create yet another nsWindow instance and the check that I added for hidden windows doesn't guard against that. With this patch I keep a static layer manager that corresponds to the static compositor and is shared across windows that are using OMTC. This has to be nulled out on shutdown because otherwise there's some shutdown crash because of leaked objects. Try run with this patch and the one from bug 826300 is at https://tbpl.mozilla.org/?tree=Try&rev=f846a02b7c2e
Comment on attachment 698757 [details] [diff] [review] (4/3) Allow the compositor to be created earlier Review of attachment 698757 [details] [diff] [review]: ----------------------------------------------------------------- looks good
Attachment #698757 - Flags: review?(snorp) → review+
Status: REOPENED → RESOLVED
Closed: 9 years ago → 9 years ago
Resolution: --- → FIXED
Comment on attachment 698757 [details] [diff] [review] (4/3) Allow the compositor to be created earlier [Approval Request Comment] Bug caused by (feature/regressing bug #): none User impact if declined: crashes with the signature in bug 834243 Testing completed (on m-c, etc.): on m-c Risk to taking this patch (and alternatives if risky): low risk, android only String or UUID changes made by this patch: none Note that the first three patches on this bug are already on aurora because they landed when m-c was FF20.
Attachment #698757 - Flags: approval-mozilla-aurora?
Attachment #698757 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Backed out for turning Android tests orange. https://hg.mozilla.org/releases/mozilla-aurora/rev/019452f32c21
I haven't been able to figure out why the above push made everything orange. My initial thought was that it needed bug 826300 as well but a try push with both patches was also just as orange. I can run the build from aurora on my device directly without issue, and a debug build I made using the same cset also works fine locally. Without a stack for the segfault I'm not sure I can figure out why this is crashing.
After some more investigation it looks like everything goes orange because the first call to renderRequested() crashes. Based on the crash info in e.g. https://tbpl.mozilla.org/php/getParsedLog.php?id=19424393&tree=Try&full=1#error0 and an disassembly of libmozglue.so I think the problem is that GeckoAppShell.scheduleComposite's native implemention in AndroidJNI.cpp calls nsWindow::ScheduleComposite(), and if libxul isn't loaded yet (which it doesn't appear to be) then this call resolves to 0x0 and it crashes. It's odd that this is getting tickled on aurora tests but nowhere else though.
Thinking about this more, I think this might be a legitimate startup crash that could occur in the wild and not get reported since it happens so early (before libxul is fully loaded). I will file a separate bug for this.
Comment on attachment 698757 [details] [diff] [review] (4/3) Allow the compositor to be created earlier Based on the number of problems this patch has turned up and the depth of the rabbit hole I've been pulled in to, I no longer think this is safe for uplift. Dropping the aurora approval flag.
Attachment #698757 - Flags: approval-mozilla-aurora+
You need to log in before you can comment on or make changes to this bug.