crash in AndroidGLController::SetGLVersion

RESOLVED FIXED in Firefox 21

Status

()

defect
--
critical
RESOLVED FIXED
7 years ago
6 years ago

People

(Reporter: scoobidiver, Assigned: kats)

Tracking

({crash})

Trunk
mozilla21
ARM
Android
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox17 wontfix, firefox18 wontfix, firefox19 affected, firefox20 affected, firefox21 fixed)

Details

(Whiteboard: [native-crash][startupcrash], crash signature)

Attachments

(6 attachments, 3 obsolete attachments)

1.64 KB, patch
Details | Diff | Splinter Review
35.64 KB, text/plain
Details
4.10 KB, patch
snorp
: review+
kats
: checkin+
Details | Diff | Splinter Review
5.32 KB, patch
snorp
: review+
kats
: checkin+
Details | Diff | Splinter Review
2.02 KB, patch
snorp
: review+
kats
: checkin+
Details | Diff | Splinter Review
6.02 KB, patch
snorp
: review+
Details | Diff | Splinter Review
There are seven crashes over the last four weeks.

Signature 	_JNIEnv::CallVoidMethod | AndroidGLController::SetGLVersion More Reports Search
UUID	d9810d76-f5ea-4398-b387-318132120825
Date Processed	2012-08-25 02:32:38
Uptime	8
Last Crash	13 seconds before submission
Install Age	1.4 minutes since version was first installed.
Install Time	2012-08-25 02:30:21
Product	FennecAndroid
Version	17.0a1
Build ID	20120824101755
Release Channel	nightly
OS	Linux
OS Version	0.0.0 Linux 3.1.10-digetx-thor-4.6 #411 SMP PREEMPT Mon Jun 11 23:56:23 EEST 2012 armv7l
Build Architecture	arm
Build Architecture Info	
Crash Reason	SIGSEGV
Crash Address	0x0
App Notes 	
AdapterDescription: 'NVIDIA Corporation -- NVIDIA Tegra -- OpenGL ES 2.0 14.01002 -- Model: Iconia A500, Product: full_a500, Manufacturer: Acer, Hardware: picasso'
Acer Iconia A500
acer/a500_ww_gen1/picasso:4.0.4/IMM76L/1333032611:user/release-keys
EMCheckCompatibility	True
Adapter Vendor ID	NVIDIA Corporation
Adapter Device ID	NVIDIA Tegra

Frame 	Module 	Signature 	Source
0 	libdvm.so 	libdvm.so@0x58134 	
1 	libdvm.so 	libdvm.so@0x58113 	
2 	libxul.so 	_JNIEnv::CallVoidMethod 	jni.h:631
3 	libxul.so 	AndroidGLController::SetGLVersion 	AndroidLayerViewWrapper.cpp:54
4 	libxul.so 	mozilla::AndroidBridge::RegisterCompositor 	AndroidBridge.cpp:1219
5 	libxul.so 	mozilla::gl::GLContextProviderEGL::CreateForWindow 	GLContextProviderEGL.cpp:2201
6 	libxul.so 	mozilla::layers::LayerManagerOGL::CreateContext 	LayerManagerOGL.cpp:144
7 	libxul.so 	mozilla::layers::CompositorParent::AllocPLayers 	LayerManagerOGL.h:75
8 	libxul.so 	mozilla::layers::PCompositorParent::OnMessageReceived 	PCompositorParent.cpp:485
9 	libxul.so 	mozilla::ipc::SyncChannel::OnDispatchMessage 	SyncChannel.cpp:143
10 	libxul.so 	mozilla::ipc::RPCChannel::OnMaybeDequeueOne 	RPCChannel.cpp:400
11 	libxul.so 	RunnableMethod<mozilla::ipc::RPCChannel, bool , Tuple0>::Run 	tuple.h:383
12 	libxul.so 	mozilla::ipc::RPCChannel::DequeueTask::Run 	RPCChannel.h:425
13 	libxul.so 	MessageLoop::RunTask 	message_loop.cc:326
14 	libxul.so 	MessageLoop::DeferOrRunPendingTask 	message_loop.cc:334
15 	libxul.so 	MessageLoop::DoWork 	message_loop.cc:434
16 	libxul.so 	base::MessagePumpDefault::Run 	message_pump_default.cc:23
17 	libxul.so 	MessageLoop::RunInternal 	message_loop.cc:208
18 	libxul.so 	MessageLoop::Run 	message_loop.cc:201
19 	libxul.so 	base::Thread::ThreadMain 	thread.cc:156
20 	libxul.so 	ThreadFunc 	platform_thread_posix.cc:31
21 	libc.so 	libc.so@0x12f3a 	
22 	libc.so 	libc.so@0x12a66

More reports at:
https://crash-stats.mozilla.com/report/list?signature=_JNIEnv%3A%3ACallVoidMethod+|+AndroidGLController%3A%3ASetGLVersion
The only possibility I can think of here is that the call to NewGlobalRef in AndroidGLController::Acquire returns null, so mJObj is null on the call to SetGLVersion. There isn't really any sane way to recover from that situation; putting in a guard here will will just move the crash elsewhere, if that's the cause.
Also on the 17 branch this hasn't happened since the July 18 build, so there's a good chance it was caused by something else and has since been fixed. We should wait and see if it there are any crashes from this in newer code before doing anything.
(In reply to Kartikaya Gupta (:kats) from comment #2)
> Also on the 17 branch this hasn't happened since the July 18 build
No, the latest crashes happened in 17.0a1/20120824, which is recent.
Oh, my mistake. Those were marked as "dup" so I ignored them. But they don't appear to be dups of anything sane.
Whiteboard: [native-crash] → [native-crash][startupcrash]
It still happens in the trunk: bp-92d3b851-5d72-496f-bf03-b72d12121105.
Crash Signature: [@ _JNIEnv::CallVoidMethod | AndroidGLController::SetGLVersion] → [@ _JNIEnv::CallVoidMethod | AndroidGLController::SetGLVersion] [@ JNI_CreateJavaVM | _JNIEnv::CallVoidMethod | AndroidGLController::SetGLVersion] [@ _JNIEnv::CallVoidMethod(_jobject*, _jmethodID*, ...) | AndroidGLController::SetGLVersion(int)]
Assignee: nobody → bugmail.mozilla
Blocks: 820167
A small subset of Avi's patch in bug 820167 reliably triggers this crash:

https://tbpl.mozilla.org/?tree=Try&rev=82d078ed3eca
This appears to be caused by the compositor registration (which is triggered by the GetLayerManager call) happening before LayerView creation:

12-19 17:06:20.777 I/Gecko   ( 3903): AndroidBridge::RegisterCompositor
12-19 17:06:20.777 E/GeckoLayerView( 3903): Error registering compositor!
12-19 17:06:20.777 E/GeckoLayerView( 3903): java.lang.NullPointerException
12-19 17:06:20.777 E/GeckoLayerView( 3903): 	at org.mozilla.gecko.gfx.LayerView.registerCxxCompositor(LayerView.java:349)
12-19 17:06:20.777 E/GeckoLayerView( 3903): 	at dalvik.system.NativeStart.run(Native Method)
12-19 17:06:20.777 E/GeckoLayerView( 3903): 	at dalvik.system.NativeStart.run(Native Method)
12-19 17:06:20.777 F/libc    ( 3903): Fatal signal 11 (SIGSEGV) at 0x00000000 (code=1), thread 3944 (Compositor)
Actually it's the mLayerView.mListener that's null. This should be easy enough to fix.
The simple fix didn't work :(

12-19 17:28:40.808 E/libEGL  ( 4083): EGLNativeWindowType 0x5d33c028 already connected to another API
12-19 17:28:40.808 E/libEGL  ( 4083): eglCreateWindowSurface:298 error 300b (EGL_BAD_NATIVE_WINDOW)
12-19 17:28:40.808 W/System.err( 4083): org.mozilla.gecko.gfx.GLController$GLControllerException: EGL window surface could not be created! Error 12299
12-19 17:28:40.808 W/System.err( 4083): 	at org.mozilla.gecko.gfx.GLController.provideEGLSurface(GLController.java:176)
12-19 17:28:40.808 W/System.err( 4083): 	at dalvik.system.NativeStart.run(Native Method)
12-19 17:28:40.808 W/System.err( 4083): 	at dalvik.system.NativeStart.run(Native Method)
12-19 17:28:40.816 I/Gecko   ( 4083): WARNING: Failed to create LayerManagerOGL context: file /Users/kats/zspace/mozilla-git/gfx/layers/opengl/LayerManagerOGL.cpp, line 469
12-19 17:28:40.816 I/Gecko   ( 4083): ###!!! ABORT: We need a context on Android: file /Users/kats/zspace/mozilla-git/gfx/layers/opengl/LayerManagerOGL.cpp, line 497
12-19 17:28:40.816 F/libc    ( 4083): Fatal signal 11 (SIGSEGV) at 0x00000000 (code=1), thread 4130 (Compositor)

I'll need to take a closer look at how all the compositor startup binding code works. CC'ing snorp as well for any ideas here.
After some more investigation the problem appears to be that when GetLayerManager is called really early on, it is called on an nsWindow instance that is different from the nsWindow instance in the "steady state" later. I don't know why we create a bunch of different nsWindow instances on startup but I will look into that. (The different nsWindow instances have different layer manager variables and so we end up creating two different compositors, resulting in the crash).
So the nsWindow creation actually seems relatively sane after more investigation. I'm documenting what I found here so that it's recorded somewhere. 6 nsWindow instances are created on Fennec startup; see the attached backtraces to nsWindow::nsWindow(). The first one is the hidden window (I don't know exactly what this is used for, but apparently it has some useful purpose); the second is the content window inside that hidden window to load about:blank. The third is the main XUL window and the fourth is again the corresponding about:blank content window inside. The fifth is the actual hiddenWindow.html resource file that is loaded and replaces the about:blank that was the 2nd window created. And finally the sixth window created is for browser.xul and it replaces the about:blank that was the 4th window created.

With the failure patch (attachment #694032 [details] [diff] [review]) is applied, the first call to nsWindow::GetLayerManager happens on the hiddenWindow.html window, and ends up initializing the graphics stack there even though it's a hidden window. The next call to GetLayerManager (which is what happens normally) is called on the browser.xul window, creates a second compositor, and crashes.

I have patches that seem to fix this in a reasonable way; coming shortly.
Also scope some things down and fix up documentation.
Attachment #694490 - Flags: review?(snorp)
I don't know why we were doing this and I don't care.
Attachment #694491 - Flags: review?(snorp)
This makes GetLayerManager when called on the hiddenWindow.html window to just return null and not initialize a bunch of stuff improperly.
Attachment #694492 - Flags: review?(snorp)
Comment on attachment 694492 [details] [diff] [review]
(3/4) - Don't create a layer manager for a hidden window

Actually this might be wrong. TopWindow() doesn't return what I thought it did. Let me test more and fix.
Attachment #694492 - Flags: review?(snorp)
Ok, this seems to do the job. I don't need part 4 after all.
Attachment #694492 - Attachment is obsolete: true
Attachment #694514 - Flags: review?(snorp)
Attachment #694490 - Attachment description: (1/4) Cleanup - delete unused code → (1/3) - Cleanup - delete unused code
Attachment #694491 - Attachment description: (2/4) - Delete even more useless code → (2/3) - Delete even more useless code
Attachment #694490 - Flags: review?(snorp) → review+
Attachment #694491 - Flags: review?(snorp) → review+
Comment on attachment 694514 [details] [diff] [review]
(3/3) Don't create a layer manager for a hidden window

Review of attachment 694514 [details] [diff] [review]:
-----------------------------------------------------------------

not sure I fully understand everything going on here, but I guess it makes sense enough to r+
Attachment #694514 - Flags: review?(snorp) → review+
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #12)
> The first one is the hidden
> window (I don't know exactly what this is used for, but apparently it has
> some useful purpose)

mfinkle pointed me to bug 71895 which is trying to rip this out.
This is still crashing with Avi's patch in bug 820167:

https://tbpl.mozilla.org/?tree=Try&rev=ba3232b40309
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Also crashes with the "reliable trigger" patch in this bug:

https://tbpl.mozilla.org/?tree=Try&rev=fd52a72f6f36
Interesting, I wasn't seeing this crash when I ran it locally. It looks like this one is even more timing-dependent since not all of the try runs are failing. I'll put up a patch to guard against the NPE that is causing this.
An update: I have a patch [1] that fixes the crash and that works perfectly when I build it locally. However when I make a try build out of it (e.g. [2]) and try to run on that on my device I get strange behaviour that looks like the compositor is always paused. Basically nothing gets painted to the screen, and I can't reproduce it on a local build so I don't have any good way of debugging it. I'm not sure how to move forward with this.

[1] https://hg.mozilla.org/try/rev/d20181d0dc75
[2] https://tbpl.mozilla.org/?tree=Try&rev=aed61693c996
I'm able to artificially reproduce the crash in comment #23 locally if I bump some stuff off to another thread and make it sleep() first. However my patch in comment #25 fixes the crash with no other apparent side-effects. I'm not sure why the try build corresponding to that patch had the weird blank screen issues. I'll try a few more things.
Ugh. Turns out this is being caused by the SetCompositor(NULL, NULL) in the nsWindow destructor. The early call to GetLayerManager() means that the compositor gets created early, and is then nulled out when the nsWindow destructor runs on windows #5 and #6 (see comment 12). Without the early call, the compositor is only created after those windows are destroyed, and so the compositor is never nulled out and the problem doesn't happen.
Pushed this to try at https://tbpl.mozilla.org/?tree=Try&rev=7e70d0658d03, will request review once that's green. It worked locally. I don't know why we null out the compositor in the destructor but given that we store the compositor as a static it seems like a bad idea to be nulling it out from an nsWindow instance (of which we have a bunch).

As for the listener hunk, I verified that none of the listener methods actually gets called before gecko is ready during normal startup flow, and if they do, then we want to process them normally. I also considered taking out the null checks for mListener in LayerView but it looks like there's still a period between LayerView construction and LayerView.initializeView() getting called where mListener will be null and I didn't want to risk hitting that NPE.
That weird INJECT_EVENTS permission error still seems to be happening, not sure what's going on there. Looks like it happened before, back in bug 766445, but went away by itself.
And it started happening on inbound as well, bug 825426. I can reproduce it (slightly different error message) when I run the tests locally on a GN so I'll try to debug it.
So what I think is happening here is ScheduleResumeComposition, which was intended to only get called from the java UI thread, is now getting called from the gecko thread (as of bug 797942 / cset ef02503c4387). The implementation of this function eventually calls back into java to get a GL surface, but since it's not on the java UI thread, the surface could have disappeared by then. This happens if you open the awesomescreen during startup. This locks up basically all of the threads until the awesome screen is dismissed, which may not even be possible because of deadlock.
Posted patch (5/3) Kill the deadlock (obsolete) — Splinter Review
This waitForValidSurface() code has irked me since it came into existence because it causes the compositor thread to block on java. And since the gecko thread blocks on the compositor thread in CompositorParent::ScheduleResumeOnCompositorThread, the gecko thread gets blocked too. Getting rid of it and letting the compositor remain in a paused state if a surface is not available seems to fix the problem. I have a small try run at https://tbpl.mozilla.org/?tree=Try&rev=200a4f0ded28 which shows the testAllPagesTab is fixed; pushed a more comprehensive try run to https://tbpl.mozilla.org/?tree=Try&rev=e972c0fb8292. Will request review if that's good.
Comment on attachment 697437 [details] [diff] [review]
(5/3) Kill the deadlock

This patch almost works, but not quite. I am moving this fix into bug 826300 since it's a pretty large bug in its own right.
Attachment #697437 - Attachment is obsolete: true
In addition to moving the setListener call earlier in java-land, I'm also making some changes to the c++ code to prevent creation of multiple layer managers. The previous patches were insufficient for fixing pageloader-based talos tests, because those tests create yet another nsWindow instance and the check that I added for hidden windows doesn't guard against that. With this patch I keep a static layer manager that corresponds to the static compositor and is shared across windows that are using OMTC. This has to be nulled out on shutdown because otherwise there's some shutdown crash because of leaked objects.

Try run with this patch and the one from bug 826300 is at https://tbpl.mozilla.org/?tree=Try&rev=f846a02b7c2e
Attachment #696358 - Attachment is obsolete: true
Attachment #698757 - Flags: review?(snorp)
Blocks: 828126
Comment on attachment 698757 [details] [diff] [review]
(4/3) Allow the compositor to be created earlier

Review of attachment 698757 [details] [diff] [review]:
-----------------------------------------------------------------

looks good
Attachment #698757 - Flags: review?(snorp) → review+
Target Milestone: mozilla20 → mozilla21
https://hg.mozilla.org/mozilla-central/rev/dcaee859dcf5
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
Blocks: 807606
Comment on attachment 698757 [details] [diff] [review]
(4/3) Allow the compositor to be created earlier

[Approval Request Comment]
Bug caused by (feature/regressing bug #): none
User impact if declined: crashes with the signature in bug 834243
Testing completed (on m-c, etc.): on m-c
Risk to taking this patch (and alternatives if risky): low risk, android only
String or UUID changes made by this patch: none

Note that the first three patches on this bug are already on aurora because they landed when m-c was FF20.
Attachment #698757 - Flags: approval-mozilla-aurora?
Blocks: 834243
Attachment #698757 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
I haven't been able to figure out why the above push made everything orange. My initial thought was that it needed bug 826300 as well but a try push with both patches was also just as orange. I can run the build from aurora on my device directly without issue, and a debug build I made using the same cset also works fine locally. Without a stack for the segfault I'm not sure I can figure out why this is crashing.
After some more investigation it looks like everything goes orange because the first call to renderRequested() crashes. Based on the crash info in e.g. https://tbpl.mozilla.org/php/getParsedLog.php?id=19424393&tree=Try&full=1#error0 and an disassembly of libmozglue.so I think the problem is that GeckoAppShell.scheduleComposite's native implemention in AndroidJNI.cpp calls nsWindow::ScheduleComposite(), and if libxul isn't loaded yet (which it doesn't appear to be) then this call resolves to 0x0 and it crashes.

It's odd that this is getting tickled on aurora tests but nowhere else though.
Thinking about this more, I think this might be a legitimate startup crash that could occur in the wild and not get reported since it happens so early (before libxul is fully loaded). I will file a separate bug for this.
Comment on attachment 698757 [details] [diff] [review]
(4/3) Allow the compositor to be created earlier

Based on the number of problems this patch has turned up and the depth of the rabbit hole I've been pulled in to, I no longer think this is safe for uplift. Dropping the aurora approval flag.
Attachment #698757 - Flags: approval-mozilla-aurora+
You need to log in before you can comment on or make changes to this bug.