Closed Bug 834243 Opened 11 years ago Closed 9 years ago

crash in mozilla::layers::LayerManagerOGL/CompositorOGL::Initialize with abort message: "We need a context on Android"

Categories

(Core Graveyard :: Widget: Android, defect, P5)

All
Android
defect

Tracking

(firefox20+ wontfix, firefox21+ affected, firefox22 affected, firefox23 affected, firefox24 affected, firefox25 affected, fennec+)

RESOLVED INCOMPLETE
Tracking Status
firefox20 + wontfix
firefox21 + affected
firefox22 --- affected
firefox23 --- affected
firefox24 --- affected
firefox25 --- affected
fennec + ---

People

(Reporter: scoobidiver, Assigned: bjacob)

References

Details

(Keywords: crash, Whiteboard: [native-crash][startupcrash][leave open])

Crash Data

Attachments

(5 files, 6 obsolete files)

1.07 KB, patch
kats
: review+
Details | Diff | Splinter Review
1.21 KB, patch
vlad
: review+
Details | Diff | Splinter Review
4.72 KB, patch
snorp
: review+
Details | Diff | Splinter Review
5.71 KB, patch
snorp
: review+
Details | Diff | Splinter Review
1.86 KB, patch
BenWa
: review+
Details | Diff | Splinter Review
There are still crashes after the fix of bug 785597.
It's #10 top crasher in 20.0a2 and #7 in 21.0a1.

Signature 	mozalloc_abort(char const*) | NS_DebugBreak_P | mozilla::layers::LayerManagerOGL::Initialize(nsRefPtr<mozilla::gl::GLContext>, bool) More Reports Search
UUID	9a089ff3-30f0-4bf7-8cae-f24d12130123
Date Processed	2013-01-23 23:31:24
Uptime	9
Last Crash	57 seconds before submission
Install Age	2.1 hours since version was first installed.
Install Time	2013-01-23 21:26:39
Product	FennecAndroid
Version	21.0a1
Build ID	20130123052720
Release Channel	nightly
OS	Android
OS Version	0.0.0 Linux 3.0.8 #1 SMP PREEMPT Thu Apr 19 16:30:15 CST 2012 armv7l Huawei/MediaPad/hws7300u:4.0.3/HuaweiMediaPad/S7301uV1R2C232B002:user/release-keys
Build Architecture	arm
Build Architecture Info	
Crash Reason	SIGSEGV
Crash Address	0x0
App Notes 	
AdapterDescription: 'Qualcomm -- Adreno (TM) 220 -- OpenGL ES 2.0 2184622 -- Model: HUAWEI MediaPad, Product: MediaPad, Manufacturer: HUAWEI, Hardware: hws7300u'
xpcom_runtime_abort(###!!! ABORT: We need a context on Android: file ../../../gfx/layers/opengl/LayerManagerOGL.cpp, line 498)
HUAWEI HUAWEI MediaPad
Huawei/MediaPad/hws7300u:4.0.3/HuaweiMediaPad/S7301uV1R2C232B002:user/release-keys
Processor Notes 	sp-processor02.phx1.mozilla.com_15785:2008; exploitablity tool: ERROR: unable to analyze dump
EMCheckCompatibility	True
Adapter Vendor ID	Qualcomm
Adapter Device ID	Adreno (TM) 220
Device	HUAWEI HUAWEI MediaPad
Android API Version	15 (REL)
Android CPU ABI	armeabi-v7a

Frame 	Module 	Signature 	Source
0 	libmozalloc.so 	mozalloc_abort 	mozalloc_abort.cpp:30
1 	libxul.so 	NS_DebugBreak_P 	nsDebugImpl.cpp:422
2 	libxul.so 	mozilla::layers::LayerManagerOGL::Initialize 	LayerManagerOGL.cpp:498
3 	libxul.so 	mozilla::layers::LayerManagerOGL::Initialize 	LayerManagerOGL.cpp:60
4 	libxul.so 	mozilla::layers::CompositorParent::AllocPLayers 	CompositorParent.cpp:1150
5 	libxul.so 	mozilla::layers::PCompositorParent::OnMessageReceived 	PCompositorParent.cpp:591
6 	libxul.so 	mozilla::ipc::SyncChannel::OnDispatchMessage 	SyncChannel.cpp:145
7 	libxul.so 	mozilla::ipc::RPCChannel::OnMaybeDequeueOne 	RPCChannel.cpp:400
8 	libxul.so 	RunnableMethod<IPC::ChannelProxy::Context, void 	tuple.h:383
9 	libxul.so 	mozilla::ipc::RPCChannel::DequeueTask::Run 	RPCChannel.h:425
10 	libxul.so 	MessageLoop::RunTask 	message_loop.cc:333
11 	libxul.so 	MessageLoop::DeferOrRunPendingTask 	message_loop.cc:341
12 	libxul.so 	MessageLoop::DoWork 	message_loop.cc:441
13 	libxul.so 	base::MessagePumpDefault::Run 	message_pump_default.cc:23
14 	libxul.so 	MessageLoop::RunInternal 	message_loop.cc:215
15 	libxul.so 	MessageLoop::Run 	message_loop.cc:208
16 	libxul.so 	base::Thread::ThreadMain 	thread.cc:156
17 	libxul.so 	ThreadFunc 	platform_thread_posix.cc:39
18 	libc.so 	libc.so@0x1317e 	
19 	libc.so 	libc.so@0x12cd2 	

More reports at:
https://crash-stats.mozilla.com/report/list?signature=mozalloc_abort%28char+const*%29+|+NS_DebugBreak_P+|+mozilla%3A%3Alayers%3A%3ALayerManagerOGL%3A%3AInitialize%28nsRefPtr%3Cmozilla%3A%3Agl%3A%3AGLContext%3E%2C+bool%29
The crash is caused by failing to initialize a valid GL context. Looking at the logcat from the crash report at bp-a22c95d0-f6bf-4a71-ae9e-001282130116 (which is the only one I found since bug 785597 landed and is on API 16+), I see this:

16:34:10.769 E GeckoConsole: Attempt to use JS function on a different thread calling nsIDirectoryServiceProvider.getFile. JS objects may not be shared across threads.
16:34:29.035 D GeckoLayerClient: Window-size changed to (720,1038)
16:34:29.105 I Gecko   : Logging GL tracing output to /data/data/org.mozilla.fennec/firefox.trace
16:34:29.105 I Gecko   : Attempting load of /data/local/egltrace.so
16:34:29.105 I Gecko   : Attempting load of libEGL.so
16:34:29.105 W System.err: java.lang.IllegalArgumentException: Make sure the SurfaceView or associated SurfaceHolder has a valid Surface
16:34:29.105 W System.err:  at com.google.android.gles_jni.EGLImpl._eglCreateWindowSurface(Native Method)
16:34:29.105 W System.err:  at com.google.android.gles_jni.EGLImpl.eglCreateWindowSurface(EGLImpl.java:90)
16:34:29.105 W System.err:  at org.mozilla.gecko.gfx.GLController.provideEGLSurface(GLController.java:145)
16:34:29.105 W System.err:  at dalvik.system.NativeStart.run(Native Method)
16:34:29.105 W System.err:  at dalvik.system.NativeStart.run(Native Method)
16:34:29.105 I Gecko   : ###!!! ABORT: We need a context on Android: file ../../../gfx/layers/opengl/LayerManagerOGL.cpp, line 498
16:34:29.105 E Gecko   : mozalloc_abort: ###!!! ABORT: We need a context on Android: file ../../../gfx/layers/opengl/LayerManagerOGL.cpp, line 498

So the root cause is that _eglCreateWindowSurface is failing. Not sure why that is happening though.
Component: Graphics: Layers → Widget: Android
Also for the record crash bp-c084b2d0-3261-4f0d-8629-e52b12130119 is an instance of this crash on FF20. A lot of the other crashes on 20 with this signature should be fixed by bug 785597 which is not on that branch (as of now anyway).
Tracking - can you nominate bug 785597 for uplift so we can determine if it does fix the issues here?
Assignee: nobody → bugmail.mozilla
Depends on: 785597
Crash Signature: [@ mozalloc_abort(char const*) | NS_DebugBreak_P | mozilla::layers::LayerManagerOGL::Initialize(nsRefPtr<mozilla::gl::GLContext>, bool)] → [@ mozalloc_abort(char const*) | NS_DebugBreak_P | mozilla::layers::LayerManagerOGL::Initialize(nsRefPtr<mozilla::gl::GLContext>, bool)] [@ mozalloc_abort | nsACString_internal::ReplaceASCII]
Crash Signature: [@ mozalloc_abort(char const*) | NS_DebugBreak_P | mozilla::layers::LayerManagerOGL::Initialize(nsRefPtr<mozilla::gl::GLContext>, bool)] [@ mozalloc_abort | nsACString_internal::ReplaceASCII] → [@ mozalloc_abort(char const*) | NS_DebugBreak_P | mozilla::layers::LayerManagerOGL::Initialize(nsRefPtr<mozilla::gl::GLContext> bool)] [@ mozalloc_abort | nsACString_internal::ReplaceASCII] [@ mozalloc_abort | nsACString_internal::ReplaceASCII(unsigned…
Blocks: 839032
Crash Signature: bool)] [@ mozalloc_abort | nsACString_internal::ReplaceASCII] [@ mozalloc_abort | nsACString_internal::ReplaceASCII(unsigned int, unsigned int, char const*, unsigned int)] → bool)] [@ mozalloc_abort | nsACString_internal::ReplaceASCII] [@ mozalloc_abort | nsACString_internal::ReplaceASCII(unsigned int, unsigned int, char const* unsigned int)] [@ mozalloc_abort(char const*) | NS_DebugBreak_P | nsACString_internal::ReplaceAS…
Given that the patches in bug 785597 were determined to be too risky for uplift - is there any reason to continue to track here for FF20?  Are there any other options to help keep this topcrash from shipping with 20?
Flags: needinfo?
Crashes have stopped since 21.0a1/20130208 and 20.0a2/20130209. The working ranges are:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=04e13fc9dbff&tochange=fcf79680a057
http://hg.mozilla.org/releases/mozilla-aurora/pushloghtml?fromchange=2b7ea57afb90&tochange=0b515f5a44d4
It might be fixed by Sriran's fixes although I don't know why.
Flags: needinfo?
(In reply to Scoobidiver from comment #6)
> Crashes have stopped since 21.0a1/20130208 and 20.0a2/20130209.
My bad. The signature has morphed.

It's #7 top crasher in 21.0a1 where bug 785597 is fixed and #9 top crasher in 20.0a2 so tracking for 20 is still valid although the fix of bug 785597 can't be uplifted.

Since the fix of bug 838603 (forced OpenGL ES on Android), crashes have "An error occurred earlier while querying gfx info: eglChooseConfig returned zero OpenGL ES2 configs. Maybe this device does not support OpenGL ES2?" in App Notes maybe because some crashes (bug 771774, bug 778175?) have merged here.
(In reply to Scoobidiver from comment #7)
> (In reply to Scoobidiver from comment #6)
> > Crashes have stopped since 21.0a1/20130208 and 20.0a2/20130209.
> My bad. The signature has morphed.
> 
> It's #7 top crasher in 21.0a1 where bug 785597 is fixed and #9 top crasher
> in 20.0a2 so tracking for 20 is still valid although the fix of bug 785597
> can't be uplifted.
> 

What has it morphed to? I'm having trouble following all the different variations of crashes and how they are grouped. Can you link to a specific crash report (or list of crash reports) that I can look at?

> Since the fix of bug 838603 (forced OpenGL ES on Android), crashes have "An
> error occurred earlier while querying gfx info: eglChooseConfig returned
> zero OpenGL ES2 configs. Maybe this device does not support OpenGL ES2?" in
> App Notes maybe because some crashes (bug 771774, bug 778175?) have merged
> here.

I think these crashes are from devices that were originally blocklisted but that bug 838603 force-enabled. Presumably once the reverted blocklist takes effect these should die down.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #8)
> What has it morphed to? I'm having trouble following all the different
> variations of crashes and how they are grouped. Can you link to a specific
> crash report (or list of crash reports) that I can look at?
This bug gathers all crash signatures with the "We need a context on Android" abort message whatever the cause. Note that some stack traces seem buggy although the one in comment 0 is right.

The previous crashes were: https://crash-stats.mozilla.com/report/list?signature=mozalloc_abort%28char+const*%29+|+NS_DebugBreak_P+|+mozilla%3A%3Alayers%3A%3ALayerManagerOGL%3A%3AInitialize%28nsRefPtr%3Cmozilla%3A%3Agl%3A%3AGLContext%3E%2C+bool%29
Now they are: https://crash-stats.mozilla.com/report/list?signature=mozalloc_abort%28char+const*%29+|+NS_DebugBreak_P+|+nsACString_internal%3A%3AReplaceASCII%28unsigned+int%2C+unsigned+int%2C+char+const*%2C+unsigned+int%29
tracking-fennec: ? → 20+
I was able to reproduce the problem in comment #2 while investigating bug 842946. I believe the problem is that we get the surface creation notification on the UI thread, and so GLController.surfaceChanged runs on the UI thread. It sets the mSurfaceValid to true and notifies the compositor thread waiting in GLController.waitForValidSurface. The compositor thread then goes ahead and tries to grab the surface by calling GLController.provideEGLSurface(). However the UI thread is still doing *something* and the surface isn't fully created yet, so the code in provideEGLSurface() ends up throwing the IllegalArgumentException. I'm trying to come up with a good fix for this.
Blocks: 842946
I want to test this a bit more tomorrow but any feedback in the meantime is more than welcome.
Attachment #716260 - Flags: feedback?(snorp)
In particular, the things I would like feedback on are edge cases where the surface becomes invalid or changes between the provideEGLSurface call and the stuff that runs on the UI thread. There might be a better way to organize this code overall to avoid such edge cases.
Crash Signature: unsigned int)] [@ mozalloc_abort(char const*) | NS_DebugBreak_P | nsACString_internal::ReplaceASCII(unsigned int, unsigned int, char const*, unsigned int)] [@ mozalloc_abort(char const*) | NS_DebugBreak_P] → unsigned int)] [@ mozalloc_abort(char const*) | NS_DebugBreak_P | nsACString_internal::ReplaceASCII(unsigned int, unsigned int, char const* unsigned int)] [@ mozalloc_abort(char const*) | NS_DebugBreak_P] [@ mozalloc_abort | __sread] [@ mozalloc_abort…
Comment on attachment 716260 [details] [diff] [review]
Do surface stuff on the UI thread

Removing feedback flag; we discussed this offline and decided to tackle this via bug 844275 (marked as a blocker). I'll leave this patch here for now as it may be an acceptable short-term fix.
Attachment #716260 - Flags: feedback?(snorp)
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #13)
> Comment on attachment 716260 [details] [diff] [review]
> Do surface stuff on the UI thread
> 
> Removing feedback flag; we discussed this offline and decided to tackle this
> via bug 844275 (marked as a blocker) 

What is this blocking? I don't see any indication in that bug that it's targeted for a particular release.  Will this bug still be in the FF20 timeline or should we untrack?
(In reply to Lukas Blakk [:lsblakk] from comment #14)
> What is this blocking? I don't see any indication in that bug that it's
> targeted for a particular release.  Will this bug still be in the FF20
> timeline or should we untrack?

Sorry, by "marked as a blocker" I meant that I put in the "depends on" list of this bug (i.e. that bug blocks this one). I would say to leave this bug on the FF20 timeline for now; I will know within a day or two if my fix in bug 844275 is viable and can be uplifted.
Just an update: my patch queue is getting pretty large and I don't think I'd be comfortable with uplifting all of it to beta. I'd be fine with uplifting it to aurora though, so maybe we should retarget this bug to FF21. If the crash volume for this bug is high enough to warrant it, I could look at making a smaller change to FF20 that takes care of some of the crashes.
This should be fixed in the March 1 nightly with the landing of bug 844275. Please reopen if this still happens.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla22
See bp-27b2d439-e4e0-4d54-bbe0-706e62130301.
Status: RESOLVED → REOPENED
Crash Signature: unsigned int)] [@ mozalloc_abort(char const*) | NS_DebugBreak_P] [@ mozalloc_abort | __sread] [@ mozalloc_abort | pthread_mutex_unlock] → unsigned int)] [@ mozalloc_abort(char const*) | NS_DebugBreak_P] [@ mozalloc_abort | __sread] [@ mozalloc_abort | pthread_mutex_unlock] [@ mozalloc_abort(char const*) | NS_DebugBreak_P | TParseContext::parseMatrixFields(std::basic_string<char std::cha…
Resolution: FIXED → ---
Ugh. From the logcat I see that the Gecko thread is busy when we go to create the compositor, so even though that's supposed to be a sync operation we abort it and then things go bad. I have filed bug 846774 to try and figure out why this is happening.
Depends on: 846774
We're not able to take bug 846774 up to beta but it will be resolved in Aurora so moving tracking to FF21 and wontfixing for FF20.
(In reply to Scoobidiver from comment #18)
> See bp-27b2d439-e4e0-4d54-bbe0-706e62130301.

How did you find this crash report? The crash signature on it is completely different. I'd like to know if landing bug 846774 took care of this bug, but when I search for any crashes on 22.0a1 with LayerManagerOGL in the signature, nothing shows up. Is there a way to search through the app notes?
tracking-fennec: 20+ → 21+
needinfo'ing scoobidiver for comment 21
Flags: needinfo?(scoobidiver)
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #21)
> How did you find this crash report?
It's over sized but I have no other ways to refine because of buggy stack traces (again): https://crash-stats.mozilla.com/query/query?product=FennecAndroid&version=FennecAndroid%3A22.0a1&query_search=signature&query_type=startswith&query=mozalloc_abort&do_query=1

> Is there a way to search through the app notes?
I don't have the tools to do that but Benoit (see bug 771774 comment 28) or Robert can.
Flags: needinfo?(scoobidiver)
Ok, thanks.

I dug through the crash data for march 15, and found 12 crashes for 22.0a1 that had "We need a context on Android" in the app notes. Some were on old builds and some were duplicates, but all of them had a thread in the list of threads that was in the middle of a call to CreateCompositor.

Taking [1] as a typical example, I see thread 10 is the gecko thread, in the middle of a call to CreateCompositor triggered by the compositor-resume event posted from java. The crashing thread is the compositor thread, trying to create the compositor. The ContextProviderEGL::CreateForWindow call returned null, which results in the abort, but that call returned null because the surface it obtained from java was null (via AndroidBridge::ProvideEGLSurface). Based on the logcat from [2] which is another crash of this type, the surface could be null because the java thread times out while waiting for the compositor creation and lets the UI thread continue. This appears to be a different manifestation of the same fault behind bug 797615 which is also still not fixed after all of my patches have landed.

[1] https://crash-stats.mozilla.com/report/index/ec012762-ff76-4b62-baca-a643b2130315
[2] https://crash-stats.mozilla.com/report/index/13235f26-24fb-4d18-a471-aac5b2130315
Blocks: 797615
No longer blocks: 797615
Depends on: 797615
Crash Signature: , std::char_traits<char>, pool_allocator<char> > const&, int, TMatrixFields&, int) ] → , std::char_traits<char>, pool_allocator<char> > const&, int, TMatrixFields&, int) ] [@ mozalloc_abort(char const*) | NS_DebugBreak]
Depends on: 852467
is this patch ready for review?
Flags: needinfo?(bugmail.mozilla)
Comment on attachment 716260 [details] [diff] [review]
Do surface stuff on the UI thread

No, this patch has been obsoleted by bug 844275. The thing blocking this now is bug 852467.
Attachment #716260 - Attachment is obsolete: true
Flags: needinfo?(bugmail.mozilla)
Depends on: 858926
Bug 852467, which this depends on, is marked 22+ (but is as yet unassigned). I don't think this will be fixed in 21.
tracking-fennec: 21+ → ?
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #28)
> Bug 852467, which this depends on, is marked 22+ (but is as yet unassigned).
> I don't think this will be fixed in 21.

Looks like Bug 852467 changed state recently and is tracking 21+ now.So I am going to track the blocker bug per that.Let me know if I understood something wrong here.

We still have a few beta's to roll and could try speculative fixes if need be to help resolve this top-crasher.
tracking-fennec: ? → 22+
Crash Signature: , std::char_traits<char>, pool_allocator<char> > const&, int, TMatrixFields&, int) ] [@ mozalloc_abort(char const*) | NS_DebugBreak] → , std::char_traits<char>, pool_allocator<char> > const&, int, TMatrixFields&, int) ] [@ mozalloc_abort(char const*) | NS_DebugBreak] [@ libmozalloc.so@0x9aa ]
Crash Signature: , std::char_traits<char>, pool_allocator<char> > const&, int, TMatrixFields&, int) ] [@ mozalloc_abort(char const*) | NS_DebugBreak] [@ libmozalloc.so@0x9aa ] → , std::char_traits<char>, pool_allocator<char> > const&, int, TMatrixFields&, int) ] [@ mozalloc_abort(char const*) | NS_DebugBreak] [@ libmozalloc.so@0x9aa ] [@ libmozalloc.so@0x9e2 ]
Crash Signature: , std::char_traits<char>, pool_allocator<char> > const&, int, TMatrixFields&, int) ] [@ mozalloc_abort(char const*) | NS_DebugBreak] [@ libmozalloc.so@0x9aa ] [@ libmozalloc.so@0x9e2 ] → , std::char_traits<char>, pool_allocator<char> > const&, int, TMatrixFields&, int) ] [@ mozalloc_abort(char const*) | NS_DebugBreak] [@ libmozalloc.so@0x9aa ] [@ libmozalloc.so@0x9e2 ] [@ mozalloc_abort(char const*) | NS_DebugBreak_P | nsACString_inte…
Depends on: 864103
It has been replaced by bug 864103.
Target Milestone: mozilla22 → ---
(In reply to Scoobidiver from comment #30)
> It has been replaced by bug 864103.

Isn't that a dupe, basically?
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #31)
> (In reply to Scoobidiver from comment #30)
> > It has been replaced by bug 864103.
> Isn't that a dupe, basically?
Same as bug 595351 and bug 858032.
(In reply to Scoobidiver from comment #32)
> (In reply to Robert Kaiser (:kairo@mozilla.com) from comment #31)
> > (In reply to Scoobidiver from comment #30)
> > > It has been replaced by bug 864103.
> > Isn't that a dupe, basically?
> Same as bug 595351 and bug 858032.

Umm, not, not at all. Or did you see LayerManagerOGL being replaced with a completely different implementation that works completely differently?
Layer managers and compositor OGL are two different layer functions. One crashes later than the other. That might explain why the volume is lower in bug 864103 (likely not a top crasher) than in this bug.
20130421: 1
20130420: 0
20130419: 1
20130418: 0
20130417: 0
20130416: 0
20130415: 0
20130414: 0
20130413: 3
20130412: 1
20130411: 1 <-- bug 864103
20130410: 0
20130409: 0
20130408: 5
20130407: 1
20130406: 3
20130405: 0
20130403: 2
20130402: 3
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #33)
> (In reply to Scoobidiver from comment #32)
> > (In reply to Robert Kaiser (:kairo@mozilla.com) from comment #31)
> > > (In reply to Scoobidiver from comment #30)
> > > > It has been replaced by bug 864103.
> > > Isn't that a dupe, basically?
> > Same as bug 595351 and bug 858032.
> 
> Umm, not, not at all. Or did you see LayerManagerOGL being replaced with a
> completely different implementation that works completely differently?

I've not been following this bug and may be off here, but just so you know:

LayerManagerOGL only exists in old-layers. Android already uses new-layers. I'm referring to the layers-refactoring that recently landed: bug 825928.
tracking-fennec: 22+ → ?
22+ doesn't line up with tracking-firefox21+
tracking-fennec: ? → 21+
We should keep an eye on the crash volume here to see if bug 852467 helped at all.
Looks like there is at least https://crash-stats.mozilla.com/report/index/2afb873d-1da3-4a92-98af-836f72130428 which includes that fix and still crashed with this in the App Notes.
No longer depends on: 864103
Summary: crash in mozilla::layers::LayerManagerOGL::Initialize with abort message: "We need a context on Android" → crash in mozilla::layers::LayerManagerOGL/CompositorOGL::Initialize with abort message: "We need a context on Android"
It's #2 crasher in 22.0a2 and #27 in 23.0a1. The improvement is likely due to bug 825928 (see comment 34).
I'm not yet convinced it was bug 825928 that made the difference. The crash volume on 23 is too low IMO to really tell. I grabbed all the crash data for April and grepped all the crashes on Fennec 23 with "we need a context on Android" in the app notes and tallied them by build ID:

   4 20130402030843
   2 20130403030841
   3 20130403090950
   3 20130406030922
   4 20130407030841
   5 20130408030928
   2 20130410065939
   1 20130411030925
   1 20130412030828
   3 20130413030927
   1 20130416030901
   6 20130419030957
   1 20130420031010
   1 20130421031002
   1 20130422030937
   1 20130427030919

While it does show a decrease around April 8, the numbers aren't high enough to inspire much confidence. I was also looking at bug 797615 (intermittent orange) which I believe was caused by the same underlying issue, and the last of those happened on April 16 (see my comment 515 on that bug for details). At this point I can't say for sure what landed that made the difference.
(In reply to Kartikaya Gupta (email:kats@mozilla.com, away May2-May20) from comment #37)
> We should keep an eye on the crash volume here to see if bug 852467 helped
> at all.

:kats, Bug 852467 which was blocking the investigation here is on m-c as of 4/26, i think its too late to uplift it to beta(fx21) at this point of time, but we could definitely check with :michal for aurora uplift if you think that will help here ?
Based on the data in comment 41 it looks like the crash volume may have gone down, but it appears to have happened before bug 852467 even landed. I'm not confident that uplifting bug 852467 will help significantly. However it may be worth uplifting anyway if it's low-risk, because we'll get a better idea of how much it helps. This crash appears to be around 6-10% of all crashes (percentage taken per-buildid) on Aurora right now.
tracking-fennec: 21+ → +
In 21.0, almost all crashes occur on devices with AMD Z430 GPU running Gingerbread.
Crash Signature: , int) ] [@ mozalloc_abort(char const*) | NS_DebugBreak] [@ libmozalloc.so@0x9aa ] [@ libmozalloc.so@0x9e2 ] [@ mozalloc_abort(char const*) | NS_DebugBreak_P | nsACString_internal::AppendFunc(void*, char const*, unsigned int) ] → , int) ] [@ mozalloc_abort(char const*) | NS_DebugBreak] [@ libmozalloc.so@0x9aa ] [@ libmozalloc.so@0x9e2 ] [@ mozalloc_abort(char const*) | NS_DebugBreak_P | nsACString_internal::AppendFunc(void*, char const*, unsigned int) ] [@ mozalloc_abort(char…
Crash Signature: ] [@ mozalloc_abort(char const*) | NS_DebugBreak_P | 0 (deleted)@0x11f911f ] [@ mozalloc_abort(char const*) | NS_DebugBreak_P | libsurfaceflinger_client.so@0x15559 ] → ] [@ mozalloc_abort(char const*) | NS_DebugBreak_P | 0 (deleted)@0x11f911f ] [@ mozalloc_abort(char const*) | NS_DebugBreak_P | libsurfaceflinger_client.so@0x15559 ] [@ mozalloc_abort(char const*) | NS_DebugBreak | nsACString_internal::ReplaceASCII(un…
A comment says: "I was on www.inside-handy.de and has opened a second tab. I was tiping about: in the adress bar and than I tapped on Firefox:Über ihren Browser (about:firefox). Than Nightly was down." (see bp-41459da5-e634-4331-8713-dcb102130611).
Crash Signature: nsACString_internal::ReplaceASCII(unsigned int, unsigned int, char const*, unsigned int) ] → nsACString_internal::ReplaceASCII(unsigned int, unsigned int, char const*, unsigned int) ] [@ mozalloc_abort(char const*) ] [@ mozalloc_abort(char const*) | NS_DebugBreak | nsDebugImpl::QueryInterface(nsID const&, void**) ]
Hardware: ARM → All
Aaronmt or kbrosnan, would you have time to take a look into reproducing using comment 45?
Flags: needinfo?(aaron.train)
Flags: needinfo?(kbrosnan)
At last a non-buggy stack trace because of bug 884300.

Reports at:
https://crash-stats.mozilla.com/report/list?signature=mozalloc_abort%28char+const*%29+|+NS_DebugBreak+|+mozilla%3A%3Alayers%3A%3ACompositorOGL%3A%3AInitialize%28%29
Crash Signature: nsACString_internal::ReplaceASCII(unsigned int, unsigned int, char const*, unsigned int) ] [@ mozalloc_abort(char const*) ] [@ mozalloc_abort(char const*) | NS_DebugBreak | nsDebugImpl::QueryInterface(nsID const&, void**) ] → nsACString_internal::ReplaceASCII(unsigned int, unsigned int, char const*, unsigned int) ] [@ mozalloc_abort(char const*) ] [@ mozalloc_abort(char const*) | NS_DebugBreak | nsDebugImpl::QueryInterface(nsID const&, void**) ] [@ mozalloc_abort(char cons…
Crashes stopped after 25.0a1/20130626, 24.0a2/20130701 and almost stopped after 23.0b1 (currently 7 crashes in 23.0b2). The working windows might be (startup crash):
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=cc80aa0c7c13&tochange=3b955f306226
http://hg.mozilla.org/releases/mozilla-aurora/pushloghtml?fromchange=30a7b726833d&tochange=3f9d6dbaf669
http://hg.mozilla.org/releases/mozilla-beta/pushloghtml?fromchange=165aad98ce0a&tochange=96daccbcd1f9
Possible fixes: bug 884057 (most likely), bug 887097, or bug 886077.
Flags: needinfo?(kbrosnan)
Flags: needinfo?(aaron.train)
Whiteboard: [native-crash] → [native-crash][startupcrash]
Scoobidiver, does that mean we effectively can close this one as fixed (which would be awesome)?

I somewhat suspect that it could be bug bug 887097, actually, but the devs would know better. :)
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #49)
> Scoobidiver, does that mean we effectively can close this one as fixed
> (which would be awesome)?
I think we should remove the topcrash keyword once 23.0 is in Release.
(In reply to Scoobidiver from comment #50)
> I think we should remove the topcrash keyword once 23.0 is in Release.

I care even more about if we should mark it fixed, actually - probably because you are so awesome in maintaining the topcrash keyword. ;-)
We can't mark it as fixed because it's #86 crasher in 23.0b2 and has a good chance to keep its rank in future Betas (~75 kADU) despite Aurora and Nightly (~1.5 kADU) being unaffected.
The only way to show this big win is by removing the topcrash keyword when 23.0 switches to Release.
I also suspect this was fixed by bug 887097. Or more precisely, I think bug 887097 turned most of the crashes with this signature into ANR reports instead, which is more useful because it has gecko stacks that are closer to the root cause. See for example bug 889881 and bug 889564 which identified two potential causes of this.
Depends on: 887097
It's no longer a top crasher in current channels.
Keywords: topcrash
This is the absolute topcrash by a large margin on 26.0a1 Nightly.
Keywords: topcrash
This signature has shot up sharply in the last days and is now dominating the 28.0a1 Nightly stats. What's up here?
In channel Kats mentioned that he believes this is related to bjacob's recent rewrite of the related code
Flags: needinfo?(bjacob)
Interesting; very possibly related to my recent changes in bug 925608 and bug 937204, but I can't reproduce locally. Should be very easy to fix if we can reproduce or at least find a useful stack in a crash report.
Flags: needinfo?(bjacob)
Hah, indeed there is a spike on 28.0a1:

$ for d in `cat dates`; do echo "$d: `zcat $d-pub-crashdata.csv.gz | grep "We need a context on Android" | grep 28\\.0a1 | wc -l`"; done
20131112: 0
20131113: 0
20131114: 0
20131115: 0
20131116: 32
20131117: 123
I'm not having an easy time reproducing this. Vlad got it once on a Kobo arc 10HD, but it since evaded out attempts to reproduce in GDB...

If anyone has definite STR to share on a particular device, that's what would help the most.
Here's a breakdown of devices that gave these crashes yesterday, sorted by decreasing number of crash reports (the number in front of each of them).

$ zcat 20131117-pub-crashdata.csv.gz | grep "We need a context on Android" | grep 28\\.0a1 | cut -d \| -f 6- | sed 's|\(-keys\).*|\1|g' | sort | uniq -c | sort -rn
      8  asus Nexus 7 | google/nakasi/grouper:4.3/JWR66Y/776638:user/release-keys
      6  samsung SAMSUNG-SGH-I727 | samsung/SGH-I727/SGH-I727:4.1.2/JZO54K/I727UCMC1:user/release-keys
      6  samsung GT-P3110 | samsung/espressowifixx/espressowifi:4.1.1/JRO03C/P3110XXCLK7:user/release-keys
      6  samsung GT-I9100 | samsung/GT-I9100/GT-I9100:4.0.3/IML74K/XXLPQ:user/release-keys
      5  samsung GT-N7000 | samsung/GT-N7000/GT-N7000:4.0.3/IML74K/ZCLP6:user/release-keys
      5  motorola/XT890_rtcoreeu/smi:4.1.2/9.8.2I-50_SMI-26/1358465787:user/release-keys
      5  LGE Nexus 4 | google/occam/mako:4.4/KRT16O/907817:user/release-keys
      4  LGE Nexus 5 | google/hammerhead/hammerhead:4.4/KRT16M/893803:user/release-keys
      3  Sony C6603 | Sony/C6603_1270-7689/C6603:4.2.2/10.3.1.A.2.67/vPd3rg:user/release-keys
      3  nothumb Build | asus Nexus 7 | google/nakasi/grouper:4.3/JWR66Y/776638:user/release-keys
      3  LGE Nexus 4 | google/occam/mako:4.3/JWR66Y/776638:user/release-keys
      3  lge LG-LS860 | lge/l2s_SPR_US/l2s:4.0.4/IMM76I/LS860ZV7.c755138:user/release-keys
      2  samsung SCH-I605 | Verizon/t0ltevzw/t0ltevzw:4.1.2/JZO54K/I605VRAMC3:user/release-keys
      2  samsung SAMSUNG-SGH-I317 | samsung/t0lteatt/t0lteatt:4.1.2/JZO54K/I317UCAMH3:user/release-keys
      2  samsung GT-N7000 | samsung/GT-N7000/GT-N7000:4.1.2/JZO54K/N7000DXLSE:user/release-keys
      2  samsung GT-I9105P | samsung/s2vepxx/s2vep:4.1.2/JZO54K/I9105PXXAMC2:user/release-keys
      2  samsung Galaxy Nexus | google/yakju/maguro:4.3/JWR66Y/776638:user/release-keys
      2  samsung Galaxy Nexus | google/mysid/toro:4.2.2/JDQ39/573038:user/release-keys
      2  rockchip ICOO D70Pro II | rk30sdk/rk30sdk/rk30sdk:4.1.1/JRO03H/20121219.102045:eng/release-keys
      2  nothumb Build | Sony ST26i | orange_pl/ST26i_1267-4750/ST26i:4.1.2/11.2.A.0.31/07_96g:user/release-keys
      2  nothumb Build | motorola XT910 | motorola/XT910_amxar/umts_spyder:4.1.2/9.8.2O-124_SPUL-17/1362188864:user/release-keys
      2  LGE LG-LS696 | sprint_lge/m3s_sprint_us/m3s:2.3.7/ZVF.GWK74/48205F09:user/release-keys
      2  LGE LG-E975 | lge/geehrc_open_eu/geehrc:4.1.2/JZO54K/E97510b.1360741472:user/release-keys
      2  HTC HTC One X+ | telus_wwe/evitareul/evitareul:4.1.1/JRO03C/164725.1:user/release-keys
      2  ET102 M3 | ET102/DS975/DS975:4.1.1/JRO03H/20130130.161349:eng/release-keys
      2  asus Nexus 7 | google/razor/flo:4.3/JSS15R/804956:user/release-keys
      2  Asus Nexus 7 | google/nakasi/grouper:4.4/KRT16M/573038:user/release-keys
      1  STAR B943 | ASK_SP460_HD/V6/V6:4.2.1/JOP40D/1364387290:user/test-keys
      1  samsung SHV-E210S | samsung/c1skt/c1skt:4.1.2/JZO54K/E210SKSJMJ2:user/release-keys
      1  samsung SGH-T889 | samsung/t0ltetmo/t0ltetmo:4.1.2/JZO54K/T889UVBMB4:user/release-keys
      1  Samsung SGH-I747 | samsung/d2uc/d2att:4.0.4/IMM76D/I747UCALH9:user/release-keys
      1  Samsung SAMSUNG-SGH-I747 | samsung/d2uc/d2att:4.0.4/IMM76D/I747UCALEM:user/release-keys
      1  samsung Nexus 10 | google/mantaray/manta:4.3/JWR66Y/776638:user/release-keys
      1  samsung GT-P7500 | samsung/GT-P7500/GT-P7500:4.0.4/IMM76D/XXLQ8:user/release-keys
      1  samsung GT-N7105 | samsung/t0ltexx/t0lte:4.1.2/JZO54K/N7105XXDLL4:user/release-keys
      1  samsung GT-N7100 | samsung/t03gxx/t03g:4.1.2/JZO54K/N7100XXDMF2:user/release-keys
      1  samsung GT-N7100 | samsung/t03gxx/t03g:4.1.2/JZO54K/N7100XXDMC3:user/release-keys
      1  samsung GT-N7100 | samsung/t03gxx/t03g:4.1.1/JRO03C/N7100XXALJ3:user/release-keys
      1  samsung GT-N7000 | samsung/GT-N7000/GT-N7000:4.1.2/JZO54K/N7000XXLSZ:user/release-keys
      1  samsung GT-I9300 | samsung/m0xx/m0:4.3/JSS15J/I9300XXUGMJ9:user/release-keys
      1  samsung GT-I9300 | samsung/m0xx/m0:4.2.2/JDQ39/I9300XXUFME7:user/release-keys
      1  samsung GT-I9300 | samsung/m0xx/m0:4.1.2/JZO54K/I9300XXELL1:user/release-keys
      1  samsung GT-I9300 | samsung/m0xx/m0:4.1.1/JRO03C/I9300XXDLIB:user/release-keys
      1  samsung GT-I8190L | samsung/goldenub/golden:4.1.2/JZO54K/I8190LUBAMH2:user/release-keys
      1  nothumb Build | WST S5301 | alps/H958/H958:4.0.4/IMM76D/1359943407:user/test-keys
      1  nothumb Build | samsung SGH-T589W | samsung/SGH-T589W/SGH-T589W:2.3.6/GINGERBREAD/UQLD4:user/release-keys
      1  nothumb Build | samsung SAMSUNG-SGH-I747 | samsung/d2uc/d2att:4.1.2/JZO54K/I747UCDMG2:user/release-keys
      1  nothumb Build | Samsung GT-S5830 | Samsung/cm_cooper/cooper:4.3.1/JLS36I/c914381532:userdebug/test-keys
      1  nothumb Build | LGE Nexus 5 | google/hammerhead/hammerhead:4.4/KRT16M/893803:user/release-keys
      1  nothumb Build | LGE LG-E510f | lge/lge_univa/univa_tcl-mx:2.3.4/GRJ22/V10a.4810f43e:user/release-keys
      1  LGE Nexus 4 | google/occam/mako:4.3/JSS15J/737497:user/release-keys
      1  LGE Nexus 4 | google/occam/mako:4.2.2/JDQ39/573038:user/release-keys
      1  LGE LG G2 | LGE/pa_g2/g2:4.3.1/JLS36I/eng.houstonn.20131029.135121:userdebug/test-keys
      1  LGE LG-E739 | lge/e739/e739:2.3.6/GRK39F/V10u-Jan-23-2013.2EDE7D0735:user/release-keys
      1  LGE LG-E450f | lge/vee5ss_tcl_mx/vee5ss:4.1.2/JZO54K/498462d0f7:user/release-keys
      1  HTC HTC One X | cingular_us/evita/evita:4.0.3/IML74K/54373.1:user/release-keys
      1  HTC HTC One S | o2_de/ville/ville:4.1.1/JRO03C/128736.8:user/release-keys
      1  HTC HTC Glacier | tmobile/htc_glacier/glacier:2.3.4/GRJ22/347621.1:user/release-keys
      1  HTC HTC first | cingular_us/mystul/mystul:4.1.2/JZO54K/180011.4:user/release-keys
      1  bq bq Edison 2 3G | bq/bq_Edison2_3G/bq_rk:4.1.1/1.0.2_20130809-08:25/39:user/release-keys
      1  Asus Nexus 7 | google/nakasi/grouper:4.3/JWR66V/573038:user/release-keys
      1  Amazon KFSOWI | Amazon/soho/soho:4.2.2/JDQ39/11.3.0.5_user_305083920:user/release-keys
      1  Acer V370 | acer/c11_ww_gen1/C11:4.2.2/JDQ39/1382521727:user/release-keys
All of these crashes seem to be startup crashes, with an uptime typically around 4 or 5 seconds.
Also, these 123 crashes yesterday are 2/3 of the 181 crash reports we got on Android 28.0a1. Together with the fact that it's a startup crash, it's very very critical...

I suspect that I may have been overly aggressive simplifying the Java-side surface creation code in parts 4/8 and 5/8 in bug 925608.

I'm hesitating between 2 options at the moment:
 - back out bug 925608 altogether, to limit damage
 - or back out only parts 5/8 -- 8/8, which should preserve most of the benefits of bug 925608's patches and tell us specificly if the issue is with part 5/8, but on the other hand, if we're crashing all our nightly users on startup, soon we won't have any user left :-(
Although just because this crash is 2/3 of our android nightly crashes, doesn't imply that we're crashing 2/3 of our users ;-) maybe I was a bit over-dramatic in comment 64.
Another thing to note is that the present bug was filed a very long time ago, so the problem "always" existed, it just was far less prevalent.

And indeed the old Java code was complicated and had comments hinting at surface creation being fragile and complicated.

Maybe the real way out is instead to accept that surface creation can fail and find a way to make it not be a big deal. For example, if it fails, maybe we can try again soon later, maybe even we'll get another surfaceChanged event so we'll automatically try again then.
This runnable business is how things were done prior to my changed in bug 925608.

The crashes here are very probably caused by failure to create our EGLSurface (which would give exactly this GLContext initialization error as in the crash reports).

We're trying to create an EGLSurface against mView, and the old code deferred that to a runnable on mView's event loop, with a comment explaining how that was motivated by EGLSurface creation failures when the view wasn't ready yet, and my patch changed that to happen immediately instead. Maybe that was a bad change; this patch reverts it.

Tested locally on one device, works fine here (does not make a visible difference to me, I can't reproduce the bug here anyway).
Attachment #8334104 - Flags: review?(bugmail.mozilla)
Comment on attachment 8334104 [details] [diff] [review]
restore the code to defer updateCompositor to mView's event loop

Review of attachment 8334104 [details] [diff] [review]:
-----------------------------------------------------------------

This should probably have a comment, but otherwise r=me.
Attachment #8334104 - Flags: review?(bugmail.mozilla) → review+
Landed on B2G-Inbound, with a comment, together with a backout of part 8/8 of bug 925608 which is another suspect.

remote:   https://hg.mozilla.org/integration/b2g-inbound/rev/06b0905b5c16
remote:   https://hg.mozilla.org/integration/b2g-inbound/rev/f7c711424299
Whiteboard: [native-crash][startupcrash] → [native-crash][startupcrash][leave open]
It's not improving...

bjacob:~/crash-stats$ for d in `cat dates`; do echo "$d: `zcat $d-pub-crashdata.csv.gz | grep "We need a context on Android" | grep 28\\.0a1 | wc -l`"; done
20131114: 0
20131115: 0
20131116: 32
20131117: 123
20131118: 141
20131119: 203
Here is my best attempt to explain the crash.

Before my patches, GLController.surfaceChanged (which is now renamed to serverSurfaceChanged) would not attempt to create the Gecko compositor before it succeeds creating its EGL surface. Now it creates the Gecko compositor immediately and lets it try to create its EGL surface, and that's what's failing here.

So the question is why are we calling GLController.serverSurfaceChanged too early?

The present patch removes a hunk that I could never understand, and the comment above that function simply doesn't match what that function is doing. That hunk was calling GLController.serverSurfaceChanged specifically if the serversurface didn't exist yet. That doesn't seem to be possibly useful: in that case, we will get a surfaceChanged callback anyway before anything can make it to the screen, so why bother calling it manually before?
Attachment #8335280 - Flags: review?(vladimir)
Attachment #8335280 - Flags: review?(bugmail.mozilla)
Here's a more conservative patch that does nothing else but drop that surfaceChanged call.
Attachment #8335280 - Attachment is obsolete: true
Attachment #8335280 - Flags: review?(vladimir)
Attachment #8335280 - Flags: review?(bugmail.mozilla)
Attachment #8335282 - Flags: review?(vladimir)
Attachment #8335282 - Flags: review?(bugmail.mozilla)
Comment on attachment 8335282 [details] [diff] [review]
dont-call-surfacechanged-from-onSizeChanged

sorry about the noise, i'll think a bit longer and try on a couple devices first.
Attachment #8335282 - Flags: review?(vladimir)
Attachment #8335282 - Flags: review?(bugmail.mozilla)
Pls review this as :kats is on PTO and you've been looking at this code.
Attachment #8335282 - Attachment is obsolete: true
Attachment #8335595 - Flags: review?(vladimir)
Fingers crossed:

https://hg.mozilla.org/integration/mozilla-inbound/rev/038356d89dc2

If this doesn't suffice, we can start considering doing sub-optimal things, such as: creating a EGLSurface and verifying that it succeeds, before creating the compositor. That would mimick the behavior of the code before my patches. It wouldn't slow down startup by much, but it would be a step away from trying to understand what the real problem is. Hopefully I'm not abusing the Nightly channel too much for crazy experiments here. Let me know if you'd rather have us give up on trying to understand things and rather do all what we can to avoid crashing.
KaiRo: I would love to know as early as possible if that changeset fixed that crash, i.e., if there still are crashes with "We need a context on Android" assertion message with build IDs >= 20131121. If you could provide me with that information at some time tomorrow (nov. 21) then that would allow me to iterate in 1 day instead of 2 days currently (using daily crashdata csv files).
Flags: needinfo?(kairo)
Thanks; looks like https://crash-stats.mozilla.com/search/?product=FennecAndroid&version=28.0a1&app_notes=need&build_id=%3E%3D20131121000000&_facets=signature&_columns=date&_columns=signature&_columns=build_id&_columns=app_notes is still giving crashes.

I also checked that the changeset was in this Nightly.

So, let's try the next thing here, which is to restore the old behavior of ensuring that we actually can create an EGLSurface for the window, before we go on and try to create the compositor, which will then use that already-created surface.
Comment on attachment 8336366 [details] [diff] [review]
Revert to old behavior of holding off from starting the compositor until we have successfully preallocated its EGLSurface

Review of attachment 8336366 [details] [diff] [review]:
-----------------------------------------------------------------

As discussed on skype, this one looks good to me.
Attachment #8336366 - Flags: review?(snorp) → review+
/o\

The weird/interesting thing is that it only fails on Android 2.2... need to test on a real 2.2 device.
So here is the story of why this failed on Android 2.x (both 2.2 and 2.3).

Android 2.x differs from newer Android versions in that it doesn't initialize EGL for us. There, if we don't call eglInitialize, nothing will do it for us, and our EGL calls then just fail.

Calling eglInitialize used to be done by the GfxInfoThread. With bug 93720 eliminating this thread, EGL initialization moved to the gecko compositor. But the patch here that just bounced was relying on being able to make EGL calls in Java before the compositor is created. That's what failed on Android 2.x.

So we have to bring back EGL initialization to Java, but measurements show that to be a very expensive operation (see comment in next patch, e.g. 80 ms on Nexus S) so it is important to keep it on a separate thread, like was the case with the defunct GfxInfoThread.
Comment on attachment 8336922 [details] [diff] [review]
Revert to calling eglInitialize in Java; do it first in a EGLPreloadingThread to avoid blocking the UI thread for a long time

Review of attachment 8336922 [details] [diff] [review]:
-----------------------------------------------------------------

::: mobile/android/base/gfx/GLController.java
@@ +45,5 @@
> +
> +    @Override
> +    public void run()
> +    {
> +        long nanoTimeBefore = System.nanoTime();

I don't think we need this timing stuff do we?

@@ +208,5 @@
>      private void initEGL() {
>          if (mEGL != null) {
>              return;
>          }
> +        long nanoTimeBefore = System.nanoTime();

same

@@ +219,5 @@
> +        try {
> +            mEGLPreloadingThread.join();
> +        } catch (InterruptedException e) {
> +            Log.w(LOGTAG, "Thread interrupted", e);
> +            Thread.currentThread().interrupt();

Why is this necessary?

@@ +239,5 @@
> +        // Also note that while calling eglInitialize isn't necessary on Android 4.x
> +        // (at least Android's HardwareRenderer does it for us already), it is necessary
> +        // on Android 2.x.
> +        int[] returnedVersion = new int[2];
> +        if (!mEGL.eglInitialize(mEGLDisplay, returnedVersion)) {

This seems wrong to me. We know we kicked off the preload thread, and we joined it up above so we know it ran. Maybe just ask EGLPreloadingThread whether it succeded or not and try again here if it failed?
Attachment #8336922 - Flags: review?(snorp) → review-
Attachment #8336972 - Flags: review?(snorp) → review+
Still orange but one of the try runs has an interesting clue:

11-22 12:58:06.865 W/GeckoGLController( 1855): GLController::updateCompositor with mCompositorCreated=false
11-22 12:58:06.865 W/GeckoGLController( 1855): EGL initialization took 30000 ns
11-22 12:58:06.865 W/dalvikvm( 1855): threadid=1: thread exiting with uncaught exception (group=0x4001d848)
11-22 12:58:06.865 E/GeckoAppShell( 1855): >>> REPORTING UNCAUGHT EXCEPTION FROM THREAD 1 ("main")
11-22 12:58:06.865 E/GeckoAppShell( 1855): java.lang.IllegalArgumentException: Make sure the SurfaceView or associated SurfaceHolder has a valid Surface
11-22 12:58:06.865 E/GeckoAppShell( 1855): 	at com.google.android.gles_jni.EGLImpl._eglCreateWindowSurface(Native Method)
11-22 12:58:06.865 E/GeckoAppShell( 1855): 	at com.google.android.gles_jni.EGLImpl.eglCreateWindowSurface(EGLImpl.java:87)
11-22 12:58:06.865 E/GeckoAppShell( 1855): 	at org.mozilla.gecko.gfx.GLController.AttemptPreallocateEGLSurfaceForCompositor(GLController.java:310)
11-22 12:58:06.865 E/GeckoAppShell( 1855): 	at org.mozilla.gecko.gfx.GLController.updateCompositor(GLController.java:189)
11-22 12:58:06.865 E/GeckoAppShell( 1855): 	at org.mozilla.gecko.gfx.GLController$1.run(GLController.java:171)
11-22 12:58:06.865 E/GeckoAppShell( 1855): 	at android.os.Handler.handleCallback(Handler.java:587)
11-22 12:58:06.865 E/GeckoAppShell( 1855): 	at android.os.Handler.dispatchMessage(Handler.java:92)
11-22 12:58:06.865 E/GeckoAppShell( 1855): 	at android.os.Looper.loop(Looper.java:123)
11-22 12:58:06.865 E/GeckoAppShell( 1855): 	at android.app.ActivityThread.main(ActivityThread.java:4627)
11-22 12:58:06.865 E/GeckoAppShell( 1855): 	at java.lang.reflect.Method.invokeNative(Native Method)
11-22 12:58:06.865 E/GeckoAppShell( 1855): 	at java.lang.reflect.Method.invoke(Method.java:521)
11-22 12:58:06.865 E/GeckoAppShell( 1855): 	at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:868)
11-22 12:58:06.865 E/GeckoAppShell( 1855): 	at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:626)
11-22 12:58:06.865 E/GeckoAppShell( 1855): 	at dalvik.system.NativeStart.main(Native Method)
Hah! The old code had a try { ... } around eglCreateWindowSurface.
It's coming out green!!!!!!!!!!!!!

\o/ !!!

/me runs naked around the stadium
Unsure why mcMerge didn't include the other cset, but it was merged to m-c also: https://hg.mozilla.org/mozilla-central/rev/c98cabc0d180

I *think* this'll be in time for the Nightly tonight, though I'm unsure if the tree closure and lack of power in the MV office could affect that.
Many thanks Wes for checking!
Still no crashes for today! Keep those fingers crossed!
The link in comment 101 shows that there have been exactly 3 crash reports with build ID >= 20131123, and they are all on Sony Ericsson LT15i and LT18i. So Friday's patches were definitely effective, and if the trend continues, we can file a separate bug for these Sony phones.
Yay! Thanks for that work, Benoit!
We already crash release builds on Android when GLContext::CreateForWindow fails, see:

http://dxr.mozilla.org/mozilla-central/source/gfx/layers/opengl/CompositorOGL.cpp#379

That is the assertion that is crashing here.

As of current Nightlies, this only seem to reproduce on a few devices: Sony Ericsson LT15i and LT18i at the moment.

To understand a bit more about these failures, the present patch makes us crash a bit earlier with a more specific message, on Android only.
Attachment #8337825 - Flags: review?(bgirard)
Comment on attachment 8337825 [details] [diff] [review]
Add fatal assertions in CreateForWindow on Android to get better crash reports

Review of attachment 8337825 [details] [diff] [review]:
-----------------------------------------------------------------

We talked offline. IMO we should do the same for b2g. Let's avoid ifdef where we don't need them.
Attachment #8337825 - Flags: review?(bgirard) → review-
Attachment #8337897 - Flags: review? → review?(bgirard)
Attachment #8337825 - Attachment is obsolete: true
Comment on attachment 8337897 [details] [diff] [review]
Add fatal assertions in CreateForWindow to get better crash reports

Review of attachment 8337897 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/gl/GLContextProviderEGL.cpp
@@ +1124,5 @@
>  
>  already_AddRefed<GLContext>
>  GLContextProviderEGL::CreateForWindow(nsIWidget *aWidget)
>  {
> +    // The MOZ_CRASH in this function are to help track down bug 834243.

IMO, strictly my opinin. I'm inclined to say remove this and make the MOZ_CRASH permanent. If we can't recover from this failure on that platform then it should be a crash.
Attachment #8337897 - Flags: review?(bgirard) → review+
Here is a crash-stats search link for CreateForWindow crashes on Fennec Nightly (none so far):

https://crash-stats.mozilla.com/search/?product=FennecAndroid&version=28.0a1&signature=~CreateForWindow&_facets=signature&_columns=date&_columns=signature&_columns=build_id&_columns=app_notes

The previous search link for "We need a context on Android" stopped at 6 crash reports from 3 different devices, all of them Sony Ericsson LT15i and LT18i. I expect that with 87948e6a1f99 landed, the "We need a context on Android" crashes will be morphed into CreateForWindow crashes, whence the new search link.
...still no crash reports since 87948e6a1f99 landed... darn, I'd love to understand better the issue on these Sony devices.
(In reply to Benoit Jacob [:bjacob] from comment #112)
> ...still no crash reports since 87948e6a1f99 landed... darn, I'd love to
> understand better the issue on these Sony devices.

That's probably also because a new Socorro release landed yesterday that needs reindexing of all the search data, and there was a small failure doing that and it needed to be started over, so right now there's no search indexes available and therefore no results will be returned for any search. The current estimate is for search to fully work again for data from recent days in about 5h from when I'm writing this comment.
Got one crash report,

https://crash-stats.mozilla.com/report/index/0516a50b-c7b7-46c4-ac1f-0f55a2131127

it's on a LGE LG-P920, so different manufacturer and different GPU... the only constant is that the Android SDK version in all remaining crashes so far is 15.

The crash (as seen in the stack) is specifically EGLSurface creation failure which is very strange as that surface is pre-created by Java and we shouldn't hit this point without it already being successfully created; I'll try to add some more asserts to further refine the crashes.
So. The most likely place for this to fail is GLController::CreateEGLSurfaceForCompositorWrapper() in GeneratedJNIWrappers.cpp. The code is:

jobject GLController::CreateEGLSurfaceForCompositorWrapper() {
    JNIEnv *env = GetJNIForThread();
    if (!env) {
        ALOG_BRIDGE("Aborted: No env - %s", __PRETTY_FUNCTION__);
        return nullptr;
    }

    if (env->PushLocalFrame(1) != 0) {
        ALOG_BRIDGE("Exceptional exit of: %s", __PRETTY_FUNCTION__);
        env->ExceptionDescribe();
        env->ExceptionClear();
        return nullptr;
    }

    jobject temp = env->CallObjectMethod(wrapped_obj, jCreateEGLSurfaceForCompositorWrapper);

    if (env->ExceptionCheck()) {
        ALOG_BRIDGE("Exceptional exit of: %s", __PRETTY_FUNCTION__);
        env->ExceptionDescribe();
        env->ExceptionClear();
        env->PopLocalFrame(NULL);
        return nullptr;
    }

    jobject ret = static_cast<jobject>(env->PopLocalFrame(temp));
    return ret;
}

So that's three different ways that this could fail.

That's really unfortunate because if we fail here, we crash; and the whole code around this is organized to make sure that there is never a reason why this should fail (that's why we hold off from creating the compositor until we have successfully created a surface in Java).

In other words, if fallibility is unavoidable here, then our design is wrong since the beginning.

But maybe it's avoidable? Why should this ever fail?

I wanted to edit this function to add MOZ_CRASH's in these three error cases, so we get more specific crash reports. But I can't edit this generated file. What can I do to make the specific error case here apparent in crash reports?
Flags: needinfo?(snorp)
Flags: needinfo?(bugmail.mozilla)
Ah no, we have a race condition in GLController.java to resolve before it would make sense to look into that.

If the sequence of events is:

 1. Java surfaceChanged event, we create the surface and send a message to create the compositor
 2. Java surfaceDestroyed even, we destroy the surface
 3. Compositor gets created, wants to get the surface

Then the compositor will get a surface creation failure, because the preallocated surface that it was supposed to get has since been destroyed.

In that case, it seems that the right course of action is for compositor creation to gracefully fail and for Java to know that it has failed.

The problem is that at the moment, Java has no way to know what happened in gecko compositor creation. Java just fires a createcompositor event, and ignores anything that happens onwards on the gecko side.

We could fix that by unifying createcompositor and resumecompositor into a single event, so that the logic there would be entirely on the c++ side.
Flags: needinfo?(snorp)
Flags: needinfo?(bugmail.mozilla)
Assignee: bugmail.mozilla → bjacob
No longer depends on: 950203
filter on [mass-p5]
Priority: -- → P5
I'm resolving this bug as incomplete as there are zero reports against a currently supported version of Fennec. There were 87 reports against Fennec last week but all were with version 28 or older. Please reopen this bug report if you think there's something else to be done here.
Status: REOPENED → RESOLVED
Closed: 11 years ago9 years ago
Resolution: --- → INCOMPLETE
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: