Closed Bug 1108709 Opened 10 years ago Closed 9 years ago

Android crash in libOMX_Core.so@0x1a40 on Mali-400 MP and some PowerVR SGX 540 adapters

Categories

(Core :: Audio/Video, defect)

All
Android
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla38
Tracking Status
firefox34 --- wontfix
firefox35 + fixed
firefox36 + fixed
firefox37 + fixed
firefox38 --- fixed
relnote-firefox --- 35+

People

(Reporter: u279076, Assigned: snorp)

References

Details

(Keywords: crash, topcrash-android-armv7)

Crash Data

Attachments

(1 file, 1 obsolete file)

This bug was filed from the Socorro interface and is 
report bp-44e90151-6584-4e52-8da4-c0f2e2141205.
=============================================================
Ø 0 	libOMX_Core.so 	libOMX_Core.so@0x1a40 	
Ø 1 	libOMX_Core.so 	libOMX_Core.so@0x1a07 	
Ø 2 	libOMX_Core.so 	libOMX_Core.so@0x16cb 	
Ø 3 	libOMX_Core.so 	libOMX_Core.so@0x1693 	
Ø 4 	libstagefright_omx.so 	libstagefright_omx.so@0xf635 	
Ø 5 	libstagefrighthw.so 	libstagefrighthw.so@0x9e7 	
Ø 6 	libstagefrighthw.so 	libstagefrighthw.so@0x9f7 	
Ø 7 	libstagefright_omx.so 	libstagefright_omx.so@0xbec9 	
Ø 8 	libOMX_Core.so 	libOMX_Core.so@0x2682 	
Ø 9 	libc.so 	libc.so@0x1d58b 	
Ø 10 	libOMX_Core.so 	libOMX_Core.so@0x2682 	
Ø 11 	libOMX_Core.so 	libOMX_Core.so@0x263c 	
Ø 12 	libOMX_Core.so 	libOMX_Core.so@0x1bf7 	
Ø 13 	libOMX_Core.so 	libOMX_Core.so@0x263c 	
Ø 14 	libOMX_Core.so 	libOMX_Core.so@0x263c 	
Ø 15 	libOMX_Core.so 	libOMX_Core.so@0x1525 	
Ø 16 	libOMX_Core.so 	libOMX_Core.so@0x2653 	
Ø 17 	libOMX_Core.so 	libOMX_Core.so@0x179b 	
Ø 18 	libstagefright_omx.so 	libstagefright_omx.so@0xbfd5 	
Ø 19 	libstagefright_omx.so 	libstagefright_omx.so@0xb0cf 	
Ø 20 	libstagefright.so 	libstagefright.so@0xad8dd 	
Ø 21 	libstagefright.so 	libstagefright.so@0x836b9 	
Ø 22 	libstagefright_foundation.so 	libstagefright_foundation.so@0x7497 	
Ø 23 	libstagefright.so 	libstagefright.so@0xad8a3 	
Ø 24 	libstagefright_foundation.so 	libstagefright_foundation.so@0xa125 	
Ø 25 	libstagefright_foundation.so 	libstagefright_foundation.so@0x8409 	
Ø 26 	libstagefright_foundation.so 	libstagefright_foundation.so@0x842f 	
Ø 27 	libstagefright_foundation.so 	libstagefright_foundation.so@0xa125 	
Ø 28 	libstagefright.so 	libstagefright.so@0x8397f 	
Ø 29 	libstagefright_foundation.so 	libstagefright_foundation.so@0x6821 	
Ø 30 	libstagefright_foundation.so 	libstagefright_foundation.so@0xc01e 	
Ø 31 	libstagefright_foundation.so 	libstagefright_foundation.so@0x7609 	
32 	libutils.so 	androidCreateRawThreadEtc 	
33 		@0x63e5657e 	
34 	libutils.so 	_ZN7android6Thread3runEPKcij 	
35 		@0x64d5064e 	
Ø 36 	libc.so 	libc.so@0xe3da 	
Ø 37 	libc.so 	libc.so@0xdac6 
=============================================================
More reports: https://crash-stats.mozilla.com/report/list?product=FennecAndroid&signature=libOMX_Core.so%400x1a40

This crash has been showing up in early data for Fennec 35 Beta. It's currently #3 with 4% of our crashes (ADIs are only at 34K at this point though). The crash:installation rations is roughly 2 to 1 right now.

I originally thought this was bug 808378 which was fixed in Fennec 25 but that's not the case. There's many reports here with multiple versions between Fennec 25 and 35. 

Top Product Breakdown:
> FennecAndroid 	34.0 	59.41%
> FennecAndroid 	33.1 	17.89%
> FennecAndroid 	33.0 	3.91%
> FennecAndroid 	35.0b1 	3.44%
This is again majorly Mali-400 MP adapters, in addition some PowerVR SGX 540 are seeing it - but nothing else.
Summary: crash in libOMX_Core.so@0x1a40 → Android crash in libOMX_Core.so@0x1a40 on Mali-400 MP and some PowerVR SGX 540 adapters
I'm getting one of the affected devices shipped to me (asus memopad 8 ME102A) in order to figure this out.
I received the device today and can reproduce the crash easily. I don't really have a better stack in the calling thread, but in another thread I have:

#0  0x4017582c in __futex_syscall3 () from /Users/snorp/source/jimdb-arm/lib/DAOKCT500586/system/lib/libc.so
#1  0x4016b420 in __pthread_cond_timedwait_relative () from /Users/snorp/source/jimdb-arm/lib/DAOKCT500586/system/lib/libc.so
#2  0x4016b47c in __pthread_cond_timedwait () from /Users/snorp/source/jimdb-arm/lib/DAOKCT500586/system/lib/libc.so
#3  0x40038774 in android::ALooperRoster::postAndAwaitResponse(android::sp<android::AMessage> const&, android::sp<android::AMessage>*) () from /Users/snorp/source/jimdb-arm/lib/DAOKCT500586/system/lib/libstagefright_foundation.so
#4  0x40039388 in android::AMessage::postAndAwaitResponse(android::sp<android::AMessage>*) () from /Users/snorp/source/jimdb-arm/lib/DAOKCT500586/system/lib/libstagefright_foundation.so
#5  0x663f5722 in android::MediaCodec::PostAndAwaitResponse(android::sp<android::AMessage> const&, android::sp<android::AMessage>*) () from /Users/snorp/source/jimdb-arm/lib/DAOKCT500586/system/lib/libstagefright.so
#6  0x663f636c in android::MediaCodec::init(char const*, bool, bool) () from /Users/snorp/source/jimdb-arm/lib/DAOKCT500586/system/lib/libstagefright.so
#7  0x663f6438 in android::MediaCodec::CreateByType(android::sp<android::ALooper> const&, char const*, bool) () from /Users/snorp/source/jimdb-arm/lib/DAOKCT500586/system/lib/libstagefright.so
#8  0x5985321c in android::JMediaCodec::JMediaCodec(_JNIEnv*, _jobject*, char const*, bool, bool) () from /Users/snorp/source/jimdb-arm/lib/DAOKCT500586/system/lib/libmedia_jni.so
#9  0x598532ee in ?? () from /Users/snorp/source/jimdb-arm/lib/DAOKCT500586/system/lib/libmedia_jni.so
#10 0x408a5294 in dvmPlatformInvoke () from /Users/snorp/source/jimdb-arm/lib/DAOKCT500586/system/lib/libdvm.so
#11 0x408d4414 in dvmCallJNIMethod(unsigned int const*, JValue*, Method const*, Thread*) () from /Users/snorp/source/jimdb-arm/lib/DAOKCT500586/system/lib/libdvm.so
#12 0x408ae6a4 in dvmJitToInterpNoChain () from /Users/snorp/source/jimdb-arm/lib/DAOKCT500586/system/lib/libdvm.so
#13 0x408ae6a4 in dvmJitToInterpNoChain () from /Users/snorp/source/jimdb-arm/lib/DAOKCT500586/system/lib/libdvm.so

So we're crashing in MediaCodec::CreateByType(). Not encouraging at all.
Assignee: nobody → snorp
The CreateDecoderByType() call does not crash if I call from the main UI thread. The usage of the looper in the stack there may be a clue. I don't think we have a Looper active on the thread where we create the decoder right now. Maybe that will help?
Creating the decoder in a new thread in Java also works, even though there is no Looper. So that might not be related.

Creating the MediaCodec in Java *does* fail if I try it from GeckoThread.run(). So there is something about that thread that is making things pretty unhappy. I also noticed the following lines in logcat:

I/        (17310): new RKOMXPlugin
E/        (17310): A Component loader constructor fails. Exiting
F/libc    (17310): Fatal signal 11 (SIGSEGV) at 0x00000000 (code=1), thread 17353 (CodecLooper)
Aha. I found some random source code for the RKOMXPlugin here[0]. It tries to dlsym a bunch of OMX stuff. I'm guessing that our OmxPlugin backend is screwing that up.

[0] https://github.com/zerouid/device_rockchip_rk2818/blob/master/libstagefrighthw/RkOMXPlugin.cpp
Actually I am now thinking the custom linker might be causing a problem here.
Loading libnss3 causes the MediaCodec to later crash upon construction. I do not understand why just yet. There is no dlopen/dlsym activity while creating the MediaCodec, and we load libnss3 with CustomElf, so none of this really makes sense to me.
Mike do you have any idea what could be going on here? What happens if the system linker dlopens a library that uses sqlite3 after we've opened nss3 (which exports sqlite3 symbols) via the custom linker? That should be alright because the system linker only knows about the symbols it has loaded itself, right? How else could the loading of nss3 screw things up?
Flags: needinfo?(mh+mozilla)
After some remote-hands debugging through irc, this is where we are:
- There is a "A Component loader constructor fails. Exiting" message before the crash.
- The crash is voluntary, it's the "Exiting" part of the message above.
- The message is printed from libOMX_Core.so's RKOMX_Init function, after it called BOSA_ST_InitComponentLoader which returns an error code.
- BOSA_ST_InitComponentLoader calls dlopen and dlsym, I guess one of those is failing. We need to find which one, with what parameters, and why.
Flags: needinfo?(mh+mozilla)
What might be handy is looking at the system linker error logs with LD_DEBUG=5 (this requires to be set in the environment before dalvik initializes, so needs to use a wrapper script)
I don't have root on this thing, but I'll see if I can get it and then try the LD_DEBUG stuff. What I was able to do just now, though, was figure out what stagefright/omx are trying to dlopen/dlsym.

dlsym("libstagefrighthw.so", "createOMXPlugin")
dlsym("libstagefrighthw.so", "ZN7android15createOMXPluginEv")

dlopen("libOMX_Core.so")

dlsym("libOMX_Core.so", "RKOMX_Init")
dlsym("libOMX_Core.so", "RKOMX_DeInit")
dlsym("libOMX_Core.so", "RKOMX_ComponentNameEnum")
dlsym("libOMX_Core.so", "RKOMX_GetHandle")
dlsym("libOMX_Core.so", "RKOMX_FreeHandle")
dlsym("libOMX_Core.so", "RKOMX_GetRolesOfComponent")

Then crash. If I had to guess, one of those or maybe just the last one fails for some reason. Looking further.
If I dlopen/dlsym those same things before and after nss is loaded I get the same results. Might be barking up the wrong tree?
Oh, I should add that I can resolve all of those symbols except RKOMX_DeInit.
The question was what dlopen/dlsym are happening *from* libOMX_Core.so's BOSA_ST_InitComponentLoader
AFAICT, no dlopen/dlsym is happening in BOSA_ST_InitComponentLoader.
Can you trace through BOSA_ST_InitComponentLoader with nexti and see what function calls it does and how it exits?
And possible, what's different when nss is not loaded.
It turns out that BOSA_ST_InitComponentLoader (or actually BOSA_AddComponentLoader, I believe) tries to fopen("system/lib/registry", "r"). This works before loadNSSLibs() because the cwd is "/", but loadNSSLibs() changes it to GRE_HOME.

This fix is sad, but we can't intercept fopen() since the OMX code is loaded by the system linker.
Attachment #8557993 - Flags: review?(mh+mozilla)
Things seem to work fine without the chdir at all, so lets do that instead.
Attachment #8557993 - Attachment is obsolete: true
Attachment #8557993 - Flags: review?(mh+mozilla)
Attachment #8558176 - Flags: review?(mh+mozilla)
Attachment #8558176 - Flags: review?(mh+mozilla) → review+
https://hg.mozilla.org/mozilla-central/rev/d48928891b94
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla38
Comment on attachment 8558176 [details] [diff] [review]
Don't chdir on Android

Approval Request Comment
[Feature/regressing bug #]: 1014614
[User impact if declined]: Crashes when attempting to play video on any Rockchip device
[Describe test coverage new/current, TreeHerder]: nightly
[Risks and why]: fairly low risk, only removes an unnecessary chdir()
[String/UUID change made/needed]: none
Attachment #8558176 - Flags: approval-mozilla-release?
Attachment #8558176 - Flags: approval-mozilla-beta?
Attachment #8558176 - Flags: approval-mozilla-aurora?
I'm tracking for 36+ and flagging for 35. I haven't heard feedback about video playback on Android in 34 or 35. Without strong justification, I think this fix may be suitable for a ride along but it does not seem like a driver for a point release.
(In reply to Lawrence Mandel [:lmandel] (use needinfo) from comment #24)
> I'm tracking for 36+ and flagging for 35. I haven't heard feedback about
> video playback on Android in 34 or 35. Without strong justification, I think
> this fix may be suitable for a ride along but it does not seem like a driver
> for a point release.

It is the #1 top crasher for 35, and results in an instant crash when playing any mp4 video on any rockchip device. I think it is justified for a point release.
Attachment #8558176 - Flags: approval-mozilla-beta?
Attachment #8558176 - Flags: approval-mozilla-beta+
Attachment #8558176 - Flags: approval-mozilla-aurora?
Attachment #8558176 - Flags: approval-mozilla-aurora+
NI Lukas to make sure she sees that.
Flags: needinfo?(lsblakk)
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #25)
> (In reply to Lawrence Mandel [:lmandel] (use needinfo) from comment #24)
> > I'm tracking for 36+ and flagging for 35. I haven't heard feedback about
> > video playback on Android in 34 or 35. Without strong justification, I think
> > this fix may be suitable for a ride along but it does not seem like a driver
> > for a point release.
> 
> It is the #1 top crasher for 35, and results in an instant crash when
> playing any mp4 video on any rockchip device. I think it is justified for a
> point release.

"topcrash" on Android is relative - Kevin, can you weigh in on the overall crash volume and how this compares in volume (esp. since it's for certain devices) to other crashes we've done/not done a dot release for?
Flags: needinfo?(lsblakk) → needinfo?(kbrosnan)
Release Note Request (optional, but appreciated)
[Why is this notable]: Fixed the number 1 top crash that
[Suggested wording]: "Fixed crash with video playback on Asus MeMO Pad 10 or 8, Tesco Hudl, Lenovo Lifetab E models and several other devices running the PowerVR SGX 540 GPU"
[Links (documentation, blog post, etc)]: only this bug



This is the number one top crash on release (24,050 crashes over 7d) and beta. On release it is the number one top crash by nearly double the number of crashes of 2 signature ~13,000. This crash is very device specific and users with the video chipset are experiencing abnormally high crash rates.

Top 10 Affected devices:
asus      K00F (Asus MeMO Pad 10)
asus      K00L (Asus MeMO Pad 8)
Tesco     Hudl HT7S3
LENOVO    LIFETAB_E10320
bq        bq Edison 2 Quad Core
LENOVO    LIFETAB_E10316
LENOVO    LIFETAB_E10312
HUAWEI    MediaPad 7 Youth
Iriver    tolino tab 8.9
LENOVO    LIFETAB_E7316
relnote-firefox: --- → ?
Flags: needinfo?(kbrosnan)
The GPU is actually not a factor here. The affected devices all have a Rockchip SoC, which can apparently use either PowerVR or Mali for the GPU.
That would make the suggested relnote: "Fixed crash with video playback on Asus MeMO Pad 10 or 8, Tesco Hudl, Lenovo Lifetab E models and several other devices running the Rockchip SoC"
Oh, and this is almost 8% of all crashes in Firefox 35 for Android, FWIW.
A bit late on this thread, but here's all the verbatims for the Top 3 Devices Kevin mentioned Since January: https://docs.google.com/a/mozilla.com/spreadsheets/d/1PoJtckgaIH88MWFS4qpOAsHKwrbs7F5LTNtxOzJh-oc/edit#gid=0
I've bolded the most relevant entries

Also, the frequency of complaints with both video+stability words is much higher on these devices than in our general feedback for Android:
| is_top_three | pct_crash_vid | responses |
+--------------+---------------+-----------+
|        False |        1.4670 |      7703 |
|         True |       12.5000 |        16 |

Let me know if I can provide any other relevant information.

Thanks
I'm convinced - let's get this uplifted and I can kick off builds first thing in my morning tomorrow (am currently in Germany so it will be very early tomorrow for PST).
Flags: needinfo?(snorp)
Attachment #8558176 - Flags: approval-mozilla-release? → approval-mozilla-release+
I have verified that this bug is in fact fixed in the 35.01 build
FWIW, https://crash-stats.mozilla.com/report/list?signature=libsomxcore.so%400x1732 and https://crash-stats.mozilla.com/report/list?signature=libOMX_Core.so%400x1ab8 are crashes that are going down in overall 35.* stats together with the main one on this bug.
This has been relnoted:
"35.0.1 Fixed crash with video playback on Asus MeMO Pad 10 or 8, Tesco Hudl, Lenovo Lifetab E models and several other devices running the Rockchip SoC"
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: