startup Crash in [@ libc.so | art::Runtime::Abort]
Categories
(GeckoView :: General, defect)
Tracking
(firefox146 fixed, firefox147 fixed, firefox148 fixed)
People
(Reporter: aryx, Assigned: tnikkel)
References
(Regression)
Details
(4 keywords)
Crash Data
Attachments
(3 files)
|
48 bytes,
text/x-phabricator-request
|
Details | |
|
48 bytes,
text/x-phabricator-request
|
phab-bot
:
approval-mozilla-beta+
|
Details |
|
48 bytes,
text/x-phabricator-request
|
phab-bot
:
approval-mozilla-release+
|
Details |
This crash signature turned more frequent with the release of Firefox for Android 145. Only API versions 31-36 are reported. 80% of crashes in the first minute after launch.
6/6 reports checked had corrupted crash stacks.
Crash report: https://crash-stats.mozilla.org/report/index/0ab39e22-837e-43cf-895a-45d800251211
Reason:
SIGABRT / SI_QUEUE
Top 10 frames:
0 libc.so libc.so@0x950b0
1 libart.so art::Runtime::Abort(char const*)
2 libbase.so libbase.so@0x16188
3 libbase.so libbase.so@0x15730
4 libart.so art::JavaVMExt::AddGlobalRef(art::Thread*, art::ObjPtr<art::mirror::Object>)
5 libart.so art::JNI<false>::NewGlobalRef(_JNIEnv*, _jobject*)
6 libxul.so _JNIEnv::NewGlobalRef(_jobject*) /builds/worker/fetches/android-ndk/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/include/jni.h:548
6 libxul.so mozilla::jni::GlobalRef<mozilla::java::GeckoResult>::NewGlobalRef(_JNIEnv*, _... widget/android/jni/Refs.h:514
6 libxul.so mozilla::jni::GlobalRef<mozilla::java::GeckoResult>::GlobalRef(mozilla::jni::... widget/android/jni/Refs.h:540
6 libxul.so mozilla::widget::ImageDecoderSupport::Decode(mozilla::jni::StringParam const&... widget/android/ImageDecoderSupport.cpp:99
Comment 1•5 months ago
|
||
Bug 1988124 made changes to ImageDecoderSupport.cpp during the Fx145 cycle. I can't NI tschuster, but maybe tnikkel has cycles to look?
Updated•5 months ago
|
Comment 2•5 months ago
•
|
||
We need to find someone that knows about Android JNI stuff. To me that code seems harmless. Maybe using GlobalRef() twice with the same LocalRef() is not allowed, somehow?
Edit: The line actually points to the first result = java::GeckoResult::GlobalRef(result), so probably not that.
Updated•5 months ago
|
Updated•5 months ago
|
Updated•5 months ago
|
Comment 3•5 months ago
|
||
I believe https://github.com/mozilla-firefox/firefox/commit/339eb8c6dc15 should cleanly revert as well.
| Assignee | ||
Comment 4•5 months ago
|
||
If we need something quick then that might be the best option.
Comment 5•5 months ago
|
||
This assertion is global value's capacity reaches max (JNI ERROR ....).
In this situation, we cannot use Java's local reference since this is promise task. Currently although this code uses 2 lambda functions (with 2 global ref allocation), I guess that it is better to use ResolveOrRejectValue&& to merge it with one lambda function at least.
Or, we don't pass GeckoResult as parameter (using something unique id), then Java side has a something map data for GeckoResult and unique id.
| Assignee | ||
Updated•5 months ago
|
| Assignee | ||
Comment 6•5 months ago
|
||
Sounds like a forward fix would be a bit involved and a little unclear if it would fix the issue?
I've got the revert on try server https://treeherder.mozilla.org/jobs?repo=try&revision=c397f9cc2060fe6b3159f9eb0b12ea014a7e36fd
Comment 7•5 months ago
|
||
Sounds like every time we add a new GlobalRef we could now run into crashes? That seems like a pretty big footgun to me. Thanks Timothy for trying out the revert.
| Assignee | ||
Comment 8•5 months ago
|
||
Updated•5 months ago
|
| Assignee | ||
Updated•5 months ago
|
| Assignee | ||
Comment 9•5 months ago
|
||
Okay, so I guess the plan will be land the back out on central, and then request uplift to beta and release? And file a followup to re-land this code with a fix to avoid this issue? Does that sound good?
Comment 10•5 months ago
|
||
The old code also uses one GlobalRef, isn't it possible that we would still crash, because we are so close to some internal limit?
| Assignee | ||
Comment 11•5 months ago
|
||
This is the only code that has landed in the crashing file so we're pretty sure it is this changeset, even if we don't fully understand why? ie we are pretty confident the backout will fix the crashes?
Comment 12•5 months ago
|
||
What I gleaned from comment 5, is that this is some kind of global limit. So other code running before the ImageDecoder might have been added, which also uses GlobalRef and then this code is blamed for exceeding the global limit. In that case the revert might not help, because the actual problem is in another code (or rather having a global limit).
(I think we should still try to revert this anyway, but as a said, this is a huge footgun)
Comment 13•5 months ago
|
||
Many code capture "jobject" as Java's GlobalRef in lambda function. If promise is resolved or rejected definitely, it won't be leaked. Although I have fixed possible global ref leak in webrtc code last week (bug 2003417), I don't know another leak issue yet...
| Assignee | ||
Comment 14•5 months ago
|
||
(In reply to Tom Schuster (MoCo) from comment #12)
What I gleaned from comment 5, is that this is some kind of global limit. So other code running before the ImageDecoder might have been added, which also uses GlobalRef and then this code is blamed for exceeding the global limit. In that case the revert might not help, because the actual problem is in another code (or rather having a global limit).
It could be other code, but the crash spike started with the release of 145, in which bug 1988124 landed, so there would have to be something else in 145 that pushed it over the limit.
(I think we should still try to revert this anyway, but as a said, this is a huge footgun)
We should fix the footgun, but right now we need as quick and simple of a fix as possible to fix this top crash. It doesn't block us working on a fix on the footgun.
Comment 15•5 months ago
|
||
Before landing bug 1988124, we hold GeckoResult as global ref. If we still use MozPromise, we can try global weak ref (mozilla::jni::Object::WeakRef etc) instead, but I don't test it yet.
Comment 16•5 months ago
|
||
Comment 17•5 months ago
|
||
| bugherder | ||
| Assignee | ||
Comment 18•5 months ago
|
||
Original Revision: https://phabricator.services.mozilla.com/D276153
Updated•5 months ago
|
Comment 19•5 months ago
|
||
firefox-beta Uplift Approval Request
- User impact if declined: top crash
- Code covered by automated testing: no
- Fix verified in Nightly: no
- Needs manual QE test: no
- Steps to reproduce for manual QE testing:
- Risk associated with taking this patch: low
- Explanation of risk level: backout of patch to previous state we had for quite a while
- String changes made/needed: none
- Is Android affected?: yes
Comment 20•5 months ago
|
||
firefox-release Uplift Approval Request
- User impact if declined: top crash
- Code covered by automated testing: no
- Fix verified in Nightly: no
- Needs manual QE test: no
- Steps to reproduce for manual QE testing:
- Risk associated with taking this patch: low
- Explanation of risk level: backout of patch to previous state we had for quite a while
- String changes made/needed: none
- Is Android affected?: yes
| Assignee | ||
Comment 21•5 months ago
|
||
Original Revision: https://phabricator.services.mozilla.com/D276153
Updated•5 months ago
|
Updated•5 months ago
|
Comment 22•5 months ago
|
||
| uplift | ||
Updated•5 months ago
|
Updated•5 months ago
|
Comment 23•5 months ago
|
||
| uplift | ||
Updated•3 months ago
|
Description
•