Closed Bug 2005500 Opened 5 months ago Closed 5 months ago

startup Crash in [@ libc.so | art::Runtime::Abort]

Categories

(GeckoView :: General, defect)

Unspecified
Android
defect

Tracking

(firefox146 fixed, firefox147 fixed, firefox148 fixed)

RESOLVED FIXED
148 Branch
Tracking Status
firefox146 --- fixed
firefox147 --- fixed
firefox148 --- fixed

People

(Reporter: aryx, Assigned: tnikkel)

References

(Regression)

Details

(4 keywords)

Crash Data

Attachments

(3 files)

This crash signature turned more frequent with the release of Firefox for Android 145. Only API versions 31-36 are reported. 80% of crashes in the first minute after launch.

6/6 reports checked had corrupted crash stacks.

Crash report: https://crash-stats.mozilla.org/report/index/0ab39e22-837e-43cf-895a-45d800251211

Reason:

SIGABRT / SI_QUEUE

Top 10 frames:

0  libc.so  libc.so@0x950b0
1  libart.so  art::Runtime::Abort(char const*)
2  libbase.so  libbase.so@0x16188
3  libbase.so  libbase.so@0x15730
4  libart.so  art::JavaVMExt::AddGlobalRef(art::Thread*, art::ObjPtr<art::mirror::Object>)
5  libart.so  art::JNI<false>::NewGlobalRef(_JNIEnv*, _jobject*)
6  libxul.so  _JNIEnv::NewGlobalRef(_jobject*)  /builds/worker/fetches/android-ndk/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/include/jni.h:548
6  libxul.so  mozilla::jni::GlobalRef<mozilla::java::GeckoResult>::NewGlobalRef(_JNIEnv*, _...  widget/android/jni/Refs.h:514
6  libxul.so  mozilla::jni::GlobalRef<mozilla::java::GeckoResult>::GlobalRef(mozilla::jni::...  widget/android/jni/Refs.h:540
6  libxul.so  mozilla::widget::ImageDecoderSupport::Decode(mozilla::jni::StringParam const&...  widget/android/ImageDecoderSupport.cpp:99

Bug 1988124 made changes to ImageDecoderSupport.cpp during the Fx145 cycle. I can't NI tschuster, but maybe tnikkel has cycles to look?

Component: General → Graphics: ImageLib
Flags: needinfo?(tnikkel)
Product: Firefox for Android → Core
Keywords: topcrash

We need to find someone that knows about Android JNI stuff. To me that code seems harmless. Maybe using GlobalRef() twice with the same LocalRef() is not allowed, somehow?

Edit: The line actually points to the first result = java::GeckoResult::GlobalRef(result), so probably not that.

Flags: needinfo?(bugzeeeeee)
Flags: needinfo?(m_kato)
Component: Graphics: ImageLib → General
Product: Core → GeckoView

I believe https://github.com/mozilla-firefox/firefox/commit/339eb8c6dc15 should cleanly revert as well.

If we need something quick then that might be the best option.

This assertion is global value's capacity reaches max (JNI ERROR ....).

In this situation, we cannot use Java's local reference since this is promise task. Currently although this code uses 2 lambda functions (with 2 global ref allocation), I guess that it is better to use ResolveOrRejectValue&& to merge it with one lambda function at least.

Or, we don't pass GeckoResult as parameter (using something unique id), then Java side has a something map data for GeckoResult and unique id.

Flags: needinfo?(m_kato)
Keywords: regression
Regressed by: 1988124

Sounds like a forward fix would be a bit involved and a little unclear if it would fix the issue?

I've got the revert on try server https://treeherder.mozilla.org/jobs?repo=try&revision=c397f9cc2060fe6b3159f9eb0b12ea014a7e36fd

Sounds like every time we add a new GlobalRef we could now run into crashes? That seems like a pretty big footgun to me. Thanks Timothy for trying out the revert.

Assignee: nobody → tnikkel
Status: NEW → ASSIGNED
Flags: needinfo?(tnikkel)

Okay, so I guess the plan will be land the back out on central, and then request uplift to beta and release? And file a followup to re-land this code with a fix to avoid this issue? Does that sound good?

The old code also uses one GlobalRef, isn't it possible that we would still crash, because we are so close to some internal limit?

This is the only code that has landed in the crashing file so we're pretty sure it is this changeset, even if we don't fully understand why? ie we are pretty confident the backout will fix the crashes?

What I gleaned from comment 5, is that this is some kind of global limit. So other code running before the ImageDecoder might have been added, which also uses GlobalRef and then this code is blamed for exceeding the global limit. In that case the revert might not help, because the actual problem is in another code (or rather having a global limit).

(I think we should still try to revert this anyway, but as a said, this is a huge footgun)

Many code capture "jobject" as Java's GlobalRef in lambda function. If promise is resolved or rejected definitely, it won't be leaked. Although I have fixed possible global ref leak in webrtc code last week (bug 2003417), I don't know another leak issue yet...

(In reply to Tom Schuster (MoCo) from comment #12)

What I gleaned from comment 5, is that this is some kind of global limit. So other code running before the ImageDecoder might have been added, which also uses GlobalRef and then this code is blamed for exceeding the global limit. In that case the revert might not help, because the actual problem is in another code (or rather having a global limit).

It could be other code, but the crash spike started with the release of 145, in which bug 1988124 landed, so there would have to be something else in 145 that pushed it over the limit.

(I think we should still try to revert this anyway, but as a said, this is a huge footgun)

We should fix the footgun, but right now we need as quick and simple of a fix as possible to fix this top crash. It doesn't block us working on a fix on the footgun.

Before landing bug 1988124, we hold GeckoResult as global ref. If we still use MozPromise, we can try global weak ref (mozilla::jni::Object::WeakRef etc) instead, but I don't test it yet.

See Also: → 2005723
Pushed by tnikkel@mozilla.com: https://github.com/mozilla-firefox/firefox/commit/2239f1d5ffba https://hg.mozilla.org/integration/autoland/rev/79bddf5152a1 Backed out ee76a3709456 and bcbd97f8ecbb of bug 1988124 for causing crashes. r=tschuster,geckoview-reviewers,m_kato
Status: ASSIGNED → RESOLVED
Closed: 5 months ago
Resolution: --- → FIXED
Target Milestone: --- → 148 Branch
Attachment #9532701 - Flags: approval-mozilla-beta?

firefox-beta Uplift Approval Request

  • User impact if declined: top crash
  • Code covered by automated testing: no
  • Fix verified in Nightly: no
  • Needs manual QE test: no
  • Steps to reproduce for manual QE testing:
  • Risk associated with taking this patch: low
  • Explanation of risk level: backout of patch to previous state we had for quite a while
  • String changes made/needed: none
  • Is Android affected?: yes

firefox-release Uplift Approval Request

  • User impact if declined: top crash
  • Code covered by automated testing: no
  • Fix verified in Nightly: no
  • Needs manual QE test: no
  • Steps to reproduce for manual QE testing:
  • Risk associated with taking this patch: low
  • Explanation of risk level: backout of patch to previous state we had for quite a while
  • String changes made/needed: none
  • Is Android affected?: yes
Attachment #9532702 - Flags: approval-mozilla-release?
Attachment #9532701 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #9532702 - Flags: approval-mozilla-release? → approval-mozilla-release+
Flags: needinfo?(bugzeeeeee)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: