Open Bug 1804115 Opened 2 years ago Updated 1 year ago

kotlinx.coroutines.ChildHandleNode Crash in [@ java.lang.NoClassDefFoundError: at kotlinx.coroutines.JobSupport.attachChild]

Categories

(Fenix :: Crash Reporting, defect, P2)

Unspecified
Android
defect

Tracking

(firefox107 unaffected, firefox108 unaffected, firefox109 wontfix, firefox110 wontfix, firefox111 wontfix, firefox112 wontfix, firefox113 wontfix, firefox114 wontfix, firefox115 wontfix, firefox121 wontfix, firefox122 wontfix, firefox123 wontfix)

Tracking Status
firefox107 --- unaffected
firefox108 --- unaffected
firefox109 --- wontfix
firefox110 --- wontfix
firefox111 --- wontfix
firefox112 --- wontfix
firefox113 --- wontfix
firefox114 --- wontfix
firefox115 --- wontfix
firefox121 --- wontfix
firefox122 --- wontfix
firefox123 --- wontfix

People

(Reporter: cpeterson, Unassigned)

References

Details

(Keywords: crash, leave-open, regression, Whiteboard: [geckoview:m114?])

Crash Data

Attachments

(1 file)

Crash report: https://crash-stats.mozilla.org/report/index/b89f5422-05d2-4c9d-b821-c64370221201

This crash signature spiked in 103 and 104, but then disappeared in 105-108, but a few recent reports just appeared in Nightly 109. Is this a new regression in 109?

Over the last six months, there have been 1753 crash reports and 100% of them are from API 21 (Android 5.0 Lollipop) and 99% are from 32-bit ARM.
Java stack trace:

java.lang.NoClassDefFoundError: kotlinx.coroutines.ChildHandleNode
	at kotlinx.coroutines.JobSupport.attachChild(JobSupport.kt:1)
	at kotlinx.coroutines.JobSupport.initParentJob(JobSupport.kt:11)
	at kotlinx.coroutines.AbstractCoroutine.<init>(AbstractCoroutine.kt:12)
	at kotlinx.coroutines.StandaloneCoroutine.<init>(Builders.common.kt:1)
	at kotlinx.coroutines.BuildersKt.launch(Unknown Source)
	at kotlinx.coroutines.BuildersKt.launch$default(Unknown Source)
	at mozilla.components.lib.crash.CrashReporter.submitCaughtException(CrashReporter.kt:78)
	at org.mozilla.fenix.experiments.NimbusSetupKt$createNimbus$1$1.invoke(NimbusSetup.kt:68)
	at org.mozilla.experiments.nimbus.AbstractNimbusBuilder.build(NimbusBuilder.kt:124)
	at org.mozilla.fenix.components.Analytics$experiments$2.invoke(Analytics.kt:152)
	at kotlin.SynchronizedLazyImpl.getValue(LazyJVM.kt:21)
	at org.mozilla.fenix.components.Analytics.getExperiments(Analytics.kt:3)
	at org.mozilla.fenix.FenixApplication.onCreate(FenixApplication.kt:125)
	at android.app.Instrumentation.callApplicationOnCreate(Instrumentation.java:1020)
	at android.app.ActivityThread.handleBindApplication(ActivityThread.java:5007)
	at android.app.ActivityThread.access$1600(ActivityThread.java:172)
	at android.app.ActivityThread$H.handleMessage(ActivityThread.java:1483)
	at android.os.Handler.dispatchMessage(Handler.java:102)
	at android.os.Looper.loop(Looper.java:145)
	at android.app.ActivityThread.main(ActivityThread.java:5832)
	at java.lang.reflect.Method.invoke(Native Method)
	at java.lang.reflect.Method.invoke(Method.java:372)
	at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:1399)
	at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:1194)

The severity field is not set for this bug.
:cpeterson, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(cpeterson)

Looks like there was a big spike in 103, but the current crash volume is very low.

Severity: -- → S3
Flags: needinfo?(cpeterson)
Priority: -- → P5

Charlie, this crash's volume increased when 109 hit release on Jan 17.

Did you expect PR https://github.com/mozilla-mobile/fenix/pull/27951 would fix this crash? You added that PR to this bug's "See Also" list, but then removed it.

RyanVM suspects this crash is a regression from https://github.com/mozilla-mobile/fenix/pull/27934.

Component: General → Experimentation and Telemetry
Keywords: regression
Priority: P5 → P3

I added 27951 initially thinking it might be related, but through our conversation in slack I believe we determined it likely was not.

Trying to repro it now, running into issues with that.

Okay so still not able to reproduce the specific issue, but I analyzed the call chain and here's where I'm at now:

  1. The error reporter is bound to the Nimbus object during setup
  2. An error of indeterminate origin occurs within Nimbus. Most likely Rust.
  3. CrashReporter.kt.submitCaughtException launches a coroutine
  4. We ultimately end up in initParentJob (part of the coroutine library) where it tries to attach the current coroutine to the parent job, and fails because for some reason the kotlinx.coroutines.ChildHandleNode class cannot be found.

So while yes, there is definitely some error coming back from the Nimbus Rust side of things, I'm currently unable to determine what the error is since the crash reporter's coroutine launch is crashing the application.

Looking at the stack trace, I'd agree with :charlie 's analysis: there are two errors here— one happening somewhere in the Nimbus rust SDK, and then one in the passed-in-to-Nimbus error reporter, in the CrashReporter.

This PR catches error reporter errors, and falling back to logging the nimbus errors locally, which should give us extra information about the Nimbus error if we're lucky enough to get the logs from a device.

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 10 AArch64 and ARM crashes on beta

:cpeterson, could you consider increasing the severity of this top-crash bug?

For more information, please visit auto_nag documentation.

Flags: needinfo?(cpeterson)
Keywords: topcrash

(In reply to James Hugman [:jhugman] [@jhugman] from comment #7)

This PR catches error reporter errors, and falling back to logging the nimbus errors locally, which should give us extra information about the Nimbus error if we're lucky enough to get the logs from a device.

James, so there's nothing we can do for this bug unless someone can reproduce the error locally and capture the device log?

It's interesting that 100% of these crash reports are from Android API 21 (Android 5.0 Lollipop) and 99% are from 32-bit ARM.

Flags: needinfo?(cpeterson) → needinfo?(jhugman)
Priority: P3 → P2

:cpeterson

James, so there's nothing we can do for this bug unless someone can reproduce the error locally and capture the device log?

I think there are two bugs here. The first is in CrashReporter.submitCaughtException; this would be triggered anytime there is a caught exception in Lollipop 32-bit ARM devices.

The second is Nimbus is reporting caught exceptions rather frequently, and trusting that the crash reporter won't itself crash.

To fix the first issue, I don't think there's anything to do unless someone can reproduce the error locally.

The PR attached addresses the second issue, making it harder to reproduce locally in a release APK.

Flags: needinfo?(jhugman)
Crash Signature: [@ java.lang.NoClassDefFoundError: at kotlinx.coroutines.JobSupport.attachChild(JobSupport.kt:1)] → [@ java.lang.NoClassDefFoundError: at kotlinx.coroutines.JobSupport.attachChild]
Summary: kotlinx.coroutines.ChildHandleNode Crash in [@ java.lang.NoClassDefFoundError: at kotlinx.coroutines.JobSupport.attachChild(JobSupport.kt:1)] → kotlinx.coroutines.ChildHandleNode Crash in [@ java.lang.NoClassDefFoundError: at kotlinx.coroutines.JobSupport.attachChild]
Crash Signature: [@ java.lang.NoClassDefFoundError: at kotlinx.coroutines.JobSupport.attachChild] → [@ java.lang.NoClassDefFoundError: at kotlinx.coroutines.JobSupport.attachChild(JobSupport.kt:1)]

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit auto_nag documentation.

Keywords: topcrash

Hi James, I see one crash report from Fenix 112, which should have the A-S release including the PR from comment 7. Does anything look interesting in the crash report?
https://crash-stats.mozilla.org/report/index/d4e6f3cf-ebee-4317-942d-434980230322

Flags: needinfo?(jhugman)

Hi :RyanVM

This appears to be the same root cause: a NoClassDefFoundError, affecting only Samsung armeabi-v7a Android API 21 devices.

The patch from comment 7 missed the call into Rust to construct the NimbusClient, which is causing the errorReporter to be called. The patch in comment 13 fixes this.

I'm still interested in Nimbus exceptions! I'd still really like to know what was being reported to the exception handler before it dies.

Flags: needinfo?(jhugman)

(In reply to James Hugman [:jhugman] [@jhugman] from comment #14)

I'm still interested in Nimbus exceptions! I'd still really like to know what was being reported to the exception handler before it dies.

Charlie, is this something you can help with? It's not entirely clear to me what the next step is here with James' new diagnostic patch landed.

Flags: needinfo?(chumphreys)

To be clear: without a way of reporting these back to our Crash Reporting infrastructure, I don't think it's possible to find out what the Nimbus caught exception that prompts us to call submitCaughtException.

When the crash reporter is fixed, then the Nimbus team can continue investigating. Suggest pinging the Fenix team to prioritise and investigate.

/cc :brclark / :jmahon

Flags: needinfo?(jmahon)
Flags: needinfo?(chumphreys)
Flags: needinfo?(brclark)

IIUC, Charlie's comment 6 describes the problem that is causing the crash reporter crash in comment 0:

  1. We ultimately end up in initParentJob (part of the coroutine library) where it tries to attach the current coroutine to the parent job, and fails because for some reason the kotlinx.coroutines.ChildHandleNode class cannot be found.

So while yes, there is definitely some error coming back from the Nimbus Rust side of things, I'm currently unable to determine what the error is since the crash reporter's coroutine launch is crashing the application.

Component: Experimentation and Telemetry → Crash Reporting
Flags: needinfo?(jmahon)
Flags: needinfo?(brclark)
Whiteboard: [geckoview:m114?]

The leave-open keyword is there and there is no activity for 6 months.
:royang, maybe it's time to close this bug?
For more information, please visit BugBot documentation.

Flags: needinfo?(royang)

Waiting for 121 release to see if this crash is still an issue.

Flags: needinfo?(royang)
See Also: → 1872192

We're still receiving crash reports from 122 and 123. This crash looks like yet another case of the kotlinx.coroutines crashes (like bug 1844964, bug 1804115, and bug 1851704) on Samsung devices running Android 5.0 or 5.1, which are a known issue in the kotlinx.coroutines library: https://github.com/Kotlin/kotlinx.coroutines/issues/490

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: