Open Bug 1851704 Opened 2 years ago Updated 1 year ago

Crash in [@ java.lang.NoSuchFieldException: at java.lang.Class.getDeclaredField(Class.java)] org.mozilla.fenix.wallpapers.WallpapersUseCases$DefaultInitializeWallpaperUseCase

Categories

(Firefox for Android :: Homepage, defect, P3)

ARM
Android
defect

Tracking

()

Tracking Status
firefox117 --- wontfix
firefox118 --- wontfix
firefox119 --- wontfix
firefox120 --- wontfix
firefox121 --- wontfix
firefox122 --- wontfix
firefox123 --- wontfix
firefox124 --- wontfix
firefox125 --- wontfix
firefox126 --- wontfix

People

(Reporter: cpeterson, Unassigned)

References

Details

(Keywords: crash, Whiteboard: [fxdroid])

Crash Data

Crash report: https://crash-stats.mozilla.org/report/index/83d0b7d9-fca4-42ad-ac01-196840230905

Exception: java.lang.NoSuchFieldException
Message: _decision

Frame   Module  Function  Source
0   java.lang.Class   getDeclaredField  Class.java:929
1   java.util.concurrent.atomic.AtomicIntegerFieldUpdater$AtomicIntegerFieldUpdaterImpl   <init>  AtomicIntegerFieldUpdater.java:251
2   java.util.concurrent.atomic.AtomicIntegerFieldUpdater   newUpdater  AtomicIntegerFieldUpdater.java:49 
3   kotlinx.coroutines.DispatchedCoroutine  <clinit>  Builders.common.kt:5
4   kotlinx.coroutines.BuildersKt   withContext   :90 
5   org.mozilla.fenix.wallpapers.WallpapersUseCases$DefaultInitializeWallpaperUseCase   invoke  WallpapersUseCases.kt:439 
6   org.mozilla.fenix.FenixApplication$downloadWallpapers$1   invokeSuspend   FenixApplication.kt:53
7   kotlin.coroutines.jvm.internal.BaseContinuationImpl   resumeWith  ContinuationImpl.kt:9 
8   kotlinx.coroutines.DispatchedTask   run   DispatchedTask.kt:112 
9   kotlinx.coroutines.scheduling.CoroutineScheduler$Worker   run   CoroutineScheduler.kt:96

The bug is linked to a topcrash signature, which matches the following criteria:

  • Top 10 AArch64 and ARM crashes on beta
  • Top 10 AArch64 and ARM crashes on release

For more information, please visit BugBot documentation.

Keywords: topcrash

This looks to be a known issue in the stack: https://github.com/Kotlin/kotlinx.coroutines/issues/1900

please check your proguard rules.
If your rules are in order, but some users still see the crash, it's likely to be an Android-specific issue and it's worth reporting to Google issue tracker

Do we use ProGuard, Redex, or any other dex optimizers that cause symbol-renaming issues like this?

Joe is this on your team's radar?

Flags: needinfo?(jmahon)

Wasn't actively being pursued. Adding it to our working board.

Component: General → Homepage
Flags: needinfo?(jmahon)
Whiteboard: [fxdroid]

(In reply to Chris Peterson [:cpeterson] from comment #3)

please check your proguard rules.
If your rules are in order, but some users still see the crash, it's likely to be an Android-specific issue and it's worth reporting to Google issue tracker

Do we use ProGuard, Redex, or any other dex optimizers that cause symbol-renaming issues like this?

Yes, we should look into ensuring that we are using the appropriate keep rules to avoid this crash if possible, however, this happening in the same release and API version 21 as bug 1844964 is suspect that we are suffering from another root cause that is masked by these two different crash reports.

See Also: → 1844964

Note that this bug might look like a regression in August, but more likely is that it's an existing crash that was misreported and buried in the "EMPTY: no frame data available" crash reports (bug 1847429). willkg added a Socorro workaround for these "empty" crash reports and started generating proper crash signatures for reports received after August 1.

This sadly seems like a WONTFIX issue. Looking at it with fresh eyes after the bug 1844964 investigation, and I no longer think it's related.

From this spicy thread in the coroutine library (Kotlin/kotlinx.coroutines#490), this is a known issue and there is no consensus on a solution as it's a bug on the Samsung Android API 21 devices that Samsung has not responded to.

What can we do? Not much without taking some incredible amount of energy to rewrite a Samsung-only API 21 patch which I'm not eager to do.

I agree with comment 7, that this is probably not a new crash because nothing has changed in this code path either.

Jonathan, if we can't avoid hitting this exception in Kotlin code, can we catch and ignore it lower in the call stack in WallpapersUseCases$DefaultInitializeWallpaperUseCase or FenixApplication$downloadWallpapers?

Also, if this exception has been a known Kotlin issue since 2018, why did we only start hitting it in Fx 117? Did we rewrite our wallpapers code to use coroutines in 117?

Frame   Module  Function  Source
0   java.lang.Class   getDeclaredField  Class.java:929
1   java.util.concurrent.atomic.AtomicIntegerFieldUpdater$AtomicIntegerFieldUpdaterImpl   <init>  AtomicIntegerFieldUpdater.java:251
2   java.util.concurrent.atomic.AtomicIntegerFieldUpdater   newUpdater  AtomicIntegerFieldUpdater.java:49 
3   kotlinx.coroutines.DispatchedCoroutine  <clinit>  Builders.common.kt:5
4   kotlinx.coroutines.BuildersKt   withContext   :90 
5   org.mozilla.fenix.wallpapers.WallpapersUseCases$DefaultInitializeWallpaperUseCase   invoke  WallpapersUseCases.kt:439 
6   org.mozilla.fenix.FenixApplication$downloadWallpapers$1   invokeSuspend   FenixApplication.kt:53
7   kotlin.coroutines.jvm.internal.BaseContinuationImpl   resumeWith  ContinuationImpl.kt:9 
8   kotlinx.coroutines.DispatchedTask   run   DispatchedTask.kt:112 
9   kotlinx.coroutines.scheduling.CoroutineScheduler$Worker   run   CoroutineScheduler.kt:96
Flags: needinfo?(jonalmeida942)

IIUC we can catch the exception in Kotlin code if we just want this specific crash to go away. However, that would imply a few things:

  • Users on those devices won't have access to a feature without any indication as to why. Our solve here would be to basically silently fail downloading wallpapers forever.
  • The underlying issue that Samsung has caused here is liable to appear in other places throughout the app, so we either
    1. Litter the codebase with exception handling (and random dead spots in the app for users)
    2. Undertake a pretty massive (and hacky, ultimately) technical initiative to solve it more generally

IMO the short-term solution is not worth the damage to the user experience or the long-term health of our code, and the long-term investment is not worth the effort given the (shrinking) size of the device segment of Samsung users on Android 5.0.x.

(In reply to Matt Tighe [:matt-tighe] from comment #10)

  • The underlying issue that Samsung has caused here is liable to appear in other places throughout the app, so we either

Thanks Matt!

Chris, I want to emphasize this point of Matt's: we will just be moving the problem to the next place in the code base where we need to use AtomicIntegerFieldUpdater in any of our libraries (everything kotlin with a suspend function).

(In reply to Chris Peterson [:cpeterson] from comment #9)

Also, if this exception has been a known Kotlin issue since 2018, why did we only start hitting it in Fx 117? Did we rewrite our wallpapers code to use coroutines in 117?

That's a good question, I forgot to reconsider that again.

The kotlinx.coroutines 1.7.0 update had a new API implementation for Channel (coroutine API that is used internally) which is faster, but touches on this buggy code in Samsung API 21 devices whenever it needs to access the java Atomic*FieldUpdater. One of the library author's have a better explanation of the problem here.

I don't think backing it out is feasible because we have to do this update eventually (our dependencies are also updating their dependencies to it too).

The wordpress app seems to have faced this issue too and updated their min SDK version to 24. Bumping up to API 22 alone would solve a lot of bugs like this for us (even in our native rust components).

Flags: needinfo?(jonalmeida942)

Seems like the simplest and most maintainable option would indeed be to bump the minSDK version. I just added an investigation task to help us determine if there are considerable downsides to doing so, let's circle back to this after doing that investigation, before closing this as WONTFIX.

I generally get nervous about WONTFIX'ing unresolved crashes, because those crashes will still continue to pop up in our logs, so it just perpetuates noise that makes it a little harder to notice other emerging issues. (that would be my main personal justification for catching and handling this one, even if it leaves the users with a broken experience - but your counterarguments to doing so are solid, as well).

Seems like the simplest and most maintainable option would indeed be to bump the minSDK version.

Since this crash only affects users with API 21, bumping the minimum API to 22 would leave these users with an unsupported (and still crashing) Firefox.

This telemetry query says that only 0.13% of Fenix users have API 21 (vs 2.37% with API 22):

https://sql.telemetry.mozilla.org/queries/80030#198699

Priority: P1 → P2
See Also: → 1872192
See Also: → 1804115
Priority: P2 → P3
Hardware: Unspecified → ARM

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit BugBot documentation.

Keywords: topcrash

Since the crash volume is low (less than 15 per week), the severity is downgraded to S3. Feel free to change it back if you think the bug is still critical.

For more information, please visit BugBot documentation.

Severity: S2 → S3
You need to log in before you can comment on or make changes to this bug.