Closed Bug 1733757 Opened 3 years ago Closed 3 years ago

Crash in [@ ffi_support::handle_map::ConcurrentHandleMap<T>::insert]

Categories

(Data Platform and Tools :: Glean: SDK, defect, P1)

Unspecified
Android
defect

Tracking

(firefox-esr91 unaffected, firefox93 unaffected, firefox94+ verified, firefox95+ verified)

VERIFIED FIXED
Tracking Status
firefox-esr91 --- unaffected
firefox93 --- unaffected
firefox94 + verified
firefox95 + verified

People

(Reporter: valentin, Assigned: janerik)

References

Details

(Keywords: crash)

Crash Data

Attachments

(2 files, 2 obsolete files)

Crash report: https://crash-stats.mozilla.org/report/index/5e834290-9eb1-4d57-bad7-f90750211002

MOZ_CRASH Reason: Bug: next_index pointed at non-free list entry (or end of list)

Top 10 frames of crashing thread:

0 libxul.so RustMozCrash mozglue/static/rust/wrappers.cpp:18
1 libxul.so mozglue_static::panic_hook mozglue/static/rust/lib.rs:91
2 libxul.so core::ops::function::Fn::call library/core/src/ops/function.rs:70
3 libxul.so std::panicking::rust_panic_with_hook library/std/src/panicking.rs:626
4 libxul.so std::panicking::begin_panic::{{closure}} library/std/src/panicking.rs:542
5 libxul.so std::sys_common::backtrace::__rust_end_short_backtrace library/std/src/sys_common/backtrace.rs:141
6 libxul.so std::panicking::begin_panic library/std/src/panicking.rs:541
7 libxul.so ffi_support::handle_map::ConcurrentHandleMap<T>::insert third_party/rust/ffi-support/src/handle_map.rs:762
8 libxul.so glean_labeled_counter_metric_get third_party/rust/glean-ffi/src/labeled.rs:77
9 libjnidispatch.so libjnidispatch.so@0x10154 
Component: Glean Platform → Glean: SDK

I'm not able to reproduce in the emulator right now.
Talked to rpapa.
We're now at 1000+ crashes in crash-stats.

We landed the switch to GeckoView-provided Glean on 2021-09-30: https://github.com/mozilla-mobile/android-components/pull/11062
This got into Fenix shortly after.

Code-wise this panic comes from ffi-support: https://github.com/mozilla/ffi-support/blob/0fdc22a8dfe3731be5fd39b311e4e4885219e26c/src/handle_map.rs#L409
This shouldn't trigger under normal behavior.

rpapa started a document to also look into recent test crashes: https://docs.google.com/document/d/11dLgCTIQ30edwsOCavnjL4a0Du5ksrFukNi7LaUfuiA/edit#

From latest log-cat

java.lang.NullPointerException: Attempt to invoke virtual method 'java.lang.String okhttp3.mockwebserver.RecordedRequest.getPath()' on a null object reference
at org.mozilla.samples.glean.pings.BaselinePingTest.waitForPingContent(BaselinePingTest.kt:92)
at org.mozilla.samples.glean.pings.BaselinePingTest.waitForPingContent$default(BaselinePingTest.kt:83)
at org.mozilla.samples.glean.pings.BaselinePingTest.validateBaselinePing(BaselinePingTest.kt:125)
Assignee: nobody → jrediger
Priority: -- → P1

(In reply to Aaron Train [:aaronmt] from comment #2)

From latest log-cat

java.lang.NullPointerException: Attempt to invoke virtual method 'java.lang.String okhttp3.mockwebserver.RecordedRequest.getPath()' on a null object reference
at org.mozilla.samples.glean.pings.BaselinePingTest.waitForPingContent(BaselinePingTest.kt:92)
at org.mozilla.samples.glean.pings.BaselinePingTest.waitForPingContent$default(BaselinePingTest.kt:83)
at org.mozilla.samples.glean.pings.BaselinePingTest.validateBaselinePing(BaselinePingTest.kt:125)

That looks like a test failure though (the test not receiving a ping when it expects one).
If there's no gecko crash recorded in that test run I suppose it's a different issue.

I tried creating a test on Fenix and running that on a physical device, but so far that didn't trigger a crash.

Crash Signature: [@ ffi_support::handle_map::ConcurrentHandleMap<T>::insert]
See Also: → 1734419

QA was able to reliably hit this, see https://github.com/mozilla-mobile/fenix/issues/21767

Flags: needinfo?(jrediger)

This is huge! I reproduced it on my test phone now, just waiting 9 minutes.
That might help me drill down into what's happening. Thanks!

Flags: needinfo?(jrediger)

This includes a bugfix preventing a crash on Fenix,
a bugfix preventing a panic when trying to flush ping-lifetime data after shutdown,
as well as an API change to make glean::persist_ping_lifetime_data async.

Attached file Using this bug for n (obsolete) —
Attachment #9245266 - Attachment is obsolete: true
Attachment #9245267 - Attachment is obsolete: true

Using this bug for noting some of the context.

Some of it is captured in the extensive commit message in Glean, too.

Timeline:

  • 2020-05-26: We landed a commit to remove finalizers from our Kotlin classes.
    • Reason: Kotlin finalizers are unreliable, caused crashes and largely wouldn't be run under normal use anyway, except when the app quits.
  • 2020-05-27: Glean v26.0.0 was released, containing above fix.
  • 2021-09-30: We landed a PR in a-c to switch from libglean.so to a geckoview-provided library.
  • 2021-10-02: This issue is filed, showing a rapid increase in crashes reported from Fenix Nightly.
  • 2021-10-04: Due to the weekend I only get to take a look by Monday. Able to identify where the crash is happening. Unable to reproduce it or understand why it's happening
  • 2021-10-05: Trying to add a desperate test to Fenix, exercising. Trying to run that on physical devices in our Firebase account. No luck triggering a crash.
  • 2021-10-06: I pick up a loaner device, hoping to reproduce the crash (previously only using the Android Emulator)
  • 2021-10-07: QA is able to reproduce the crash and provides simple Steps to Reproduce
    • Soon after I'm able to reproduce the issue on my loaner phone, at the end of my work day
  • 2021-10-08: Picking up this work again, instrumenting the relevant code paths with more logging. While doing so I notice the size limitations of the library we're using
    • This starts to give me an idea what's going on
    • Continous testing allows me to track it down to the root cause: rapid re-creation of metric objects
    • Rest of the day is spent on writing a fix and testing it out. By the end of the day my PR is up for review
  • 2021-10-11: I land this work, do a Glean release and push it into mozilla-central right after

Which channels are affected?

Fenix Nightly & Beta. Beta since the merge day last week.
We're planning to uplift the changes, which will fix Beta before it goes into Release.

Why didn't we notice this crash before?

That's unclear at the moment.
I think with the move to having Glean in GeckoView we're only now getting these crashes through the Gecko crash reporter.
Before I assume we should have gotten it in Sentry, but I can't find anything relevant.
I'm reaching out to Android folks to see if they have an idea.

How frequent is it really?

I don't know for sure.
The rapid increase in crashes after my work was merged made this urgent in my eyes,
but I'm not sure about overall crash numbers for Fenix.
By raw numbers of affected users its low compared to total users,
but this again might be biased by not getting data from everyone.

What's the impact on other metrics?

Given I don't know how many are really affected it's hard to say.
To trigger it requires a significant use of the browser.
The test case requires uninterrupted use.
From my understanding of the code path that triggers it, it probably can happen if the browser is never fully quit, but used again and again.
It all depends on the website content and whether that triggers certain metric recordings.

(In reply to Jan-Erik Rediger [:janerik] from comment #15)

To trigger it requires a significant use of the browser.
The test case requires uninterrupted use.
From my understanding of the code path that triggers it, it probably can happen if the browser is never fully quit, but used again and again.

I got a lot of crashes by watching Youtube in the browser (sometimes in the background with the Video Background Play Fix addon).

(In reply to Valentin Gosu [:valentin] (he/him) from comment #16)

(In reply to Jan-Erik Rediger [:janerik] from comment #15)

To trigger it requires a significant use of the browser.
The test case requires uninterrupted use.
From my understanding of the code path that triggers it, it probably can happen if the browser is never fully quit, but used again and again.

I got a lot of crashes by watching Youtube in the browser (sometimes in the background with the Video Background Play Fix addon).

Thanks! That's indeed a more likely usecase I haven't considered.

Update on "Why didn't we notice this crash before?"

Because we didn't crash the full browser before!
I took Fenix 93.0.0-beta.4 for a test run in the emulator. After 9 minutes or so the logcat dumped this on me:

           libglean_ffi  E  ffi_support::error: Caught a panic calling rust code: "Bug: next_index pointed at non-free list entry (or end of list)"
                         E  glean_ffi::handlemap_ext: Glean failed (ErrorCode(-1)): Bug: next_index pointed at non-free list entry (or end of list)
                         E  ffi_support::error: Caught a panic calling rust code: "called `Result::unwrap()` on an `Err` value: \"PoisonError { inner: .. }\""
                         E  glean_ffi::handlemap_ext: Glean failed (ErrorCode(-1)): called `Result::unwrap()` on an `Err` value: "PoisonError { inner: .. }"
                         E  ffi_support::error: Caught a panic calling rust code: "called `Result::unwrap()` on an `Err` value: \"PoisonError { inner: .. }\""

This gets repeated for every time Glean tries to record anything, or really do anything.
That means on old version, since at least May 2020 we're likely missing some data when the browser is running into this bug.

It's also more clear to me why:
We're using a dispatcher, running these tasks in another thread.
Glean panics, crashing that thread, but this does not affect the main application beyond that.
With the move to ship Glean through GeckoView Rust panics get translated into aborts, ultimatively hitting MOZ_CRASH and taking down the full browser.
That's a change in behavior we haven't considered.

Glean should not panic under normal operation, but only for actual bugs. So I do think it's the right behavior now, just unfortunate we didn't have enough time to catch it in Nightly.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED

Please nominate this for Beta approval when you get a chance.

Flags: needinfo?(jrediger)

It seems I can't use the approval-mozilla-release flag here at the moment, so again going the manual way.

Beta/Release Uplift Approval Request

  • User impact if declined: top crasher for Fenix on Nightly & Beta
  • Is this code covered by automated tests?: Yes, tests are in the Glean repository
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce: -
  • List of other uplifts needed: None
  • Risk to taking this patch: Medium
  • Why is the change risky/not risky? (and alternatives if risky): New Glean version, with automated tests in the Glean repository. Additionally manually tested in local builds.
  • String changes made/needed:
Flags: needinfo?(jrediger) → needinfo?(ryanvm)

Note: Bug 1732988 also requires the Glean version upgrade uplift. That's covered by this request.

Comment on attachment 9245218 [details]
Bug 1733757 - Update to Glean v42.0.1. r?Dexter

Beta/Release Uplift Approval Request

  • User impact if declined: top crasher for Fenix on Nightly & Beta
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: Yes
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Medium
  • Why is the change risky/not risky? (and alternatives if risky): New Glean version, with automated tests in the Glean repository. Additionally manually tested in local builds.
  • String changes made/needed:
Attachment #9245218 - Flags: approval-mozilla-beta?

We got flag support now, so I figured we can do it properly now too.

Flags: needinfo?(ryanvm)

Comment on attachment 9245218 [details]
Bug 1733757 - Update to Glean v42.0.1. r?Dexter

Approved for Desktop 94.0b5 & Fenix 94.0.0-beta.3.

Attachment #9245218 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

No new crashes on Beta or Nightly since this landed \m/

Status: RESOLVED → VERIFIED
See Also: → 1732988
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: