Closed Bug 1733757 Opened 3 years ago Closed 3 years ago

Crash in [@ ffi_support::handle_map::ConcurrentHandleMap<T>::insert]

Tracking

(firefox-esr91 unaffected, firefox93 unaffected, firefox94+ verified, firefox95+ verified)

Status:

VERIFIED FIXED

Tracking Flags:

Tracking

Status

firefox-esr91

---

unaffected

firefox93

---

unaffected

firefox94

verified

firefox95

verified

People

(Reporter: valentin, Assigned: janerik)

References

Details

(Keywords: crash)

Crash Data

Attachments

(2 files, 2 obsolete files)

Link to GitHub pull-request: https://github.com/mozilla/glean/pull/1823 3 years ago GitHub Bugzilla PR Linker 42 bytes, text/x-github-pull-request		Details \| Review
Bug 1733757 - Update to Glean v42.0.1. r?Dexter 3 years ago Jan-Erik Rediger [:janerik] 48 bytes, text/x-phabricator-request	RyanVM : approval-mozilla-beta+	Details \| Review
Using this bug for noting some of the context. Some of it is captured in the [extensive commit message in Glean](https://github.com/mozilla/glean/commit/499309475f4f002bdd6c19db4aa051634760efe1), too. Timeline: * 2020-05-26: We landed a [commit to remo 3 years ago Jan-Erik Rediger [:janerik] 3.50 KB, text/plain		Details
Using this bug for n 3 years ago Jan-Erik Rediger [:janerik] 3.50 KB, text/plain		Details

Valentin Gosu [:valentin] (he/him)

Reporter

Description

•

3 years ago

Crash report: https://crash-stats.mozilla.org/report/index/5e834290-9eb1-4d57-bad7-f90750211002

MOZ_CRASH Reason: Bug: next_index pointed at non-free list entry (or end of list)

Top 10 frames of crashing thread:

0 libxul.so RustMozCrash mozglue/static/rust/wrappers.cpp:18
1 libxul.so mozglue_static::panic_hook mozglue/static/rust/lib.rs:91
2 libxul.so core::ops::function::Fn::call library/core/src/ops/function.rs:70
3 libxul.so std::panicking::rust_panic_with_hook library/std/src/panicking.rs:626
4 libxul.so std::panicking::begin_panic::{{closure}} library/std/src/panicking.rs:542
5 libxul.so std::sys_common::backtrace::__rust_end_short_backtrace library/std/src/sys_common/backtrace.rs:141
6 libxul.so std::panicking::begin_panic library/std/src/panicking.rs:541
7 libxul.so ffi_support::handle_map::ConcurrentHandleMap<T>::insert third_party/rust/ffi-support/src/handle_map.rs:762
8 libxul.so glean_labeled_counter_metric_get third_party/rust/glean-ffi/src/labeled.rs:77
9 libjnidispatch.so libjnidispatch.so@0x10154

Jan-Erik Rediger [:janerik]

Assignee

Updated

•

3 years ago

Component: Glean Platform → Glean: SDK

Jan-Erik Rediger [:janerik]

Assignee

Comment 1

•

3 years ago

I'm not able to reproduce in the emulator right now.
Talked to rpapa.
We're now at 1000+ crashes in crash-stats.

We landed the switch to GeckoView-provided Glean on 2021-09-30: https://github.com/mozilla-mobile/android-components/pull/11062
This got into Fenix shortly after.

Code-wise this panic comes from ffi-support: https://github.com/mozilla/ffi-support/blob/0fdc22a8dfe3731be5fd39b311e4e4885219e26c/src/handle_map.rs#L409
This shouldn't trigger under normal behavior.

rpapa started a document to also look into recent test crashes: https://docs.google.com/document/d/11dLgCTIQ30edwsOCavnjL4a0Du5ksrFukNi7LaUfuiA/edit#

Aaron Train [:aaronmt]

Comment 2

•

3 years ago

From latest log-cat

java.lang.NullPointerException: Attempt to invoke virtual method 'java.lang.String okhttp3.mockwebserver.RecordedRequest.getPath()' on a null object reference
at org.mozilla.samples.glean.pings.BaselinePingTest.waitForPingContent(BaselinePingTest.kt:92)
at org.mozilla.samples.glean.pings.BaselinePingTest.waitForPingContent$default(BaselinePingTest.kt:83)
at org.mozilla.samples.glean.pings.BaselinePingTest.validateBaselinePing(BaselinePingTest.kt:125)

Jan-Erik Rediger [:janerik]

Assignee

Updated

•

3 years ago

Assignee: nobody → jrediger

Priority: -- → P1

Jan-Erik Rediger [:janerik]

Assignee

Comment 3

•

3 years ago

(In reply to Aaron Train [:aaronmt] from comment #2)

From latest log-cat

java.lang.NullPointerException: Attempt to invoke virtual method 'java.lang.String okhttp3.mockwebserver.RecordedRequest.getPath()' on a null object reference
at org.mozilla.samples.glean.pings.BaselinePingTest.waitForPingContent(BaselinePingTest.kt:92)
at org.mozilla.samples.glean.pings.BaselinePingTest.waitForPingContent$default(BaselinePingTest.kt:83)
at org.mozilla.samples.glean.pings.BaselinePingTest.validateBaselinePing(BaselinePingTest.kt:125)

That looks like a test failure though (the test not receiving a ping when it expects one).
If there's no gecko crash recorded in that test run I suppose it's a different issue.

Jan-Erik Rediger [:janerik]

Assignee

Comment 4

•

3 years ago

I tried creating a test on Fenix and running that on a physical device, but so far that didn't trigger a crash.

Jan-Erik Rediger [:janerik]

Assignee

Updated

•

3 years ago

Crash Signature: [@ ffi_support::handle_map::ConcurrentHandleMap<T>::insert]

Kevin Brosnan [Ex-Mozilla]

Updated

•

3 years ago

Updated

•

3 years ago

See Also: → https://github.com/mozilla-mobile/fenix/issues/21767

Ryan VanderMeulen [:RyanVM]

Comment 7

•

3 years ago

QA was able to reliably hit this, see https://github.com/mozilla-mobile/fenix/issues/21767

Flags: needinfo?(jrediger)

Ryan VanderMeulen [:RyanVM]

Updated

•

3 years ago

status-firefox93: --- → unaffected

status-firefox94: --- → affected

status-firefox95: --- → affected

status-firefox-esr91: --- → unaffected

tracking-firefox94: --- → +

tracking-firefox95: --- → +

Jan-Erik Rediger [:janerik]

Assignee

Comment 8

•

3 years ago

This is huge! I reproduced it on my test phone now, just waiting 9 minutes.
That might help me drill down into what's happening. Thanks!

Flags: needinfo?(jrediger)

GitHub Bugzilla PR Linker

Comment 9

•

3 years ago

Attached file Link to GitHub pull-request: https://github.com/mozilla/glean/pull/1823 — Details

Jan-Erik Rediger [:janerik]

Assignee

Comment 10

•

3 years ago

badboy merged PR #1823: "Bug 1733757 - Store created label metrics to avoid rapidly reconstructing them all …" in 6c3be03.

Keeping it open to land the m-c update after a release.

Jan-Erik Rediger [:janerik]

Assignee

Comment 11

•

3 years ago

Attached file Bug 1733757 - Update to Glean v42.0.1. r?Dexter — Details

This includes a bugfix preventing a crash on Fenix,
a bugfix preventing a panic when trying to flush ping-lifetime data after shutdown,
as well as an API change to make glean::persist_ping_lifetime_data async.

Pulsebot

Comment 12

•

3 years ago

Pushed by jrediger@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/cec837611f11
Update to Glean v42.0.1. r=Dexter

Jan-Erik Rediger [:janerik]

Assignee

Comment 13

•

3 years ago

Attached file Using this bug for noting some of the context. Some of it is captured in the [extensive commit message in Glean](https://github.com/mozilla/glean/commit/499309475f4f002bdd6c19db4aa051634760efe1), too. Timeline: * 2020-05-26: We landed a [commit to remo (obsolete) — Details

Jan-Erik Rediger [:janerik]

Assignee

Comment 14

•

3 years ago

Attached file Using this bug for n (obsolete) — Details

Jan-Erik Rediger [:janerik]

Assignee

Updated

•

3 years ago

Attachment #9245266 - Attachment is obsolete: true

Jan-Erik Rediger [:janerik]

Assignee

Updated

•

3 years ago

Attachment #9245267 - Attachment is obsolete: true

Jan-Erik Rediger [:janerik]

Assignee

Comment 15

•

3 years ago

Using this bug for noting some of the context.

Some of it is captured in the extensive commit message in Glean, too.

Timeline:

2020-05-26: We landed a commit to remove finalizers from our Kotlin classes.
- Reason: Kotlin finalizers are unreliable, caused crashes and largely wouldn't be run under normal use anyway, except when the app quits.
2020-05-27: Glean v26.0.0 was released, containing above fix.
2021-09-30: We landed a PR in a-c to switch from libglean.so to a geckoview-provided library.
2021-10-02: This issue is filed, showing a rapid increase in crashes reported from Fenix Nightly.
2021-10-04: Due to the weekend I only get to take a look by Monday. Able to identify where the crash is happening. Unable to reproduce it or understand why it's happening
2021-10-05: Trying to add a desperate test to Fenix, exercising. Trying to run that on physical devices in our Firebase account. No luck triggering a crash.
2021-10-06: I pick up a loaner device, hoping to reproduce the crash (previously only using the Android Emulator)
2021-10-07: QA is able to reproduce the crash and provides simple Steps to Reproduce
- Soon after I'm able to reproduce the issue on my loaner phone, at the end of my work day
2021-10-08: Picking up this work again, instrumenting the relevant code paths with more logging. While doing so I notice the size limitations of the library we're using
- This starts to give me an idea what's going on
- Continous testing allows me to track it down to the root cause: rapid re-creation of metric objects
- Rest of the day is spent on writing a fix and testing it out. By the end of the day my PR is up for review
2021-10-11: I land this work, do a Glean release and push it into mozilla-central right after

Which channels are affected?

Fenix Nightly & Beta. Beta since the merge day last week.
We're planning to uplift the changes, which will fix Beta before it goes into Release.

Why didn't we notice this crash before?

That's unclear at the moment.
I think with the move to having Glean in GeckoView we're only now getting these crashes through the Gecko crash reporter.
Before I assume we should have gotten it in Sentry, but I can't find anything relevant.
I'm reaching out to Android folks to see if they have an idea.

How frequent is it really?

I don't know for sure.
The rapid increase in crashes after my work was merged made this urgent in my eyes,
but I'm not sure about overall crash numbers for Fenix.
By raw numbers of affected users its low compared to total users,
but this again might be biased by not getting data from everyone.

What's the impact on other metrics?

Given I don't know how many are really affected it's hard to say.
To trigger it requires a significant use of the browser.
The test case requires uninterrupted use.
From my understanding of the code path that triggers it, it probably can happen if the browser is never fully quit, but used again and again.
It all depends on the website content and whether that triggers certain metric recordings.

Valentin Gosu [:valentin] (he/him)

Reporter

Comment 16

•

3 years ago

(In reply to Jan-Erik Rediger [:janerik] from comment #15)

To trigger it requires a significant use of the browser.
The test case requires uninterrupted use.
From my understanding of the code path that triggers it, it probably can happen if the browser is never fully quit, but used again and again.

I got a lot of crashes by watching Youtube in the browser (sometimes in the background with the Video Background Play Fix addon).

Jan-Erik Rediger [:janerik]

Assignee

Comment 17

•

3 years ago

(In reply to Valentin Gosu [:valentin] (he/him) from comment #16)

(In reply to Jan-Erik Rediger [:janerik] from comment #15)

To trigger it requires a significant use of the browser.
The test case requires uninterrupted use.
From my understanding of the code path that triggers it, it probably can happen if the browser is never fully quit, but used again and again.

I got a lot of crashes by watching Youtube in the browser (sometimes in the background with the Video Background Play Fix addon).

Thanks! That's indeed a more likely usecase I haven't considered.

Jan-Erik Rediger [:janerik]

Assignee

Comment 18

•

3 years ago

Update on "Why didn't we notice this crash before?"

Because we didn't crash the full browser before!
I took Fenix 93.0.0-beta.4 for a test run in the emulator. After 9 minutes or so the logcat dumped this on me:

           libglean_ffi  E  ffi_support::error: Caught a panic calling rust code: "Bug: next_index pointed at non-free list entry (or end of list)"
                         E  glean_ffi::handlemap_ext: Glean failed (ErrorCode(-1)): Bug: next_index pointed at non-free list entry (or end of list)
                         E  ffi_support::error: Caught a panic calling rust code: "called `Result::unwrap()` on an `Err` value: \"PoisonError { inner: .. }\""
                         E  glean_ffi::handlemap_ext: Glean failed (ErrorCode(-1)): called `Result::unwrap()` on an `Err` value: "PoisonError { inner: .. }"
                         E  ffi_support::error: Caught a panic calling rust code: "called `Result::unwrap()` on an `Err` value: \"PoisonError { inner: .. }\""

This gets repeated for every time Glean tries to record anything, or really do anything.
That means on old version, since at least May 2020 we're likely missing some data when the browser is running into this bug.

It's also more clear to me why:
We're using a dispatcher, running these tasks in another thread.
Glean panics, crashing that thread, but this does not affect the main application beyond that.
With the move to ship Glean through GeckoView Rust panics get translated into aborts, ultimatively hitting MOZ_CRASH and taking down the full browser.
That's a change in behavior we haven't considered.

Glean should not panic under normal operation, but only for actual bugs. So I do think it's the right behavior now, just unfortunate we didn't have enough time to catch it in Nightly.

Marian-Vasile Laza

Comment 19

•

3 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/cec837611f11

Status: NEW → RESOLVED

Closed: 3 years ago

status-firefox95: affected → fixed

Resolution: --- → FIXED

Ryan VanderMeulen [:RyanVM]

Comment 20

•

3 years ago

Please nominate this for Beta approval when you get a chance.

Flags: needinfo?(jrediger)

Jan-Erik Rediger [:janerik]

Assignee

Comment 21

•

3 years ago

It seems I can't use the approval-mozilla-release flag here at the moment, so again going the manual way.

Beta/Release Uplift Approval Request

User impact if declined: top crasher for Fenix on Nightly & Beta
Is this code covered by automated tests?: Yes, tests are in the Glean repository
Has the fix been verified in Nightly?: Yes
Needs manual test from QE?: No
If yes, steps to reproduce: -
List of other uplifts needed: None
Risk to taking this patch: Medium
Why is the change risky/not risky? (and alternatives if risky): New Glean version, with automated tests in the Glean repository. Additionally manually tested in local builds.
String changes made/needed:

Flags: needinfo?(jrediger) → needinfo?(ryanvm)

Jan-Erik Rediger [:janerik]

Assignee

Comment 22

•

3 years ago

Note: Bug 1732988 also requires the Glean version upgrade uplift. That's covered by this request.

Jan-Erik Rediger [:janerik]

Assignee

Comment 23

•

3 years ago

Comment on attachment 9245218 [details]
Bug 1733757 - Update to Glean v42.0.1. r?Dexter

Beta/Release Uplift Approval Request

User impact if declined: top crasher for Fenix on Nightly & Beta
Is this code covered by automated tests?: Yes
Has the fix been verified in Nightly?: No
Needs manual test from QE?: Yes
If yes, steps to reproduce:
List of other uplifts needed: None
Risk to taking this patch: Medium
Why is the change risky/not risky? (and alternatives if risky): New Glean version, with automated tests in the Glean repository. Additionally manually tested in local builds.
String changes made/needed:

Attachment #9245218 - Flags: approval-mozilla-beta?

Jan-Erik Rediger [:janerik]

Assignee

Comment 24

•

3 years ago

We got flag support now, so I figured we can do it properly now too.

Flags: needinfo?(ryanvm)

Ryan VanderMeulen [:RyanVM]

Comment 25

•

3 years ago

Comment on attachment 9245218 [details]
Bug 1733757 - Update to Glean v42.0.1. r?Dexter

Approved for Desktop 94.0b5 & Fenix 94.0.0-beta.3.

Attachment #9245218 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Ryan VanderMeulen [:RyanVM]

Comment 26

•

3 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/318efdd7ee23

status-firefox94: affected → fixed

Ryan VanderMeulen [:RyanVM]

Comment 27

•

3 years ago

No new crashes on Beta or Nightly since this landed \m/

Status: RESOLVED → VERIFIED

status-firefox94: fixed → verified

status-firefox95: fixed → verified

Chris H-C :chutten

Updated

•

3 years ago