Crash in [@ ffi_support::handle_map::ConcurrentHandleMap<T>::insert]
Categories
(Data Platform and Tools :: Glean: SDK, defect, P1)
Tracking
(firefox-esr91 unaffected, firefox93 unaffected, firefox94+ verified, firefox95+ verified)
Tracking | Status | |
---|---|---|
firefox-esr91 | --- | unaffected |
firefox93 | --- | unaffected |
firefox94 | + | verified |
firefox95 | + | verified |
People
(Reporter: valentin, Assigned: janerik)
References
Details
(Keywords: crash)
Crash Data
Attachments
(2 files, 2 obsolete files)
42 bytes,
text/x-github-pull-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
RyanVM
:
approval-mozilla-beta+
|
Details | Review |
Crash report: https://crash-stats.mozilla.org/report/index/5e834290-9eb1-4d57-bad7-f90750211002
MOZ_CRASH Reason: Bug: next_index pointed at non-free list entry (or end of list)
Top 10 frames of crashing thread:
0 libxul.so RustMozCrash mozglue/static/rust/wrappers.cpp:18
1 libxul.so mozglue_static::panic_hook mozglue/static/rust/lib.rs:91
2 libxul.so core::ops::function::Fn::call library/core/src/ops/function.rs:70
3 libxul.so std::panicking::rust_panic_with_hook library/std/src/panicking.rs:626
4 libxul.so std::panicking::begin_panic::{{closure}} library/std/src/panicking.rs:542
5 libxul.so std::sys_common::backtrace::__rust_end_short_backtrace library/std/src/sys_common/backtrace.rs:141
6 libxul.so std::panicking::begin_panic library/std/src/panicking.rs:541
7 libxul.so ffi_support::handle_map::ConcurrentHandleMap<T>::insert third_party/rust/ffi-support/src/handle_map.rs:762
8 libxul.so glean_labeled_counter_metric_get third_party/rust/glean-ffi/src/labeled.rs:77
9 libjnidispatch.so libjnidispatch.so@0x10154
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Comment 1•3 years ago
|
||
I'm not able to reproduce in the emulator right now.
Talked to rpapa.
We're now at 1000+ crashes in crash-stats.
We landed the switch to GeckoView-provided Glean on 2021-09-30: https://github.com/mozilla-mobile/android-components/pull/11062
This got into Fenix shortly after.
Code-wise this panic comes from ffi-support: https://github.com/mozilla/ffi-support/blob/0fdc22a8dfe3731be5fd39b311e4e4885219e26c/src/handle_map.rs#L409
This shouldn't trigger under normal behavior.
rpapa started a document to also look into recent test crashes: https://docs.google.com/document/d/11dLgCTIQ30edwsOCavnjL4a0Du5ksrFukNi7LaUfuiA/edit#
Comment 2•3 years ago
|
||
From latest log-cat
java.lang.NullPointerException: Attempt to invoke virtual method 'java.lang.String okhttp3.mockwebserver.RecordedRequest.getPath()' on a null object reference
at org.mozilla.samples.glean.pings.BaselinePingTest.waitForPingContent(BaselinePingTest.kt:92)
at org.mozilla.samples.glean.pings.BaselinePingTest.waitForPingContent$default(BaselinePingTest.kt:83)
at org.mozilla.samples.glean.pings.BaselinePingTest.validateBaselinePing(BaselinePingTest.kt:125)
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Comment 3•3 years ago
|
||
(In reply to Aaron Train [:aaronmt] from comment #2)
From latest log-cat
java.lang.NullPointerException: Attempt to invoke virtual method 'java.lang.String okhttp3.mockwebserver.RecordedRequest.getPath()' on a null object reference at org.mozilla.samples.glean.pings.BaselinePingTest.waitForPingContent(BaselinePingTest.kt:92) at org.mozilla.samples.glean.pings.BaselinePingTest.waitForPingContent$default(BaselinePingTest.kt:83) at org.mozilla.samples.glean.pings.BaselinePingTest.validateBaselinePing(BaselinePingTest.kt:125)
That looks like a test failure though (the test not receiving a ping when it expects one).
If there's no gecko crash recorded in that test run I suppose it's a different issue.
Assignee | ||
Comment 4•3 years ago
|
||
I tried creating a test on Fenix and running that on a physical device, but so far that didn't trigger a crash.
Assignee | ||
Updated•3 years ago
|
Updated•3 years ago
|
Comment 7•3 years ago
|
||
QA was able to reliably hit this, see https://github.com/mozilla-mobile/fenix/issues/21767
Updated•3 years ago
|
Assignee | ||
Comment 8•3 years ago
|
||
This is huge! I reproduced it on my test phone now, just waiting 9 minutes.
That might help me drill down into what's happening. Thanks!
Comment 9•3 years ago
|
||
Assignee | ||
Comment 10•3 years ago
|
||
badboy merged PR #1823: "Bug 1733757 - Store created label metrics to avoid rapidly reconstructing them all …" in 6c3be03.
--
Keeping it open to land the m-c update after a release.
Assignee | ||
Comment 11•3 years ago
|
||
This includes a bugfix preventing a crash on Fenix,
a bugfix preventing a panic when trying to flush ping-lifetime data after shutdown,
as well as an API change to make glean::persist_ping_lifetime_data
async.
Comment 12•3 years ago
|
||
Pushed by jrediger@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/cec837611f11 Update to Glean v42.0.1. r=Dexter
Assignee | ||
Comment 13•3 years ago
|
||
Assignee | ||
Comment 14•3 years ago
|
||
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Comment 15•3 years ago
|
||
Using this bug for noting some of the context.
Some of it is captured in the extensive commit message in Glean, too.
Timeline:
- 2020-05-26: We landed a commit to remove finalizers from our Kotlin classes.
- Reason: Kotlin finalizers are unreliable, caused crashes and largely wouldn't be run under normal use anyway, except when the app quits.
- 2020-05-27: Glean v26.0.0 was released, containing above fix.
- 2021-09-30: We landed a PR in a-c to switch from libglean.so to a geckoview-provided library.
- 2021-10-02: This issue is filed, showing a rapid increase in crashes reported from Fenix Nightly.
- 2021-10-04: Due to the weekend I only get to take a look by Monday. Able to identify where the crash is happening. Unable to reproduce it or understand why it's happening
- 2021-10-05: Trying to add a desperate test to Fenix, exercising. Trying to run that on physical devices in our Firebase account. No luck triggering a crash.
- 2021-10-06: I pick up a loaner device, hoping to reproduce the crash (previously only using the Android Emulator)
- 2021-10-07: QA is able to reproduce the crash and provides simple Steps to Reproduce
- Soon after I'm able to reproduce the issue on my loaner phone, at the end of my work day
- 2021-10-08: Picking up this work again, instrumenting the relevant code paths with more logging. While doing so I notice the size limitations of the library we're using
- This starts to give me an idea what's going on
- Continous testing allows me to track it down to the root cause: rapid re-creation of metric objects
- Rest of the day is spent on writing a fix and testing it out. By the end of the day my PR is up for review
- 2021-10-11: I land this work, do a Glean release and push it into mozilla-central right after
Which channels are affected?
Fenix Nightly & Beta. Beta since the merge day last week.
We're planning to uplift the changes, which will fix Beta before it goes into Release.
Why didn't we notice this crash before?
That's unclear at the moment.
I think with the move to having Glean in GeckoView we're only now getting these crashes through the Gecko crash reporter.
Before I assume we should have gotten it in Sentry, but I can't find anything relevant.
I'm reaching out to Android folks to see if they have an idea.
How frequent is it really?
I don't know for sure.
The rapid increase in crashes after my work was merged made this urgent in my eyes,
but I'm not sure about overall crash numbers for Fenix.
By raw numbers of affected users its low compared to total users,
but this again might be biased by not getting data from everyone.
What's the impact on other metrics?
Given I don't know how many are really affected it's hard to say.
To trigger it requires a significant use of the browser.
The test case requires uninterrupted use.
From my understanding of the code path that triggers it, it probably can happen if the browser is never fully quit, but used again and again.
It all depends on the website content and whether that triggers certain metric recordings.
Reporter | ||
Comment 16•3 years ago
|
||
(In reply to Jan-Erik Rediger [:janerik] from comment #15)
To trigger it requires a significant use of the browser.
The test case requires uninterrupted use.
From my understanding of the code path that triggers it, it probably can happen if the browser is never fully quit, but used again and again.
I got a lot of crashes by watching Youtube in the browser (sometimes in the background with the Video Background Play Fix addon).
Assignee | ||
Comment 17•3 years ago
|
||
(In reply to Valentin Gosu [:valentin] (he/him) from comment #16)
(In reply to Jan-Erik Rediger [:janerik] from comment #15)
To trigger it requires a significant use of the browser.
The test case requires uninterrupted use.
From my understanding of the code path that triggers it, it probably can happen if the browser is never fully quit, but used again and again.I got a lot of crashes by watching Youtube in the browser (sometimes in the background with the Video Background Play Fix addon).
Thanks! That's indeed a more likely usecase I haven't considered.
Assignee | ||
Comment 18•3 years ago
|
||
Update on "Why didn't we notice this crash before?"
Because we didn't crash the full browser before!
I took Fenix 93.0.0-beta.4 for a test run in the emulator. After 9 minutes or so the logcat dumped this on me:
libglean_ffi E ffi_support::error: Caught a panic calling rust code: "Bug: next_index pointed at non-free list entry (or end of list)"
E glean_ffi::handlemap_ext: Glean failed (ErrorCode(-1)): Bug: next_index pointed at non-free list entry (or end of list)
E ffi_support::error: Caught a panic calling rust code: "called `Result::unwrap()` on an `Err` value: \"PoisonError { inner: .. }\""
E glean_ffi::handlemap_ext: Glean failed (ErrorCode(-1)): called `Result::unwrap()` on an `Err` value: "PoisonError { inner: .. }"
E ffi_support::error: Caught a panic calling rust code: "called `Result::unwrap()` on an `Err` value: \"PoisonError { inner: .. }\""
This gets repeated for every time Glean tries to record anything, or really do anything.
That means on old version, since at least May 2020 we're likely missing some data when the browser is running into this bug.
It's also more clear to me why:
We're using a dispatcher, running these tasks in another thread.
Glean panics, crashing that thread, but this does not affect the main application beyond that.
With the move to ship Glean through GeckoView Rust panics get translated into aborts, ultimatively hitting MOZ_CRASH
and taking down the full browser.
That's a change in behavior we haven't considered.
Glean should not panic under normal operation, but only for actual bugs. So I do think it's the right behavior now, just unfortunate we didn't have enough time to catch it in Nightly.
Comment 19•3 years ago
|
||
bugherder |
Comment 20•3 years ago
|
||
Please nominate this for Beta approval when you get a chance.
Assignee | ||
Comment 21•3 years ago
|
||
It seems I can't use the approval-mozilla-release
flag here at the moment, so again going the manual way.
Beta/Release Uplift Approval Request
- User impact if declined: top crasher for Fenix on Nightly & Beta
- Is this code covered by automated tests?: Yes, tests are in the Glean repository
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: No
- If yes, steps to reproduce: -
- List of other uplifts needed: None
- Risk to taking this patch: Medium
- Why is the change risky/not risky? (and alternatives if risky): New Glean version, with automated tests in the Glean repository. Additionally manually tested in local builds.
- String changes made/needed:
Assignee | ||
Comment 22•3 years ago
|
||
Note: Bug 1732988 also requires the Glean version upgrade uplift. That's covered by this request.
Assignee | ||
Comment 23•3 years ago
|
||
Comment on attachment 9245218 [details]
Bug 1733757 - Update to Glean v42.0.1. r?Dexter
Beta/Release Uplift Approval Request
- User impact if declined: top crasher for Fenix on Nightly & Beta
- Is this code covered by automated tests?: Yes
- Has the fix been verified in Nightly?: No
- Needs manual test from QE?: Yes
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Medium
- Why is the change risky/not risky? (and alternatives if risky): New Glean version, with automated tests in the Glean repository. Additionally manually tested in local builds.
- String changes made/needed:
Assignee | ||
Comment 24•3 years ago
|
||
We got flag support now, so I figured we can do it properly now too.
Comment 25•3 years ago
|
||
Comment on attachment 9245218 [details]
Bug 1733757 - Update to Glean v42.0.1. r?Dexter
Approved for Desktop 94.0b5 & Fenix 94.0.0-beta.3.
Comment 26•3 years ago
|
||
bugherder uplift |
Comment 27•3 years ago
|
||
No new crashes on Beta or Nightly since this landed \m/
Description
•