Android startup crash in [@ libmegazord.so][@ uniffi_core::ffi::rustcalls::rust_call_with_out_status ]
Categories
(Application Services :: General, defect)
Tracking
(firefox130 wontfix, firefox131+ wontfix, firefox132+ wontfix, firefox133+ wontfix)
People
(Reporter: aryx, Unassigned)
References
Details
(Keywords: crash)
Crash Data
This signature existed before but got more frequent with Firefox for Android 129.0 and later. It's a startup crash and Android 8.1 and 9.0 are the most affected versions.
900 crashes for Fenix 130.0 + 130.0.1
Jeff, could you investigate or redirect this request?
bp-1d362ef9-ef5b-44a2-af5b-0e1bb0240919
Frame Module Signature Source Trust
Ø 0 libmegazord.so libmegazord.so@0x2c9256 context
Ø 1 libmegazord.so libmegazord.so@0x725fae scan
Ø 2 libmegazord.so libmegazord.so@0x2baccd scan
Ø 3 libmegazord.so libmegazord.so@0x2d18ff scan
Ø 4 libjnidispatch.so libjnidispatch.so@0x1302a scan
Ø 5 libart.so libart.so@0x2b4746 scan
Ø 6 libjnidispatch.so libjnidispatch.so@0x11c12 scan
Ø 7 libjnidispatch.so libjnidispatch.so@0x1a512 scan
Ø 8 libmegazord.so libmegazord.so@0x2d1863 scan
Ø 9 libjnidispatch.so libjnidispatch.so@0x1c11a scan
Ø 10 libjnidispatch.so libjnidispatch.so@0x1bf3a scan
Ø 11 libjnidispatch.so libjnidispatch.so@0x12242 scan
Ø 12 libjnidispatch.so libjnidispatch.so@0x63d2 scan
Ø 13 libart.so libart.so@0x2b3cae scan
Ø 14 libjnidispatch.so libjnidispatch.so@0x1a4ee scan
Ø 15 dalvik-main space (deleted) dalvik-main space (deleted)@0x57848e scan
Updated•6 months ago
|
Updated•6 months ago
|
Comment 1•6 months ago
|
||
We've got symbols for 131 now. Example:
https://crash-stats.mozilla.org/report/index/58e35f58-9fc3-4450-a096-b00460240919
Comment 2•6 months ago
|
||
The bug is marked as tracked for firefox131 (beta) and tracked for firefox132 (nightly). We have limited time to fix this, the soft freeze is in 6 days. However, the bug still isn't assigned.
:boek, could you please find an assignee for this tracked bug? If you disagree with the tracking decision, please talk with the release managers.
For more information, please visit BugBot documentation.
Comment 3•6 months ago
|
||
Something seems very strange about that crash dump. Am I reading it right that the stack is 500 items deep? It lists ffi_logins_rust_future_free_void
multiple times but that function should never be called since we don't use UniFFI async yet and I'm pretty sure there's no way to recursively call it.
OTOH, the function 2nd to the top is logins.uniffi_logins_fn_func_check_canary
, which is a function called at startup. The very top of the stack is https://github.com/mozilla/uniffi-rs/blob/0ecafdc06799205caf1432b93787a9c1f810a168/uniffi_core/src/ffi/rustcalls.rs#L169, which is assigning to an out pointer. If something is wrong in the code, that could definitely cause a segfault.
Comment 4•6 months ago
|
||
I'm never sure how much priority to put on these crashes. Part of me wants to say it's a UniFFI bug that we should be spending a lot of time investigating, the other part of me wants to say that the numbers are relatively low and it's likely caused by a hardware bug.
(Also, sorry for setting the assignee, I didn't mean to do that).
Comment 5•6 months ago
|
||
I'm inclined to wait and see how things evolve once 131 rides to Release with proper symbols available. We don't actually know right now if that uniffi signature comprises the majority of these crashes or not.
Comment 6•6 months ago
|
||
The Bugbug bot thinks this bug should belong to the 'Fenix::Crash Reporting' component, and is moving the bug to that component. Please correct in case you think the bot is wrong.
Updated•6 months ago
|
Comment 7•6 months ago
|
||
Because its a startup crash I think we should probably prioritize a time-boxed investigation with the goal to at least rule a few things that are causing it. :bdk any thoughts on who would be best to help with this?
Comment 8•6 months ago
|
||
I could take a look, but I'm not even sure what the first step would be to investigate. Maybe we could get someone with crash reporter experience to help me understand what's happening with the crash report stack. Are all those nested calls to ffi_logins_rust_future_free_void
real?
Comment 9•5 months ago
|
||
Today's A-S nightly includes the fix for bug 1921532, so once that makes it into shipping Fenix nightly builds, we may see some changes in the stack traces that will help here.
Comment 10•5 months ago
|
||
This is a reminder regarding comment #2!
The bug is marked as tracked for firefox131 (release), tracked for firefox132 (beta) and tracked for firefox133 (nightly). We have limited time to fix this, the soft freeze is in 14 days. However, the bug still isn't assigned.
Updated•5 months ago
|
Comment 11•5 months ago
|
||
This is a reminder regarding comment #2!
The bug is marked as tracked for firefox131 (release), tracked for firefox132 (beta) and tracked for firefox133 (nightly). We have limited time to fix this, the soft freeze is in 8 days. However, the bug still isn't assigned.
Comment 12•5 months ago
|
||
The severity field is not set for this bug.
:amejia, could you have a look please?
For more information, please visit BugBot documentation.
Updated•5 months ago
|
Comment 13•5 months ago
|
||
We recently made some changes to get inline symbols in our crash reports. I think these reports may be the same bug: https://crash-stats.mozilla.org/report/index/a3cc5628-f709-4243-972f-699500241014, https://crash-stats.mozilla.org/report/index/ec77355d-ba28-4112-99b1-2e1b90241016. Both look like they're happening inside the check_canary
function.
The crashes seem to be happening inside the deallocation code, which to me means it could be a UniFFI double-free bug or maybe it's a system issue. I keep looking at the size of those stacks and being very surprised.
...and I just noticed one more thing, all of these stacks seem to have _Unwind_GetTextRelBase
in them. I believe this means that we're trying to unwind the stack because of an exception or something.
To speculate even more, maybe the Rust code is calling into Kotlin and Kotlin threw an exception that UniFFI is somehow not catching and we're now trying to unwind across the FFI call. I could easily see that causing issues. Here's our code for that. I wonder if Exception
is not broad enough, maybe we should be catching Throwable
or something. Also, maybe the e.toString()
call is throwing.
Updated•5 months ago
|
Updated•5 months ago
|
Updated•5 months ago
|
Updated•5 months ago
|
Comment 14•5 months ago
|
||
https://crash-stats.mozilla.org/report/index/58ebb896-4279-405a-a27c-b94250241024 seems to have more of a useful-looking stack. Does that shed any light?
Comment 15•5 months ago
|
||
That one seems like it's definitely a memory allocation error. It looks like a different stack than the others though, I don't see any calls to _Unwind_GetTextRelBase
. I see std::panic::catch_unwind
in that stack, but that's a different call -- it's what we call to prepare for a possible unwinding.
Can we split that stack into a separate issue? I believe the presence of alloc::alloc::handle_alloc_error
indicates it's an allocation error. I also don't think there's much we can do in this situation, maybe we could just ignore those crashes.
Updated•4 months ago
|
Updated•4 months ago
|
Description
•