Open Bug 1919825 Opened 6 months ago Updated 3 months ago

Android startup crash in [@ libmegazord.so][@ uniffi_core::ffi::rustcalls::rust_call_with_out_status ]

Categories

(Application Services :: General, defect)

Unspecified
Android
defect

Tracking

(firefox130 wontfix, firefox131+ wontfix, firefox132+ wontfix, firefox133+ wontfix)

Tracking Status
firefox130 --- wontfix
firefox131 + wontfix
firefox132 + wontfix
firefox133 + wontfix

People

(Reporter: aryx, Unassigned)

References

Details

(Keywords: crash)

Crash Data

This signature existed before but got more frequent with Firefox for Android 129.0 and later. It's a startup crash and Android 8.1 and 9.0 are the most affected versions.
900 crashes for Fenix 130.0 + 130.0.1

Jeff, could you investigate or redirect this request?

bp-1d362ef9-ef5b-44a2-af5b-0e1bb0240919

Frame	Module	Signature	Source	Trust
Ø 0 	libmegazord.so	libmegazord.so@0x2c9256		context
Ø 1 	libmegazord.so	libmegazord.so@0x725fae		scan
Ø 2 	libmegazord.so	libmegazord.so@0x2baccd		scan
Ø 3 	libmegazord.so	libmegazord.so@0x2d18ff		scan
Ø 4 	libjnidispatch.so	libjnidispatch.so@0x1302a		scan
Ø 5 	libart.so	libart.so@0x2b4746		scan
Ø 6 	libjnidispatch.so	libjnidispatch.so@0x11c12		scan
Ø 7 	libjnidispatch.so	libjnidispatch.so@0x1a512		scan
Ø 8 	libmegazord.so	libmegazord.so@0x2d1863		scan
Ø 9 	libjnidispatch.so	libjnidispatch.so@0x1c11a		scan
Ø 10 	libjnidispatch.so	libjnidispatch.so@0x1bf3a		scan
Ø 11 	libjnidispatch.so	libjnidispatch.so@0x12242		scan
Ø 12 	libjnidispatch.so	libjnidispatch.so@0x63d2		scan
Ø 13 	libart.so	libart.so@0x2b3cae		scan
Ø 14 	libjnidispatch.so	libjnidispatch.so@0x1a4ee		scan
Ø 15 	dalvik-main space (deleted)	dalvik-main space (deleted)@0x57848e		scan
Flags: needinfo?(jboek)
Flags: needinfo?(jboek)
See Also: → 1917677, 1889982
Crash Signature: [@ libmegazord.so] → [@ libmegazord.so ] [@ uniffi_core::ffi::rustcalls::rust_call_with_out_status ]
Flags: needinfo?(bdeankawamura)

The bug is marked as tracked for firefox131 (beta) and tracked for firefox132 (nightly). We have limited time to fix this, the soft freeze is in 6 days. However, the bug still isn't assigned.

:boek, could you please find an assignee for this tracked bug? If you disagree with the tracking decision, please talk with the release managers.

For more information, please visit BugBot documentation.

Flags: needinfo?(jboek)

Something seems very strange about that crash dump. Am I reading it right that the stack is 500 items deep? It lists ffi_logins_rust_future_free_void multiple times but that function should never be called since we don't use UniFFI async yet and I'm pretty sure there's no way to recursively call it.

OTOH, the function 2nd to the top is logins.uniffi_logins_fn_func_check_canary, which is a function called at startup. The very top of the stack is https://github.com/mozilla/uniffi-rs/blob/0ecafdc06799205caf1432b93787a9c1f810a168/uniffi_core/src/ffi/rustcalls.rs#L169, which is assigning to an out pointer. If something is wrong in the code, that could definitely cause a segfault.

Assignee: nobody → jboek
Flags: needinfo?(bdeankawamura)

I'm never sure how much priority to put on these crashes. Part of me wants to say it's a UniFFI bug that we should be spending a lot of time investigating, the other part of me wants to say that the numbers are relatively low and it's likely caused by a hardware bug.

(Also, sorry for setting the assignee, I didn't mean to do that).

Assignee: jboek → nobody

I'm inclined to wait and see how things evolve once 131 rides to Release with proper symbols available. We don't actually know right now if that uniffi signature comprises the majority of these crashes or not.

The Bugbug bot thinks this bug should belong to the 'Fenix::Crash Reporting' component, and is moving the bug to that component. Please correct in case you think the bot is wrong.

Component: General → Crash Reporting

Because its a startup crash I think we should probably prioritize a time-boxed investigation with the goal to at least rule a few things that are causing it. :bdk any thoughts on who would be best to help with this?

Component: Crash Reporting → General
Flags: needinfo?(jboek) → needinfo?(bdeankawamura)

I could take a look, but I'm not even sure what the first step would be to investigate. Maybe we could get someone with crash reporter experience to help me understand what's happening with the crash report stack. Are all those nested calls to ffi_logins_rust_future_free_void real?

Flags: needinfo?(bdeankawamura)

Today's A-S nightly includes the fix for bug 1921532, so once that makes it into shipping Fenix nightly builds, we may see some changes in the stack traces that will help here.

This is a reminder regarding comment #2!

The bug is marked as tracked for firefox131 (release), tracked for firefox132 (beta) and tracked for firefox133 (nightly). We have limited time to fix this, the soft freeze is in 14 days. However, the bug still isn't assigned.

Summary: Android startup crash in [@ libmegazord.so] → Android startup crash in [@ libmegazord.so][@ uniffi_core::ffi::rustcalls::rust_call_with_out_status ]

This is a reminder regarding comment #2!

The bug is marked as tracked for firefox131 (release), tracked for firefox132 (beta) and tracked for firefox133 (nightly). We have limited time to fix this, the soft freeze is in 8 days. However, the bug still isn't assigned.

The severity field is not set for this bug.
:amejia, could you have a look please?

For more information, please visit BugBot documentation.

Flags: needinfo?(amejiamarmol)
Flags: needinfo?(amejiamarmol)
Product: Fenix → Application Services

We recently made some changes to get inline symbols in our crash reports. I think these reports may be the same bug: https://crash-stats.mozilla.org/report/index/a3cc5628-f709-4243-972f-699500241014, https://crash-stats.mozilla.org/report/index/ec77355d-ba28-4112-99b1-2e1b90241016. Both look like they're happening inside the check_canary function.

The crashes seem to be happening inside the deallocation code, which to me means it could be a UniFFI double-free bug or maybe it's a system issue. I keep looking at the size of those stacks and being very surprised.

...and I just noticed one more thing, all of these stacks seem to have _Unwind_GetTextRelBase in them. I believe this means that we're trying to unwind the stack because of an exception or something.

To speculate even more, maybe the Rust code is calling into Kotlin and Kotlin threw an exception that UniFFI is somehow not catching and we're now trying to unwind across the FFI call. I could easily see that causing issues. Here's our code for that. I wonder if Exception is not broad enough, maybe we should be catching Throwable or something. Also, maybe the e.toString() call is throwing.

Crash Signature: [@ libmegazord.so ] [@ uniffi_core::ffi::rustcalls::rust_call_with_out_status ] → [@ libmegazord.so ] [@ uniffi_core::ffi::rustcalls::rust_call_with_out_status ] [@ libc.so | <.plt ELF section in libmegazord.so> | uniffi_core::ffi::rustbuffer::RustBuffer::destroy_into_vec ]
Crash Signature: [@ libmegazord.so ] [@ uniffi_core::ffi::rustcalls::rust_call_with_out_status ] [@ libc.so | <.plt ELF section in libmegazord.so> | uniffi_core::ffi::rustbuffer::RustBuffer::destroy_into_vec ] → [@ libmegazord.so ] [@ uniffi_core::ffi::rustcalls::rust_call_with_out_status ] [@ libc.so | <.plt ELF section in libmegazord.so> | std::sys::pal::unix::alloc::<T>::dealloc ] [@ libc.so | <.plt ELF section in libmegazord.so> | uniffi_core::ffi::rustbuf…
Crash Signature: [@ libmegazord.so ] [@ uniffi_core::ffi::rustcalls::rust_call_with_out_status ] [@ libc.so | <.plt ELF section in libmegazord.so> | std::sys::pal::unix::alloc::<T>::dealloc ] [@ libc.so | <.plt ELF section in libmegazord.so> | uniffi_core::ffi::rustbuf… → [@ libmegazord.so ] [@ alloc::raw_vec::handle_error ] [@ uniffi_core::ffi::rustcalls::rust_call_with_out_status ] [@ libc.so | <.plt ELF section in libmegazord.so> | std::sys::pal::unix::alloc::<T>::dealloc ] [@ libc.so | <.plt ELF section in libmegaz…

https://crash-stats.mozilla.org/report/index/58ebb896-4279-405a-a27c-b94250241024 seems to have more of a useful-looking stack. Does that shed any light?

Flags: needinfo?(bdeankawamura)

That one seems like it's definitely a memory allocation error. It looks like a different stack than the others though, I don't see any calls to _Unwind_GetTextRelBase. I see std::panic::catch_unwind in that stack, but that's a different call -- it's what we call to prepare for a possible unwinding.

Can we split that stack into a separate issue? I believe the presence of alloc::alloc::handle_alloc_error indicates it's an allocation error. I also don't think there's much we can do in this situation, maybe we could just ignore those crashes.

Flags: needinfo?(bdeankawamura)
You need to log in before you can comment on or make changes to this bug.