Crash in [@ free | core::ptr::real_drop_in_place | <bookmark_sync::store::Store as dogear::store::Store>::fetch_local_tree]
Categories
(Toolkit :: Places, defect, P2)
Tracking
()
People
(Reporter: gsvelto, Unassigned)
References
Details
(4 keywords)
Crash Data
This bug is for crash report bp-6ad332d7-e35a-4b6f-832d-a5a3d0200427.
Top 10 frames of crashing thread:
0 libmozglue.dylib free memory/build/malloc_decls.h:54
1 XUL core::ptr::real_drop_in_place toolkit/library/rust/shared/lib.rs:263
2 XUL <bookmark_sync::store::Store as dogear::store::Store>::fetch_local_tree toolkit/components/places/bookmark_sync/src/store.rs
3 XUL bookmark_sync::merger::MergeTask::merge third_party/rust/dogear/src/store.rs:64
4 XUL <bookmark_sync::merger::MergeTask as moz_task::Task>::run toolkit/components/places/bookmark_sync/src/merger.rs:217
5 XUL moz_task::TaskRunnable::allocate::Run xpcom/rust/moz_task/src/lib.rs:132
6 XUL nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1200
7 XUL mozilla::ipc::MessagePumpForNonMainThreads::Run ipc/glue/MessagePump.cpp:332
8 XUL MessageLoop::Run ipc/chromium/src/base/message_loop.cc:290
9 XUL nsThread::ThreadFunc xpcom/threads/nsThread.cpp:444
This appears to be a macOS-specfic use-after-free access. From what I can tell most if not all of the crashes happen in a runnable so this is most likely some sort of race where we're accessing a dead object. Looking at the raw crashes there's often multiple registers holding the poison pattern which suggests we're loading multiple fields from said object. One interesting tidbit is that there's often two registers holding two separate bits looking like this:
reg1: 0xe5e5e5e5e5e00000
reg2: 0x000000000005e5e5
It's as if we're loading bit-fields or a string from the object.
There's nothing really useful in the comments, quite a few users mentioned this happened either when Firefox was in the background or they weren't interacting with it because they were watching a video. Could this be triggered by some background activity?
Comment 1•4 years ago
|
||
toolkit/components/places/bookmark_sync/src/
is code related to Firefox sync, so it is highly likely that it would be running in the background.
Updated•4 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
Comment 2•3 years ago
|
||
Hmm, we've seen very similar crashes in bug 1583047 and bug 1605052 for bookmark sync specifically. Bug 1593734 seemed vaguely related, in that it also mentioned moz_task
, though it was for rkv and not Dogear. That would make sense, since moz_task
is all about scheduling runnables to run on a background thread. I asked Nathan for some help over in bug 1583047, comment 1, and his explanation made perfect sense, though it didn't seem like that's what was happening.
Here's how that code works now. The TaskRunnable
wraps a task, which is just a trait object that implements the Task
trait. We enforce the Send + Sync
bound on it, so that it's safe to use from another thread. The wrapper runnable gets dispatched to a target thread, calls the boxed task's run
method, and ping-pongs the same runnable back to the main thread, where it then calls the task's done
method.
There's also the ThreadPtrHolder
type, which is basically an analogue of nsMainThreadPtrHolder
from C++. It wraps a Rust XPCOM object in a thread-safe reference counted wrapper that calls NS_ProxyReleaseISupports
to decrement the refcount on the original thread where the holder was created. We use it in Rust code to hold references to JS-implemented XPCOM interfaces in our tasks.
Gabriele, would you mind auditing those two types to make sure they're sound? I'm kind of out of my element when it comes to the intersection of XPCOM (and Rust XPCOM's magic, too), threading, and memory safety, it would be so valuable to have your expertise here!
Comment 3•3 years ago
|
||
(In reply to :Lina Cambridge from comment #2)
Here's how that code works now.
Oh, and here's how bookmark_sync
uses it. It makes a MergeTask
(which has thread-safe members) on the main thread, and schedules it to run on the Places connection's async thread. On the async thread, we increment the refcount of the mozIStorageConnection
(which I think should be safe, since it uses atomic refcounting and its methods expect to be used from multiple threads), and create our Store
on the stack—this is what implements the fetch_local_tree
and other methods.
Could the async thread getting torn down out from underneath us cause this? (Though, how come the thread gets torn down...I thought we only did that when the connection is closed, which, for Places, should be when the browser closes...not just when it's hanging out in the background).
Comment 4•3 years ago
|
||
Worrying crash signatures, but the stacks here seem to be in back-end stuff (triggered by UI or sync). Maybe sec-moderate if this isn't triggerable from web content.
Reporter | ||
Comment 5•3 years ago
|
||
I dug into the code and analyzed a few more minidumps in detail but I still can't find what's going wrong. The two bits of code in comment 2 seem OK and more importantly they don't contain any macOS-specific code but this crash is happening only on macOS. I suspect the problem is at a higher level otherwise we'd see it on other platforms too.
Now, https://crash-stats.mozilla.org/report/index/1acf494f-904c-4de9-bb2f-d9cb10200504 has a better stack trace where it's obvious that the object we're accessing is held by a local variable of <bookmark_sync::store::Store as dogear::store::Store>::fetch_local_tree
. Now the question is: which one? I also suspect it might be related to the storage connection because I don't see anything else in that method which could be deleted from under us. And again I don't see macOS-specific code anywhere related to the bookmarks, so we might be hitting some macOS-specific corner-case with the storage connection. One thing worth of attention is that apart from the crashing thread all the other threads seem to be idling, including the main one.
Marco, does anything look suspicious here (see comment 2 and comment 3 for more details)? Is there anything here that could cause the storage connection to disappear from under this code while it runs?
Comment 6•3 years ago
|
||
Connection objects can of course be destroyed if nothing keeps them alive, but if this is about Places, the only point where it would be released is on shutdown, or if the database is corrupt on startup (but then I suspect we'd not even reach this point).
I know there has been some work to shutdown sooner, but I don't think it affected connections... May this crash happen on shutdown?
Reporter | ||
Comment 7•3 years ago
|
||
Unfortunately not, this seems to be happening during idle periods but not during shutdown.
Comment 8•3 years ago
•
|
||
There's also bp-d1994625-403a-4aa0-8b1b-942ad0200504 that crashes inside Store::apply
, and bp-282f27d9-8c5e-42e6-9a00-c18160200506 inside Store::remote_row_to_item
, that gets a bit further with an OOM; both on Windows. Could it be the same underlying issue—maybe we're panicking because we hit an OOM, and doing a double-drop cleaning up? Though, I think we compile with panic = "abort"
, so probably not... Or maybe our OOM hook is doing something fishy?
Reporter | ||
Comment 9•3 years ago
|
||
On macOS we often don't detect OOM crashes correctly - typically when we're killed by the OOM killer - so it's hard to know if these crashes are happening near OOM scenarios or not. That being said I think it's more likely that this is just a race where some other thread pulled the rug from under us.
Updated•3 years ago
|
Updated•3 years ago
|
Reporter | ||
Comment 10•3 years ago
|
||
Added another signature.
Reporter | ||
Comment 11•3 years ago
|
||
Here's yet another signature.
Reporter | ||
Comment 12•3 years ago
|
||
This could be another signature for this or another symptom of the same problem but I'm not sure:
The interesting bit in this one is that it manifest itself in three different ways with the same stack:
- A general protection fault caused by accessing the poison pattern at 0xe5e5e5e5e5e5e5e5e5e5e5e5e5e5e5e5
- A
NULL
pointer dereference with offset 8 - An illegal instruction
Lina does this give you more clues about this issue? Assuming that our hunch is correct and this is a race, what could be accessing an object that's being processed by dogear::merge::accumulate
? I'm assuming that there's some sort of callback involved and that it's read from an object which would lead to the illegal instruction crash (because the address is bogus).
Last but not least we've got this other suspicious signature which I also suspect belongs here:
[@ <dogear::tree::Tree as core::convert::TryFrom<dogear::tree::Builder>>::try_from]
This one is an actual panic, however there's a bunch of different reasons: index out of bounds, assertions failing and even unreachable code being reached. This again sounds like an access to an object that's corrupt or maybe is being modified concurrently.
Comment 13•3 years ago
|
||
Interesting! Unfortunately, it doesn't give me any more clues. The only thing I can think of for accumulate
that's allocated in Gecko is an AbortSignal
, which is allocated for each operation here and passed here. Everything else—merged nodes, completion ops—is allocated inside of Dogear, in local variables.
The other thing I can think of for the other Store
operations is logging and recording telemetry by the merger as it works, which dispatches runnables back to the main thread. That's the only thing besides the Places connection itself that comes from the outside.
Side note, I just saw bug 1644021 fly by, and I'm wondering if they could be related at all? The setup is different (SendBatchRunnable
is holding a RefPtr
created from a StaticRefPtr
, which we've asserted is non-null, and StreamingTelemetryDelegate
wraps a JNI object), but the dispatching back to the main thread and holding a strong reference to something that's been deallocated (which, in the case of bug 1644021, should never be deallocated) and has the poison pattern raised my eyebrows.
Updated•3 years ago
|
Updated•3 years ago
|
Comment 17•3 years ago
|
||
(In reply to :Lina Cambridge from comment #13)
Side note, I just saw bug 1644021 fly by, and I'm wondering if they could be related at all? The setup is different (
SendBatchRunnable
is holding aRefPtr
created from aStaticRefPtr
, which we've asserted is non-null, andStreamingTelemetryDelegate
wraps a JNI object), but the dispatching back to the main thread and holding a strong reference to something that's been deallocated (which, in the case of bug 1644021, should never be deallocated) and has the poison pattern raised my eyebrows.
So SendBatchRunnable
turned out to be using thread-unsafe refcounting. I don't think that's the problem here, because TaskRunnable
uses atomic
refcounting, which is defined here. The other members of TaskRunnable
are thread-safe too—&'static str
and AtomicBool
, Box<dyn Task + Send + Sync>
(thread-safe from Rust's perspective), and RefPtr<nsIThread>
(which is also declared to use thread-safe refcounting).
Reporter | ||
Comment 18•3 years ago
|
||
After fixing bug 1665411 it occurred to me that this might be just another instance of that problem. We'll have to wait until it gets uplifted to be sure but I'm pretty confident this bug was caused by that issue, all the clues point to it:
- macOS-only crash in platform-independent code
- use-after-free
- sound locking yet dead objects appear to "bleed" out of the critical sections
Reporter | ||
Comment 19•3 years ago
|
||
Just found another signature. Hopefully when I fix (if I fix?) bug 1676343 this will go away.
Comment 20•7 months ago
|
||
No recent reports from any modern versions.
Updated•3 months ago
|
Description
•