Closed Bug 1635013 Opened 4 years ago Closed 7 months ago

Crash in [@ free | core::ptr::real_drop_in_place | <bookmark_sync::store::Store as dogear::store::Store>::fetch_local_tree]

Categories

(Toolkit :: Places, defect, P2)

Unspecified
macOS
defect

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: gsvelto, Unassigned)

References

Details

(4 keywords)

Crash Data

This bug is for crash report bp-6ad332d7-e35a-4b6f-832d-a5a3d0200427.

Top 10 frames of crashing thread:

0 libmozglue.dylib free memory/build/malloc_decls.h:54
1 XUL core::ptr::real_drop_in_place toolkit/library/rust/shared/lib.rs:263
2 XUL <bookmark_sync::store::Store as dogear::store::Store>::fetch_local_tree toolkit/components/places/bookmark_sync/src/store.rs
3 XUL bookmark_sync::merger::MergeTask::merge third_party/rust/dogear/src/store.rs:64
4 XUL <bookmark_sync::merger::MergeTask as moz_task::Task>::run toolkit/components/places/bookmark_sync/src/merger.rs:217
5 XUL moz_task::TaskRunnable::allocate::Run xpcom/rust/moz_task/src/lib.rs:132
6 XUL nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1200
7 XUL mozilla::ipc::MessagePumpForNonMainThreads::Run ipc/glue/MessagePump.cpp:332
8 XUL MessageLoop::Run ipc/chromium/src/base/message_loop.cc:290
9 XUL nsThread::ThreadFunc xpcom/threads/nsThread.cpp:444

This appears to be a macOS-specfic use-after-free access. From what I can tell most if not all of the crashes happen in a runnable so this is most likely some sort of race where we're accessing a dead object. Looking at the raw crashes there's often multiple registers holding the poison pattern which suggests we're loading multiple fields from said object. One interesting tidbit is that there's often two registers holding two separate bits looking like this:

reg1: 0xe5e5e5e5e5e00000
reg2: 0x000000000005e5e5

It's as if we're loading bit-fields or a string from the object.

There's nothing really useful in the comments, quite a few users mentioned this happened either when Firefox was in the background or they weren't interacting with it because they were watching a video. Could this be triggered by some background activity?

toolkit/components/places/bookmark_sync/src/ is code related to Firefox sync, so it is highly likely that it would be running in the background.

Flags: needinfo?(markh)
Flags: needinfo?(lina)
Flags: needinfo?(markh)
Flags: needinfo?(markh)
Flags: needinfo?(markh)

Hmm, we've seen very similar crashes in bug 1583047 and bug 1605052 for bookmark sync specifically. Bug 1593734 seemed vaguely related, in that it also mentioned moz_task, though it was for rkv and not Dogear. That would make sense, since moz_task is all about scheduling runnables to run on a background thread. I asked Nathan for some help over in bug 1583047, comment 1, and his explanation made perfect sense, though it didn't seem like that's what was happening.

Here's how that code works now. The TaskRunnable wraps a task, which is just a trait object that implements the Task trait. We enforce the Send + Sync bound on it, so that it's safe to use from another thread. The wrapper runnable gets dispatched to a target thread, calls the boxed task's run method, and ping-pongs the same runnable back to the main thread, where it then calls the task's done method.

There's also the ThreadPtrHolder type, which is basically an analogue of nsMainThreadPtrHolder from C++. It wraps a Rust XPCOM object in a thread-safe reference counted wrapper that calls NS_ProxyReleaseISupports to decrement the refcount on the original thread where the holder was created. We use it in Rust code to hold references to JS-implemented XPCOM interfaces in our tasks.

Gabriele, would you mind auditing those two types to make sure they're sound? I'm kind of out of my element when it comes to the intersection of XPCOM (and Rust XPCOM's magic, too), threading, and memory safety, it would be so valuable to have your expertise here!

Flags: needinfo?(lina) → needinfo?(gsvelto)

(In reply to :Lina Cambridge from comment #2)

Here's how that code works now.

Oh, and here's how bookmark_sync uses it. It makes a MergeTask (which has thread-safe members) on the main thread, and schedules it to run on the Places connection's async thread. On the async thread, we increment the refcount of the mozIStorageConnection (which I think should be safe, since it uses atomic refcounting and its methods expect to be used from multiple threads), and create our Store on the stack—this is what implements the fetch_local_tree and other methods.

Could the async thread getting torn down out from underneath us cause this? (Though, how come the thread gets torn down...I thought we only did that when the connection is closed, which, for Places, should be when the browser closes...not just when it's hanging out in the background).

Worrying crash signatures, but the stacks here seem to be in back-end stuff (triggered by UI or sync). Maybe sec-moderate if this isn't triggerable from web content.

I dug into the code and analyzed a few more minidumps in detail but I still can't find what's going wrong. The two bits of code in comment 2 seem OK and more importantly they don't contain any macOS-specific code but this crash is happening only on macOS. I suspect the problem is at a higher level otherwise we'd see it on other platforms too.

Now, https://crash-stats.mozilla.org/report/index/1acf494f-904c-4de9-bb2f-d9cb10200504 has a better stack trace where it's obvious that the object we're accessing is held by a local variable of <bookmark_sync::store::Store as dogear::store::Store>::fetch_local_tree. Now the question is: which one? I also suspect it might be related to the storage connection because I don't see anything else in that method which could be deleted from under us. And again I don't see macOS-specific code anywhere related to the bookmarks, so we might be hitting some macOS-specific corner-case with the storage connection. One thing worth of attention is that apart from the crashing thread all the other threads seem to be idling, including the main one.

Marco, does anything look suspicious here (see comment 2 and comment 3 for more details)? Is there anything here that could cause the storage connection to disappear from under this code while it runs?

Flags: needinfo?(gsvelto) → needinfo?(mak)

Connection objects can of course be destroyed if nothing keeps them alive, but if this is about Places, the only point where it would be released is on shutdown, or if the database is corrupt on startup (but then I suspect we'd not even reach this point).
I know there has been some work to shutdown sooner, but I don't think it affected connections... May this crash happen on shutdown?

Flags: needinfo?(mak)

Unfortunately not, this seems to be happening during idle periods but not during shutdown.

There's also bp-d1994625-403a-4aa0-8b1b-942ad0200504 that crashes inside Store::apply, and bp-282f27d9-8c5e-42e6-9a00-c18160200506 inside Store::remote_row_to_item, that gets a bit further with an OOM; both on Windows. Could it be the same underlying issue—maybe we're panicking because we hit an OOM, and doing a double-drop cleaning up? Though, I think we compile with panic = "abort", so probably not... Or maybe our OOM hook is doing something fishy?

On macOS we often don't detect OOM crashes correctly - typically when we're killed by the OOM killer - so it's hard to know if these crashes are happening near OOM scenarios or not. That being said I think it's more likely that this is just a race where some other thread pulled the rug from under us.

Priority: -- → P2
Severity: -- → S3

Added another signature.

Crash Signature: [@ free | core::ptr::real_drop_in_place | <bookmark_sync::store::Store as dogear::store::Store>::fetch_local_tree] [@ core::ptr::real_drop_in_place | <bookmark_sync::store::Store as dogear::store::Store>::fetch_local_tree] [@ Allocator<T>::free | core::… → [@ Allocator<T>::free | core::ptr::real_drop_in_place | <bookmark_sync::store::Store as dogear::store::Store>::fetch_local_tree] [@ arena_dalloc | core::ptr::real_drop_in_place | <bookmark_sync::store::Store as dogear::store::Store>::fetch_local_tree] […

Here's yet another signature.

Crash Signature: dogear::store::Store>::fetch_local_tree] [@ free | <bookmark_sync::store::Store as dogear::store::Store>::fetch_local_tree] [@ free | core::ptr::real_drop_in_place | <bookmark_sync::store::Store as dogear::store::Store>::fetch_local_tree] → dogear::store::Store>::fetch_local_tree] [@ free | <bookmark_sync::store::Store as dogear::store::Store>::fetch_local_tree] [@ free | core::ptr::real_drop_in_place | <bookmark_sync::store::Store as dogear::store::Store>::fetch_local_tree] [@ core::ptr…

This could be another signature for this or another symptom of the same problem but I'm not sure:

[@ dogear::merge::accumulate]

The interesting bit in this one is that it manifest itself in three different ways with the same stack:

  • A general protection fault caused by accessing the poison pattern at 0xe5e5e5e5e5e5e5e5e5e5e5e5e5e5e5e5
  • A NULL pointer dereference with offset 8
  • An illegal instruction

Lina does this give you more clues about this issue? Assuming that our hunch is correct and this is a race, what could be accessing an object that's being processed by dogear::merge::accumulate? I'm assuming that there's some sort of callback involved and that it's read from an object which would lead to the illegal instruction crash (because the address is bogus).

Last but not least we've got this other suspicious signature which I also suspect belongs here:

[@ <dogear::tree::Tree as core::convert::TryFrom<dogear::tree::Builder>>::try_from]

This one is an actual panic, however there's a bunch of different reasons: index out of bounds, assertions failing and even unreachable code being reached. This again sounds like an access to an object that's corrupt or maybe is being modified concurrently.

Flags: needinfo?(lina)

Interesting! Unfortunately, it doesn't give me any more clues. The only thing I can think of for accumulate that's allocated in Gecko is an AbortSignal, which is allocated for each operation here and passed here. Everything else—merged nodes, completion ops—is allocated inside of Dogear, in local variables.

The other thing I can think of for the other Store operations is logging and recording telemetry by the merger as it works, which dispatches runnables back to the main thread. That's the only thing besides the Places connection itself that comes from the outside.

Side note, I just saw bug 1644021 fly by, and I'm wondering if they could be related at all? The setup is different (SendBatchRunnable is holding a RefPtr created from a StaticRefPtr, which we've asserted is non-null, and StreamingTelemetryDelegate wraps a JNI object), but the dispatching back to the main thread and holding a strong reference to something that's been deallocated (which, in the case of bug 1644021, should never be deallocated) and has the poison pattern raised my eyebrows.

Flags: needinfo?(lina)
Crash Signature: core::ptr::drop_in_place | <bookmark_sync::store::Store as dogear::store::Store>::fetch_local_tree] → core::ptr::drop_in_place | <bookmark_sync::store::Store as dogear::store::Store>::fetch_local_tree] [@ core::ptr::real_drop_in_place | dogear::store::Store::merge_with_driver] [@ core::ptr::real_drop_in_place | dogear::store::Store::merge_with_driver ]…
Crash Signature: core::ptr::drop_in_place | <bookmark_sync::store::Store as dogear::store::Store>::fetch_local_tree] [@ core::ptr::real_drop_in_place | dogear::store::Store::merge_with_driver] [@ core::ptr::real_drop_in_place | dogear::store::Store::merge_with_driver ]… → core::ptr::drop_in_place | <bookmark_sync::store::Store as dogear::store::Store>::fetch_local_tree] [@ core::ptr::real_drop_in_place | dogear::store::Store::merge_with_driver] [@ dogear::tree::BuilderEntry::parents_by]

(In reply to :Lina Cambridge from comment #13)

Side note, I just saw bug 1644021 fly by, and I'm wondering if they could be related at all? The setup is different (SendBatchRunnable is holding a RefPtr created from a StaticRefPtr, which we've asserted is non-null, and StreamingTelemetryDelegate wraps a JNI object), but the dispatching back to the main thread and holding a strong reference to something that's been deallocated (which, in the case of bug 1644021, should never be deallocated) and has the poison pattern raised my eyebrows.

So SendBatchRunnable turned out to be using thread-unsafe refcounting. I don't think that's the problem here, because TaskRunnable uses atomic refcounting, which is defined here. The other members of TaskRunnable are thread-safe too—&'static str and AtomicBool, Box<dyn Task + Send + Sync> (thread-safe from Rust's perspective), and RefPtr<nsIThread> (which is also declared to use thread-safe refcounting).

After fixing bug 1665411 it occurred to me that this might be just another instance of that problem. We'll have to wait until it gets uplifted to be sure but I'm pretty confident this bug was caused by that issue, all the clues point to it:

  • macOS-only crash in platform-independent code
  • use-after-free
  • sound locking yet dead objects appear to "bleed" out of the critical sections
See Also: → 1665411
Depends on: 1676343

Just found another signature. Hopefully when I fix (if I fix?) bug 1676343 this will go away.

Crash Signature: core::ptr::drop_in_place | <bookmark_sync::store::Store as dogear::store::Store>::fetch_local_tree] [@ core::ptr::real_drop_in_place | dogear::store::Store::merge_with_driver] [@ dogear::tree::BuilderEntry::parents_by] → core::ptr::drop_in_place | <bookmark_sync::store::Store as dogear::store::Store>::fetch_local_tree] [@ core::ptr::real_drop_in_place | dogear::store::Store::merge_with_driver] [@ dogear::tree::BuilderEntry::parents_by] [@ dogear::tree::MergedNode::loc…

No recent reports from any modern versions.

Status: NEW → RESOLVED
Closed: 7 months ago
Resolution: --- → WORKSFORME
Group: firefox-core-security
You need to log in before you can comment on or make changes to this bug.