Closed Bug 1654335 Opened 5 years ago Closed 3 years ago

Use-after-free crashes in timer code at [@ <name omitted> | std::__1::__sift_up<T>]

Categories

(Core :: XPCOM, defect)

x86_64
macOS
defect

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox-esr78 - wontfix
firefox79 --- wontfix
firefox80 --- wontfix
firefox81 --- wontfix
firefox82 --- wontfix
firefox83 + wontfix
firefox84 + wontfix
firefox85 + wontfix
firefox86 + wontfix

People

(Reporter: gsvelto, Unassigned)

References

Details

(Keywords: crash, csectype-uaf, sec-moderate, Whiteboard: [adv-esr78.5-])

Crash Data

This bug is for crash report bp-9eaa9de3-4856-4693-a241-5f5a90200721.

Top 10 frames of crashing thread:

0 XUL <name omitted> xpcom/threads/TimerThread.h:105
1 XUL void std::__1::__sift_up<bool  /builds/worker/fetches/clang/include/c++/v1/algorithm:4835
2 XUL TimerThread::AddTimerInternal xpcom/threads/TimerThread.cpp:638
3 XUL XUL@0x455bcf 
4 XUL TimerThread::AddTimer xpcom/threads/TimerThread.cpp:518
5 XUL nsTimerImpl::InitHighResolutionWithCallback xpcom/threads/nsTimerImpl.cpp:372
6 XUL <name omitted> xpcom/threads/nsTimerImpl.cpp:417
7 XUL NS_NewTimerWithCallback xpcom/threads/nsTimerImpl.cpp:93
8 XUL NS_DispatchToThreadQueue xpcom/threads/nsThreadUtils.cpp:412
9 XUL nsIScreenManager::COMTypeInfo<nsIScreenManager, void>::kIID 

This is an odd crash in that it seems to be macOS-specific but it affects a piece of code that doesn't have many platform-specific bits. It seems to affect only older versions of macOS, specifically 10.9, 10.10 and 10.11. It seems to have appeared with 78ESR and for some reason does not seem to affect the non-ESR versions.

The crash itself is worrisome in that it's clearly a use-after-free access and the stack has been smashed (there's garbage on the stack below the second frame).

[Tracking Requested - why for this release]: This looks really bad. NI all the usual macOS suspects.

It's not surprising that this is only showing up on ESR because macOS 10.9-10.11 users were purposely migrated from release to ESR in conjuction with increasing the minimum supported version to 10.12 for Fx79+.

Flags: needinfo?(spohl.mozilla.bugs)
Flags: needinfo?(mstange.moz)
Flags: needinfo?(haftandilian)

More complete stack:

mozilla::TimeStamp63Bit::operator unsigned long long() const at /builds/worker/workspace/obj-build/dist/include/mozilla/TimeStamp.h:53
 (inlined by) mozilla::TimeStamp::operator<(mozilla::TimeStamp const&) const at /builds/worker/workspace/obj-build/dist/include/mozilla/TimeStamp.h:543
 (inlined by) TimerThread::Entry::UniquePtrLessThan(mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&) at /builds/worker/checkouts/gecko/xpcom/threads/TimerThread.h:105
void std::__1::__sift_up<bool (*&)(mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&), mozilla::ArrayIterator<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, nsTArray<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> > > > >(mozilla::ArrayIterator<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, nsTArray<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> > > >, mozilla::ArrayIterator<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, nsTArray<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> > > >, bool (*&)(mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&), std::__1::iterator_traits<mozilla::ArrayIterator<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, nsTArray<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> > > > >::difference_type) at /builds/worker/fetches/clang/bin/../include/c++/v1/algorithm:4835
void std::__1::push_heap<mozilla::ArrayIterator<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, nsTArray<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> > > >, bool (*)(mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&)>(mozilla::ArrayIterator<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, nsTArray<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> > > >, mozilla::ArrayIterator<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, nsTArray<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> > > >, bool (*)(mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&)) at /builds/worker/fetches/clang/bin/../include/c++/v1/algorithm:4858
 (inlined by) TimerThread::AddTimerInternal(nsTimerImpl*) at /builds/worker/checkouts/gecko/xpcom/threads/TimerThread.cpp:638
TimerThread::FindNextFireTimeForCurrentThread(mozilla::TimeStamp, unsigned int) at ??:?
TimerThread::AddTimer(nsTimerImpl*) at /builds/worker/checkouts/gecko/xpcom/threads/TimerThread.cpp:518
nsTimerImpl::InitHighResolutionWithCallback(nsITimerCallback*, mozilla::BaseTimeDuration<mozilla::TimeDurationValueCalculator> const&, unsigned int) at /builds/worker/checkouts/gecko/xpcom/threads/nsTimerImpl.cpp:372
<broken frame>
nsTimerImpl::InitWithCallback(nsITimerCallback*, unsigned int, unsigned int) at /builds/worker/checkouts/gecko/xpcom/threads/nsTimerImpl.cpp:356
 (inlined by) nsTimer::InitWithCallback(nsITimerCallback*, unsigned int, unsigned int) at /builds/worker/checkouts/gecko/xpcom/threads/nsTimerImpl.h:235
 (inlined by) NS_NewTimerWithCallback(nsITimer**, nsITimerCallback*, unsigned int, unsigned int, nsIEventTarget*) at /builds/worker/checkouts/gecko/xpcom/threads/nsTimerImpl.cpp:93
NS_DispatchToThreadQueue(already_AddRefed<nsIRunnable>&&, unsigned int, nsIThread*, mozilla::EventQueuePriority) at /builds/worker/checkouts/gecko/xpcom/threads/nsThreadUtils.cpp:412

I don't know much yet.

We're crashing when we're accessing a field at the beginning of a TimeStamp value, and the TimeStamp is the first field of an Entry object. Each Entry is separately heap-allocated using MakeUnique, the UniquePtrs are stored in an nsTArray. All of the UniquePtrs in the mTimers array should be non-null.
The crash addresses are mostly 0x0, but there are also some 0x9s in there, some 0x8, and some 0xffffffffe5e5e508.
So that must mean that the contents of the nsTArray have gone bad. The UniquePtrs are pointing to bad addresses.

But why? And why only on old versions of macOS? And only on ESR78... (edit: never mind, the ESR part is explained in comment 1.)

Flags: needinfo?(mstange.moz)

Here's a somewhat similar crash from Fennec 68:
https://crash-stats.mozilla.org/report/index/654251b0-8ba8-4e4a-b6c4-977dc0200719

I am trying to find crashes from Windows and Linux that have "TimerThread" in their stack frames but I don't know how to search for that.

(In reply to Markus Stange [:mstange] from comment #5)

I am trying to find crashes from Windows and Linux that have "TimerThread" in their stack frames but I don't know how to search for that.

protosignature contains

The crashes that are apparently crashing on the 0x0 address aren't really crashing there, they're crashing on the poison pattern 0xe5e5e5e5e5e5e5e5 but because of bug 1493342 they're being reported as NULL. So in most cases this is an UAF access. If you open up the "Raw Data and Minidumps" field in those reports you'll often find that the rcx register contains the poison pattern and we're likely crashing when accessing it.

To look for crashes that contain TimerThread code (or any other function) in the stack you can use the "proto signature" field. It contains the function names of the first 10 frames on the stack of every crash IIRC:

https://crash-stats.mozilla.org/search/?proto_signature=~TimerThread

There's at least three more signatures that are certainly related. I'll add them.

Crash Signature: [@ <name omitted> | std::__1::__sift_up<T>] → [@ <name omitted> | std::__1::__sift_up<T>] [@ std::__1::__pop_heap<T>] [@ std::__1::__sift_up<T>] [@ TimerThread::AddTimerInternal]

Thanks. Those other similar signatures are all mac-only, too.

I don't know where to go from here.

Maybe there's an unsynchronized access to mTimers from a different thread? I see mutexes being used in TimerThread consistently so that seems unlikely.
Even more unlikely ideas: Maybe __sift_up or push/pop_heap on macOS has a bug where it accesses outside of the range we give it? Maybe there's a compiler bug that miscompiles __sift_up or push/pop_heap so that it accesses outside of the range we give it?

I'll open a minidump and look at the stack. Since one of the frames it's corrupted it's possible that the real cause of the crash is there: some object / variable on the stack gets corrupted and then we use it as an index in the array which leads to the UAF access.

Byron, as somebody who has looked at some similar ish sounding timer issues in the past, does this look like any existing issues we've had before? Thanks.

Group: dom-core-security
Flags: needinfo?(docfaraday)
Summary: Crash in [@ <name omitted> | std::__1::__sift_up<T>] → Use-after-free crashes in timer code at [@ <name omitted> | std::__1::__sift_up<T>]

I'm not sure if this should be sec-high or sec-moderate.

Keywords: sec-high

(In reply to Nathan Froyd [:froydnj] from comment #14)

The signature is almost exactly like https://bugzilla.mozilla.org/show_bug.cgi?id=1594016

The stacks there also show signs of smashing after the second frame.

There's no useful user comments in the reports but URLs from these three sites come up more than a dozen times:

https://www.sfgate.com/
https://torontosun.com
https://www.mysanantonio.com/

I don't have the right hardware but if we could find someone who does we might try to repro.

I'm running ESR 78 on 10.15.6 now. I've clicked around on sfgate.com for 5 minutes now, no crash so far.

We've had races in that array before, but we went to considerable effort to lock it down. I do not see any place where we access that array (or any element within it) without the appropriate protection. The possibilities I can think of are:

  1. Misuse of the mTimers array somehow. Things like iterator invalidation.
  2. Random corruption; some unrelated code has stomped on mTimers' memory, due to a UAF bug or similar.
  3. Stack corruption.

Is it possible the MakeUnique is failing here, causing an empty UniquePtr to be pushed?

https://searchfox.org/mozilla-central/rev/3b6958c26049c1e27b2790a43154caaba9f6dd4a/xpcom/threads/TimerThread.cpp#638

(In reply to Byron Campen [:bwc] from comment #19)

Is it possible the MakeUnique is failing here, causing an empty UniquePtr to be pushed?

https://searchfox.org/mozilla-central/rev/3b6958c26049c1e27b2790a43154caaba9f6dd4a/xpcom/threads/TimerThread.cpp#638

No, MakeUnique allocates infallibly, so we should crash if it fails.

Not sure if this is related, but this looks super fishy to me:

https://searchfox.org/mozilla-central/rev/3b6958c26049c1e27b2790a43154caaba9f6dd4a/mozglue/misc/TimeStamp.h#46-53

Seems to me that we can have a case where none of the following are true: a < b, b < a, b == a (because a and b both have the same mTimeStamp, but different values in mUsedCanonicalNow). I don't know if push_heap/pop_heap are going to care about this, but still yuck.

Also, I seem to have missed the fact that we converted over to using a heap for this timer stuff; stl heaps are not necessarily stable, and I think we still have code that expects two timers scheduled with the same delay on the same thread to fire in the order they were scheduled.

It is also interesting that all of these problems originate with ChromeUtils::IdleDispatch, and not with other uses of the timer API (AFAICT). Might be a hint.

Flags: needinfo?(docfaraday)

(In reply to Markus Stange [:mstange] from comment #17)

I'm running ESR 78 on 10.15.6 now. I've clicked around on sfgate.com for 5 minutes now, no crash so far.

This seems to be happening only on 10.9, 10.10 and 10.11. Given the platform-specific nature of this crash we probably have to look for something that might cause issues up the stack, or before we got to the point of the crash. However there's not much in the way of platform-specific code in the first few frames: we grab the lock, get a timestamp which uses mach_absolute_time() and do some logging. None of those seem like they might be causing issues.

I noticed https://torontosun.com/news has 4 crashes (in various different articles), which may help in reproing.

Flags: needinfo?(spohl.mozilla.bugs)

Seeing a bunch of crashes in 77, but a lot fewer in 78. Did the signature migrate? Did we fix something?

Keywords: sec-moderate

The crash mostly affects macOS 10.9-10.11 users, which were migrated to ESR78 (where this is still happening with high frequency).

Looking at the "crash data" chart, the number of these crashes increased dramatically between 7-5-2020 and 7-8-2020. Does anyone have an explanation for that?

Markus Stange just CCed me on this bug. And yes, it's weird and interesting. But the crash stacks provide no macOS-specific clues, so my expertise and tools are unlikely to help here.

As it happens, I have a working macOS 10.10.5 partition. I've been testing with the URLs from comment #16, but so far I haven't seen any crashes.

Also, given the number of crashes for at least the std::__1::__sift_up<T> signature, it's surprising it doesn't show up in the list of Mac topcrashes. Has it been deliberately kept out of that list because this is a security bug?

(In reply to Steven Michaud [:smichaud] (Retired) from comment #29)

Has it been deliberately kept out of that list because this is a security bug?

I don't think that's something we have set up.

I vaguely remember that at least some Firefox code sets the stack size. (My memory is too vague to be able to find it again.) Is it possible that we're just running out of stack space, and that changing that setting (or settings) would make a difference here?

I can't explain why we'd be more likely to run out of stack space on older versions of macOS, though.

I just looked at the assembly code for std::__1::__sift_up<T> (in today's mozilla-central nightly), and the only external method it calls directly is free().

Could this be a jemalloc bug? Does jemalloc behave differently on macOS versions older than 10.12?

assembly code for std::__1::__sift_up<T>

Its implementation for TimerThread::Entry.

Actually, I'm not sure the Mozilla codebase is still using jemalloc:

https://bugzilla.mozilla.org/show_bug.cgi?id=1363992

(In reply to Steven Michaud [:smichaud] (Retired) from comment #34)

Actually, I'm not sure the Mozilla codebase is still using jemalloc:

We're using jemalloc 3.

Well, I guess more that we use "mozjemalloc", which is a fork of jemalloc 3.

I just looked at the assembly code for std::__1::__sift_up<T> (in today's mozilla-central nightly), and the only external method it calls directly is free().

Likewise for std::__1::__pop_heap<T>

Testing with a recent mozilla-central nightly on macOS 10.15.6 and OS X 10.10.5 (and a HookCase hook library that hooks free(), std::__1::__sift_up<T> and std::__1::__pop_heap<T>), it appears that free() is never called from either of those TimerThread::Entry template methods in XUL, at least during ordinary circumstances. That's probably as it should be.

I wonder about non-ordinary circumstances, but I haven't yet figured out how to test for them.

(In reply to Steven Michaud [:smichaud] (Retired) from comment #38)

Testing with a recent mozilla-central nightly on macOS 10.15.6 and OS X 10.10.5 (and a HookCase hook library that hooks free(), std::__1::__sift_up<T> and std::__1::__pop_heap<T>), it appears that free() is never called from either of those TimerThread::Entry template methods in XUL, at least during ordinary circumstances. That's probably as it should be.

Could it be that the issue isn't with the std::__1::__sift_up<T> and std::__1::__pop_heap<T> calls themselves, but rather with something that gets called earlier and corrupts the stack?

Yes, it could. Whatever the problem, though, it isn't easy to reproduce (as I found testing on OS X 10.10.5). At some point I'll try digging deeper into Mozilla's stl heap code (since that seems likely to be the source of the trouble). But I currently know nothing about it, and it will be a while before I have the time. I'll also keep comment #21 in mind.

I'm starting to see crashes in the nightly channel on macOS 10.15, see this one for example. This is worrisome, I don't think we can rule it out as an issue with older versions of macOS, it appears to affect new ones as well.

I've started digging further into this, and am puzzled that std::__1::__sift_up<T> and std::__1::__pop_heap<T> (where T == TimerThread::Entry) ever call free(). After all, std::push_heap() and std::pop_heap() are intended not to add or subtract items from their "heap" -- just to change its order.

In my tests (comment #38) I never found free() actually being called from either of these two template methods. But you'd think it shouldn't even be possible. Each call to free() comes just after a call to RefPtr<nsTimerImpl>::~RefPtr() on the same pointer (in %rdi).

As best I can tell, free() (and RefPtr<nsTimerImpl>::~RefPtr()) would no longer be called if we could get the clang compiler to treat a TimerThread::Entry object as "large" rather than "small".

In ~/.mozbuild/clang/include/c++/any, the __move() method for struct _LIBCPP_TEMPLATE_VIS _SmallHandler calls __destroy(), but the __move() method for struct _LIBCPP_TEMPLATE_VIS _LargeHandler doesn't. Any suggestions for how to fool the clang compiler into treating a TimerThread::Entry object as "large" instead of "small"?

I actually just hit this crash. bp-b7a07adc-5b27-4ec8-aef6-2335d0200814

I had this article open from a news site: https://www.wweek.com/news/2020/08/13/united-states-postal-service-confirmed-it-has-removed-mailboxes-in-portland-and-eugene/?fbclid=IwAR2oZrYxdHdSAAKiqwmmpWsZi3pBjjroWXEWhHg7x73yBNbGEFhRSJ3qjow

I'm also 4 and a half minutes into a Firefox build on my machine while watching a video in the background my machine is under heavy load, if that matters.

Hmm it just crashed again on the same page: bp-b7c75792-bf05-41f6-9745-391d80200814

(In reply to Andrew McCreight [:mccr8] from comment #45)

Hmm it just crashed again on the same page: bp-b7c75792-bf05-41f6-9745-391d80200814

Was your machine still under load? Could this be a race of some kind (though I'd be surprised if it is given that this code seems to be properly guarded by mutexes).

(In reply to Gabriele Svelto [:gsvelto] from comment #46)

Was your machine still under load? Could this be a race of some kind (though I'd be surprised if it is given that this code seems to be properly guarded by mutexes).

Yes, my build was still going.

Andrew, do you have any non-standard settings?

I'm now doing a Firefox build, and have been playing with your URL from comment #44 in both FF 79.0 and today's mozilla-central nightly. No crashes. I'm running macOS 10.15.6 build 19G73 on a Mid-2015 15" MacBook Pro with 16GB of RAM.

I'm on Firefox 81. I have Fission enabled (set fission.autostart to true on a Nightly build). I'm not running any kind of ad blocker which might qualify as "non-standard". I'm on macOS 10.15.5 on a 2019 MacBookPro with 64GB of RAM. (I didn't realize until just now how much RAM I have on this computer...)

I tried again, with a build going, and it didn't crash, so maybe that was just a fluke... twice.

As of the 2020-08-12-03-44-18 mozilla-central nightly, almost all non-exported symbols are stripped from XUL (including the ones I've been working with). Anyone know what's up with that?

Maybe glandium knows.

Flags: needinfo?(mh+mozilla)

bug 1647780 did that. You can download the dSYMs or the breakpad symbols from the symbol server.

Flags: needinfo?(mh+mozilla)

bug 1647780 did that. You can download the dSYMs or the breakpad symbols from the symbol server.

Thanks for letting me know.

I take it that operations on TimerThread::mTimers from Mozilla code are properly protected by locks. Can the same be said for operations on mTimers using std::push_heap() and std::pop_heap()? They do, after all, take place on different threads.

Ugh. Nevermind, it seems that I was lead on wild goose chase by llvm-objdump. It seems that while I was looking at the disassembly it blindly ignored the dsym file I had provided so NS_NewLocalFileWithCFURL is just a big blob of code which can contain anything and everything. Scratch my previous comments, they're most likely wrong.

I just found an interesting stack trace, much more complete than most of those we've seen. Unlike most of our crashes, this seems to be one in the main (firefox) process, not the content process:

https://crash-stats.mozilla.org/signature/?signature=nsTArray_base%3CT%3E%3A%3ASwapArrayElements%3CT%3E%20%7C%20mozilla%3A%3Adetail%3A%3AMutexImpl%3A%3A~MutexImpl%20%7C%20XUL%400x9b8ecf%20%7C%20mozilla%3A%3Adetail%3A%3AMutexImpl%3A%3Aunlock%20%7C%20MessageLoop%3A%3APostTask%20%7C%20PR_GetCurrentThread%20%7C%20%3Cname%20omitted%3E%20%7C%20%3Cname%20omitted%3E%20%7C%20std%3A%3A__1%3A%3A__sift_up%3CT%3E&date=%3E%3D2020-08-10T15%3A39%3A00.000Z&date=%3C2020-08-17T15%3A39%3A00.000Z&_sort=-date

    0  XUL  nsTArrayInfallibleAllocator::ResultTypeProxy nsTArray_base<nsTArrayInfallibleAllocator, nsTArray_RelocateUsingMemutils>::SwapArrayElements<nsTArrayInfallibleAllocator, nsTArrayInfallibleAllocator>(nsTArray_base<nsTArrayInfallibleAllocator, nsTArray_RelocateUsingMemutils>&, unsigned long, unsigned long)  xpcom/ds/nsTArray-inl.h:416  context
    1  libmozglue.dylib  mozilla::detail::MutexImpl::~MutexImpl()  mozglue/misc/Mutex_posix.cpp:111  scan
    2  XUL  XUL@0x9b8ecf   scan
    3  libmozglue.dylib  mozilla::detail::MutexImpl::unlock()  mozglue/misc/Mutex_posix.cpp:178  scan
    4  XUL  MessageLoop::PostTask(already_AddRefed<nsIRunnable>)  ipc/chromium/src/base/message_loop.cc:346  scan
    5  libnss3.dylib  PR_GetCurrentThread  nsprpub/pr/src/pthreads/ptthread.c:640  scan
    6  libmozglue.dylib  <name omitted>  memory/build/malloc_decls.h:51  scan
    7  libmozglue.dylib  <name omitted>  memory/build/malloc_decls.h:51  scan
    8  XUL  void std::__1::__sift_up<bool (*&)(mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&), mozilla::ArrayIterator<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, nsTArray<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> > > > >(mozilla::ArrayIterator<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, nsTArray<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> > > >, mozilla::ArrayIterator<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, nsTArray<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> > > >, bool (*&)(mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&), std::__1::iterator_traits<mozilla::ArrayIterator<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, nsTArray<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> > > > >::difference_type)  /builds/worker/fetches/clang/include/c++/v1/algorithm:4835  scan
    9  XUL  XUL@0x9fb87f   scan
    10  XUL  nsTHashtable<nsBaseHashtableET<nsDepCharHashKey, mozilla::ipc::ChannelCountReporter::ChannelCounts> >::s_MatchEntry(PLDHashEntryHdr const*, void const*)  xpcom/ds/nsTHashtable.h:494  scan
    11  XUL  XUL@0x3f295f   scan
    12  XUL  <name omitted>  xpcom/ds/PLDHashTable.cpp:567  scan
    13  libmozglue.dylib  <name omitted>  memory/build/malloc_decls.h:51  scan
    14  XUL  .str.33.llvm.9700260797439940092   scan
    15  libmozglue.dylib  <name omitted>  memory/build/malloc_decls.h:51  scan
    16  libmozglue.dylib  mozilla::detail::MutexImpl::lock()  mozglue/misc/Mutex_posix.cpp:156  scan
    17  XUL  mozilla::ipc::(anonymous namespace)::ActivateAndCleanupIPCStream(mozilla::ipc::IPCStream&, bool, bool)  ipc/glue/IPCStreamUtils.cpp:222  scan
    18  XUL  mozilla::ipc::AutoIPCStream::~AutoIPCStream()  ipc/glue/IPCStreamUtils.cpp:0  scan
    19  XUL  detail::ProxyReleaseEvent<mozilla::ipc::HoldIPCStream>::Run()  xpcom/threads/nsProxyRelease.h:34  scan
    20  XUL  nsThread::ProcessNextEvent(bool, bool*)  xpcom/threads/nsThread.cpp:1211  scan
    21  XUL  nsTimerImpl::InitCommon(mozilla::BaseTimeDuration<mozilla::TimeDurationValueCalculator> const&, unsigned int, nsTimerImpl::Callback&&)  xpcom/threads/nsTimerImpl.cpp:310  scan
    22  XUL  .str.82.llvm.13599766915509620700   scan
    23  XUL  XUL@0x29b4abf   scan
    24  XUL  PLDHashTable::Remove(void const*)  xpcom/ds/PLDHashTable.cpp:594  scan
    25  XUL  nsXREDirProvider::DoStartup()::kCrashed   scan
    26  XUL  NS_ProcessNextEvent(nsIThread*, bool)  xpcom/threads/nsThreadUtils.cpp:501  scan
    27  XUL  .str.379.llvm.611105640701535871   scan
    28  XUL  mozilla::dom::ContentParent::Observe(nsISupports*, char const*, char16_t const*)  dom/ipc/ContentParent.cpp:2909  scan
    29  XUL  NS_TableDrivenQI(void*, nsID const&, void**, QITableEntry const*)  xpcom/base/nsISupportsImpl.cpp:22  scan
    30  XUL  FifoWatcher::kPrefName   scan
    31  XUL  nsTHashtable<nsObserverList>::s_MatchEntry(PLDHashEntryHdr const*, void const*)  xpcom/ds/nsTHashtable.h:494  scan
    32  XUL  nsMaybeWeakPtrArray<nsIObserver>::RemoveWeakElement(nsIObserver*)  xpcom/base/nsMaybeWeakPtr.h:95  scan
    33  XUL  .str.7.llvm.6184990430567596100   scan
    34  libmozglue.dylib  free  memory/build/malloc_decls.h:54  scan
    35  libmozglue.dylib  free  memory/build/malloc_decls.h:54  scan
    36  XUL  .str.379.llvm.611105640701535871   scan
    37  XUL  .str.379.llvm.611105640701535871   scan
    38  XUL  .str.379.llvm.611105640701535871   scan
    39  XUL  .str.379.llvm.611105640701535871   scan
    40  XUL  nsXREDirProvider::DoStartup()::kCrashed   scan
    41  XUL  .str.379.llvm.611105640701535871   scan
    42  XUL  nsObserverList::NotifyObservers(nsISupports*, char const*, char16_t const*)  xpcom/ds/nsObserverList.cpp:65  scan
    43  XUL  nsXREDirProvider::DoStartup()::kCrashed   scan
    44  XUL  nsObserverService::NotifyObservers(nsISupports*, char const*, char16_t const*)  xpcom/ds/nsObserverService.cpp:288  scan
    45  XUL  mozilla::AppWindow::Destroy()  xpfe/appshell/AppWindow.cpp:669  scan
    46  XUL  .str.379.llvm.611105640701535871   scan
    47  XUL  .str.19.llvm.2081317860475493682   scan
    48  firefox  firefox@0x2ce2   scan
    49  XUL  nsXREDirProvider::DoStartup()::kCrashed   scan
    50  XUL  nsXREDirProvider::DoShutdown()  toolkit/xre/nsXREDirProvider.cpp:1039  scan
    51  XUL  ScopedXPCOMStartup::~ScopedXPCOMStartup()  toolkit/xre/nsAppRunner.cpp:1273  scan
    52  XUL  XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&)  toolkit/xre/nsAppRunner.cpp:4829  scan
    53  firefox  firefox@0x2ce2   scan
    54  XUL  XRE_main(int, char**, mozilla::BootstrapConfig const&)  toolkit/xre/nsAppRunner.cpp:4866  scan
    55  QD  GCC_except_table125   scan
    56  libsystem_c.dylib  __tolower   scan
    57  firefox  firefox@0x1a83   scan
    58  firefox  firefox@0x2cea   scan
    59  firefox  firefox@0x2ce2   scan

Comment #58's crash seems to be happening during shutdown. I suspect that's not true, though, of most of the other crashes (the content process ones).

Gabriele, you were on the right track. As best I can tell, the std::__1::__sift_up<T> crashes don't happen there.

Here's a crash report from the 20200811091731 mozilla-central nightly (the last one that still has non-external symbols in XUL):

https://crash-stats.mozilla.org/report/index/4b428e5b-aaec-4666-b41c-ce1260200811

The "Raw Data and Minidumps" tab says the module offset (in XUL) is 0x4a3110. When I do nm -pam XUL | less and then search on "sift_up", I find that this offset is past the end of std::__1::__sift_up<T>. In fact it's the address of nsTArray_Impl<RefPtr<mozilla::MozPromise<CopyableTArray<bool>, bool, false>::ThenValueBase>, nsTArrayInfallibleAllocator>::Clear().

The "function offset" (0xa0) does fit inside this function, but I'm not sure it's correct -- 0x4a3110 + 0xa0 == 0x4a31b0, which isn't at an instruction boundary.

In any case, something is seriously wrong with the symbolication code used by https://crash-stats.mozilla.org/.

In any case, something is seriously wrong with the symbolication code used by https://crash-stats.mozilla.org/.

Or maybe with the code that creates minidumps, or with the copy of dump_syms currently used by Mozilla to create minidumps. Has either of those changed in the last few months? Say around 7-05-2020, when the number of this bug's crashes in the "crash data" chart started increasing dramatically, as I mentioned in comment #28?

We changed the symbol dumping tool in bug 1654845, but that landed 2020-07-25, so way outside your target date.

Crash reports for std::__1::__sift_up<T> do exist, in much smaller numbers, for dates earlier than 2020-07-05. Their crash stacks are much higher quality. But for some reason only a vanishingly small number happened in mozilla-central (alpha) builds (the only ones with symbols unstripped):

https://crash-stats.mozilla.org/search/?signature=~sift_up&date=%3E%3D2020-02-04T00%3A00%3A00.000Z&date=%3C2020-07-04T00%3A00%3A00.000Z&_facets=signature&_facets=version&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-version

I did check up on one of them, which happened in the 20200221095110 build:
https://crash-stats.mozilla.org/report/index/40552d7a-3a18-4c12-9d79-8ff730200328#tab-details

For this one, the "module offset" is inside std::__1::__sift_up<T>, at a location that makes sense (just after movq 0x8(%rax), %rax, which derefences %rax).

At the end of all this, I think we'll find out that we need to back out the patch for bug 1647780. Yes, Mozilla's symbol handling code seems to make non-external symbols redundant. But without them we have no way to check that the symbol handling code is working correctly.

(Following up comment #60 and comment #61)

Oops, looks like I messed up, too. There are two mozilla-central nightlies for 2020-08-11: 20200811091731 and 20200811214738. I thought I had a copy of the former installed locally, but it turns out I had a copy of the latter. Once I installed 20200811091731, the problem I reported went away. On it, the offset 0x4a3110 is well within std::__1::__sift_up<T> (where T == TimerThread::Entry).

Sigh. Sorry for the confusion. Back to square one.

The definition of __sift_up() in <algorithm> (in ~/.mozbuild/clang/include/c++/v1/algorithm) is as follows:

    template <class _Compare, class _RandomAccessIterator>
    void
    __sift_up(_RandomAccessIterator __first, _RandomAccessIterator __last, _Compare __comp,
              typename iterator_traits<_RandomAccessIterator>::difference_type __len)
    {
        typedef typename iterator_traits<_RandomAccessIterator>::value_type value_type;
        if (__len > 1)
        {
            __len = (__len - 2) / 2;
            _RandomAccessIterator __ptr = __first + __len;
            if (__comp(*__ptr, *--__last))
            {
                value_type __t(_VSTD::move(*__last));
                do
                {
                    *__last = _VSTD::move(*__ptr);
                    __last = __ptr;
                    if (__len == 0)
                        break;
                    __len = (__len - 1) / 2;
                    __ptr = __first + __len;
                } while (__comp(*__ptr, __t));
                *__last = _VSTD::move(__t);
            }
        }
    }

Crash reports indicate the crashes take place at line 4835, which is

            if (__comp(*__ptr, *--__last))

Thinking it might be one way to trigger this bug's crashes, I used a HookCase hook library to increase the value of __len I passed to the original function. This did trigger crashes (if __len got large enough), but the signature was [@ InvalidArrayIndex_CRASH | std::__1::__sift_up<T> ]. This seems to rule out that an incorrect value for __len is being passed to __sift_up().

Flags: needinfo?(haftandilian)

(In response to comment #21)

I used a HookCase hook library to hook TimerThread::UniquePtrLessThan() and make it always return false or always return true when running in the content process. Neither causes crashes (though making it always return false prevents any content from being displayed). I think this rules out that the problem reported in comment #21 triggers/causes this bug's crashes.

(In reply to Steven Michaud [:smichaud] (Retired) from comment #58)

I just found an interesting stack trace, much more complete than most of those we've seen. Unlike most of our crashes, this seems to be one in the main (firefox) process, not the content process:

Apart from the first frame all the others were found via stack-scanning (see the Trust column) so they might be wrong.

Could someone CC me on bug 1660297? I currently don't have permission to view it.

Done.

I don't know if it is related, but bug 1660300 is about a shutdown hang in a method that creates a new timer.

After considerable digging around, I've managed to write a HookCase hook library that logs a stack trace whenever an nsTimerImpl object that has been manipulated by std::__1::__sift_up<T> or std::__1::__pop_heap<T> (that has been stored in the nsTimerThread::mTimers array) is destroyed (when nsTimerImpl::~nsTimerImpl() is called on it). My intent is to find out what kinds of nsTimerImpl objects can end up in this array, and to learn something about their life cycles. That can be useful in deciphering non-reproducible crash bugs, like the ones reported here.

But I see many different kinds of stack traces, and am having trouble learning anything useful about them. Does anyone have something to say on the life cycles of nsTimerImpl objects stored in the nsTimerThread::mTimers array?

Not all nsTimerImpl objects end up in the nsTimerThread::mTimers array. But a lot of them do.

I suspect they're all related to this bug.

I suspect many of them are related to this bug.

Crash Signature: [@ <name omitted> | std::__1::__sift_up<T>] [@ std::__1::__pop_heap<T>] [@ std::__1::__sift_up<T>] [@ TimerThread::AddTimerInternal] → [@ <name omitted> | std::__1::__sift_up<T>] [@ std::__1::__pop_heap<T>] [@ std::__1::__sift_up<T>] [@ TimerThread::AddTimerInternal] [@ TimerThread::Entry::UniquePtrLessThan ]

Adding media bugs that appear related to see also. These bugs have different signatures, but share the sharp spike in rates at ESR78, MacOS specificity, and involve dangling pointers (in some cases they are reliably poisoned).

See Also: → 1660368, 1584956

Bryce, could you CC me on those bugs?

Flags: needinfo?(bvandyk)

(In reply to Steven Michaud [:smichaud] (Retired) from comment #79)

Bryce, could you CC me on those bugs?

Done.

Flags: needinfo?(bvandyk)

Thanks!

The first thing that stands out is that the crashes for bug 1660368 and bug 1584956 started increasing substantially on or just after 2020-07-05, just like this bug. I still have no idea as to why.

(Following up comment #81)

Likewise for bug 1659938.

It looks like the early July increase in crashes is just an artifact -- that's when Mac ESR crashes stopped being throttled. See bug 1659938 comment 6.

It looks like the early July increase in crashes is just an artifact -- that's when Mac ESR crashes stopped being throttled. See bug 1659938 comment 6.

Or possibly it's due to a large increase in the number of Mac ESR users, since that's when Mozilla de-supported OS X versions older than 10.12. Either way, though, the increase in crashes doesn't seem to have been caused by any contemporaneous change on the ESR branch.

I'm unfamiliar with the sampling used for the crash stats, so if anyone is able to point me to an explanation I'd appreciate it.

My current understanding is that ESR is not throttled unlike other channels. Assuming N failures on release and N failures on ESR, more crashes will be reported for ESR despite the same number occurring. This means that with the mass migration to ESR various bugs appear to have a significant increase in num crashes.

If that is the case, then the see also media bugs that I assumed are related based on the increased crash rates and MacOS heavy skew may not be related. I.e. there is not a common underlying bug -- it's just the reporting bumping all the numbers at the same time.

I'm going to leave the see alsos while I make sure the above is correct, but if it is then it sounds like they should be removed.

Will, do we throttle crashes coming from ESR channels differently than release? See comment 85 for more details.

Flags: needinfo?(willkg)

Firefox release is sampled at 10%. All other apps/channels are unthrottled, as are reports submitted manually from about:crashes.

https://github.com/mozilla-services/antenna/blob/main/antenna/throttler.py#L389

Thanks!

Flags: needinfo?(willkg)

Because of throttling and a variety of other issues with the crash report data, I highly discourage people from analysis on counts of crash reports. The crash ping data in Telemetry can be normalized by usage numbers and will be better for comparing counts. I did some discussion of this here: https://bluesock.org/~willkg/blog/mozilla/crash_pings_crash_reports.html

I've been banging my head against this bug for a while, but it's now defeated me, at least for the time being. I'm going to put it aside, and maybe come back later with a fresh mind.

I discovered two interesting things along the way:

  1. I can trigger crashes in friends of push_heap and pop_heap by calling UniquePtr<Entry>::~UniquePtr or UniquePtr<Entry>::release() (in a hook library) on one or both of the parameters passed in to Entry::UniquePtrLessThan(). These all happen at address 0x8. But I haven't been able to figure out a way to make this happen "in real life".

  2. I noticed that TimerThread::mMonitor is released in TimerThread::PostTimerEvent() for a call to nsIEventTarget::Dispatch(). Aha! So maybe this is where thread contention can happen. But then (in my hook library) I forced all calls to TimerEvent::RemoveTimer() (on threads other than the "timer thread") to happen during calls to nsIEventTarget::Dispatch() (from TimerThread::PostTimerEvent()) on the timer thread. I didn't see a single crash.

FYI we keep finding more and more instances of crash reports that look like this one:

  • apparently impossible to explain
  • macOS-only
  • stack appears to have been corrupted above the crashing frame

Bug 1478581 is one of them, bug 1658719 is another. I also found more crash signatures exhibiting the same behavior, see the first three ones in this search. I think we might be dealing with a specific problem.

The crash location often appears to be near a point where we make an allocation (nsTArray is often involved) so I wonder if it might be related to the memory allocator.

I forgot something: the other thing that all the bugs I mentioned above seem to have in common is that we're accessing some dead object, the poison pattern comes up often either in the crash address, an index that lead to it or appears on the stack.

Gabriele, could you CC me on bug 1478581 and bug 1658719?

Flags: needinfo?(gsvelto)

Done!

Flags: needinfo?(gsvelto)
See Also: → 1665411
Depends on: 1676343
Whiteboard: [adv-esr78.5-]

Looking at the results on ESR, I'm not sure we've made any real impact on this crash :(

Flags: needinfo?(gsvelto)

I've checked the other bugs and yes, this isn't fixed yet. What surprises me is that the volume didn't go down (or did it?) because on nightly/beta it did go down, at least for older macOS versions. Anyway I'll have to do more work on it next week.

Flags: needinfo?(gsvelto)

Gabriele, since this is currently tracked for 85 Beta, I thought of bringing this back to your attention for any planned follow-up.

Flags: needinfo?(gsvelto)

Unfortunately this needs a thorough investigation in order to find a proper fix and I haven't had the time to do it yet.

Flags: needinfo?(gsvelto)

It is high on my priority list though, I'll try to fix this as soon as I put out the other fires in crash reporting & friends.

I spent some time poring over these crashes again and there's something that suddenly jumped out: there are no crashes for macOS 11. I went through all the other crashes under bug 1676343 and there are no macOS 11 crashes under any of the signatures. The only crash report I could find for macOS 11 has a different stack and isn't an UAF.

So this is very, very likely to be a bug in macOS 10.15 and older. Either in the POSIX thread library where mutexes are implemented or in the kernel-level mutexes. Given that the memory allocator is involved - because we're allocating/deallocating objects on the spot where we crash - it might be worth mentioning that the locks we use there pass different options to the macOS kernel compared to the pthread mutexes; so it might as well be a bug in how those two types of kernel-level mutexes interact.

Did a signature change, or your change to locking on macOS, "fix" this? Crash rate plummeted.

Flags: needinfo?(gsvelto)

Looks like a signature change

Crash Signature: [@ <name omitted> | std::__1::__sift_up<T>] [@ std::__1::__pop_heap<T>] [@ std::__1::__sift_up<T>] [@ TimerThread::AddTimerInternal] [@ TimerThread::Entry::UniquePtrLessThan ] → [@ <name omitted> | std::__1::__sift_up<T>] [@ std::__1::__pop_heap<T>] [@ std::__1::__sift_up<T>] [@ TimerThread::AddTimerInternal] [@ TimerThread::Entry::UniquePtrLessThan] [@ <name omitted> | TimerThread::AddTimerInternal]
Flags: needinfo?(gsvelto)

It looks like there is only one crash from more recent versions than ESR78, which we aren't maintaining anymore (one crash from ESR102). Unless this starts spiking again, closing for now.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → WORKSFORME
Group: dom-core-security
You need to log in before you can comment on or make changes to this bug.