Use-after-free crashes in timer code at [@ <name omitted> | std::__1::__sift_up<T>]
Categories
(Core :: XPCOM, defect)
Tracking
()
People
(Reporter: gsvelto, Unassigned)
References
Details
(Keywords: crash, csectype-uaf, sec-moderate, Whiteboard: [adv-esr78.5-])
Crash Data
This bug is for crash report bp-9eaa9de3-4856-4693-a241-5f5a90200721.
Top 10 frames of crashing thread:
0 XUL <name omitted> xpcom/threads/TimerThread.h:105
1 XUL void std::__1::__sift_up<bool /builds/worker/fetches/clang/include/c++/v1/algorithm:4835
2 XUL TimerThread::AddTimerInternal xpcom/threads/TimerThread.cpp:638
3 XUL XUL@0x455bcf
4 XUL TimerThread::AddTimer xpcom/threads/TimerThread.cpp:518
5 XUL nsTimerImpl::InitHighResolutionWithCallback xpcom/threads/nsTimerImpl.cpp:372
6 XUL <name omitted> xpcom/threads/nsTimerImpl.cpp:417
7 XUL NS_NewTimerWithCallback xpcom/threads/nsTimerImpl.cpp:93
8 XUL NS_DispatchToThreadQueue xpcom/threads/nsThreadUtils.cpp:412
9 XUL nsIScreenManager::COMTypeInfo<nsIScreenManager, void>::kIID
This is an odd crash in that it seems to be macOS-specific but it affects a piece of code that doesn't have many platform-specific bits. It seems to affect only older versions of macOS, specifically 10.9, 10.10 and 10.11. It seems to have appeared with 78ESR and for some reason does not seem to affect the non-ESR versions.
The crash itself is worrisome in that it's clearly a use-after-free access and the stack has been smashed (there's garbage on the stack below the second frame).
Comment 1•5 years ago
|
||
[Tracking Requested - why for this release]: This looks really bad. NI all the usual macOS suspects.
It's not surprising that this is only showing up on ESR because macOS 10.9-10.11 users were purposely migrated from release to ESR in conjuction with increasing the minimum supported version to 10.12 for Fx79+.
Comment 2•5 years ago
|
||
More complete stack:
mozilla::TimeStamp63Bit::operator unsigned long long() const at /builds/worker/workspace/obj-build/dist/include/mozilla/TimeStamp.h:53
(inlined by) mozilla::TimeStamp::operator<(mozilla::TimeStamp const&) const at /builds/worker/workspace/obj-build/dist/include/mozilla/TimeStamp.h:543
(inlined by) TimerThread::Entry::UniquePtrLessThan(mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&) at /builds/worker/checkouts/gecko/xpcom/threads/TimerThread.h:105
void std::__1::__sift_up<bool (*&)(mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&), mozilla::ArrayIterator<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, nsTArray<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> > > > >(mozilla::ArrayIterator<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, nsTArray<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> > > >, mozilla::ArrayIterator<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, nsTArray<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> > > >, bool (*&)(mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&), std::__1::iterator_traits<mozilla::ArrayIterator<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, nsTArray<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> > > > >::difference_type) at /builds/worker/fetches/clang/bin/../include/c++/v1/algorithm:4835
void std::__1::push_heap<mozilla::ArrayIterator<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, nsTArray<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> > > >, bool (*)(mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&)>(mozilla::ArrayIterator<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, nsTArray<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> > > >, mozilla::ArrayIterator<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, nsTArray<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> > > >, bool (*)(mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&)) at /builds/worker/fetches/clang/bin/../include/c++/v1/algorithm:4858
(inlined by) TimerThread::AddTimerInternal(nsTimerImpl*) at /builds/worker/checkouts/gecko/xpcom/threads/TimerThread.cpp:638
TimerThread::FindNextFireTimeForCurrentThread(mozilla::TimeStamp, unsigned int) at ??:?
TimerThread::AddTimer(nsTimerImpl*) at /builds/worker/checkouts/gecko/xpcom/threads/TimerThread.cpp:518
nsTimerImpl::InitHighResolutionWithCallback(nsITimerCallback*, mozilla::BaseTimeDuration<mozilla::TimeDurationValueCalculator> const&, unsigned int) at /builds/worker/checkouts/gecko/xpcom/threads/nsTimerImpl.cpp:372
<broken frame>
nsTimerImpl::InitWithCallback(nsITimerCallback*, unsigned int, unsigned int) at /builds/worker/checkouts/gecko/xpcom/threads/nsTimerImpl.cpp:356
(inlined by) nsTimer::InitWithCallback(nsITimerCallback*, unsigned int, unsigned int) at /builds/worker/checkouts/gecko/xpcom/threads/nsTimerImpl.h:235
(inlined by) NS_NewTimerWithCallback(nsITimer**, nsITimerCallback*, unsigned int, unsigned int, nsIEventTarget*) at /builds/worker/checkouts/gecko/xpcom/threads/nsTimerImpl.cpp:93
NS_DispatchToThreadQueue(already_AddRefed<nsIRunnable>&&, unsigned int, nsIThread*, mozilla::EventQueuePriority) at /builds/worker/checkouts/gecko/xpcom/threads/nsThreadUtils.cpp:412
Comment 3•5 years ago
•
|
||
I don't know much yet.
We're crashing when we're accessing a field at the beginning of a TimeStamp value, and the TimeStamp is the first field of an Entry object. Each Entry is separately heap-allocated using MakeUnique, the UniquePtrs are stored in an nsTArray. All of the UniquePtrs in the mTimers array should be non-null.
The crash addresses are mostly 0x0, but there are also some 0x9s in there, some 0x8, and some 0xffffffffe5e5e508.
So that must mean that the contents of the nsTArray have gone bad. The UniquePtrs are pointing to bad addresses.
But why? And why only on old versions of macOS? And only on ESR78... (edit: never mind, the ESR part is explained in comment 1.)
Comment 4•5 years ago
|
||
Well, we do have crashes from modern macOS versions as well, on non-ESR:
https://crash-stats.mozilla.org/report/index/18bb690f-6e6e-4888-bc9a-765130200715
https://crash-stats.mozilla.org/report/index/58fefd5c-cf43-490b-b5c7-e899f0200708
[...]
Comment 5•5 years ago
|
||
Here's a somewhat similar crash from Fennec 68:
https://crash-stats.mozilla.org/report/index/654251b0-8ba8-4e4a-b6c4-977dc0200719
I am trying to find crashes from Windows and Linux that have "TimerThread" in their stack frames but I don't know how to search for that.
Comment 6•5 years ago
|
||
(In reply to Markus Stange [:mstange] from comment #5)
I am trying to find crashes from Windows and Linux that have "TimerThread" in their stack frames but I don't know how to search for that.
protosignature contains
| Reporter | ||
Comment 7•5 years ago
|
||
The crashes that are apparently crashing on the 0x0 address aren't really crashing there, they're crashing on the poison pattern 0xe5e5e5e5e5e5e5e5 but because of bug 1493342 they're being reported as NULL. So in most cases this is an UAF access. If you open up the "Raw Data and Minidumps" field in those reports you'll often find that the rcx register contains the poison pattern and we're likely crashing when accessing it.
To look for crashes that contain TimerThread code (or any other function) in the stack you can use the "proto signature" field. It contains the function names of the first 10 frames on the stack of every crash IIRC:
https://crash-stats.mozilla.org/search/?proto_signature=~TimerThread
| Reporter | ||
Comment 8•5 years ago
|
||
There's at least three more signatures that are certainly related. I'll add them.
| Reporter | ||
Updated•5 years ago
|
Comment 9•5 years ago
|
||
Thanks. Those other similar signatures are all mac-only, too.
I don't know where to go from here.
Comment 10•5 years ago
|
||
Maybe there's an unsynchronized access to mTimers from a different thread? I see mutexes being used in TimerThread consistently so that seems unlikely.
Even more unlikely ideas: Maybe __sift_up or push/pop_heap on macOS has a bug where it accesses outside of the range we give it? Maybe there's a compiler bug that miscompiles __sift_up or push/pop_heap so that it accesses outside of the range we give it?
| Reporter | ||
Comment 11•5 years ago
|
||
I'll open a minidump and look at the stack. Since one of the frames it's corrupted it's possible that the real cause of the crash is there: some object / variable on the stack gets corrupted and then we use it as an index in the array which leads to the UAF access.
Comment 12•5 years ago
|
||
Byron, as somebody who has looked at some similar ish sounding timer issues in the past, does this look like any existing issues we've had before? Thanks.
Comment 13•5 years ago
|
||
I'm not sure if this should be sec-high or sec-moderate.
Comment 14•5 years ago
|
||
The signature is almost exactly like https://bugzilla.mozilla.org/show_bug.cgi?id=1594016
| Reporter | ||
Comment 15•5 years ago
|
||
(In reply to Nathan Froyd [:froydnj] from comment #14)
The signature is almost exactly like https://bugzilla.mozilla.org/show_bug.cgi?id=1594016
The stacks there also show signs of smashing after the second frame.
| Reporter | ||
Comment 16•5 years ago
•
|
||
There's no useful user comments in the reports but URLs from these three sites come up more than a dozen times:
https://www.sfgate.com/
https://torontosun.com
https://www.mysanantonio.com/
I don't have the right hardware but if we could find someone who does we might try to repro.
Comment 17•5 years ago
|
||
I'm running ESR 78 on 10.15.6 now. I've clicked around on sfgate.com for 5 minutes now, no crash so far.
Comment 18•5 years ago
|
||
We've had races in that array before, but we went to considerable effort to lock it down. I do not see any place where we access that array (or any element within it) without the appropriate protection. The possibilities I can think of are:
- Misuse of the mTimers array somehow. Things like iterator invalidation.
- Random corruption; some unrelated code has stomped on mTimers' memory, due to a UAF bug or similar.
- Stack corruption.
Comment 19•5 years ago
|
||
Is it possible the MakeUnique is failing here, causing an empty UniquePtr to be pushed?
Comment 20•5 years ago
|
||
(In reply to Byron Campen [:bwc] from comment #19)
Is it possible the MakeUnique is failing here, causing an empty UniquePtr to be pushed?
No, MakeUnique allocates infallibly, so we should crash if it fails.
Comment 21•5 years ago
•
|
||
Not sure if this is related, but this looks super fishy to me:
Seems to me that we can have a case where none of the following are true: a < b, b < a, b == a (because a and b both have the same mTimeStamp, but different values in mUsedCanonicalNow). I don't know if push_heap/pop_heap are going to care about this, but still yuck.
Comment 22•5 years ago
|
||
Also, I seem to have missed the fact that we converted over to using a heap for this timer stuff; stl heaps are not necessarily stable, and I think we still have code that expects two timers scheduled with the same delay on the same thread to fire in the order they were scheduled.
Comment 23•5 years ago
|
||
It is also interesting that all of these problems originate with ChromeUtils::IdleDispatch, and not with other uses of the timer API (AFAICT). Might be a hint.
| Reporter | ||
Comment 24•5 years ago
|
||
(In reply to Markus Stange [:mstange] from comment #17)
I'm running ESR 78 on 10.15.6 now. I've clicked around on sfgate.com for 5 minutes now, no crash so far.
This seems to be happening only on 10.9, 10.10 and 10.11. Given the platform-specific nature of this crash we probably have to look for something that might cause issues up the stack, or before we got to the point of the crash. However there's not much in the way of platform-specific code in the first few frames: we grab the lock, get a timestamp which uses mach_absolute_time() and do some logging. None of those seem like they might be causing issues.
Comment 25•5 years ago
|
||
I noticed https://torontosun.com/news has 4 crashes (in various different articles), which may help in reproing.
Updated•5 years ago
|
Comment 26•5 years ago
|
||
Seeing a bunch of crashes in 77, but a lot fewer in 78. Did the signature migrate? Did we fix something?
Comment 27•5 years ago
|
||
The crash mostly affects macOS 10.9-10.11 users, which were migrated to ESR78 (where this is still happening with high frequency).
Comment 28•5 years ago
•
|
||
Looking at the "crash data" chart, the number of these crashes increased dramatically between 7-5-2020 and 7-8-2020. Does anyone have an explanation for that?
Markus Stange just CCed me on this bug. And yes, it's weird and interesting. But the crash stacks provide no macOS-specific clues, so my expertise and tools are unlikely to help here.
As it happens, I have a working macOS 10.10.5 partition. I've been testing with the URLs from comment #16, but so far I haven't seen any crashes.
Comment 29•5 years ago
|
||
Also, given the number of crashes for at least the std::__1::__sift_up<T> signature, it's surprising it doesn't show up in the list of Mac topcrashes. Has it been deliberately kept out of that list because this is a security bug?
Comment 30•5 years ago
|
||
(In reply to Steven Michaud [:smichaud] (Retired) from comment #29)
Has it been deliberately kept out of that list because this is a security bug?
I don't think that's something we have set up.
Comment 31•5 years ago
|
||
I vaguely remember that at least some Firefox code sets the stack size. (My memory is too vague to be able to find it again.) Is it possible that we're just running out of stack space, and that changing that setting (or settings) would make a difference here?
I can't explain why we'd be more likely to run out of stack space on older versions of macOS, though.
Comment 32•5 years ago
|
||
I just looked at the assembly code for std::__1::__sift_up<T> (in today's mozilla-central nightly), and the only external method it calls directly is free().
Could this be a jemalloc bug? Does jemalloc behave differently on macOS versions older than 10.12?
Comment 33•5 years ago
|
||
assembly code for std::__1::__sift_up<T>
Its implementation for TimerThread::Entry.
Comment 34•5 years ago
|
||
Actually, I'm not sure the Mozilla codebase is still using jemalloc:
Comment 35•5 years ago
•
|
||
(In reply to Steven Michaud [:smichaud] (Retired) from comment #34)
Actually, I'm not sure the Mozilla codebase is still using jemalloc:
We're using jemalloc 3.
Comment 36•5 years ago
•
|
||
Well, I guess more that we use "mozjemalloc", which is a fork of jemalloc 3.
Comment 37•5 years ago
|
||
I just looked at the assembly code for std::__1::__sift_up<T> (in today's mozilla-central nightly), and the only external method it calls directly is free().
Likewise for std::__1::__pop_heap<T>
Comment 38•5 years ago
|
||
Testing with a recent mozilla-central nightly on macOS 10.15.6 and OS X 10.10.5 (and a HookCase hook library that hooks free(), std::__1::__sift_up<T> and std::__1::__pop_heap<T>), it appears that free() is never called from either of those TimerThread::Entry template methods in XUL, at least during ordinary circumstances. That's probably as it should be.
I wonder about non-ordinary circumstances, but I haven't yet figured out how to test for them.
| Reporter | ||
Comment 39•5 years ago
|
||
(In reply to Steven Michaud [:smichaud] (Retired) from comment #38)
Testing with a recent mozilla-central nightly on macOS 10.15.6 and OS X 10.10.5 (and a HookCase hook library that hooks free(), std::__1::__sift_up<T> and std::__1::__pop_heap<T>), it appears that free() is never called from either of those TimerThread::Entry template methods in XUL, at least during ordinary circumstances. That's probably as it should be.
Could it be that the issue isn't with the std::__1::__sift_up<T> and std::__1::__pop_heap<T> calls themselves, but rather with something that gets called earlier and corrupts the stack?
Comment 40•5 years ago
|
||
Yes, it could. Whatever the problem, though, it isn't easy to reproduce (as I found testing on OS X 10.10.5). At some point I'll try digging deeper into Mozilla's stl heap code (since that seems likely to be the source of the trouble). But I currently know nothing about it, and it will be a while before I have the time. I'll also keep comment #21 in mind.
| Reporter | ||
Comment 41•5 years ago
|
||
I'm starting to see crashes in the nightly channel on macOS 10.15, see this one for example. This is worrisome, I don't think we can rule it out as an issue with older versions of macOS, it appears to affect new ones as well.
Comment 42•5 years ago
|
||
I've started digging further into this, and am puzzled that std::__1::__sift_up<T> and std::__1::__pop_heap<T> (where T == TimerThread::Entry) ever call free(). After all, std::push_heap() and std::pop_heap() are intended not to add or subtract items from their "heap" -- just to change its order.
In my tests (comment #38) I never found free() actually being called from either of these two template methods. But you'd think it shouldn't even be possible. Each call to free() comes just after a call to RefPtr<nsTimerImpl>::~RefPtr() on the same pointer (in %rdi).
Comment 43•5 years ago
|
||
As best I can tell, free() (and RefPtr<nsTimerImpl>::~RefPtr()) would no longer be called if we could get the clang compiler to treat a TimerThread::Entry object as "large" rather than "small".
In ~/.mozbuild/clang/include/c++/any, the __move() method for struct _LIBCPP_TEMPLATE_VIS _SmallHandler calls __destroy(), but the __move() method for struct _LIBCPP_TEMPLATE_VIS _LargeHandler doesn't. Any suggestions for how to fool the clang compiler into treating a TimerThread::Entry object as "large" instead of "small"?
Comment 44•5 years ago
|
||
I actually just hit this crash. bp-b7a07adc-5b27-4ec8-aef6-2335d0200814
I had this article open from a news site: https://www.wweek.com/news/2020/08/13/united-states-postal-service-confirmed-it-has-removed-mailboxes-in-portland-and-eugene/?fbclid=IwAR2oZrYxdHdSAAKiqwmmpWsZi3pBjjroWXEWhHg7x73yBNbGEFhRSJ3qjow
I'm also 4 and a half minutes into a Firefox build on my machine while watching a video in the background my machine is under heavy load, if that matters.
Comment 45•5 years ago
|
||
Hmm it just crashed again on the same page: bp-b7c75792-bf05-41f6-9745-391d80200814
| Reporter | ||
Comment 46•5 years ago
|
||
(In reply to Andrew McCreight [:mccr8] from comment #45)
Hmm it just crashed again on the same page: bp-b7c75792-bf05-41f6-9745-391d80200814
Was your machine still under load? Could this be a race of some kind (though I'd be surprised if it is given that this code seems to be properly guarded by mutexes).
Updated•5 years ago
|
Comment 47•5 years ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #46)
Was your machine still under load? Could this be a race of some kind (though I'd be surprised if it is given that this code seems to be properly guarded by mutexes).
Yes, my build was still going.
Comment 48•5 years ago
|
||
Andrew, do you have any non-standard settings?
I'm now doing a Firefox build, and have been playing with your URL from comment #44 in both FF 79.0 and today's mozilla-central nightly. No crashes. I'm running macOS 10.15.6 build 19G73 on a Mid-2015 15" MacBook Pro with 16GB of RAM.
Comment 49•5 years ago
|
||
I'm on Firefox 81. I have Fission enabled (set fission.autostart to true on a Nightly build). I'm not running any kind of ad blocker which might qualify as "non-standard". I'm on macOS 10.15.5 on a 2019 MacBookPro with 64GB of RAM. (I didn't realize until just now how much RAM I have on this computer...)
Comment 50•5 years ago
|
||
I tried again, with a build going, and it didn't crash, so maybe that was just a fluke... twice.
Comment 51•5 years ago
|
||
As of the 2020-08-12-03-44-18 mozilla-central nightly, almost all non-exported symbols are stripped from XUL (including the ones I've been working with). Anyone know what's up with that?
Comment 53•5 years ago
|
||
bug 1647780 did that. You can download the dSYMs or the breakpad symbols from the symbol server.
Comment 54•5 years ago
•
|
||
bug 1647780 did that. You can download the dSYMs or the breakpad symbols from the symbol server.
Thanks for letting me know.
I take it that operations on TimerThread::mTimers from Mozilla code are properly protected by locks. Can the same be said for operations on mTimers using std::push_heap() and std::pop_heap()? They do, after all, take place on different threads.
| Comment hidden (obsolete) |
| Comment hidden (obsolete) |
| Reporter | ||
Comment 57•5 years ago
|
||
Ugh. Nevermind, it seems that I was lead on wild goose chase by llvm-objdump. It seems that while I was looking at the disassembly it blindly ignored the dsym file I had provided so NS_NewLocalFileWithCFURL is just a big blob of code which can contain anything and everything. Scratch my previous comments, they're most likely wrong.
Comment 58•5 years ago
|
||
I just found an interesting stack trace, much more complete than most of those we've seen. Unlike most of our crashes, this seems to be one in the main (firefox) process, not the content process:
0 XUL nsTArrayInfallibleAllocator::ResultTypeProxy nsTArray_base<nsTArrayInfallibleAllocator, nsTArray_RelocateUsingMemutils>::SwapArrayElements<nsTArrayInfallibleAllocator, nsTArrayInfallibleAllocator>(nsTArray_base<nsTArrayInfallibleAllocator, nsTArray_RelocateUsingMemutils>&, unsigned long, unsigned long) xpcom/ds/nsTArray-inl.h:416 context
1 libmozglue.dylib mozilla::detail::MutexImpl::~MutexImpl() mozglue/misc/Mutex_posix.cpp:111 scan
2 XUL XUL@0x9b8ecf scan
3 libmozglue.dylib mozilla::detail::MutexImpl::unlock() mozglue/misc/Mutex_posix.cpp:178 scan
4 XUL MessageLoop::PostTask(already_AddRefed<nsIRunnable>) ipc/chromium/src/base/message_loop.cc:346 scan
5 libnss3.dylib PR_GetCurrentThread nsprpub/pr/src/pthreads/ptthread.c:640 scan
6 libmozglue.dylib <name omitted> memory/build/malloc_decls.h:51 scan
7 libmozglue.dylib <name omitted> memory/build/malloc_decls.h:51 scan
8 XUL void std::__1::__sift_up<bool (*&)(mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&), mozilla::ArrayIterator<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, nsTArray<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> > > > >(mozilla::ArrayIterator<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, nsTArray<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> > > >, mozilla::ArrayIterator<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, nsTArray<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> > > >, bool (*&)(mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&), std::__1::iterator_traits<mozilla::ArrayIterator<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, nsTArray<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> > > > >::difference_type) /builds/worker/fetches/clang/include/c++/v1/algorithm:4835 scan
9 XUL XUL@0x9fb87f scan
10 XUL nsTHashtable<nsBaseHashtableET<nsDepCharHashKey, mozilla::ipc::ChannelCountReporter::ChannelCounts> >::s_MatchEntry(PLDHashEntryHdr const*, void const*) xpcom/ds/nsTHashtable.h:494 scan
11 XUL XUL@0x3f295f scan
12 XUL <name omitted> xpcom/ds/PLDHashTable.cpp:567 scan
13 libmozglue.dylib <name omitted> memory/build/malloc_decls.h:51 scan
14 XUL .str.33.llvm.9700260797439940092 scan
15 libmozglue.dylib <name omitted> memory/build/malloc_decls.h:51 scan
16 libmozglue.dylib mozilla::detail::MutexImpl::lock() mozglue/misc/Mutex_posix.cpp:156 scan
17 XUL mozilla::ipc::(anonymous namespace)::ActivateAndCleanupIPCStream(mozilla::ipc::IPCStream&, bool, bool) ipc/glue/IPCStreamUtils.cpp:222 scan
18 XUL mozilla::ipc::AutoIPCStream::~AutoIPCStream() ipc/glue/IPCStreamUtils.cpp:0 scan
19 XUL detail::ProxyReleaseEvent<mozilla::ipc::HoldIPCStream>::Run() xpcom/threads/nsProxyRelease.h:34 scan
20 XUL nsThread::ProcessNextEvent(bool, bool*) xpcom/threads/nsThread.cpp:1211 scan
21 XUL nsTimerImpl::InitCommon(mozilla::BaseTimeDuration<mozilla::TimeDurationValueCalculator> const&, unsigned int, nsTimerImpl::Callback&&) xpcom/threads/nsTimerImpl.cpp:310 scan
22 XUL .str.82.llvm.13599766915509620700 scan
23 XUL XUL@0x29b4abf scan
24 XUL PLDHashTable::Remove(void const*) xpcom/ds/PLDHashTable.cpp:594 scan
25 XUL nsXREDirProvider::DoStartup()::kCrashed scan
26 XUL NS_ProcessNextEvent(nsIThread*, bool) xpcom/threads/nsThreadUtils.cpp:501 scan
27 XUL .str.379.llvm.611105640701535871 scan
28 XUL mozilla::dom::ContentParent::Observe(nsISupports*, char const*, char16_t const*) dom/ipc/ContentParent.cpp:2909 scan
29 XUL NS_TableDrivenQI(void*, nsID const&, void**, QITableEntry const*) xpcom/base/nsISupportsImpl.cpp:22 scan
30 XUL FifoWatcher::kPrefName scan
31 XUL nsTHashtable<nsObserverList>::s_MatchEntry(PLDHashEntryHdr const*, void const*) xpcom/ds/nsTHashtable.h:494 scan
32 XUL nsMaybeWeakPtrArray<nsIObserver>::RemoveWeakElement(nsIObserver*) xpcom/base/nsMaybeWeakPtr.h:95 scan
33 XUL .str.7.llvm.6184990430567596100 scan
34 libmozglue.dylib free memory/build/malloc_decls.h:54 scan
35 libmozglue.dylib free memory/build/malloc_decls.h:54 scan
36 XUL .str.379.llvm.611105640701535871 scan
37 XUL .str.379.llvm.611105640701535871 scan
38 XUL .str.379.llvm.611105640701535871 scan
39 XUL .str.379.llvm.611105640701535871 scan
40 XUL nsXREDirProvider::DoStartup()::kCrashed scan
41 XUL .str.379.llvm.611105640701535871 scan
42 XUL nsObserverList::NotifyObservers(nsISupports*, char const*, char16_t const*) xpcom/ds/nsObserverList.cpp:65 scan
43 XUL nsXREDirProvider::DoStartup()::kCrashed scan
44 XUL nsObserverService::NotifyObservers(nsISupports*, char const*, char16_t const*) xpcom/ds/nsObserverService.cpp:288 scan
45 XUL mozilla::AppWindow::Destroy() xpfe/appshell/AppWindow.cpp:669 scan
46 XUL .str.379.llvm.611105640701535871 scan
47 XUL .str.19.llvm.2081317860475493682 scan
48 firefox firefox@0x2ce2 scan
49 XUL nsXREDirProvider::DoStartup()::kCrashed scan
50 XUL nsXREDirProvider::DoShutdown() toolkit/xre/nsXREDirProvider.cpp:1039 scan
51 XUL ScopedXPCOMStartup::~ScopedXPCOMStartup() toolkit/xre/nsAppRunner.cpp:1273 scan
52 XUL XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&) toolkit/xre/nsAppRunner.cpp:4829 scan
53 firefox firefox@0x2ce2 scan
54 XUL XRE_main(int, char**, mozilla::BootstrapConfig const&) toolkit/xre/nsAppRunner.cpp:4866 scan
55 QD GCC_except_table125 scan
56 libsystem_c.dylib __tolower scan
57 firefox firefox@0x1a83 scan
58 firefox firefox@0x2cea scan
59 firefox firefox@0x2ce2 scan
Comment 59•5 years ago
|
||
Comment #58's crash seems to be happening during shutdown. I suspect that's not true, though, of most of the other crashes (the content process ones).
Comment 60•5 years ago
|
||
| wrong | ||
Gabriele, you were on the right track. As best I can tell, the std::__1::__sift_up<T> crashes don't happen there.
Here's a crash report from the 20200811091731 mozilla-central nightly (the last one that still has non-external symbols in XUL):
https://crash-stats.mozilla.org/report/index/4b428e5b-aaec-4666-b41c-ce1260200811
The "Raw Data and Minidumps" tab says the module offset (in XUL) is 0x4a3110. When I do nm -pam XUL | less and then search on "sift_up", I find that this offset is past the end of std::__1::__sift_up<T>. In fact it's the address of nsTArray_Impl<RefPtr<mozilla::MozPromise<CopyableTArray<bool>, bool, false>::ThenValueBase>, nsTArrayInfallibleAllocator>::Clear().
The "function offset" (0xa0) does fit inside this function, but I'm not sure it's correct -- 0x4a3110 + 0xa0 == 0x4a31b0, which isn't at an instruction boundary.
In any case, something is seriously wrong with the symbolication code used by https://crash-stats.mozilla.org/.
Comment 61•5 years ago
|
||
| wrong | ||
In any case, something is seriously wrong with the symbolication code used by https://crash-stats.mozilla.org/.
Or maybe with the code that creates minidumps, or with the copy of dump_syms currently used by Mozilla to create minidumps. Has either of those changed in the last few months? Say around 7-05-2020, when the number of this bug's crashes in the "crash data" chart started increasing dramatically, as I mentioned in comment #28?
Comment 62•5 years ago
|
||
We changed the symbol dumping tool in bug 1654845, but that landed 2020-07-25, so way outside your target date.
Comment 63•5 years ago
•
|
||
Crash reports for std::__1::__sift_up<T> do exist, in much smaller numbers, for dates earlier than 2020-07-05. Their crash stacks are much higher quality. But for some reason only a vanishingly small number happened in mozilla-central (alpha) builds (the only ones with symbols unstripped):
I did check up on one of them, which happened in the 20200221095110 build:
https://crash-stats.mozilla.org/report/index/40552d7a-3a18-4c12-9d79-8ff730200328#tab-details
For this one, the "module offset" is inside std::__1::__sift_up<T>, at a location that makes sense (just after movq 0x8(%rax), %rax, which derefences %rax).
Comment 64•5 years ago
|
||
At the end of all this, I think we'll find out that we need to back out the patch for bug 1647780. Yes, Mozilla's symbol handling code seems to make non-external symbols redundant. But without them we have no way to check that the symbol handling code is working correctly.
Comment 65•5 years ago
|
||
(Following up comment #60 and comment #61)
Oops, looks like I messed up, too. There are two mozilla-central nightlies for 2020-08-11: 20200811091731 and 20200811214738. I thought I had a copy of the former installed locally, but it turns out I had a copy of the latter. Once I installed 20200811091731, the problem I reported went away. On it, the offset 0x4a3110 is well within std::__1::__sift_up<T> (where T == TimerThread::Entry).
Sigh. Sorry for the confusion. Back to square one.
Comment 66•5 years ago
|
||
The definition of __sift_up() in <algorithm> (in ~/.mozbuild/clang/include/c++/v1/algorithm) is as follows:
template <class _Compare, class _RandomAccessIterator>
void
__sift_up(_RandomAccessIterator __first, _RandomAccessIterator __last, _Compare __comp,
typename iterator_traits<_RandomAccessIterator>::difference_type __len)
{
typedef typename iterator_traits<_RandomAccessIterator>::value_type value_type;
if (__len > 1)
{
__len = (__len - 2) / 2;
_RandomAccessIterator __ptr = __first + __len;
if (__comp(*__ptr, *--__last))
{
value_type __t(_VSTD::move(*__last));
do
{
*__last = _VSTD::move(*__ptr);
__last = __ptr;
if (__len == 0)
break;
__len = (__len - 1) / 2;
__ptr = __first + __len;
} while (__comp(*__ptr, __t));
*__last = _VSTD::move(__t);
}
}
}
Crash reports indicate the crashes take place at line 4835, which is
if (__comp(*__ptr, *--__last))
Thinking it might be one way to trigger this bug's crashes, I used a HookCase hook library to increase the value of __len I passed to the original function. This did trigger crashes (if __len got large enough), but the signature was [@ InvalidArrayIndex_CRASH | std::__1::__sift_up<T> ]. This seems to rule out that an incorrect value for __len is being passed to __sift_up().
Updated•5 years ago
|
Comment 67•5 years ago
•
|
||
(In response to comment #21)
I used a HookCase hook library to hook TimerThread::UniquePtrLessThan() and make it always return false or always return true when running in the content process. Neither causes crashes (though making it always return false prevents any content from being displayed). I think this rules out that the problem reported in comment #21 triggers/causes this bug's crashes.
| Reporter | ||
Comment 68•5 years ago
|
||
(In reply to Steven Michaud [:smichaud] (Retired) from comment #58)
I just found an interesting stack trace, much more complete than most of those we've seen. Unlike most of our crashes, this seems to be one in the main (firefox) process, not the content process:
Apart from the first frame all the others were found via stack-scanning (see the Trust column) so they might be wrong.
Comment 70•5 years ago
|
||
Could someone CC me on bug 1660297? I currently don't have permission to view it.
Comment 71•5 years ago
|
||
Done.
Comment 72•5 years ago
|
||
I don't know if it is related, but bug 1660300 is about a shutdown hang in a method that creates a new timer.
Comment 73•5 years ago
•
|
||
After considerable digging around, I've managed to write a HookCase hook library that logs a stack trace whenever an nsTimerImpl object that has been manipulated by std::__1::__sift_up<T> or std::__1::__pop_heap<T> (that has been stored in the nsTimerThread::mTimers array) is destroyed (when nsTimerImpl::~nsTimerImpl() is called on it). My intent is to find out what kinds of nsTimerImpl objects can end up in this array, and to learn something about their life cycles. That can be useful in deciphering non-reproducible crash bugs, like the ones reported here.
But I see many different kinds of stack traces, and am having trouble learning anything useful about them. Does anyone have something to say on the life cycles of nsTimerImpl objects stored in the nsTimerThread::mTimers array?
Not all nsTimerImpl objects end up in the nsTimerThread::mTimers array. But a lot of them do.
Comment 74•5 years ago
|
||
Randell pointed out to me today that we have another signature that's UAF-looking and rising quickly in frequency for 10.9-10.11 ESR78 users:
https://crash-stats.mozilla.org/signature/?product=Firefox&signature=objc_msgSend%20%7C%20_NSAccessibilityUIElementForSpecifier&version=78.2.0esr&date=%3C2020-09-02T16%3A44%3A06%2B00%3A00&date=%3E%3D2020-08-26T16%3A44%3A06%2B00%3A00
Comment 75•5 years ago
|
||
There are lots of crash stacks that have "sift_up" or "pop_heap" somewhere in the "proto signature":
I suspect they're all related to this bug.
Comment 76•5 years ago
|
||
I suspect they're all related to this bug.
I suspect many of them are related to this bug.
Comment 77•5 years ago
|
||
Updated•5 years ago
|
Adding media bugs that appear related to see also. These bugs have different signatures, but share the sharp spike in rates at ESR78, MacOS specificity, and involve dangling pointers (in some cases they are reliably poisoned).
(In reply to Steven Michaud [:smichaud] (Retired) from comment #79)
Bryce, could you CC me on those bugs?
Done.
Comment 81•5 years ago
•
|
||
Thanks!
The first thing that stands out is that the crashes for bug 1660368 and bug 1584956 started increasing substantially on or just after 2020-07-05, just like this bug. I still have no idea as to why.
Comment 82•5 years ago
|
||
(Following up comment #81)
Likewise for bug 1659938.
Comment 83•5 years ago
|
||
It looks like the early July increase in crashes is just an artifact -- that's when Mac ESR crashes stopped being throttled. See bug 1659938 comment 6.
Comment 84•5 years ago
|
||
It looks like the early July increase in crashes is just an artifact -- that's when Mac ESR crashes stopped being throttled. See bug 1659938 comment 6.
Or possibly it's due to a large increase in the number of Mac ESR users, since that's when Mozilla de-supported OS X versions older than 10.12. Either way, though, the increase in crashes doesn't seem to have been caused by any contemporaneous change on the ESR branch.
I'm unfamiliar with the sampling used for the crash stats, so if anyone is able to point me to an explanation I'd appreciate it.
My current understanding is that ESR is not throttled unlike other channels. Assuming N failures on release and N failures on ESR, more crashes will be reported for ESR despite the same number occurring. This means that with the mass migration to ESR various bugs appear to have a significant increase in num crashes.
If that is the case, then the see also media bugs that I assumed are related based on the increased crash rates and MacOS heavy skew may not be related. I.e. there is not a common underlying bug -- it's just the reporting bumping all the numbers at the same time.
I'm going to leave the see alsos while I make sure the above is correct, but if it is then it sounds like they should be removed.
| Reporter | ||
Comment 86•5 years ago
|
||
Will, do we throttle crashes coming from ESR channels differently than release? See comment 85 for more details.
Comment 87•5 years ago
|
||
Firefox release is sampled at 10%. All other apps/channels are unthrottled, as are reports submitted manually from about:crashes.
https://github.com/mozilla-services/antenna/blob/main/antenna/throttler.py#L389
Comment 89•5 years ago
|
||
Because of throttling and a variety of other issues with the crash report data, I highly discourage people from analysis on counts of crash reports. The crash ping data in Telemetry can be normalized by usage numbers and will be better for comparing counts. I did some discussion of this here: https://bluesock.org/~willkg/blog/mozilla/crash_pings_crash_reports.html
Comment 90•5 years ago
•
|
||
I've been banging my head against this bug for a while, but it's now defeated me, at least for the time being. I'm going to put it aside, and maybe come back later with a fresh mind.
I discovered two interesting things along the way:
-
I can trigger crashes in friends of
push_heapandpop_heapby callingUniquePtr<Entry>::~UniquePtrorUniquePtr<Entry>::release()(in a hook library) on one or both of the parameters passed in toEntry::UniquePtrLessThan(). These all happen at address0x8. But I haven't been able to figure out a way to make this happen "in real life". -
I noticed that
TimerThread::mMonitoris released inTimerThread::PostTimerEvent()for a call tonsIEventTarget::Dispatch(). Aha! So maybe this is where thread contention can happen. But then (in my hook library) I forced all calls toTimerEvent::RemoveTimer()(on threads other than the "timer thread") to happen during calls tonsIEventTarget::Dispatch()(fromTimerThread::PostTimerEvent()) on the timer thread. I didn't see a single crash.
| Reporter | ||
Comment 91•5 years ago
|
||
FYI we keep finding more and more instances of crash reports that look like this one:
- apparently impossible to explain
- macOS-only
- stack appears to have been corrupted above the crashing frame
Bug 1478581 is one of them, bug 1658719 is another. I also found more crash signatures exhibiting the same behavior, see the first three ones in this search. I think we might be dealing with a specific problem.
The crash location often appears to be near a point where we make an allocation (nsTArray is often involved) so I wonder if it might be related to the memory allocator.
| Reporter | ||
Comment 92•5 years ago
|
||
I forgot something: the other thing that all the bugs I mentioned above seem to have in common is that we're accessing some dead object, the poison pattern comes up often either in the crash address, an index that lead to it or appears on the stack.
Comment 93•5 years ago
|
||
Gabriele, could you CC me on bug 1478581 and bug 1658719?
Updated•5 years ago
|
Updated•5 years ago
|
Comment 95•5 years ago
|
||
Looking at the results on ESR, I'm not sure we've made any real impact on this crash :(
| Reporter | ||
Comment 96•5 years ago
|
||
I've checked the other bugs and yes, this isn't fixed yet. What surprises me is that the volume didn't go down (or did it?) because on nightly/beta it did go down, at least for older macOS versions. Anyway I'll have to do more work on it next week.
Updated•5 years ago
|
Updated•5 years ago
|
Comment 97•5 years ago
|
||
Gabriele, since this is currently tracked for 85 Beta, I thought of bringing this back to your attention for any planned follow-up.
| Reporter | ||
Comment 98•5 years ago
|
||
Unfortunately this needs a thorough investigation in order to find a proper fix and I haven't had the time to do it yet.
| Reporter | ||
Comment 99•5 years ago
|
||
It is high on my priority list though, I'll try to fix this as soon as I put out the other fires in crash reporting & friends.
| Reporter | ||
Comment 100•5 years ago
|
||
I spent some time poring over these crashes again and there's something that suddenly jumped out: there are no crashes for macOS 11. I went through all the other crashes under bug 1676343 and there are no macOS 11 crashes under any of the signatures. The only crash report I could find for macOS 11 has a different stack and isn't an UAF.
So this is very, very likely to be a bug in macOS 10.15 and older. Either in the POSIX thread library where mutexes are implemented or in the kernel-level mutexes. Given that the memory allocator is involved - because we're allocating/deallocating objects on the spot where we crash - it might be worth mentioning that the locks we use there pass different options to the macOS kernel compared to the pthread mutexes; so it might as well be a bug in how those two types of kernel-level mutexes interact.
Updated•5 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Comment 101•3 years ago
|
||
Did a signature change, or your change to locking on macOS, "fix" this? Crash rate plummeted.
| Reporter | ||
Comment 102•3 years ago
|
||
Looks like a signature change
Comment 103•3 years ago
|
||
It looks like there is only one crash from more recent versions than ESR78, which we aren't maintaining anymore (one crash from ESR102). Unless this starts spiking again, closing for now.
Updated•2 years ago
|
Description
•