1654335 - Use-after-free crashes in timer code at [@ <name omitted> | std::__1::__sift_up<T>]

Reporter

Description

•

5 years ago

This bug is for crash report bp-9eaa9de3-4856-4693-a241-5f5a90200721.

Top 10 frames of crashing thread:

0 XUL <name omitted> xpcom/threads/TimerThread.h:105
1 XUL void std::__1::__sift_up<bool  /builds/worker/fetches/clang/include/c++/v1/algorithm:4835
2 XUL TimerThread::AddTimerInternal xpcom/threads/TimerThread.cpp:638
3 XUL XUL@0x455bcf 
4 XUL TimerThread::AddTimer xpcom/threads/TimerThread.cpp:518
5 XUL nsTimerImpl::InitHighResolutionWithCallback xpcom/threads/nsTimerImpl.cpp:372
6 XUL <name omitted> xpcom/threads/nsTimerImpl.cpp:417
7 XUL NS_NewTimerWithCallback xpcom/threads/nsTimerImpl.cpp:93
8 XUL NS_DispatchToThreadQueue xpcom/threads/nsThreadUtils.cpp:412
9 XUL nsIScreenManager::COMTypeInfo<nsIScreenManager, void>::kIID

This is an odd crash in that it seems to be macOS-specific but it affects a piece of code that doesn't have many platform-specific bits. It seems to affect only older versions of macOS, specifically 10.9, 10.10 and 10.11. It seems to have appeared with 78ESR and for some reason does not seem to affect the non-ESR versions.

The crash itself is worrisome in that it's clearly a use-after-free access and the stack has been smashed (there's garbage on the stack below the second frame).

Ryan VanderMeulen [:RyanVM]

Comment 1

•

5 years ago

[Tracking Requested - why for this release]: This looks really bad. NI all the usual macOS suspects.

It's not surprising that this is only showing up on ESR because macOS 10.9-10.11 users were purposely migrated from release to ESR in conjuction with increasing the minimum supported version to 10.12 for Fx79+.

status-firefox-esr78: --- → affected

tracking-firefox-esr78: --- → ?

Flags: needinfo?(spohl.mozilla.bugs)

Flags: needinfo?(mstange.moz)

Flags: needinfo?(haftandilian)

Markus Stange [:mstange]

Comment 2

•

5 years ago

More complete stack:

mozilla::TimeStamp63Bit::operator unsigned long long() const at /builds/worker/workspace/obj-build/dist/include/mozilla/TimeStamp.h:53
 (inlined by) mozilla::TimeStamp::operator<(mozilla::TimeStamp const&) const at /builds/worker/workspace/obj-build/dist/include/mozilla/TimeStamp.h:543
 (inlined by) TimerThread::Entry::UniquePtrLessThan(mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&) at /builds/worker/checkouts/gecko/xpcom/threads/TimerThread.h:105
void std::__1::__sift_up<bool (*&)(mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&), mozilla::ArrayIterator<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, nsTArray<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> > > > >(mozilla::ArrayIterator<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, nsTArray<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> > > >, mozilla::ArrayIterator<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, nsTArray<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> > > >, bool (*&)(mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&), std::__1::iterator_traits<mozilla::ArrayIterator<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, nsTArray<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> > > > >::difference_type) at /builds/worker/fetches/clang/bin/../include/c++/v1/algorithm:4835
void std::__1::push_heap<mozilla::ArrayIterator<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, nsTArray<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> > > >, bool (*)(mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&)>(mozilla::ArrayIterator<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, nsTArray<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> > > >, mozilla::ArrayIterator<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, nsTArray<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> > > >, bool (*)(mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&)) at /builds/worker/fetches/clang/bin/../include/c++/v1/algorithm:4858
 (inlined by) TimerThread::AddTimerInternal(nsTimerImpl*) at /builds/worker/checkouts/gecko/xpcom/threads/TimerThread.cpp:638
TimerThread::FindNextFireTimeForCurrentThread(mozilla::TimeStamp, unsigned int) at ??:?
TimerThread::AddTimer(nsTimerImpl*) at /builds/worker/checkouts/gecko/xpcom/threads/TimerThread.cpp:518
nsTimerImpl::InitHighResolutionWithCallback(nsITimerCallback*, mozilla::BaseTimeDuration<mozilla::TimeDurationValueCalculator> const&, unsigned int) at /builds/worker/checkouts/gecko/xpcom/threads/nsTimerImpl.cpp:372
<broken frame>
nsTimerImpl::InitWithCallback(nsITimerCallback*, unsigned int, unsigned int) at /builds/worker/checkouts/gecko/xpcom/threads/nsTimerImpl.cpp:356
 (inlined by) nsTimer::InitWithCallback(nsITimerCallback*, unsigned int, unsigned int) at /builds/worker/checkouts/gecko/xpcom/threads/nsTimerImpl.h:235
 (inlined by) NS_NewTimerWithCallback(nsITimer**, nsITimerCallback*, unsigned int, unsigned int, nsIEventTarget*) at /builds/worker/checkouts/gecko/xpcom/threads/nsTimerImpl.cpp:93
NS_DispatchToThreadQueue(already_AddRefed<nsIRunnable>&&, unsigned int, nsIThread*, mozilla::EventQueuePriority) at /builds/worker/checkouts/gecko/xpcom/threads/nsThreadUtils.cpp:412

Markus Stange [:mstange]

Comment 3

•

5 years ago

•

Edited

I don't know much yet.

We're crashing when we're accessing a field at the beginning of a TimeStamp value, and the TimeStamp is the first field of an Entry object. Each Entry is separately heap-allocated using MakeUnique, the UniquePtrs are stored in an nsTArray. All of the UniquePtrs in the mTimers array should be non-null.
The crash addresses are mostly 0x0, but there are also some 0x9s in there, some 0x8, and some 0xffffffffe5e5e508.
So that must mean that the contents of the nsTArray have gone bad. The UniquePtrs are pointing to bad addresses.

But why? And why only on old versions of macOS? And only on ESR78... (edit: never mind, the ESR part is explained in comment 1.)

Flags: needinfo?(mstange.moz)

Markus Stange [:mstange]

Comment 4

•

5 years ago

Well, we do have crashes from modern macOS versions as well, on non-ESR:
https://crash-stats.mozilla.org/report/index/18bb690f-6e6e-4888-bc9a-765130200715
https://crash-stats.mozilla.org/report/index/58fefd5c-cf43-490b-b5c7-e899f0200708
[...]

Markus Stange [:mstange]

Comment 5

•

5 years ago

Here's a somewhat similar crash from Fennec 68:
https://crash-stats.mozilla.org/report/index/654251b0-8ba8-4e4a-b6c4-977dc0200719

I am trying to find crashes from Windows and Linux that have "TimerThread" in their stack frames but I don't know how to search for that.

Andrew McCreight [:mccr8]

Comment 6

•

5 years ago

(In reply to Markus Stange [:mstange] from comment #5)

I am trying to find crashes from Windows and Linux that have "TimerThread" in their stack frames but I don't know how to search for that.

protosignature contains

Gabriele Svelto [:gsvelto]

Reporter

Comment 7

•

5 years ago

The crashes that are apparently crashing on the 0x0 address aren't really crashing there, they're crashing on the poison pattern 0xe5e5e5e5e5e5e5e5 but because of bug 1493342 they're being reported as NULL. So in most cases this is an UAF access. If you open up the "Raw Data and Minidumps" field in those reports you'll often find that the rcx register contains the poison pattern and we're likely crashing when accessing it.

To look for crashes that contain TimerThread code (or any other function) in the stack you can use the "proto signature" field. It contains the function names of the first 10 frames on the stack of every crash IIRC:

https://crash-stats.mozilla.org/search/?proto_signature=~TimerThread

Gabriele Svelto [:gsvelto]

Reporter

Comment 8

•

5 years ago

There's at least three more signatures that are certainly related. I'll add them.

Gabriele Svelto [:gsvelto]

Reporter

Updated

•

5 years ago

Crash Signature: [@ <name omitted> | std::__1::__sift_up<T>] → [@ <name omitted> | std::__1::__sift_up<T>] [@ std::__1::__pop_heap<T>] [@ std::__1::__sift_up<T>] [@ TimerThread::AddTimerInternal]

Markus Stange [:mstange]

Comment 9

•

5 years ago

Thanks. Those other similar signatures are all mac-only, too.

I don't know where to go from here.

Markus Stange [:mstange]

Comment 10

•

5 years ago

Maybe there's an unsynchronized access to mTimers from a different thread? I see mutexes being used in TimerThread consistently so that seems unlikely.
Even more unlikely ideas: Maybe __sift_up or push/pop_heap on macOS has a bug where it accesses outside of the range we give it? Maybe there's a compiler bug that miscompiles __sift_up or push/pop_heap so that it accesses outside of the range we give it?

Gabriele Svelto [:gsvelto]

Reporter

Comment 11

•

5 years ago

I'll open a minidump and look at the stack. Since one of the frames it's corrupted it's possible that the real cause of the crash is there: some object / variable on the stack gets corrupted and then we use it as an index in the array which leads to the UAF access.

Andrew McCreight [:mccr8]

Comment 12

•

5 years ago

Byron, as somebody who has looked at some similar ish sounding timer issues in the past, does this look like any existing issues we've had before? Thanks.

Group: dom-core-security

Flags: needinfo?(docfaraday)

Keywords: csectype-uaf, sec-high

Summary: Crash in [@ <name omitted> | std::__1::__sift_up<T>] → Use-after-free crashes in timer code at [@ <name omitted> | std::__1::__sift_up<T>]

Andrew McCreight [:mccr8]

Comment 13

•

5 years ago

I'm not sure if this should be sec-high or sec-moderate.

Keywords: sec-high

Nathan Froyd [:froydnj]

Comment 14

•

5 years ago

The signature is almost exactly like https://bugzilla.mozilla.org/show_bug.cgi?id=1594016

Gabriele Svelto [:gsvelto]

Reporter

Comment 15

•

5 years ago

(In reply to Nathan Froyd [:froydnj] from comment #14)

The signature is almost exactly like https://bugzilla.mozilla.org/show_bug.cgi?id=1594016

The stacks there also show signs of smashing after the second frame.

Gabriele Svelto [:gsvelto]

Reporter

Comment 16

•

5 years ago

•

Edited

There's no useful user comments in the reports but URLs from these three sites come up more than a dozen times:

https://www.sfgate.com/
https://torontosun.com
https://www.mysanantonio.com/

I don't have the right hardware but if we could find someone who does we might try to repro.

Markus Stange [:mstange]

Comment 17

•

5 years ago

I'm running ESR 78 on 10.15.6 now. I've clicked around on sfgate.com for 5 minutes now, no crash so far.

Byron Campen [:bwc]

Comment 18

•

5 years ago

We've had races in that array before, but we went to considerable effort to lock it down. I do not see any place where we access that array (or any element within it) without the appropriate protection. The possibilities I can think of are:

Misuse of the mTimers array somehow. Things like iterator invalidation.
Random corruption; some unrelated code has stomped on mTimers' memory, due to a UAF bug or similar.
Stack corruption.

Byron Campen [:bwc]

Comment 19

•

5 years ago

Is it possible the MakeUnique is failing here, causing an empty UniquePtr to be pushed?

https://searchfox.org/mozilla-central/rev/3b6958c26049c1e27b2790a43154caaba9f6dd4a/xpcom/threads/TimerThread.cpp#638

Nathan Froyd [:froydnj]

Comment 20

•

5 years ago

(In reply to Byron Campen [:bwc] from comment #19)

Is it possible the MakeUnique is failing here, causing an empty UniquePtr to be pushed?

https://searchfox.org/mozilla-central/rev/3b6958c26049c1e27b2790a43154caaba9f6dd4a/xpcom/threads/TimerThread.cpp#638

No, MakeUnique allocates infallibly, so we should crash if it fails.

Byron Campen [:bwc]

Comment 21

•

5 years ago

•

Edited

Not sure if this is related, but this looks super fishy to me:

https://searchfox.org/mozilla-central/rev/3b6958c26049c1e27b2790a43154caaba9f6dd4a/mozglue/misc/TimeStamp.h#46-53

Seems to me that we can have a case where none of the following are true: a < b, b < a, b == a (because a and b both have the same mTimeStamp, but different values in mUsedCanonicalNow). I don't know if push_heap/pop_heap are going to care about this, but still yuck.

Byron Campen [:bwc]

Comment 22

•

5 years ago

Also, I seem to have missed the fact that we converted over to using a heap for this timer stuff; stl heaps are not necessarily stable, and I think we still have code that expects two timers scheduled with the same delay on the same thread to fire in the order they were scheduled.

Byron Campen [:bwc]

Comment 23

•

5 years ago

It is also interesting that all of these problems originate with ChromeUtils::IdleDispatch, and not with other uses of the timer API (AFAICT). Might be a hint.

Flags: needinfo?(docfaraday)

Gabriele Svelto [:gsvelto]

Reporter

Comment 24

•

5 years ago

(In reply to Markus Stange [:mstange] from comment #17)

I'm running ESR 78 on 10.15.6 now. I've clicked around on sfgate.com for 5 minutes now, no crash so far.

This seems to be happening only on 10.9, 10.10 and 10.11. Given the platform-specific nature of this crash we probably have to look for something that might cause issues up the stack, or before we got to the point of the crash. However there's not much in the way of platform-specific code in the first few frames: we grab the lock, get a timestamp which uses mach_absolute_time() and do some logging. None of those seem like they might be causing issues.

Randell Jesup [:jesup] (needinfo me)

Comment 25

•

5 years ago

I noticed https://torontosun.com/news has 4 crashes (in various different articles), which may help in reproing.

Stephen A Pohl [:spohl]

Updated

•

5 years ago

Flags: needinfo?(spohl.mozilla.bugs)

Daniel Veditz [:dveditz]

Comment 26

•

5 years ago

Seeing a bunch of crashes in 77, but a lot fewer in 78. Did the signature migrate? Did we fix something?

Keywords: sec-moderate

Ryan VanderMeulen [:RyanVM]

Comment 27

•

5 years ago

The crash mostly affects macOS 10.9-10.11 users, which were migrated to ESR78 (where this is still happening with high frequency).

Steven Michaud [:smichaud] (Retired)

Comment 28

•

5 years ago

•

Edited

Looking at the "crash data" chart, the number of these crashes increased dramatically between 7-5-2020 and 7-8-2020. Does anyone have an explanation for that?

Markus Stange just CCed me on this bug. And yes, it's weird and interesting. But the crash stacks provide no macOS-specific clues, so my expertise and tools are unlikely to help here.

As it happens, I have a working macOS 10.10.5 partition. I've been testing with the URLs from comment #16, but so far I haven't seen any crashes.

Steven Michaud [:smichaud] (Retired)

Comment 29

•

5 years ago

Also, given the number of crashes for at least the std::__1::__sift_up<T> signature, it's surprising it doesn't show up in the list of Mac topcrashes. Has it been deliberately kept out of that list because this is a security bug?

Andrew McCreight [:mccr8]

Comment 30

•

5 years ago

(In reply to Steven Michaud [:smichaud] (Retired) from comment #29)

Has it been deliberately kept out of that list because this is a security bug?

I don't think that's something we have set up.

Steven Michaud [:smichaud] (Retired)

Comment 31

•

5 years ago

I vaguely remember that at least some Firefox code sets the stack size. (My memory is too vague to be able to find it again.) Is it possible that we're just running out of stack space, and that changing that setting (or settings) would make a difference here?

I can't explain why we'd be more likely to run out of stack space on older versions of macOS, though.

Steven Michaud [:smichaud] (Retired)

Comment 32

•

5 years ago

I just looked at the assembly code for std::__1::__sift_up<T> (in today's mozilla-central nightly), and the only external method it calls directly is free().

Could this be a jemalloc bug? Does jemalloc behave differently on macOS versions older than 10.12?

Steven Michaud [:smichaud] (Retired)

Comment 33

•

5 years ago

assembly code for std::__1::__sift_up<T>

Its implementation for TimerThread::Entry.

Steven Michaud [:smichaud] (Retired)

Comment 34

•

5 years ago

Actually, I'm not sure the Mozilla codebase is still using jemalloc:

https://bugzilla.mozilla.org/show_bug.cgi?id=1363992

Andrew McCreight [:mccr8]

Comment 35

•

5 years ago

•

Edited

(In reply to Steven Michaud [:smichaud] (Retired) from comment #34)

Actually, I'm not sure the Mozilla codebase is still using jemalloc:

We're using jemalloc 3.

Andrew McCreight [:mccr8]

Comment 36

•

5 years ago

•

Edited

Well, I guess more that we use "mozjemalloc", which is a fork of jemalloc 3.

Steven Michaud [:smichaud] (Retired)

Comment 37

•

5 years ago

I just looked at the assembly code for std::__1::__sift_up<T> (in today's mozilla-central nightly), and the only external method it calls directly is free().

Likewise for std::__1::__pop_heap<T>

Steven Michaud [:smichaud] (Retired)

Comment 38

•

5 years ago

Testing with a recent mozilla-central nightly on macOS 10.15.6 and OS X 10.10.5 (and a HookCase hook library that hooks free(), std::__1::__sift_up<T> and std::__1::__pop_heap<T>), it appears that free() is never called from either of those TimerThread::Entry template methods in XUL, at least during ordinary circumstances. That's probably as it should be.

I wonder about non-ordinary circumstances, but I haven't yet figured out how to test for them.

Gabriele Svelto [:gsvelto]

Reporter

Comment 39

•

5 years ago

(In reply to Steven Michaud [:smichaud] (Retired) from comment #38)

Testing with a recent mozilla-central nightly on macOS 10.15.6 and OS X 10.10.5 (and a HookCase hook library that hooks free(), std::__1::__sift_up<T> and std::__1::__pop_heap<T>), it appears that free() is never called from either of those TimerThread::Entry template methods in XUL, at least during ordinary circumstances. That's probably as it should be.

Could it be that the issue isn't with the std::__1::__sift_up<T> and std::__1::__pop_heap<T> calls themselves, but rather with something that gets called earlier and corrupts the stack?

Steven Michaud [:smichaud] (Retired)

Comment 40

•

5 years ago

Yes, it could. Whatever the problem, though, it isn't easy to reproduce (as I found testing on OS X 10.10.5). At some point I'll try digging deeper into Mozilla's stl heap code (since that seems likely to be the source of the trouble). But I currently know nothing about it, and it will be a while before I have the time. I'll also keep comment #21 in mind.

Gabriele Svelto [:gsvelto]

Reporter

Comment 41

•

5 years ago

I'm starting to see crashes in the nightly channel on macOS 10.15, see this one for example. This is worrisome, I don't think we can rule it out as an issue with older versions of macOS, it appears to affect new ones as well.

Steven Michaud [:smichaud] (Retired)

Comment 42

•

5 years ago

I've started digging further into this, and am puzzled that std::__1::__sift_up<T> and std::__1::__pop_heap<T> (where T == TimerThread::Entry) ever call free(). After all, std::push_heap() and std::pop_heap() are intended not to add or subtract items from their "heap" -- just to change its order.

In my tests (comment #38) I never found free() actually being called from either of these two template methods. But you'd think it shouldn't even be possible. Each call to free() comes just after a call to RefPtr<nsTimerImpl>::~RefPtr() on the same pointer (in %rdi).

Steven Michaud [:smichaud] (Retired)

Comment 43

•

5 years ago

As best I can tell, free() (and RefPtr<nsTimerImpl>::~RefPtr()) would no longer be called if we could get the clang compiler to treat a TimerThread::Entry object as "large" rather than "small".

In ~/.mozbuild/clang/include/c++/any, the __move() method for struct _LIBCPP_TEMPLATE_VIS _SmallHandler calls __destroy(), but the __move() method for struct _LIBCPP_TEMPLATE_VIS _LargeHandler doesn't. Any suggestions for how to fool the clang compiler into treating a TimerThread::Entry object as "large" instead of "small"?

Andrew McCreight [:mccr8]

Comment 44

•

5 years ago

I actually just hit this crash. bp-b7a07adc-5b27-4ec8-aef6-2335d0200814

I had this article open from a news site: https://www.wweek.com/news/2020/08/13/united-states-postal-service-confirmed-it-has-removed-mailboxes-in-portland-and-eugene/?fbclid=IwAR2oZrYxdHdSAAKiqwmmpWsZi3pBjjroWXEWhHg7x73yBNbGEFhRSJ3qjow

I'm also 4 and a half minutes into a Firefox build on my machine while watching a video in the background my machine is under heavy load, if that matters.

Andrew McCreight [:mccr8]

Comment 45

•

5 years ago

Hmm it just crashed again on the same page: bp-b7c75792-bf05-41f6-9745-391d80200814

Gabriele Svelto [:gsvelto]

Reporter

Comment 46

•

5 years ago

(In reply to Andrew McCreight [:mccr8] from comment #45)

Hmm it just crashed again on the same page: bp-b7c75792-bf05-41f6-9745-391d80200814

Was your machine still under load? Could this be a race of some kind (though I'd be surprised if it is given that this code seems to be properly guarded by mutexes).

Julien Cristau [:jcristau]

Updated

•

5 years ago

status-firefox79: --- → wontfix

status-firefox80: --- → affected

status-firefox81: --- → affected

Andrew McCreight [:mccr8]

Comment 47

•

5 years ago

(In reply to Gabriele Svelto [:gsvelto] from comment #46)

Was your machine still under load? Could this be a race of some kind (though I'd be surprised if it is given that this code seems to be properly guarded by mutexes).

Yes, my build was still going.

Steven Michaud [:smichaud] (Retired)

Comment 48

•

5 years ago

Andrew, do you have any non-standard settings?

I'm now doing a Firefox build, and have been playing with your URL from comment #44 in both FF 79.0 and today's mozilla-central nightly. No crashes. I'm running macOS 10.15.6 build 19G73 on a Mid-2015 15" MacBook Pro with 16GB of RAM.

Andrew McCreight [:mccr8]

Comment 49

•

5 years ago

I'm on Firefox 81. I have Fission enabled (set fission.autostart to true on a Nightly build). I'm not running any kind of ad blocker which might qualify as "non-standard". I'm on macOS 10.15.5 on a 2019 MacBookPro with 64GB of RAM. (I didn't realize until just now how much RAM I have on this computer...)

Andrew McCreight [:mccr8]

Comment 50

•

5 years ago

I tried again, with a build going, and it didn't crash, so maybe that was just a fluke... twice.

Steven Michaud [:smichaud] (Retired)

Comment 51

•

5 years ago

As of the 2020-08-12-03-44-18 mozilla-central nightly, almost all non-exported symbols are stripped from XUL (including the ones I've been working with). Anyone know what's up with that?

Ryan VanderMeulen [:RyanVM]

Comment 52

•

5 years ago

Maybe glandium knows.

Flags: needinfo?(mh+mozilla)

Mike Hommey [:glandium]

Comment 53

•

5 years ago

bug 1647780 did that. You can download the dSYMs or the breakpad symbols from the symbol server.

Flags: needinfo?(mh+mozilla)

Steven Michaud [:smichaud] (Retired)

Comment 54

•

5 years ago

•

Edited

bug 1647780 did that. You can download the dSYMs or the breakpad symbols from the symbol server.

Thanks for letting me know.

I take it that operations on TimerThread::mTimers from Mozilla code are properly protected by locks. Can the same be said for operations on mTimers using std::push_heap() and std::pop_heap()? They do, after all, take place on different threads.

Comment hidden (obsolete)

Gabriele Svelto [:gsvelto]

Reporter

Comment 57

•

5 years ago

Ugh. Nevermind, it seems that I was lead on wild goose chase by llvm-objdump. It seems that while I was looking at the disassembly it blindly ignored the dsym file I had provided so NS_NewLocalFileWithCFURL is just a big blob of code which can contain anything and everything. Scratch my previous comments, they're most likely wrong.

Steven Michaud [:smichaud] (Retired)

Comment 58

•

5 years ago

I just found an interesting stack trace, much more complete than most of those we've seen. Unlike most of our crashes, this seems to be one in the main (firefox) process, not the content process:

https://crash-stats.mozilla.org/signature/?signature=nsTArray_base%3CT%3E%3A%3ASwapArrayElements%3CT%3E%20%7C%20mozilla%3A%3Adetail%3A%3AMutexImpl%3A%3A~MutexImpl%20%7C%20XUL%400x9b8ecf%20%7C%20mozilla%3A%3Adetail%3A%3AMutexImpl%3A%3Aunlock%20%7C%20MessageLoop%3A%3APostTask%20%7C%20PR_GetCurrentThread%20%7C%20%3Cname%20omitted%3E%20%7C%20%3Cname%20omitted%3E%20%7C%20std%3A%3A__1%3A%3A__sift_up%3CT%3E&date=%3E%3D2020-08-10T15%3A39%3A00.000Z&date=%3C2020-08-17T15%3A39%3A00.000Z&_sort=-date

    0  XUL  nsTArrayInfallibleAllocator::ResultTypeProxy nsTArray_base<nsTArrayInfallibleAllocator, nsTArray_RelocateUsingMemutils>::SwapArrayElements<nsTArrayInfallibleAllocator, nsTArrayInfallibleAllocator>(nsTArray_base<nsTArrayInfallibleAllocator, nsTArray_RelocateUsingMemutils>&, unsigned long, unsigned long)  xpcom/ds/nsTArray-inl.h:416  context
    1  libmozglue.dylib  mozilla::detail::MutexImpl::~MutexImpl()  mozglue/misc/Mutex_posix.cpp:111  scan
    2  XUL  XUL@0x9b8ecf   scan
    3  libmozglue.dylib  mozilla::detail::MutexImpl::unlock()  mozglue/misc/Mutex_posix.cpp:178  scan
    4  XUL  MessageLoop::PostTask(already_AddRefed<nsIRunnable>)  ipc/chromium/src/base/message_loop.cc:346  scan
    5  libnss3.dylib  PR_GetCurrentThread  nsprpub/pr/src/pthreads/ptthread.c:640  scan
    6  libmozglue.dylib  <name omitted>  memory/build/malloc_decls.h:51  scan
    7  libmozglue.dylib  <name omitted>  memory/build/malloc_decls.h:51  scan
    8  XUL  void std::__1::__sift_up<bool (*&)(mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&), mozilla::ArrayIterator<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, nsTArray<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> > > > >(mozilla::ArrayIterator<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, nsTArray<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> > > >, mozilla::ArrayIterator<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, nsTArray<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> > > >, bool (*&)(mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&), std::__1::iterator_traits<mozilla::ArrayIterator<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> >&, nsTArray<mozilla::UniquePtr<TimerThread::Entry, mozilla::DefaultDelete<TimerThread::Entry> > > > >::difference_type)  /builds/worker/fetches/clang/include/c++/v1/algorithm:4835  scan
    9  XUL  XUL@0x9fb87f   scan
    10  XUL  nsTHashtable<nsBaseHashtableET<nsDepCharHashKey, mozilla::ipc::ChannelCountReporter::ChannelCounts> >::s_MatchEntry(PLDHashEntryHdr const*, void const*)  xpcom/ds/nsTHashtable.h:494  scan
    11  XUL  XUL@0x3f295f   scan
    12  XUL  <name omitted>  xpcom/ds/PLDHashTable.cpp:567  scan
    13  libmozglue.dylib  <name omitted>  memory/build/malloc_decls.h:51  scan
    14  XUL  .str.33.llvm.9700260797439940092   scan
    15  libmozglue.dylib  <name omitted>  memory/build/malloc_decls.h:51  scan
    16  libmozglue.dylib  mozilla::detail::MutexImpl::lock()  mozglue/misc/Mutex_posix.cpp:156  scan
    17  XUL  mozilla::ipc::(anonymous namespace)::ActivateAndCleanupIPCStream(mozilla::ipc::IPCStream&, bool, bool)  ipc/glue/IPCStreamUtils.cpp:222  scan
    18  XUL  mozilla::ipc::AutoIPCStream::~AutoIPCStream()  ipc/glue/IPCStreamUtils.cpp:0  scan
    19  XUL  detail::ProxyReleaseEvent<mozilla::ipc::HoldIPCStream>::Run()  xpcom/threads/nsProxyRelease.h:34  scan
    20  XUL  nsThread::ProcessNextEvent(bool, bool*)  xpcom/threads/nsThread.cpp:1211  scan
    21  XUL  nsTimerImpl::InitCommon(mozilla::BaseTimeDuration<mozilla::TimeDurationValueCalculator> const&, unsigned int, nsTimerImpl::Callback&&)  xpcom/threads/nsTimerImpl.cpp:310  scan
    22  XUL  .str.82.llvm.13599766915509620700   scan
    23  XUL  XUL@0x29b4abf   scan
    24  XUL  PLDHashTable::Remove(void const*)  xpcom/ds/PLDHashTable.cpp:594  scan
    25  XUL  nsXREDirProvider::DoStartup()::kCrashed   scan
    26  XUL  NS_ProcessNextEvent(nsIThread*, bool)  xpcom/threads/nsThreadUtils.cpp:501  scan
    27  XUL  .str.379.llvm.611105640701535871   scan
    28  XUL  mozilla::dom::ContentParent::Observe(nsISupports*, char const*, char16_t const*)  dom/ipc/ContentParent.cpp:2909  scan
    29  XUL  NS_TableDrivenQI(void*, nsID const&, void**, QITableEntry const*)  xpcom/base/nsISupportsImpl.cpp:22  scan
    30  XUL  FifoWatcher::kPrefName   scan
    31  XUL  nsTHashtable<nsObserverList>::s_MatchEntry(PLDHashEntryHdr const*, void const*)  xpcom/ds/nsTHashtable.h:494  scan
    32  XUL  nsMaybeWeakPtrArray<nsIObserver>::RemoveWeakElement(nsIObserver*)  xpcom/base/nsMaybeWeakPtr.h:95  scan
    33  XUL  .str.7.llvm.6184990430567596100   scan
    34  libmozglue.dylib  free  memory/build/malloc_decls.h:54  scan
    35  libmozglue.dylib  free  memory/build/malloc_decls.h:54  scan
    36  XUL  .str.379.llvm.611105640701535871   scan
    37  XUL  .str.379.llvm.611105640701535871   scan
    38  XUL  .str.379.llvm.611105640701535871   scan
    39  XUL  .str.379.llvm.611105640701535871   scan
    40  XUL  nsXREDirProvider::DoStartup()::kCrashed   scan
    41  XUL  .str.379.llvm.611105640701535871   scan
    42  XUL  nsObserverList::NotifyObservers(nsISupports*, char const*, char16_t const*)  xpcom/ds/nsObserverList.cpp:65  scan
    43  XUL  nsXREDirProvider::DoStartup()::kCrashed   scan
    44  XUL  nsObserverService::NotifyObservers(nsISupports*, char const*, char16_t const*)  xpcom/ds/nsObserverService.cpp:288  scan
    45  XUL  mozilla::AppWindow::Destroy()  xpfe/appshell/AppWindow.cpp:669  scan
    46  XUL  .str.379.llvm.611105640701535871   scan
    47  XUL  .str.19.llvm.2081317860475493682   scan
    48  firefox  firefox@0x2ce2   scan
    49  XUL  nsXREDirProvider::DoStartup()::kCrashed   scan
    50  XUL  nsXREDirProvider::DoShutdown()  toolkit/xre/nsXREDirProvider.cpp:1039  scan
    51  XUL  ScopedXPCOMStartup::~ScopedXPCOMStartup()  toolkit/xre/nsAppRunner.cpp:1273  scan
    52  XUL  XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&)  toolkit/xre/nsAppRunner.cpp:4829  scan
    53  firefox  firefox@0x2ce2   scan
    54  XUL  XRE_main(int, char**, mozilla::BootstrapConfig const&)  toolkit/xre/nsAppRunner.cpp:4866  scan
    55  QD  GCC_except_table125   scan
    56  libsystem_c.dylib  __tolower   scan
    57  firefox  firefox@0x1a83   scan
    58  firefox  firefox@0x2cea   scan
    59  firefox  firefox@0x2ce2   scan

Steven Michaud [:smichaud] (Retired)

Comment 59

•

5 years ago

Comment #58's crash seems to be happening during shutdown. I suspect that's not true, though, of most of the other crashes (the content process ones).

Steven Michaud [:smichaud] (Retired)

Comment 60

•

5 years ago

wrong

Gabriele, you were on the right track. As best I can tell, the std::__1::__sift_up<T> crashes don't happen there.

Here's a crash report from the 20200811091731 mozilla-central nightly (the last one that still has non-external symbols in XUL):

https://crash-stats.mozilla.org/report/index/4b428e5b-aaec-4666-b41c-ce1260200811

The "Raw Data and Minidumps" tab says the module offset (in XUL) is 0x4a3110. When I do nm -pam XUL | less and then search on "sift_up", I find that this offset is past the end of std::__1::__sift_up<T>. In fact it's the address of nsTArray_Impl<RefPtr<mozilla::MozPromise<CopyableTArray<bool>, bool, false>::ThenValueBase>, nsTArrayInfallibleAllocator>::Clear().

The "function offset" (0xa0) does fit inside this function, but I'm not sure it's correct -- 0x4a3110 + 0xa0 == 0x4a31b0, which isn't at an instruction boundary.

In any case, something is seriously wrong with the symbolication code used by https://crash-stats.mozilla.org/.

Steven Michaud [:smichaud] (Retired)

Comment 61

•

5 years ago

wrong

In any case, something is seriously wrong with the symbolication code used by https://crash-stats.mozilla.org/.

Or maybe with the code that creates minidumps, or with the copy of dump_syms currently used by Mozilla to create minidumps. Has either of those changed in the last few months? Say around 7-05-2020, when the number of this bug's crashes in the "crash data" chart started increasing dramatically, as I mentioned in comment #28?

Nathan Froyd [:froydnj]

Comment 62

•

5 years ago

We changed the symbol dumping tool in bug 1654845, but that landed 2020-07-25, so way outside your target date.

Steven Michaud [:smichaud] (Retired)

Comment 63

•

5 years ago

•

Edited

Crash reports for std::__1::__sift_up<T> do exist, in much smaller numbers, for dates earlier than 2020-07-05. Their crash stacks are much higher quality. But for some reason only a vanishingly small number happened in mozilla-central (alpha) builds (the only ones with symbols unstripped):

https://crash-stats.mozilla.org/search/?signature=~sift_up&date=%3E%3D2020-02-04T00%3A00%3A00.000Z&date=%3C2020-07-04T00%3A00%3A00.000Z&_facets=signature&_facets=version&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-version

I did check up on one of them, which happened in the 20200221095110 build:
https://crash-stats.mozilla.org/report/index/40552d7a-3a18-4c12-9d79-8ff730200328#tab-details

For this one, the "module offset" is inside std::__1::__sift_up<T>, at a location that makes sense (just after movq 0x8(%rax), %rax, which derefences %rax).

Steven Michaud [:smichaud] (Retired)

Comment 64

•

5 years ago

At the end of all this, I think we'll find out that we need to back out the patch for bug 1647780. Yes, Mozilla's symbol handling code seems to make non-external symbols redundant. But without them we have no way to check that the symbol handling code is working correctly.

Steven Michaud [:smichaud] (Retired)

Comment 65

•

5 years ago

(Following up comment #60 and comment #61)

Oops, looks like I messed up, too. There are two mozilla-central nightlies for 2020-08-11: 20200811091731 and 20200811214738. I thought I had a copy of the former installed locally, but it turns out I had a copy of the latter. Once I installed 20200811091731, the problem I reported went away. On it, the offset 0x4a3110 is well within std::__1::__sift_up<T> (where T == TimerThread::Entry).

Sigh. Sorry for the confusion. Back to square one.

Steven Michaud [:smichaud] (Retired)

Comment 66

•

5 years ago

The definition of __sift_up() in <algorithm> (in ~/.mozbuild/clang/include/c++/v1/algorithm) is as follows:

    template <class _Compare, class _RandomAccessIterator>
    void
    __sift_up(_RandomAccessIterator __first, _RandomAccessIterator __last, _Compare __comp,
              typename iterator_traits<_RandomAccessIterator>::difference_type __len)
    {
        typedef typename iterator_traits<_RandomAccessIterator>::value_type value_type;
        if (__len > 1)
        {
            __len = (__len - 2) / 2;
            _RandomAccessIterator __ptr = __first + __len;
            if (__comp(*__ptr, *--__last))
            {
                value_type __t(_VSTD::move(*__last));
                do
                {
                    *__last = _VSTD::move(*__ptr);
                    __last = __ptr;
                    if (__len == 0)
                        break;
                    __len = (__len - 1) / 2;
                    __ptr = __first + __len;
                } while (__comp(*__ptr, __t));
                *__last = _VSTD::move(__t);
            }
        }
    }

Crash reports indicate the crashes take place at line 4835, which is

            if (__comp(*__ptr, *--__last))

Thinking it might be one way to trigger this bug's crashes, I used a HookCase hook library to increase the value of __len I passed to the original function. This did trigger crashes (if __len got large enough), but the signature was [@ InvalidArrayIndex_CRASH | std::__1::__sift_up<T> ]. This seems to rule out that an incorrect value for __len is being passed to __sift_up().

Stephen A Pohl [:spohl]

Updated

•

5 years ago

Flags: needinfo?(haftandilian)

Steven Michaud [:smichaud] (Retired)

Comment 67

•

5 years ago

•

Edited

(In response to comment #21)

I used a HookCase hook library to hook TimerThread::UniquePtrLessThan() and make it always return false or always return true when running in the content process. Neither causes crashes (though making it always return false prevents any content from being displayed). I think this rules out that the problem reported in comment #21 triggers/causes this bug's crashes.

Gabriele Svelto [:gsvelto]

Reporter

Comment 68

•

5 years ago

(In reply to Steven Michaud [:smichaud] (Retired) from comment #58)

I just found an interesting stack trace, much more complete than most of those we've seen. Unlike most of our crashes, this seems to be one in the main (firefox) process, not the content process:

Apart from the first frame all the others were found via stack-scanning (see the Trust column) so they might be wrong.

Steven Michaud [:smichaud] (Retired)

Comment 70

•

5 years ago

Could someone CC me on bug 1660297? I currently don't have permission to view it.

Andrew McCreight [:mccr8]

Comment 71

•

5 years ago

Done.

Andrew McCreight [:mccr8]

Comment 72

•

5 years ago

I don't know if it is related, but bug 1660300 is about a shutdown hang in a method that creates a new timer.

Steven Michaud [:smichaud] (Retired)

Comment 73

•

5 years ago

•

Edited

After considerable digging around, I've managed to write a HookCase hook library that logs a stack trace whenever an nsTimerImpl object that has been manipulated by std::__1::__sift_up<T> or std::__1::__pop_heap<T> (that has been stored in the nsTimerThread::mTimers array) is destroyed (when nsTimerImpl::~nsTimerImpl() is called on it). My intent is to find out what kinds of nsTimerImpl objects can end up in this array, and to learn something about their life cycles. That can be useful in deciphering non-reproducible crash bugs, like the ones reported here.

But I see many different kinds of stack traces, and am having trouble learning anything useful about them. Does anyone have something to say on the life cycles of nsTimerImpl objects stored in the nsTimerThread::mTimers array?

Not all nsTimerImpl objects end up in the nsTimerThread::mTimers array. But a lot of them do.

Ryan VanderMeulen [:RyanVM]

Comment 74

•

5 years ago

Randell pointed out to me today that we have another signature that's UAF-looking and rising quickly in frequency for 10.9-10.11 ESR78 users:
https://crash-stats.mozilla.org/signature/?product=Firefox&signature=objc_msgSend%20%7C%20_NSAccessibilityUIElementForSpecifier&version=78.2.0esr&date=%3C2020-09-02T16%3A44%3A06%2B00%3A00&date=%3E%3D2020-08-26T16%3A44%3A06%2B00%3A00

Steven Michaud [:smichaud] (Retired)

Comment 75

•

5 years ago

There are lots of crash stacks that have "sift_up" or "pop_heap" somewhere in the "proto signature":

https://crash-stats.mozilla.org/search/?proto_signature=~sift_up&proto_signature=~pop_heap&platform=Mac%20OS%20X&date=%3E%3D2020-08-27T16%3A39%3A00.000Z&date=%3C2020-09-03T16%3A39%3A00.000Z&_facets=signature&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

I suspect they're all related to this bug.

Steven Michaud [:smichaud] (Retired)

Comment 76

•

5 years ago

I suspect they're all related to this bug.

I suspect many of them are related to this bug.

Steven Michaud [:smichaud] (Retired)

Comment 77

•

5 years ago

Well, at least this one :-)

https://crash-stats.mozilla.org/signature/?proto_signature=~sift_up&proto_signature=~pop_heap&platform=Mac%20OS%20X&signature=TimerThread%3A%3AEntry%3A%3AUniquePtrLessThan&date=%3E%3D2020-08-27T16%3A39%3A00.000Z&date=%3C2020-09-03T16%3A39%3A00.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_columns=startup_crash&_columns=platform_version&_sort=-date&page=1

Crash Signature: [@ <name omitted> | std::__1::__sift_up<T>] [@ std::__1::__pop_heap<T>] [@ std::__1::__sift_up<T>] [@ TimerThread::AddTimerInternal] → [@ <name omitted> | std::__1::__sift_up<T>] [@ std::__1::__pop_heap<T>] [@ std::__1::__sift_up<T>] [@ TimerThread::AddTimerInternal] [@ TimerThread::Entry::UniquePtrLessThan ]

Ryan VanderMeulen [:RyanVM]

Updated

•

5 years ago

status-firefox80: affected → wontfix

status-firefox82: --- → affected

Bryce Seager van Dyk [:bryce] (he/him) - Not reading bugmail

Comment 78

•

5 years ago

Adding media bugs that appear related to see also. These bugs have different signatures, but share the sharp spike in rates at ESR78, MacOS specificity, and involve dangling pointers (in some cases they are reliably poisoned).

Comment 79

•

5 years ago

Bryce, could you CC me on those bugs?

Flags: needinfo?(bvandyk)

Bryce Seager van Dyk [:bryce] (he/him) - Not reading bugmail

Comment 80

•

5 years ago

(In reply to Steven Michaud [:smichaud] (Retired) from comment #79)

Bryce, could you CC me on those bugs?

Done.

Flags: needinfo?(bvandyk)

Steven Michaud [:smichaud] (Retired)

Comment 81

•

5 years ago

•

Edited

Thanks!

The first thing that stands out is that the crashes for bug 1660368 and bug 1584956 started increasing substantially on or just after 2020-07-05, just like this bug. I still have no idea as to why.

Bryce Seager van Dyk [:bryce] (he/him) - Not reading bugmail

Updated

•

5 years ago

Comment 82

•

5 years ago

(Following up comment #81)

Likewise for bug 1659938.

Steven Michaud [:smichaud] (Retired)

Comment 83

•

5 years ago

It looks like the early July increase in crashes is just an artifact -- that's when Mac ESR crashes stopped being throttled. See bug 1659938 comment 6.

Steven Michaud [:smichaud] (Retired)

Comment 84

•

5 years ago

It looks like the early July increase in crashes is just an artifact -- that's when Mac ESR crashes stopped being throttled. See bug 1659938 comment 6.

Or possibly it's due to a large increase in the number of Mac ESR users, since that's when Mozilla de-supported OS X versions older than 10.12. Either way, though, the increase in crashes doesn't seem to have been caused by any contemporaneous change on the ESR branch.

Bryce Seager van Dyk [:bryce] (he/him) - Not reading bugmail

Comment 85

•

5 years ago

•

Edited

I'm unfamiliar with the sampling used for the crash stats, so if anyone is able to point me to an explanation I'd appreciate it.

My current understanding is that ESR is not throttled unlike other channels. Assuming N failures on release and N failures on ESR, more crashes will be reported for ESR despite the same number occurring. This means that with the mass migration to ESR various bugs appear to have a significant increase in num crashes.

If that is the case, then the see also media bugs that I assumed are related based on the increased crash rates and MacOS heavy skew may not be related. I.e. there is not a common underlying bug -- it's just the reporting bumping all the numbers at the same time.

I'm going to leave the see alsos while I make sure the above is correct, but if it is then it sounds like they should be removed.

Gabriele Svelto [:gsvelto]

Reporter

Comment 86

•

5 years ago

Will, do we throttle crashes coming from ESR channels differently than release? See comment 85 for more details.

Flags: needinfo?(willkg)

Julien Cristau [:jcristau]

Comment 87

•

5 years ago

Firefox release is sampled at 10%. All other apps/channels are unthrottled, as are reports submitted manually from about:crashes.

https://github.com/mozilla-services/antenna/blob/main/antenna/throttler.py#L389

Gabriele Svelto [:gsvelto]

Reporter

Comment 88

•

5 years ago

Thanks!

Flags: needinfo?(willkg)

Will Kahn-Greene [:willkg] ET needinfo? me

Comment 89

•

5 years ago

Because of throttling and a variety of other issues with the crash report data, I highly discourage people from analysis on counts of crash reports. The crash ping data in Telemetry can be normalized by usage numbers and will be better for comparing counts. I did some discussion of this here: https://bluesock.org/~willkg/blog/mozilla/crash_pings_crash_reports.html

Bryce Seager van Dyk [:bryce] (he/him) - Not reading bugmail

Updated

•

5 years ago

Comment 90

•

5 years ago

•

Edited

I've been banging my head against this bug for a while, but it's now defeated me, at least for the time being. I'm going to put it aside, and maybe come back later with a fresh mind.

I discovered two interesting things along the way:

I can trigger crashes in friends of push_heap and pop_heap by calling UniquePtr<Entry>::~UniquePtr or UniquePtr<Entry>::release() (in a hook library) on one or both of the parameters passed in to Entry::UniquePtrLessThan(). These all happen at address 0x8. But I haven't been able to figure out a way to make this happen "in real life".
I noticed that TimerThread::mMonitor is released in TimerThread::PostTimerEvent() for a call to nsIEventTarget::Dispatch(). Aha! So maybe this is where thread contention can happen. But then (in my hook library) I forced all calls to TimerEvent::RemoveTimer() (on threads other than the "timer thread") to happen during calls to nsIEventTarget::Dispatch() (from TimerThread::PostTimerEvent()) on the timer thread. I didn't see a single crash.

Gabriele Svelto [:gsvelto]

Reporter

Comment 91

•

5 years ago

FYI we keep finding more and more instances of crash reports that look like this one:

apparently impossible to explain
macOS-only
stack appears to have been corrupted above the crashing frame

Bug 1478581 is one of them, bug 1658719 is another. I also found more crash signatures exhibiting the same behavior, see the first three ones in this search. I think we might be dealing with a specific problem.

The crash location often appears to be near a point where we make an allocation (nsTArray is often involved) so I wonder if it might be related to the memory allocator.

Gabriele Svelto [:gsvelto]

Reporter

Comment 92

•

5 years ago

I forgot something: the other thing that all the bugs I mentioned above seem to have in common is that we're accessing some dead object, the poison pattern comes up often either in the crash address, an index that lead to it or appears on the stack.

Steven Michaud [:smichaud] (Retired)

Comment 93

•

5 years ago

Gabriele, could you CC me on bug 1478581 and bug 1658719?

Flags: needinfo?(gsvelto)

Gabriele Svelto [:gsvelto]

Reporter

Comment 94

•

5 years ago

Done!

Flags: needinfo?(gsvelto)

Gabriele Svelto [:gsvelto]

Reporter

Updated

•

5 years ago

Updated

•

5 years ago

status-firefox81: affected → wontfix

status-firefox82: affected → wontfix

status-firefox83: --- → affected

status-firefox84: --- → affected

tracking-firefox83: --- → +

tracking-firefox84: --- → +

tracking-firefox-esr78: ? → 83+

Gabriele Svelto [:gsvelto]

Reporter

Updated

•

5 years ago

Depends on: 1676343

Tom Ritter [:tjr] (OOTO until April)

Updated

•

5 years ago

Whiteboard: [adv-esr78.5-]

Ryan VanderMeulen [:RyanVM]

Comment 95

•

5 years ago

Looking at the results on ESR, I'm not sure we've made any real impact on this crash :(

Flags: needinfo?(gsvelto)

Gabriele Svelto [:gsvelto]

Reporter

Comment 96

•

5 years ago

I've checked the other bugs and yes, this isn't fixed yet. What surprises me is that the volume didn't go down (or did it?) because on nightly/beta it did go down, at least for older macOS versions. Anyway I'll have to do more work on it next week.

Flags: needinfo?(gsvelto)

Julien Cristau [:jcristau]

Updated

•

5 years ago

status-firefox83: affected → wontfix

status-firefox85: --- → affected

tracking-firefox85: --- → +

tracking-firefox-esr78: 83+ → ?

Ryan VanderMeulen [:RyanVM]

Updated

•

5 years ago

status-firefox84: affected → wontfix

Neha Kochar [:neha]

Comment 97

•

5 years ago

Gabriele, since this is currently tracked for 85 Beta, I thought of bringing this back to your attention for any planned follow-up.

Flags: needinfo?(gsvelto)

Gabriele Svelto [:gsvelto]

Reporter

Comment 98

•

5 years ago

Unfortunately this needs a thorough investigation in order to find a proper fix and I haven't had the time to do it yet.

Flags: needinfo?(gsvelto)

Gabriele Svelto [:gsvelto]

Reporter

Comment 99

•

5 years ago

It is high on my priority list though, I'll try to fix this as soon as I put out the other fires in crash reporting & friends.

Gabriele Svelto [:gsvelto]

Reporter

Comment 100

•

5 years ago

I spent some time poring over these crashes again and there's something that suddenly jumped out: there are no crashes for macOS 11. I went through all the other crashes under bug 1676343 and there are no macOS 11 crashes under any of the signatures. The only crash report I could find for macOS 11 has a different stack and isn't an UAF.

So this is very, very likely to be a bug in macOS 10.15 and older. Either in the POSIX thread library where mutexes are implemented or in the kernel-level mutexes. Given that the memory allocator is involved - because we're allocating/deallocating objects on the spot where we crash - it might be worth mentioning that the locks we use there pass different options to the macOS kernel compared to the pthread mutexes; so it might as well be a bug in how those two types of kernel-level mutexes interact.

Julien Cristau [:jcristau]

Updated

•

5 years ago

status-firefox85: affected → wontfix

status-firefox86: --- → affected

tracking-firefox86: --- → +

Ryan VanderMeulen [:RyanVM]

Updated

•

4 years ago

status-firefox86: affected → wontfix

Ryan VanderMeulen [:RyanVM]

Updated

•

4 years ago

status-firefox-esr78: affected → wontfix

tracking-firefox-esr78: ? → -

Gian-Carlo Pascutto [:gcp]

Comment 101

•

3 years ago

Did a signature change, or your change to locking on macOS, "fix" this? Crash rate plummeted.

Flags: needinfo?(gsvelto)

Gabriele Svelto [:gsvelto]

Reporter

Comment 102

•

3 years ago

Looks like a signature change

Crash Signature: [@ <name omitted> | std::__1::__sift_up<T>] [@ std::__1::__pop_heap<T>] [@ std::__1::__sift_up<T>] [@ TimerThread::AddTimerInternal] [@ TimerThread::Entry::UniquePtrLessThan ] → [@ <name omitted> | std::__1::__sift_up<T>] [@ std::__1::__pop_heap<T>] [@ std::__1::__sift_up<T>] [@ TimerThread::AddTimerInternal] [@ TimerThread::Entry::UniquePtrLessThan] [@ <name omitted> | TimerThread::AddTimerInternal]

Flags: needinfo?(gsvelto)

Nika Layzell [:nika] (ni? for response)

Comment 103

•

3 years ago

It looks like there is only one crash from more recent versions than ESR78, which we aren't maintaining anymore (one crash from ESR102). Unless this starts spiking again, closing for now.

Status: NEW → RESOLVED

Closed: 3 years ago

Resolution: --- → WORKSFORME

Daniel Veditz [:dveditz]

Updated

•

2 years ago

Group: dom-core-security