Closed Bug 1594016 Opened 5 years ago Closed 11 months ago

Crash in [@ TimerThread::Entry::UniquePtrLessThan]

Categories

(Core :: XPCOM, defect)

71 Branch
Desktop
macOS
defect

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox70 --- fix-optional
firefox71 --- wontfix
firefox72 --- ?

People

(Reporter: marcia, Unassigned)

Details

(Keywords: crash, regression, regressionwindow-wanted, Whiteboard: qa-not-actionable)

Crash Data

This bug is for crash report bp-b608c509-3575-4f2e-af8e-db6f10191105.

Seen while looking at release crash stats: https://bit.ly/32kAt8M. #4 top macOS crash on release. Seen in previous releases. Some comments mention crashing while using Yahoo mail.

Fairly high percentage of 10.11 users seen in correlations:

(77.98% in signature vs 00.49% overall) reason = EXC_BAD_ACCESS / EXC_I386_GPFLT
(94.64% in signature vs 00.71% overall) Module "AppleGVA" = true [97.75% vs 24.11% if platform_pretty_version = OS X 10.11]
(77.98% in signature vs 07.45% overall) address = 0x0
(91.67% in signature vs 00.75% overall) Module "AudioCodecs" = true [94.38% vs 24.48% if platform_pretty_version = OS X 10.11]

Top 10 frames of crashing thread:

0 XUL TimerThread::Entry::UniquePtrLessThan xpcom/threads/TimerThread.h:105
1 XUL void std::__1::__sift_up<bool  /builds/worker/fetches/clang/include/c++/v1/algorithm:4778
2 XUL TimerThread::AddTimer xpcom/threads/TimerThread.cpp:527
3 XUL XUL@0x1a115f 
4 XUL nsTimerImpl::InitHighResolutionWithCallback xpcom/threads/nsTimerImpl.cpp:372
5 libmozglue.dylib libmozglue.dylib@0xd4ff 
6 XUL NS_NewTimerWithCallback xpcom/threads/nsTimerImpl.cpp:92
7 XUL NS_DispatchToThreadQueue xpcom/threads/nsThreadUtils.cpp:397
8 XUL mozilla::dom::ChromeUtils::IdleDispatch dom/base/ChromeUtils.cpp:396
9 XUL mozilla::dom::ChromeUtils_Binding::idleDispatch dom/bindings/ChromeUtilsBinding.cpp:4825

This kind of reminds me of bug 1498610, although in this case we're looking at null pointers. I'm not sure how it's possible to have a null timer in the global list unless we somehow have two threads operating on the timer list. The stacks are a bit sketchy but it almost looks like this is a re-entrant AddTimer call.

Nathan can you help move this in the right direction?

Flags: needinfo?(nfroyd)

The stacks on all of these crashes are garbage. I don't understand why the stackwalker resorts to stack scanning (which is apparently quite unreliable, reliably); we should have all the information we need in the symbols file... I don't believe that we have reentrant AddTimer calls, because the stacks aren't reliable, AddTimer doesn't call anything that would reentrantly call AddTimer, and AddTimer locks (non-reentrantly, although I suppose that the implementation is not required to detect that case, so we could be running into it...) anyway.

Anyway, I don't understand how we get to this situation. The only times we ever touch the mTimers array, we own a lock, so the thread in question has exclusive access. We only ever add non-nullptr UniquePtr<Entry> elements to mTimers, and we never null out the elements of the array without also removing those same elements from the array.

We could be accessing an empty array, somehow; an empty array's elements are represented by the null pointer, and trying to index off of that is going to exhibit the symptoms here. But nsTArray's bounds checking should catch this case: any index would be >= 0 (nsTArray indices are unsigned), which would be greater than the empty array's length, which would crash. Unless the compiler is somehow smart enough to completely remove our bounds checks, which would be really scary...and which I would also expect to happen on other platforms besides Macs.

This could be a bug in libc++: that would explain the Mac-only-ness of this bug (except that we also use libc++ on Android and we see a very few crashes on other platforms...). But I don't see any change between the libc++ 8.0 and the libc++ 9.0 implementations, and I don't see anything to suggest that libc++ might have fixed bugs in its heap implementation recently.

It's not compiler related, I don't think: we updated to clang 9 on 25 September 2019, which landed in 71. That change wouldn't be present on 69 or 70. I landed bug 1579870, which tweaked bits of clang, on 17 September 2019, which is also in 71. That change correlates almost perfectly with the time that crashes started increasing, but that increase in crashes is on release, not on 71.

It's possible that TimeStamp on Mac is somehow busted? But that wouldn't explain how we get null entries in the mTimers array (see above); it would just mean that the timers would go off in very peculiar orders.

CPU info is all over the map, so we're probably not looking at a CPU bug. Ditto for OS versions, although whether the OS versions is a symptom of people not being in a rush to upgrade or an indication that something very fundamental got fixed in 10.13+ or so, it's hard to tell.

In short, I am kind of out of ideas.

Flags: needinfo?(nfroyd)

The volume of crashes is low in our late betas, I am marking it as wontfix for 71 as we have shipped the last beta.

Considering the "regressionwindow-wanted" tag, I could try to find the regression if some steps to reproduce are provided.
Please NI me if some working STR are obtained. Thanks.

QA Whiteboard: [qa-regression-triage]
Whiteboard: qa-not-actionable

Since the crash volume is low (less than 5 per week), the severity is downgraded to S3. Feel free to change it back if you think the bug is still critical.

For more information, please visit auto_nag documentation.

Severity: critical → S3

Closing because no crashes reported for 12 weeks.

Status: NEW → RESOLVED
Closed: 11 months ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.