Closed Bug 1259699 Opened 8 years ago Closed 8 years ago

spike in JS-related EXCEPTION_STACK_OVERFLOW crashes on nightly

Categories

(Core :: JavaScript: GC, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla48
Tracking Status
firefox46 --- fixed
firefox47 --- fixed
firefox48 + fixed
firefox-esr45 --- fixed

People

(Reporter: dbaron, Assigned: jandem)

References

Details

(Keywords: crash, topcrash)

Crash Data

Attachments

(1 file)

There's been a huge spike in crashes on nightly with EXCEPTION_STACK_OVERFLOW in JS-related signatures:

Massive spike on March 24 nightly in _chkstk:
https://crash-stats.mozilla.com/signature/?product=Firefox&release_channel=nightly&platform=Windows&date=%3E%3D2016-03-01&reason=EXCEPTION_STACK_OVERFLOW&signature=_chkstk&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&page=1#aggregations

Massive spike on March 23 and 24 nightlies in DispatchToTracer<>:
https://crash-stats.mozilla.com/signature/?product=Firefox&release_channel=nightly&platform=Windows&date=%3E%3D2016-03-01&reason=EXCEPTION_STACK_OVERFLOW&signature=DispatchToTracer%3CT%3E&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&page=1#aggregations

Spikes on March 19 and March 24 in UnmarkGrayTracer::onChild:
https://crash-stats.mozilla.com/signature/?product=Firefox&release_channel=nightly&platform=Windows&date=%3E%3D2016-03-01&reason=EXCEPTION_STACK_OVERFLOW&signature=UnmarkGrayTracer%3A%3AonChild&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&page=1#aggregations

Spike on March 24 in js::TraceRange<T>:
https://crash-stats.mozilla.com/signature/?product=Firefox&release_channel=nightly&platform=Windows&date=%3E%3D2016-03-01&reason=EXCEPTION_STACK_OVERFLOW&signature=js%3A%3ATraceRange%3CT%3E&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&page=1#aggregations

These spikes represent a significant portion of the crashes on nightly.
Hmm, looks like these crashes are all win64, so this must be fallout from bug 1257234. Jan, maybe our safe stack limits need to be adjusted now that we can actually exhaust the stack on win64 (instead of using at most half of it)? That's just a guess though.
Blocks: 1257234
Flags: needinfo?(jdemooij)
Ted, do you know why none of these reports have stack traces? I only see a single frame, so it's impossible to figure out if and where we're overrecursing.

(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #1)
> Jan, maybe our safe stack limits need to be adjusted now that
> we can actually exhaust the stack on win64 (instead of using at most half of
> it)? That's just a guess though.

Maybe, but since most (all?) of these crashes are in GC code I wonder if we're missing an overrecursion check there somewhere.

We backported bug 1257234, so the simplest and safest short-term fix is probably to subtract 100 KB or so from our stack quota on Win64.
Flags: needinfo?(ted)
I think unfortunately the _chkstk routine used here doesn't have any unwind info. :-/ I can probably get a plausible stack out of some of these dumps though.
Flags: needinfo?(ted)
I managed to load one of these dumps in Visual Studio and got a good stack trace there.

* We have a ton of UnmarkGrayTracer calls on the stack, using up pretty much all of our stack space. I filed bug 1259730 to make this non-recursive.

* UnmarkGrayTracer::onChild has an overrecursion check, but after that we push some frames that are quite large (I saw a 3+ KB frame for instance). This seems to be a common issue with MSVC PGO builds unfortunately.

* I think the simplest fix is to subtract more than 20 KB on Win64, to account for these large PGO stack frames.
Attached patch PatchSplinter Review
Bump the 20 KB buffer to 40 KB (Win32) or 80 KB (Win64).
Assignee: nobody → jdemooij
Status: NEW → ASSIGNED
Flags: needinfo?(jdemooij)
Attachment #8734745 - Flags: review?(ted)
Crash with signature UnmarkGrayTracer::onChild appeared the 2016-03-26 on 47.0a2. It's #9 in topcrash and represents 337 crashes.
We backed out bug 1257234 on beta for test failures. I'm wondering why that happened for beta but not for nightly/aurora (successfully - since it stopped us from hitting this crash in beta 6) 
Are the nightly/aurora tests for win64 the same, here?  Or do we treat them differently?
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #7)
> We backed out bug 1257234 on beta for test failures. I'm wondering why that
> happened for beta but not for nightly/aurora (successfully - since it
> stopped us from hitting this crash in beta 6) 
> Are the nightly/aurora tests for win64 the same, here?  Or do we treat them
> differently?

It's hard to say, but release/beta builds are slightly different from aurora/nightly builds (IIRC nightly/aurora builds have frame pointers enabled for instance), so that likely affects the code we generate - just enough to push us over the stack limit.
Comment on attachment 8734745 [details] [diff] [review]
Patch

Review of attachment 8734745 [details] [diff] [review]:
-----------------------------------------------------------------

rs=me on the mechanics, though I have no idea if this is the appropriate amount to subtract.
Attachment #8734745 - Flags: review+
I can't really comment on the absolute numbers, but based on the anecdotal evidence of the stack frame sizes in bug 1259796 I think doubling on 64-bit is reasonable.
https://hg.mozilla.org/mozilla-central/rev/72c2846e2d16
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla48
Jan, this was mentioned as a top crasher on Aurora 47. Should we consider uplifting this fix to Aurora? Thanks!
Flags: needinfo?(jdemooij)
Attachment #8734745 - Flags: review?(ted) → review+
Comment on attachment 8734745 [details] [diff] [review]
Patch

We need this small change to fix crashes on Aurora and to uplift bug 1257234 to Beta and ESR45.

Approval Request Comment
[Feature/regressing bug #]: Bug 1257234.
[User impact if declined]: Crashes.
[Describe test coverage new/current, TreeHerder]: This is hard to test well. Top crashes and intermittents have stopped after this landed though.
[Risks and why]: Low risk. 
[String/UUID change made/needed]: None.
Flags: needinfo?(jdemooij)
Attachment #8734745 - Flags: approval-mozilla-esr45?
Attachment #8734745 - Flags: approval-mozilla-beta?
Attachment #8734745 - Flags: approval-mozilla-aurora?
Comment on attachment 8734745 [details] [diff] [review]
Patch

Crash fix that seems to have fixed the crashes in Nightly, Aurora47+, ESR45
Attachment #8734745 - Flags: approval-mozilla-esr45?
Attachment #8734745 - Flags: approval-mozilla-esr45+
Attachment #8734745 - Flags: approval-mozilla-aurora?
Attachment #8734745 - Flags: approval-mozilla-aurora+
Crash Signature: [@ _chkstk] [@ DispatchToTracer<T>] [@ UnmarkGrayTracer::onChild] [@ js::TraceRange<T>] → [@ _chkstk] [@ DispatchToTracer<T>] [@ UnmarkGrayTracer::onChild] [@ js::TraceRange<T>] [@ JS::CallbackTracer::onShapeEdge ]
Crash Signature: [@ _chkstk] [@ DispatchToTracer<T>] [@ UnmarkGrayTracer::onChild] [@ js::TraceRange<T>] [@ JS::CallbackTracer::onShapeEdge ] → [@ _chkstk] [@ DispatchToTracer<T>] [@ UnmarkGrayTracer::onChild] [@ js::TraceRange<T>] [@ JS::CallbackTracer::onShapeEdge ] [@ js::ObjectGroup::traceChildren ]
Blocks: 1203467
Comment on attachment 8734745 [details] [diff] [review]
Patch

Fix for topcrash, verified on nightly, let's uplift it for beta 9.
Attachment #8734745 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
This does not apply cleanly to beta:
$ hg graft -er 21c01308022d                                                                                                                                
grafting 336149:21c01308022d "Bug 1259699 - Adjust Windows stack limits to account for large PGO stack frames. r=bholley a=ritu"
merging js/xpconnect/src/XPCJSRuntime.cpp
warning: conflicts while merging js/xpconnect/src/XPCJSRuntime.cpp! (edit, then use 'hg resolve --mark')
abort: unresolved conflicts, can't continue
Flags: needinfo?(jdemooij)
(In reply to Jan de Mooij [:jandem] from comment #2)
> Ted, do you know why none of these reports have stack traces? I only see a
> single frame, so it's impossible to figure out if and where we're
> overrecursing.

I finally spent the time to track this down. Turns out the Breakpad processor code will refuse to read a memory region that's >1MB:
https://chromium.googlesource.com/breakpad/breakpad/+/fee47f4638072e9b049ffa745bbb8aa5be986698/src/processor/minidump.cc#1183

This is pretty silly. I will get that fixed upstream. Bumping that to 2MB in my local Breakpad tree gives me a useful stack from that crash.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #22)
> I finally spent the time to track this down. Turns out the Breakpad
> processor code will refuse to read a memory region that's >1MB:

Thanks for tracking that down and fixing it.
You need to log in before you can comment on or make changes to this bug.