Closed Bug 1144649 Opened 5 years ago Closed 5 years ago

Large OOMs in CC with 38+

Categories

(Core :: XPCOM, defect, critical)

38 Branch
x86
Windows 7
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla40
Tracking Status
firefox37 --- unaffected
firefox38 + fixed
firefox39 --- fixed
firefox40 --- fixed

People

(Reporter: cshields, Assigned: mccr8)

References

Details

(Keywords: crash)

Crash Data

Attachments

(1 file)

This bug was filed from the Socorro interface and is 
report bp-26b5a9fd-30c8-4c3a-8daf-1fc872150318.
=============================================================

In Fx38 (dev edition) I've had multiple crashes recently all around NS_ABORT_OOM(unsigned int) through various js functions.  This bug from socorro is one, here are 2 others:

https://crash-stats.mozilla.com/report/index/66b45a47-48eb-4a90-89b3-0d7c22150312

https://crash-stats.mozilla.com/report/index/3a6c8740-3183-46c6-a120-579622150309
Component: XUL → Memory Allocator
[Tracking Requested - why for this release]:

Those large OOMs together (see Crash Signature field) make up 7.6% of all crashes in early 38.0b1 data.
Crash Signature: [@ OOM | large | NS_ABORT_OOM(unsigned int) | NoteJSChildTracerShim] → [@ OOM | large | NS_ABORT_OOM(unsigned int) | NoteJSChildTracerShim] [@ OOM | large | NS_ABORT_OOM(unsigned int) | JSObject::markChildren(JSTracer*)] [@ OOM | large | NS_ABORT_OOM(unsigned int) | js::TraceChildren(JSTracer*, void*, JSGCTraceKind)] [@ O…
Summary: crash in OOM | large | NS_ABORT_OOM(unsigned int) | NoteJSChildTracerShim → Large OOMs in JS code with 38+
Forget to say it happened (like many others) while while looking at a video, big black box all over the screen: something was corrupted in the memory.Fx39 https://crash-stats.mozilla.com/report/index/da76deb0-f7a5-4495-abc8-b953c2150403
Tracking this topcrash for 38.
Naveed, could you find someone to help us with this bug? Thanks
Flags: needinfo?(nihsanullah)
These stacks are totally trashed. There is absolutely nothing we can even look at here without a regression range or STR.

I think the best clue here is "it happened (like many others) while while looking at a video, big black box all over the screen". Did any major new features land in gfx in the time frame and platform where this started spiking?
I might be able to dig some better call sites out of these dumps. I'll try to see what I can do tomorrow.
Flags: needinfo?(dmajor)
(In reply to Terrence Cole [:terrence] from comment #6)
> These stacks are totally trashed. There is absolutely nothing we can even
> look at here without a regression range or STR.

I tried to find a regression range, here's the first build IDs on Nightly that the various signatures seem to have been seen with (according to the graphs feature in the report/list linked in the Crash Signature field above):
…| NoteJSChildTracerShim - 20150206030205
…| JSObject::markChildren(JSTracer*) - 20150208030206
…| js::TraceChildren(JSTracer*, void*, JSGCTraceKind) - 20150213020456
…| js::gc::MarkUnbarriered<JSObject>(JSTracer*, JSObject**, char const*) - 20150208030206

With that, it looks like something landed on Feb 5th or Feb 7th probably that caused this issue to appear.
Assignee: nobody → terrence
Flags: needinfo?(nihsanullah)
Crazy amounts of optimization here. I had to use several debuggers to sort it out.

The real stack of calls (at least in bp-2ebd7afb-a088-4e66-bbb7-c9c1b2150402) is:
NS_ABORT_OOM
PLDHashTable::Add
PL_DHashTableAdd
CCGraph::AddNodeToMap
CCGraphBuilder::AddNode
CCGraphBuilder::NoteChild
CCGraphBuilder::NoteJSObject
[stuff I haven't deciphered]
JSObject::markChildren

The allocation sizes are often a bit under 4M or 8M. Is this just a huge CC graph?
Flags: needinfo?(dmajor)
(In reply to David Major [:dmajor] from comment #9)
> Crazy amounts of optimization here. I had to use several debuggers to sort
> it out.
>
> The real stack of calls (at least in
> bp-2ebd7afb-a088-4e66-bbb7-c9c1b2150402) is:
> NS_ABORT_OOM
> PLDHashTable::Add
> PL_DHashTableAdd
> CCGraph::AddNodeToMap
> CCGraphBuilder::AddNode
> CCGraphBuilder::NoteChild
> CCGraphBuilder::NoteJSObject
> [stuff I haven't deciphered]
> JSObject::markChildren

Wow! Impressive work!

> The allocation sizes are often a bit under 4M or 8M. Is this just a huge CC
> graph?

Yes, that is also how I would interpret that stack.

I guess the next step would be to figure out what landed Feb 5 or Feb 7 that would cause the CC graph size to explode.
When I read that CC is involved, I always think "let's include :mccr8 in this bug". ;-)
Well, on Feb 11 we made that very hash table add infallible, in bug 1131901.  That doesn't quite line up, but it is in the ball park.  If that's really the regression, then it is only just causing us to crash a little earlier than we would otherwise.
Component: Memory Allocator → XPCOM
Summary: Large OOMs in JS code with 38+ → Large OOMs in CC with 38+
(In reply to Andrew McCreight [:mccr8] from comment #12)
> Well, on Feb 11 we made that very hash table add infallible, in bug 1131901.
> That doesn't quite line up, but it is in the ball park.  If that's really
> the regression, then it is only just causing us to crash a little earlier
> than we would otherwise.

Even then I think we need to see that we can do something here. This is 8.5% of all crashes in 38.0b2 right now, and our crash rates have been regression in every release since and including 36. The crashes with those signatures in sum even trump the "OOM|small" signature (allocations <256K failing), which by itself is pretty much at the same level in 38 beta as it was in 35 beta.

I'm also not confident in us having just shifted crashes from other signatures to those, but I'll accept that premise if you can show me data telling that story in a good way.
Crash Signature: , JSGCTraceKind)] [@ OOM | large | NS_ABORT_OOM(unsigned int) | js::gc::MarkUnbarriered<JSObject>(JSTracer*, JSObject**, char const*) ] → , JSGCTraceKind)] [@ OOM | large | NS_ABORT_OOM(unsigned int) | js::gc::MarkUnbarriered<JSObject>(JSTracer*, JSObject**, char const*) ] [@ OOM | large | NS_ABORT_OOM(unsigned int) | CCGraphBuilder::NoteXPCOMChild(nsISupports*)]
Allocations over 1 meg really should be expected to fail, and the larger the size, the more we should care. Large contiguous regions are hard to find on Windows. The users who fail at the 4M and 8M sizes likely could have kept going for a while before reaching hopeless OOM|small territory.

So I think this is worth doing something about. Either by gracefully handling failure or by playing tricks to keep the individual allocations smaller.
I understand what you are saying, but the problem is that this is a giant hash table.  I'm not sure how to break that into smaller pieces.  "Gracefully handling failure" would mean not running the cycle collector.  I suppose we could have some fallback mode that uses a tree instead of a hash table, but that would introduce a lot of complexity.

It is possible that on Feb 5 or 7 something landed that caused us to leak windows, which would make the CC graph larger.  We track this with the GHOST_WINDOWS telemetry, but I guess that looks about the same, though it is hard to tell.
(In reply to Andrew McCreight [:mccr8] from comment #15)
> I understand what you are saying, but the problem is that this is a giant
> hash table.  I'm not sure how to break that into smaller pieces. 
> "Gracefully handling failure" would mean not running the cycle collector.  I
> suppose we could have some fallback mode that uses a tree instead of a hash
> table, but that would introduce a lot of complexity.

Maybe we could make (template?) a variant of PLDHashTable that uses SegmentedVector or similar underneath, so growing the hashtable would make lots of smaller allocations rather than one large one?
Yeah the backing store for the table is just a giant array, so it ought to be possible to segment it. Not sure if I would want to rush such a thing to beta, but it could definitely be useful in the long term.
I ran some more supersearches and convinced myself that the first bad nightly was indeed 20150206030205. That makes the regression range: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=34a66aaaca81&tochange=7c5f187b65bf

These definitely stand out:
	2fcef6b54be7	Nicholas Nethercote — Bug 1050035 (part 5) - Make CCGraphBuilder::AddNode() infallible. r=mccr8.
	2be07829fefc	Nicholas Nethercote — Bug 1050035 (part 4) - Make PL_DHashTableAdd() infallible by default, and add a fallible alternative. r=froydnj.

I don't see a lot of context in there for why AddNode became fallible. Andrew do you remember?
Flags: needinfo?(continuation)
Rather, why it became *in*fallible.
Oh, heh, turns out bug 1131901 was just a reincarnation of bug 1050035. Well I guess now we have an explanation for the date range at least.
(In reply to David Major [:dmajor] from comment #18)
> I don't see a lot of context in there for why AddNode became fallible.
> Andrew do you remember?

If the cycle collector can't run, you are going to enter into an incomprehensible death spiral.  So the reasoning was that it was better to crash.  But given that this is happening so often, maybe it is better to leave the user to take their chances.
Assignee: terrence → continuation
Flags: needinfo?(continuation)
Depends on: 1131901
This makes us do a null check when we add something new to the graph,
but I wouldn't think that's too bad, given that we're doing a hash
table add anyways.

Crashing here is apparently fairly common. This restores the old behavior, so we at least
don't crash immediately, but instead enter a slow downward spiral of leaking.

This improves on the old behavior in that we only try and fail to grow the hash table once,
rather than on every add.  khuey I think reported that the browser got very slow, because
you are going through the very slowest path of the allocator over and over.

try run: https://treeherder.mozilla.org/#/jobs?repo=try&revision=d05c0fa7339e
Attachment #8590530 - Flags: review?(bugs)
Attachment #8590530 - Flags: review?(bugs) → review+
https://hg.mozilla.org/mozilla-central/rev/2e7778275e6d
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla40
Andrew, can we have an uplift request to aurora & beta? Thanks
Flags: needinfo?(continuation)
I spun off bug 1153865 for the discussion about segmented allocations.
Blocks: 1131901
No longer depends on: 1131901
Flags: needinfo?(continuation)
Comment on attachment 8590530 [details] [diff] [review]
Make CCGraph::AddNodeToMap fallible again.

Approval Request Comment
[Feature/regressing bug #]: bug 1131901
[User impact if declined]: crashes (see comment 13)
[Describe test coverage new/current, TreeHerder]: there's not really any test coverage
[Risks and why]: this mostly just reverts to our old behavior, though it is a little different
[String/UUID change made/needed]: none
Attachment #8590530 - Flags: approval-mozilla-beta?
Attachment #8590530 - Flags: approval-mozilla-aurora?
Comment on attachment 8590530 [details] [diff] [review]
Make CCGraph::AddNodeToMap fallible again.

Should be in 38 beta 4.
Attachment #8590530 - Flags: approval-mozilla-beta?
Attachment #8590530 - Flags: approval-mozilla-beta+
Attachment #8590530 - Flags: approval-mozilla-aurora?
Attachment #8590530 - Flags: approval-mozilla-aurora+
You need to log in before you can comment on or make changes to this bug.