Closed
Bug 1144649
Opened 7 years ago
Closed 7 years ago
Large OOMs in CC with 38+
Categories
(Core :: XPCOM, defect)
Tracking
()
RESOLVED
FIXED
mozilla40
Tracking | Status | |
---|---|---|
firefox37 | --- | unaffected |
firefox38 | + | fixed |
firefox39 | --- | fixed |
firefox40 | --- | fixed |
People
(Reporter: cshields, Assigned: mccr8)
References
Details
(Keywords: crash)
Crash Data
Attachments
(1 file)
2.83 KB,
patch
|
smaug
:
review+
Sylvestre
:
approval-mozilla-aurora+
Sylvestre
:
approval-mozilla-beta+
|
Details | Diff | Splinter Review |
This bug was filed from the Socorro interface and is report bp-26b5a9fd-30c8-4c3a-8daf-1fc872150318. ============================================================= In Fx38 (dev edition) I've had multiple crashes recently all around NS_ABORT_OOM(unsigned int) through various js functions. This bug from socorro is one, here are 2 others: https://crash-stats.mozilla.com/report/index/66b45a47-48eb-4a90-89b3-0d7c22150312 https://crash-stats.mozilla.com/report/index/3a6c8740-3183-46c6-a120-579622150309
Updated•7 years ago
|
Component: XUL → Memory Allocator
![]() |
||
Comment 1•7 years ago
|
||
[Tracking Requested - why for this release]: Those large OOMs together (see Crash Signature field) make up 7.6% of all crashes in early 38.0b1 data.
Crash Signature: [@ OOM | large | NS_ABORT_OOM(unsigned int) | NoteJSChildTracerShim] → [@ OOM | large | NS_ABORT_OOM(unsigned int) | NoteJSChildTracerShim]
[@ OOM | large | NS_ABORT_OOM(unsigned int) | JSObject::markChildren(JSTracer*)]
[@ OOM | large | NS_ABORT_OOM(unsigned int) | js::TraceChildren(JSTracer*, void*, JSGCTraceKind)]
[@ O…
status-firefox37:
--- → unaffected
status-firefox38:
--- → affected
status-firefox39:
--- → affected
status-firefox40:
--- → affected
tracking-firefox38:
--- → ?
Summary: crash in OOM | large | NS_ABORT_OOM(unsigned int) | NoteJSChildTracerShim → Large OOMs in JS code with 38+
Comment 2•7 years ago
|
||
Same crash here with Fx39 https://crash-stats.mozilla.com/report/index/da76deb0-f7a5-4495-abc8-b953c2150403
Comment 3•7 years ago
|
||
Forget to say it happened (like many others) while while looking at a video, big black box all over the screen: something was corrupted in the memory.Fx39 https://crash-stats.mozilla.com/report/index/da76deb0-f7a5-4495-abc8-b953c2150403
Tracking this topcrash for 38.
Comment 5•7 years ago
|
||
Naveed, could you find someone to help us with this bug? Thanks
Flags: needinfo?(nihsanullah)
Comment 6•7 years ago
|
||
These stacks are totally trashed. There is absolutely nothing we can even look at here without a regression range or STR. I think the best clue here is "it happened (like many others) while while looking at a video, big black box all over the screen". Did any major new features land in gfx in the time frame and platform where this started spiking?
I might be able to dig some better call sites out of these dumps. I'll try to see what I can do tomorrow.
Flags: needinfo?(dmajor)
![]() |
||
Comment 8•7 years ago
|
||
(In reply to Terrence Cole [:terrence] from comment #6) > These stacks are totally trashed. There is absolutely nothing we can even > look at here without a regression range or STR. I tried to find a regression range, here's the first build IDs on Nightly that the various signatures seem to have been seen with (according to the graphs feature in the report/list linked in the Crash Signature field above): …| NoteJSChildTracerShim - 20150206030205 …| JSObject::markChildren(JSTracer*) - 20150208030206 …| js::TraceChildren(JSTracer*, void*, JSGCTraceKind) - 20150213020456 …| js::gc::MarkUnbarriered<JSObject>(JSTracer*, JSObject**, char const*) - 20150208030206 With that, it looks like something landed on Feb 5th or Feb 7th probably that caused this issue to appear.
Updated•7 years ago
|
Assignee: nobody → terrence
Flags: needinfo?(nihsanullah)
Crazy amounts of optimization here. I had to use several debuggers to sort it out. The real stack of calls (at least in bp-2ebd7afb-a088-4e66-bbb7-c9c1b2150402) is: NS_ABORT_OOM PLDHashTable::Add PL_DHashTableAdd CCGraph::AddNodeToMap CCGraphBuilder::AddNode CCGraphBuilder::NoteChild CCGraphBuilder::NoteJSObject [stuff I haven't deciphered] JSObject::markChildren The allocation sizes are often a bit under 4M or 8M. Is this just a huge CC graph?
Flags: needinfo?(dmajor)
Comment 10•7 years ago
|
||
(In reply to David Major [:dmajor] from comment #9) > Crazy amounts of optimization here. I had to use several debuggers to sort > it out. > > The real stack of calls (at least in > bp-2ebd7afb-a088-4e66-bbb7-c9c1b2150402) is: > NS_ABORT_OOM > PLDHashTable::Add > PL_DHashTableAdd > CCGraph::AddNodeToMap > CCGraphBuilder::AddNode > CCGraphBuilder::NoteChild > CCGraphBuilder::NoteJSObject > [stuff I haven't deciphered] > JSObject::markChildren Wow! Impressive work! > The allocation sizes are often a bit under 4M or 8M. Is this just a huge CC > graph? Yes, that is also how I would interpret that stack. I guess the next step would be to figure out what landed Feb 5 or Feb 7 that would cause the CC graph size to explode.
![]() |
||
Comment 11•7 years ago
|
||
When I read that CC is involved, I always think "let's include :mccr8 in this bug". ;-)
Assignee | ||
Comment 12•7 years ago
|
||
Well, on Feb 11 we made that very hash table add infallible, in bug 1131901. That doesn't quite line up, but it is in the ball park. If that's really the regression, then it is only just causing us to crash a little earlier than we would otherwise.
Assignee | ||
Updated•7 years ago
|
Component: Memory Allocator → XPCOM
Summary: Large OOMs in JS code with 38+ → Large OOMs in CC with 38+
![]() |
||
Comment 13•7 years ago
|
||
(In reply to Andrew McCreight [:mccr8] from comment #12) > Well, on Feb 11 we made that very hash table add infallible, in bug 1131901. > That doesn't quite line up, but it is in the ball park. If that's really > the regression, then it is only just causing us to crash a little earlier > than we would otherwise. Even then I think we need to see that we can do something here. This is 8.5% of all crashes in 38.0b2 right now, and our crash rates have been regression in every release since and including 36. The crashes with those signatures in sum even trump the "OOM|small" signature (allocations <256K failing), which by itself is pretty much at the same level in 38 beta as it was in 35 beta. I'm also not confident in us having just shifted crashes from other signatures to those, but I'll accept that premise if you can show me data telling that story in a good way.
![]() |
||
Updated•7 years ago
|
Crash Signature: , JSGCTraceKind)]
[@ OOM | large | NS_ABORT_OOM(unsigned int) | js::gc::MarkUnbarriered<JSObject>(JSTracer*, JSObject**, char const*) ] → , JSGCTraceKind)]
[@ OOM | large | NS_ABORT_OOM(unsigned int) | js::gc::MarkUnbarriered<JSObject>(JSTracer*, JSObject**, char const*) ]
[@ OOM | large | NS_ABORT_OOM(unsigned int) | CCGraphBuilder::NoteXPCOMChild(nsISupports*)]
![]() |
||
Comment 14•7 years ago
|
||
Allocations over 1 meg really should be expected to fail, and the larger the size, the more we should care. Large contiguous regions are hard to find on Windows. The users who fail at the 4M and 8M sizes likely could have kept going for a while before reaching hopeless OOM|small territory. So I think this is worth doing something about. Either by gracefully handling failure or by playing tricks to keep the individual allocations smaller.
Assignee | ||
Comment 15•7 years ago
|
||
I understand what you are saying, but the problem is that this is a giant hash table. I'm not sure how to break that into smaller pieces. "Gracefully handling failure" would mean not running the cycle collector. I suppose we could have some fallback mode that uses a tree instead of a hash table, but that would introduce a lot of complexity. It is possible that on Feb 5 or 7 something landed that caused us to leak windows, which would make the CC graph larger. We track this with the GHOST_WINDOWS telemetry, but I guess that looks about the same, though it is hard to tell.
![]() |
||
Comment 16•7 years ago
|
||
(In reply to Andrew McCreight [:mccr8] from comment #15) > I understand what you are saying, but the problem is that this is a giant > hash table. I'm not sure how to break that into smaller pieces. > "Gracefully handling failure" would mean not running the cycle collector. I > suppose we could have some fallback mode that uses a tree instead of a hash > table, but that would introduce a lot of complexity. Maybe we could make (template?) a variant of PLDHashTable that uses SegmentedVector or similar underneath, so growing the hashtable would make lots of smaller allocations rather than one large one?
![]() |
||
Comment 17•7 years ago
|
||
Yeah the backing store for the table is just a giant array, so it ought to be possible to segment it. Not sure if I would want to rush such a thing to beta, but it could definitely be useful in the long term.
![]() |
||
Comment 18•7 years ago
|
||
I ran some more supersearches and convinced myself that the first bad nightly was indeed 20150206030205. That makes the regression range: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=34a66aaaca81&tochange=7c5f187b65bf These definitely stand out: 2fcef6b54be7 Nicholas Nethercote — Bug 1050035 (part 5) - Make CCGraphBuilder::AddNode() infallible. r=mccr8. 2be07829fefc Nicholas Nethercote — Bug 1050035 (part 4) - Make PL_DHashTableAdd() infallible by default, and add a fallible alternative. r=froydnj. I don't see a lot of context in there for why AddNode became fallible. Andrew do you remember?
Flags: needinfo?(continuation)
![]() |
||
Comment 19•7 years ago
|
||
Rather, why it became *in*fallible.
![]() |
||
Comment 20•7 years ago
|
||
Oh, heh, turns out bug 1131901 was just a reincarnation of bug 1050035. Well I guess now we have an explanation for the date range at least.
Assignee | ||
Comment 21•7 years ago
|
||
(In reply to David Major [:dmajor] from comment #18) > I don't see a lot of context in there for why AddNode became fallible. > Andrew do you remember? If the cycle collector can't run, you are going to enter into an incomprehensible death spiral. So the reasoning was that it was better to crash. But given that this is happening so often, maybe it is better to leave the user to take their chances.
Assignee: terrence → continuation
Flags: needinfo?(continuation)
Assignee | ||
Comment 22•7 years ago
|
||
This makes us do a null check when we add something new to the graph, but I wouldn't think that's too bad, given that we're doing a hash table add anyways. Crashing here is apparently fairly common. This restores the old behavior, so we at least don't crash immediately, but instead enter a slow downward spiral of leaking. This improves on the old behavior in that we only try and fail to grow the hash table once, rather than on every add. khuey I think reported that the browser got very slow, because you are going through the very slowest path of the allocator over and over. try run: https://treeherder.mozilla.org/#/jobs?repo=try&revision=d05c0fa7339e
Attachment #8590530 -
Flags: review?(bugs)
Updated•7 years ago
|
Attachment #8590530 -
Flags: review?(bugs) → review+
Assignee | ||
Updated•7 years ago
|
Keywords: checkin-needed
Comment 23•7 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/2e7778275e6d
Keywords: checkin-needed
Comment 24•7 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/2e7778275e6d
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla40
Comment 25•7 years ago
|
||
Andrew, can we have an uplift request to aurora & beta? Thanks
Flags: needinfo?(continuation)
![]() |
||
Comment 26•7 years ago
|
||
I spun off bug 1153865 for the discussion about segmented allocations.
Assignee | ||
Updated•7 years ago
|
Assignee | ||
Comment 27•7 years ago
|
||
Comment on attachment 8590530 [details] [diff] [review] Make CCGraph::AddNodeToMap fallible again. Approval Request Comment [Feature/regressing bug #]: bug 1131901 [User impact if declined]: crashes (see comment 13) [Describe test coverage new/current, TreeHerder]: there's not really any test coverage [Risks and why]: this mostly just reverts to our old behavior, though it is a little different [String/UUID change made/needed]: none
Attachment #8590530 -
Flags: approval-mozilla-beta?
Attachment #8590530 -
Flags: approval-mozilla-aurora?
Comment 28•7 years ago
|
||
Comment on attachment 8590530 [details] [diff] [review] Make CCGraph::AddNodeToMap fallible again. Should be in 38 beta 4.
Attachment #8590530 -
Flags: approval-mozilla-beta?
Attachment #8590530 -
Flags: approval-mozilla-beta+
Attachment #8590530 -
Flags: approval-mozilla-aurora?
Attachment #8590530 -
Flags: approval-mozilla-aurora+
You need to log in
before you can comment on or make changes to this bug.
Description
•