<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

Liz Henry (:lizzard) (relman/hg->git project)

Comment 1

•

10 years ago

[Tracking Requested - why for this release]: Those large OOMs together (see Crash Signature field) make up 7.6% of all crashes in early 38.0b1 data.

status-firefox37: --- → unaffected

status-firefox38: --- → affected

status-firefox39: --- → affected

status-firefox40: --- → affected

tracking-firefox38: --- → ?

Summary: crash in OOM | large | NS_ABORT_OOM(unsigned int) | NoteJSChildTracerShim → Large OOMs in JS code with 38+

infoplus007

Comment 2

•

10 years ago

Same crash here with Fx39 https://crash-stats.mozilla.com/report/index/da76deb0-f7a5-4495-abc8-b953c2150403

infoplus007

Comment 3

•

10 years ago

Forget to say it happened (like many others) while while looking at a video, big black box all over the screen: something was corrupted in the memory.Fx39 https://crash-stats.mozilla.com/report/index/da76deb0-f7a5-4495-abc8-b953c2150403

Comment 4

•

10 years ago

Tracking this topcrash for 38.

tracking-firefox38: ? → +

Sylvestre Ledru [:Sylvestre]

Comment 5

•

10 years ago

Naveed, could you find someone to help us with this bug? Thanks

Flags: needinfo?(nihsanullah)

Terrence Cole [:terrence]

Comment 6

•

10 years ago

These stacks are totally trashed. There is absolutely nothing we can even look at here without a regression range or STR. I think the best clue here is "it happened (like many others) while while looking at a video, big black box all over the screen". Did any major new features land in gfx in the time frame and platform where this started spiking?

Comment 7

•

10 years ago

I might be able to dig some better call sites out of these dumps. I'll try to see what I can do tomorrow.

Flags: needinfo?(dmajor)

Naveed Ihsanullah [:naveed]

Comment 8

•

10 years ago

(In reply to Terrence Cole [:terrence] from comment #6) > These stacks are totally trashed. There is absolutely nothing we can even > look at here without a regression range or STR. I tried to find a regression range, here's the first build IDs on Nightly that the various signatures seem to have been seen with (according to the graphs feature in the report/list linked in the Crash Signature field above): …| NoteJSChildTracerShim - 20150206030205 …| JSObject::markChildren(JSTracer*) - 20150208030206 …| js::TraceChildren(JSTracer*, void*, JSGCTraceKind) - 20150213020456 …| js::gc::MarkUnbarriered<JSObject>(JSTracer*, JSObject**, char const*) - 20150208030206 With that, it looks like something landed on Feb 5th or Feb 7th probably that caused this issue to appear.

Updated

•

10 years ago

Assignee: nobody → terrence

Flags: needinfo?(nihsanullah)

Terrence Cole [:terrence]

Comment 9

•

10 years ago

Crazy amounts of optimization here. I had to use several debuggers to sort it out. The real stack of calls (at least in bp-2ebd7afb-a088-4e66-bbb7-c9c1b2150402) is: NS_ABORT_OOM PLDHashTable::Add PL_DHashTableAdd CCGraph::AddNodeToMap CCGraphBuilder::AddNode CCGraphBuilder::NoteChild CCGraphBuilder::NoteJSObject [stuff I haven't deciphered] JSObject::markChildren The allocation sizes are often a bit under 4M or 8M. Is this just a huge CC graph?

Flags: needinfo?(dmajor)

Comment 10

•

10 years ago

(In reply to David Major [:dmajor] from comment #9) > Crazy amounts of optimization here. I had to use several debuggers to sort > it out. > > The real stack of calls (at least in > bp-2ebd7afb-a088-4e66-bbb7-c9c1b2150402) is: > NS_ABORT_OOM > PLDHashTable::Add > PL_DHashTableAdd > CCGraph::AddNodeToMap > CCGraphBuilder::AddNode > CCGraphBuilder::NoteChild > CCGraphBuilder::NoteJSObject > [stuff I haven't deciphered] > JSObject::markChildren Wow! Impressive work! > The allocation sizes are often a bit under 4M or 8M. Is this just a huge CC > graph? Yes, that is also how I would interpret that stack. I guess the next step would be to figure out what landed Feb 5 or Feb 7 that would cause the CC graph size to explode.

Comment 11

•

10 years ago

When I read that CC is involved, I always think "let's include :mccr8 in this bug". ;-)

Assignee

Comment 12

•

10 years ago

Well, on Feb 11 we made that very hash table add infallible, in bug 1131901. That doesn't quite line up, but it is in the ball park. If that's really the regression, then it is only just causing us to crash a little earlier than we would otherwise.

Assignee

Updated

•

10 years ago

Component: Memory Allocator → XPCOM

Summary: Large OOMs in JS code with 38+ → Large OOMs in CC with 38+

Comment 13

•

10 years ago

(In reply to Andrew McCreight [:mccr8] from comment #12) > Well, on Feb 11 we made that very hash table add infallible, in bug 1131901. > That doesn't quite line up, but it is in the ball park. If that's really > the regression, then it is only just causing us to crash a little earlier > than we would otherwise. Even then I think we need to see that we can do something here. This is 8.5% of all crashes in 38.0b2 right now, and our crash rates have been regression in every release since and including 36. The crashes with those signatures in sum even trump the "OOM|small" signature (allocations <256K failing), which by itself is pretty much at the same level in 38 beta as it was in 35 beta. I'm also not confident in us having just shifted crashes from other signatures to those, but I'll accept that premise if you can show me data telling that story in a good way.

Updated

•

10 years ago

Comment 14

•

10 years ago

Allocations over 1 meg really should be expected to fail, and the larger the size, the more we should care. Large contiguous regions are hard to find on Windows. The users who fail at the 4M and 8M sizes likely could have kept going for a while before reaching hopeless OOM|small territory. So I think this is worth doing something about. Either by gracefully handling failure or by playing tricks to keep the individual allocations smaller.

Assignee

Comment 15

•

10 years ago

I understand what you are saying, but the problem is that this is a giant hash table. I'm not sure how to break that into smaller pieces. "Gracefully handling failure" would mean not running the cycle collector. I suppose we could have some fallback mode that uses a tree instead of a hash table, but that would introduce a lot of complexity. It is possible that on Feb 5 or 7 something landed that caused us to leak windows, which would make the CC graph larger. We track this with the GHOST_WINDOWS telemetry, but I guess that looks about the same, though it is hard to tell.

Nathan Froyd [:froydnj]

Comment 16

•

10 years ago

(In reply to Andrew McCreight [:mccr8] from comment #15) > I understand what you are saying, but the problem is that this is a giant > hash table. I'm not sure how to break that into smaller pieces. > "Gracefully handling failure" would mean not running the cycle collector. I > suppose we could have some fallback mode that uses a tree instead of a hash > table, but that would introduce a lot of complexity. Maybe we could make (template?) a variant of PLDHashTable that uses SegmentedVector or similar underneath, so growing the hashtable would make lots of smaller allocations rather than one large one?

Comment 17

•

10 years ago

Yeah the backing store for the table is just a giant array, so it ought to be possible to segment it. Not sure if I would want to rush such a thing to beta, but it could definitely be useful in the long term.

Comment 18

•

10 years ago

I ran some more supersearches and convinced myself that the first bad nightly was indeed 20150206030205. That makes the regression range: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=34a66aaaca81&tochange=7c5f187b65bf These definitely stand out: 2fcef6b54be7 Nicholas Nethercote — Bug 1050035 (part 5) - Make CCGraphBuilder::AddNode() infallible. r=mccr8. 2be07829fefc Nicholas Nethercote — Bug 1050035 (part 4) - Make PL_DHashTableAdd() infallible by default, and add a fallible alternative. r=froydnj. I don't see a lot of context in there for why AddNode became fallible. Andrew do you remember?

Flags: needinfo?(continuation)

Comment 19

•

10 years ago

Rather, why it became *in*fallible.

Comment 20

•

10 years ago

Oh, heh, turns out bug 1131901 was just a reincarnation of bug 1050035. Well I guess now we have an explanation for the date range at least.

Assignee

Comment 21

•

10 years ago

(In reply to David Major [:dmajor] from comment #18) > I don't see a lot of context in there for why AddNode became fallible. > Andrew do you remember? If the cycle collector can't run, you are going to enter into an incomprehensible death spiral. So the reasoning was that it was better to crash. But given that this is happening so often, maybe it is better to leave the user to take their chances.

Assignee: terrence → continuation

Flags: needinfo?(continuation)

Assignee

Updated

•

10 years ago

Depends on: 1131901

Olli Pettay [:smaug][bugs@pettay.fi]

Assignee

Comment 22

•

10 years ago

Attached patch Make CCGraph::AddNodeToMap fallible again. — Details — Splinter Review

This makes us do a null check when we add something new to the graph, but I wouldn't think that's too bad, given that we're doing a hash table add anyways. Crashing here is apparently fairly common. This restores the old behavior, so we at least don't crash immediately, but instead enter a slow downward spiral of leaking. This improves on the old behavior in that we only try and fail to grow the hash table once, rather than on every add. khuey I think reported that the browser got very slow, because you are going through the very slowest path of the allocator over and over. try run: https://treeherder.mozilla.org/#/jobs?repo=try&revision=d05c0fa7339e

Attachment #8590530 - Flags: review?(bugs)

Updated

•

10 years ago

Attachment #8590530 - Flags: review?(bugs) → review+

Ryan VanderMeulen [:RyanVM]

Assignee

Updated

•

10 years ago

Keywords: checkin-needed

Comment 23

•

10 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/2e7778275e6d

Keywords: checkin-needed

Carsten Book [:Tomcat]

Comment 24

•

10 years ago

https://hg.mozilla.org/mozilla-central/rev/2e7778275e6d

Status: NEW → RESOLVED

Closed: 10 years ago

status-firefox40: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla40

Sylvestre Ledru [:Sylvestre]

Comment 25

•

10 years ago

Andrew, can we have an uplift request to aurora & beta? Thanks

Flags: needinfo?(continuation)

Comment 26

•

10 years ago

I spun off bug 1153865 for the discussion about segmented allocations.

Assignee

Updated

•

10 years ago

Blocks: 1131901

No longer depends on: 1131901

Flags: needinfo?(continuation)