937997 - Trunk trees closed due to virtual address space fragmentation on Win 7 debug mochitest-BC (and M2?)

Reporter

Description

•

11 years ago

+++ This bug was initially created as a clone of Bug #932781 +++ When bug 920978 starts with "uncaught exception - NS_ERROR_FAILURE: Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIZipReader.open] at chrome://mochikit/content/chrome-harness.js:271" that means we're OOM. When the stuff we're piling onto bug 935419 says "Ran out of memory while building cycle collector graph" that pretty clearly means we're OOM. Yesterday, we backed out bug 936143 and bug 933882 and the rest of the stuff in https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=ae6f2151610f for ASan OOM failures, though there were also non-ASan OOM failures while it was in. Today it relanded in https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=9f3212effb9f, seemed sort of okayish, and we merged it to mozilla-central (though not yet on to fx-team and b2g-inbound). We're hitting lots of both bug 920978 starting with OOM nsIZipReader.open, and bug 935419 again. We can say yet again that it's all shu's fault, back him out, and go back to frequently hitting those two OOM failures, but maybe not so frequently that we have to notice, or we can fix the underlying fact that we're constantly on the edge of OOM even when we don't go over. mozilla-inbound, mozilla-central and fx-team are closed.

Phil Ringnalda (:philor)

Reporter

Comment 1

•

11 years ago

Win7 b-c retriggers running in https://tbpl.mozilla.org/?tree=Mozilla-Inbound&tochange=9f3212effb9f&fromchange=475bd77c3400 to maybe see just how much we should try to scapegoat shu for being the cause of all our problems, though we've certainly had both of the comment 0 signs of OOM while he was out as well as while he's in.

Andrew McCreight [:mccr8]

Comment 2

•

11 years ago

I was thinking of disabling the assertion in bug 935419, but I hadn't gotten around to it yet. It does seem to have kicked up in frequency today for some reason. I just pushed a patch to try with some debug messages that maybe will help figure out what is going on (hopefully it will compile): https://tbpl.mozilla.org/?tree=Try&rev=50d39c93fc73

Phil Ringnalda (:philor)

Reporter

Comment 3

•

11 years ago

01:12:18 INFO - out of memory with graph entry count 507904 That seems way too low to be hitting the pldhash cap. So we must be actually running out of memory (presumably out of virtual address space)?

Ed Morley [:emorley]

Updated

•

11 years ago

Depends on: 932898

Shu-yu Guo [:shu]

Comment 14

•

11 years ago

On a browser-chrome run on my local machine with about:memory dumping for devtools tests, the highest RSS I saw is about 660 MB. That doesn't seem high enough to warrant going OOM.

Shu-yu Guo [:shu]

Comment 15

•

11 years ago

Attached patch memdump-bc.patch (obsolete) — Details — Splinter Review

Dump about:memory to /tmp for every bc test.

Shu-yu Guo [:shu]

Comment 16

•

•

11 years ago

(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #48) > My patch is unfortunately not almost ready. Perhaps we should proceed with > the backout. ok taking as sheriff-on-duty and working on the backout

Carsten Book [:Tomcat]

Comment 50

•

11 years ago

Ed Morley [:emorley]

Comment 51

•

11 years ago

I am very much uncomfortable with reopening - as philor so excellently phrased it when this issue first occurred, Shu's landing was merely the breeze that sent us over the OOM cliff on the edge of which we were already teetering. Until we resolve the root cause, every new intermittent hang/timeout is going to either be blamed on this issue (and thus potentially ignored, even if unrelated), or else result in another tree closure - in which case giving ourselves a false sense of security by reopening gains us little. At the least, we should wait until: * The high heap-unclassified in bug 938310 is correctly identified (and a followup bug filed to add the necessary about:memory reporters). * Either bug 938411 or similar fixed to improve the high RSS found in bug 938310 and dupes. * We land something similar to parts of https://hg.mozilla.org/try/rev/50d39c93fc73#l1.13 so we assert in debug builds on OOM rather than having to remember which random browser-chrome tests fail when we reach OOM conditions (a la comment 0).

Ed Morley [:emorley]

Comment 52

•

11 years ago

Note: To anyone trying to debug this, if using inbound, use 15c617927012 (pre comment 50 backout) rather than tip to increase the likelihood of reproducing. dmajor - any luck using the machine borrowed from releng?

Flags: needinfo?(dmajor)

Carsten Book [:Tomcat]

Comment 53

•

11 years ago

(In reply to Nathan Froyd (:froydnj) from comment #25) > Instructions for building, running, and analyzing with DMD can be found > here: https://wiki.mozilla.org/Performance/MemShrink/DMD just a note on this. For anyone trying to build a Windows7 Debug Build with DMD this will fail with a build failure but Bug 938526 (and the patch in there) is supposed to fix this

Andrew McCreight [:mccr8]

Reporter

Comment 79

•

11 years ago

That just takes us from "we have to fix an OOM failure that we can't figure out how to fix" to "we have to fix an incomprehensible leak that only happens when we run the last one-third of browser-chrome without having run the first two-thirds," https://tbpl.mozilla.org/php/getParsedLog.php?id=30572971&tree=Cedar

Mike Hommey [:glandium]

Comment 80

•

11 years ago

If the problem is really heap fragmentation, hiding the problem on tbpl by splitting bc won't make it go away on user machines. Because after all, if it happens on bc, there's no reason it wouldn't happen to a user. Has someone looked at crash stats to identify crashes that could be related to this?

(Away)

Comment 81

•

Comment 89

•

11 years ago

(In reply to Nicholas Nethercote [:njn] from comment #82) > > At the 90% mark, the largest contiguous free virtual region is 40MB, and the > > next largest are 7MB, 5MB, 3MB. > > This is smelling a lot like bug 859955. Indeed.

Andrew McCreight [:mccr8]

•

Updated

•

11 years ago

Flags: needinfo?(kwierso)

Ed Morley [:emorley]

Comment 115

•

11 years ago

The postmortem for this tree closure/the issues in this bug has been started at: https://etherpad.mozilla.org/LPgqYuvFJn

Nathan Froyd [:froydnj]

Comment 116

•

•

11 years ago

Tree is open.

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Ryan VanderMeulen [:RyanVM]

Updated

•

11 years ago

Target Milestone: --- → Firefox 28

(Away)

Updated

•

11 years ago

Depends on: defrag

memdump-bc.patch 11 years ago Shu-yu Guo [:shu] 1.25 KB, patch		Details \| Diff \| Splinter Review
memdump-bc.patch 11 years ago Shu-yu Guo [:shu] 1.18 KB, patch		Details \| Diff \| Splinter Review
backout-forcegc.patch 11 years ago Shu-yu Guo [:shu] 1.15 KB, patch		Details \| Diff \| Splinter Review
local bc test about:memory dumps sorted by rss 11 years ago Shu-yu Guo [:shu] 63.23 KB, text/plain		Details
about-memory-bc-929.json.gz 11 years ago Shu-yu Guo [:shu] 157.01 KB, application/gzip		Details
about-memory-bc-928.json.gz 11 years ago Shu-yu Guo [:shu] 155.64 KB, application/gzip		Details
about-memory-bc-934.json.gz 11 years ago Shu-yu Guo [:shu] 155.39 KB, application/gzip		Details
sorted-timeline 11 years ago Shu-yu Guo [:shu] 116.12 KB, text/plain		Details
vadumps-from-bc.zip 11 years ago (Away) 82.43 KB, application/zip		Details
DxDiag.txt 11 years ago (Away) 29.70 KB, text/plain		Details
vmmap.png 11 years ago (Away) 55.83 KB, image/png		Details
vmmap-data.zip 11 years ago (Away) 3.67 MB, application/zip		Details