Closed Bug 929359 Opened 6 years ago Closed 6 years ago

Intermittent TEST-UNEXPECTED-FAIL | leakcheck | 69400580 bytes leaked (AsyncLatencyLogger, AsyncStatement, AtomImpl, Attr, BackstagePass, ...) | also leaked leaked 7 DOMWINDOW(s)

Categories

(Core :: General, defect, major)

x86
Windows 7
defect
Not set
major

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox27 --- affected
firefox28 --- affected

People

(Reporter: cbook, Assigned: mccr8)

References

(Blocks 1 open bug, )

Details

(Keywords: intermittent-failure, memory-leak)

Attachments

(3 files, 2 obsolete files)

Windows 7 32-bit mozilla-inbound debug test mochitest-2 on 2013-10-21 23:39:07 PDT for push 2be5e3682c98

slave: t-w732-ix-031

https://tbpl.mozilla.org/php/getParsedLog.php?id=29476481&tree=Mozilla-Inbound

TEST-UNEXPECTED-FAIL | leakcheck | 69400580 bytes leaked (AsyncLatencyLogger, AsyncStatement, AtomImpl, Attr, BackstagePass, ...)
Return code: 1
bjacob tried to make CC OOMs fatal in bug 902922, but then Win7 M2 hit that assertion every time on shutdown.  That seems like a regression to me in the last few months, because we had these assertions going fatally before, without being permaorange.  philor theorizes that this bug is the same issue, and it does seem related.

My wild guess is going to be that bug 927705, landed to fix the CC OOMs in bug 902922 on 10/22, is somehow causing this, because that would be the most cruelly ironic.

Somebody should do some kind of annoying bisection that includes the MOZ_ASSERT patch in bug 902922, starting around 10/22, to figure out what is causing this.  I don't have a Windows machine, but I could do some try pushes to check my theory about bug 927705.
Whiteboard: [MemShrink]
philor points out that the first time this happened is on the very push for bug 927705, which adds credence to my theory.  njn, could you see if this is really a regression from your patch, by applying bjacob's patch and seeing if it makes things permaorange on Windows?  Thanks.
Flags: needinfo?(n.nethercote)
Adding a printf with the size of the hash table or graph or something in the places where bug 927705 changes to MOZ_ASSERTs might be interesting.  philor's theory is that we're not seeing the NS_ASSERTION because we're too deep in shutdown.
(In reply to Andrew McCreight [:mccr8] from comment #18)
> philor points out that the first time this happened is on the very push for
> bug 927705, which adds credence to my theory.  

> njn, could you see if this is
> really a regression from your patch, by applying bjacob's patch and seeing
> if it makes things permaorange on Windows?  Thanks.

Bug 902922 comment 63 suggests that the patches from bug 902922 and bug 927705 together do result in Windows permaorange.

I don't understand the test, but I'm confident that bug 927705's patch makes the world a better place.
Flags: needinfo?(n.nethercote)
One suggestion:  bug 927705's patch made it much harder for pldhash insertion to fail due to the capacity limit.  However, insertion can still fail if the entry storage allocation is too big.  The hash entries are 8 bytes on 32-bit (4 bytes for the hash value + 4 bytes for the PtrInfo pointer).
Attached patch pldhash/CC debugging patch. (obsolete) — Splinter Review
Here is a debugging patch that I just pushed to try.  On a failing run, I get
bazillions of "njn: change alloc of 8388612 bytes failed" messages.  That
request size is (1<<23)+4, which is reasonable for debug builds, because the
entry store is a power of two plus 4 bytes for some debug-only info.

Why is an 8MB allocation request failing?
Assignee: nobody → n.nethercote
Status: NEW → ASSIGNED
Assignee: n.nethercote → nobody
Status: ASSIGNED → NEW
So, it appears that we're really close to OOM (or even hit it, benignly) at shutdown on Windows Mochitest-2.  In fact, on successful runs you can see "Ran out of memory in ScanRoots".  The threshold appears to be about 800 DOM windows;  if we have more than that at shutdown, we probably fail.  If we have fewer, we probably succeed (though there's some noise around 790--800).

And my changes in bug 927705 may have pushed us into problem territory.  Part 2b is the obvious candidate, because it changed things so that we don't overload pldhashes when we can't grow them.

So I tried reverting that change and pushing to try, but I'm still getting failures most of the time.  Um.
I have a couple more ideas that might help here.  I will try them and report back later today.
Whiteboard: [MemShrink]
I tried reverting all the patches from bug 927705 and I still get frequent failures:  https://tbpl.mozilla.org/?tree=Try&rev=40a150e43dbb.  Hmm.  Let's hope that my other idea (to resurrect bug 815467) has better luck.
> I tried reverting all the patches from bug 927705 and I still get frequent
> failures:  https://tbpl.mozilla.org/?tree=Try&rev=40a150e43dbb.

Actually, those failures are the ones in bug 919856, which are slight different to this bug's failures.  I suspect it's not a coincidence that we stopped getting those ones just before we started getting these ones.  Those ones weren't as common, though.
Resurrecting bug 815467 didn't really help:  https://tbpl.mozilla.org/?tree=Try&rev=bbae81bb572d.  Well, I got 11 failures out of 21 runs, which is a slightly better rate than I got on previous runs.. could be real, could be noise.
(In reply to Nicholas Nethercote [:njn] from comment #113)
> Actually, those failures are the ones in bug 919856, which are slight
> different to this bug's failures.  I suspect it's not a coincidence that we
> stopped getting those ones just before we started getting these ones.  Those
> ones weren't as common, though.

Ah, good catch.  Yeah, I think if we had that behavior it would be an improvement from the orange factor perspective.  That one's only a month old.
Looking through the DOMWINDOW lines, it looks like the test_Range-foo tests are responsible for a lot of them.
I tried the ideas from comment 72 (undo part 2b of bug 927705) and comment 114 (resurrect bug 815467) in tandem:
https://tbpl.mozilla.org/?tree=Try&rev=73bb40f72806

I still get 15 failures out of 21, but they're the failures from bug 919856 ("where leaked URL list includes DOMCOre range tests") instead of the failures from this bug ("also leaked leaked 7 DOMWINDOW(s)").  Does that count as progress?

At this point it's clear that mochi-2's real problems predate bug 927705.  (Bug 919856's title even mentions the dom range tests...)
> At this point it's clear that mochi-2's real problems predate bug 927705. 

To steal shamelessly from philor on IRC:  we were already off the edge of the cliff, hanging in the air, legs spinning wildly, and bug 927705 just made us look down...
bumping up severity since this is also closing now mozilla-inbound, fx-team and mozilla-central
Severity: normal → critical
(In reply to Nicholas Nethercote [:njn] from comment #117)
> Looking through the DOMWINDOW lines, it looks like the test_Range-foo tests
> are responsible for a lot of them.


> <khuey> Tomcat|sheriffduty: ok, so your first mission is to disable the testRange tests from m2 again

I don't think that would help.

We have leaks during both mochitest-2 and mochitest-browser-chrome that look to at least my naive eyes pretty similar:

https://tbpl.mozilla.org/php/getParsedLog.php?id=29839989&tree=Mozilla-Central
https://tbpl.mozilla.org/php/getParsedLog.php?id=29872321&tree=Mozilla-Central
Severity: critical → blocker
Summary: Intermittent TEST-UNEXPECTED-FAIL | leakcheck | 69400580 bytes leaked (AsyncLatencyLogger, AsyncStatement, AtomImpl, Attr, BackstagePass, ...) | also leaked leaked 7 DOMWINDOW(s) → Now frequent TEST-UNEXPECTED-FAIL | leakcheck | 69400580 bytes leaked (AsyncLatencyLogger, AsyncStatement, AtomImpl, Attr, BackstagePass, ...) | also leaked leaked 7 DOMWINDOW(s)
Blocks: 932781
Moved the tree closing parts of this to bug 932781.
Severity: blocker → major
Summary: Now frequent TEST-UNEXPECTED-FAIL | leakcheck | 69400580 bytes leaked (AsyncLatencyLogger, AsyncStatement, AtomImpl, Attr, BackstagePass, ...) | also leaked leaked 7 DOMWINDOW(s) → Intermittent TEST-UNEXPECTED-FAIL | leakcheck | 69400580 bytes leaked (AsyncLatencyLogger, AsyncStatement, AtomImpl, Attr, BackstagePass, ...) | also leaked leaked 7 DOMWINDOW(s)
Blocks: 875585
I don't know how useful this is, but here's a plot of the number of live DOMWindows over time.  There are a bunch of approximations in there.  But you can see the number spikes up near the end.
Attached file peak windows
This leaking M2 run peaks at over a thousand DOMWINDOWs live at once.  Here are the windows that are alive, chopped off at the first 50 characters so that the various Range-test-iframe are in a single category.  As you can see, 472 of these are Range-test-iframes.

roc pointed out bug 778011 comment 3 which makes it sound like these tests keep around all of their iframes until the test is over, to ease debugging: "It would be a lot more efficient to get rid of iframes we were done using.  test_Range-insertNode.html does this, and allow you to debug using a different technique:"

That would certainly cause the peak memory usage to go up...
Attached image peak windows, by type
This graph shows the peak number of windows in each second, and breaks them down into a few categories.  As you can see, when we spike up above a thousand windows, most of these are range-test-iframes.  There are also a bunch of about:blanks for some reason.
Attachment #824959 - Attachment is obsolete: true
Attachment #824959 - Attachment is obsolete: false
Attachment #824933 - Attachment is obsolete: true
https://hg.mozilla.org/integration/mozilla-inbound/rev/33c9ee1b4564

FYI Ms2ger, I disabled this test again for being OOMy.
Here's the try run where Win7 debug M2 with this patch applied to m-i tip was green 11 times in a row:
  https://tbpl.mozilla.org/?tree=Try&rev=9fcd2511a906
Well, the graph looks almost identical with that test disabled, so there must be something more at work.  We still hit that 1000 window spike in the log I looked at.
Here's my log analysis script.  It is a bit touchy, and you have to manually fix up places in the TBPL log where multiple lines are mashed together due to multiprocess.
Looks like the test disabling doesn't completely fix the shutdown leak.  It failed about 1 out of 18 runs.  Hopefully roc's fix will save us.
(In reply to Andrew McCreight [:mccr8] from comment #132)
> https://hg.mozilla.org/integration/mozilla-inbound/rev/33c9ee1b4564
> 
> FYI Ms2ger, I disabled this test again for being OOMy.

Reverted this after bug 933072 landed so we get a better feel for where things are at.
https://hg.mozilla.org/integration/mozilla-inbound/rev/ba5c76b30f29
Depends on: 933072
I'll try to keep an eye on this in case it recurs.
Assignee: nobody → continuation
Blocks: 933741
Is it safe to close this bug now?
I'd like to wait a week or two to see if anything washes out from some time with normal tree opening.
At that point we should also search for any newly filed intermittent large leaks.
Fixed by bug 933072.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.