Closed Bug 728294 Opened 12 years ago Closed 12 years ago

Analyze cycle collection logs after test runs to detect leaks

Categories

(Testing :: General, defect)

defect
Not set
normal

Tracking

(firefox14 wontfix, firefox15 fixed, firefox16 fixed, firefox17 fixed)

RESOLVED FIXED
mozilla17
Tracking Status
firefox14 --- wontfix
firefox15 --- fixed
firefox16 --- fixed
firefox17 --- fixed

People

(Reporter: smaug, Assigned: ttaubert)

References

(Blocks 1 open bug)

Details

(Whiteboard: [MemShrink:P2][Snappy:p2][qa-])

Attachments

(5 files, 14 obsolete files)

12.47 KB, patch
ted
: review+
Details | Diff | Splinter Review
988 bytes, patch
smaug
: review+
Details | Diff | Splinter Review
15.53 KB, patch
smaug
: review+
Details | Diff | Splinter Review
28.96 KB, patch
Details | Diff | Splinter Review
28.64 KB, patch
Details | Diff | Splinter Review
Bug 726346 adds a way to analyze cycle collection logs in JS.
We could use that in testing so that we could detect if there are
(runtime) memory leak regressions.
For MemCaser I have filed an internal issue:
https://github.com/whimboo/memchaser/issues/73
OS: Linux → All
Hardware: x86_64 → All
Whiteboard: [MemShrink]
Version: unspecified → Trunk
This could be useful also for Snappy, less leaks means more responsiveness.
Whiteboard: [MemShrink] → [MemShrink][Snappy]
Smaug, do you have any more specific ideas about how this would be done?  Would this be done in an add-on?
Whiteboard: [MemShrink][Snappy] → [MemShrink:P2][Snappy]
An addon should work ok. It could run CC after each test and analyze the log to see if there are zombie documents - as an example
Attached patch WIP (obsolete) — Splinter Review
I started to write a patch for this some weeks ago but didn't find the time to continue/finish it. Maybe someone is willing to pick up the work.
A good next step for the attached patch would be to track the creation of nsDocuments, assign those to the tests that created them. So we know which test leaked and can make them fail when printing the reports.
CC log knows the URI of the documents.
Assignee: nobody → ttaubert
Attachment #620127 - Attachment is obsolete: true
Status: NEW → ASSIGNED
Blocks: 759711
Blocks: 753448
Comment on attachment 642547 [details] [diff] [review]
Part 4 - Analyze cycle collection logs on testsuite shutdown to detected leaked windows

>+const TODO_PERMANENT_LEAKS = {
>+  "chrome://mochitests/content/browser/browser/base/content/test/browser_pageInfo.js":
>+    "nsGlobalWindow chrome://browser/content/pageinfo/pageInfo.xul",
>+  "chrome://mochitests/content/browser/browser/components/places/tests/browser/browser_views_liveupdate.js":
>+    "nsGlobalWindow chrome://browser/content/bookmarks/bookmarksPanel.xul"

I disabled browser_pageInfo.js. Bug 759709 claims that browser_views_liveupdate.js leaks intermittently. If it's permanent, we should probably disable that test too.

By the way, does this roughly detect the same leaks that the previous logger did? Do you see the same amount of intermittent leaks?
(In reply to Dão Gottwald [:dao] from comment #12)
> I disabled browser_pageInfo.js.

Ok, thank you for the hint.

> Bug 759709 claims that
> browser_views_liveupdate.js leaks intermittently. If it's permanent, we
> should probably disable that test too.

Looking at all the TBPL bot comments in bug 759711 it seems permanent to me. But I'm not totally sure about that. Couldn't reproduce it locally when running the test standalone.

> By the way, does this roughly detect the same leaks that the previous logger
> did? Do you see the same amount of intermittent leaks?

I don't know for sure, that's why I didn't ask for review, yet :) I pushed to try and hope to catch some intermittent leaks so that we'll see if this produces the same results as before just without blaming tests creating the tabview window or the preloaded new tab.

https://tbpl.mozilla.org/?tree=Try&rev=ba472ba1cf43
Disabling tests? That doesn't sounds like a good idea. There has been times when some test has been
disabled and then before re-enabling it there were regressions which the test would have caught.
(But I'm not a browser/toolkit peer so if you think that won't be a problem, then ok :) )
browser_pageInfo.js doesn't cover anything important. I don't know about browser_views_liveupdate.js. Ideally whoever feels responsible for that code would care about the regression risk enough to invest in getting the test to run properly.
So, even after quite a lot of m-oth try runs there was only one intermittent leak:

TEST-UNEXPECTED-FAIL | chrome://mochitests/content/browser/browser/base/content/test/browser_sanitizeDialog.js | leaked window until shutdown [nsGlobalWindow chrome://browser/content/bookmarks/bookmarksPanel.xul]

That's actually the intermittent leak from browser/components/places/tests/browser/browser_views_liveupdate.js. I assigned it to the wrong test because my patch didn't assume that window addresses can be re-used, which I fixed now.

I actually hunted down the re-used address in the build log to confirm it's really this leak. We now know it's intermittent, too.
Patch takes now into account that window addresses can be re-used. Removed the todo leaks because there shouldn't be any permanent leaks left.
Attachment #642547 - Attachment is obsolete: true
(In reply to Tim Taubert [:ttaubert] from comment #16)
> So, even after quite a lot of m-oth try runs there was only one intermittent
> leak:

So, will your changes cause us to miss things we don't want to miss or are the current intermittent leaks false reports?
(In reply to Dão Gottwald [:dao] from comment #18)
> So, will your changes cause us to miss things we don't want to miss or are
> the current intermittent leaks false reports?

I pushed to try again because I think the current reports aren't false. I was only listening to "domwindowopened" which only fires for 'real windows' being opened.

The patch does now register observers for "{chrome,content}-document-global-created" which should cover every nsGlobalWindow instance created, including iframes and inner windows when navigating to a different page.

I'll expect more intermittent leaks to appear because some of them are not about the whole window but only an inner window.

https://tbpl.mozilla.org/?tree=Try&rev=18e61da744b2
(In reply to Tim Taubert [:ttaubert] from comment #16)
> So, even after quite a lot of m-oth try runs there was only one intermittent
> leak:
> 
> TEST-UNEXPECTED-FAIL |
> chrome://mochitests/content/browser/browser/base/content/test/
> browser_sanitizeDialog.js | leaked window until shutdown [nsGlobalWindow
> chrome://browser/content/bookmarks/bookmarksPanel.xul
Is this Bug 728426 ?
Comment on attachment 642544 [details] [diff] [review]
Part 2 - Add GetOuterWindowAddress() to nsIDOMWindowUtils

You should put IsUniversalXPConnectCapable() check to nsDOMWindowUtils::GetOuterWindowAddress
(In reply to Olli Pettay [:smaug] from comment #20)
> Is this Bug 728426 ?

Possibly? It's definitely bug 759709 which no one took care of so far. They should probably depend on each other.

(In reply to Olli Pettay [:smaug] from comment #21)
> You should put IsUniversalXPConnectCapable() check to
> nsDOMWindowUtils::GetOuterWindowAddress

Thanks, will do!
Added a IsUniversalXPConnectCapable() check at the top of GetOuterWindowAddress().
Attachment #642544 - Attachment is obsolete: true
> TEST-UNEXPECTED-FAIL | chrome://mochitests/content/browser/browser/devtools/debugger/test/browser_dbg_breakpoint-new-script.js | leaked window until shutdown [nsGlobalWindow http://example.com/browser/browser/devtools/debugger/test/browser_dbg_breakpoint-new-script.html]

> TEST-UNEXPECTED-FAIL | chrome://mochitests/content/browser/browser/base/content/test/browser_urlbarCopying.js | leaked window until shutdown [nsGlobalWindow chrome://browser/content/bookmarks/bookmarksPanel.xul]

That's browser_views_liveupdate.js again, dammit.

> TEST-UNEXPECTED-FAIL | chrome://mochitests/content/browser/browser/base/content/test/browser_tab_dragdrop.js | leaked window until shutdown [nsGlobalWindow chrome://browser/content/bookmarks/bookmarksPanel.xul]

Also browser_views_liveupdate.js.


All of these bookmarksPanel.xul leaks were inner windows that re-used addresses from outer windows and that's why there were assigned to the wrong tests. That means we should also be missing some other leaks here that didn't re-use an outer window's address. Still need to figure this out but everything else looks like we have the same intermittent leaks as before just without the false positives. Hooray.
Mapping windows to addresses is hard, they're re-used. We should instead use the window's id and listen for inner windows, only. Outer windows will only be kept alive by inner windows and inner windows have URLs that we can use for debugging and fixing those leaks.
Attachment #642933 - Attachment is obsolete: true
We need the leaked url and the window id as well.
Attachment #642545 - Attachment is obsolete: true
Whiteboard: [MemShrink:P2][Snappy] → [MemShrink:P2][Snappy:p2]
Blocks: 759709
This removes the old leak analyzer that parses stdout in debug builds.
Attachment #642537 - Attachment is obsolete: true
Attachment #644882 - Flags: review?(ted.mielczarek)
We currently have {outer,inner}-window-destroyed notifications but there's no similar thing when windows are created.

The leak detection does not differentiate between outer and inner windows, that's why there's only a "global-window-created" notification and it's only fired in debug mode.

Do you think we should have {outer,inner}-window-created notifications for opt and dbg builds or would it be fine to leave it as is? I'm not sure that having a notification debug-only is a good thing :)
Attachment #643118 - Attachment is obsolete: true
Attachment #644883 - Flags: review?(bugs)
Debug info, yay!
Attachment #643120 - Attachment is obsolete: true
Attachment #644884 - Flags: review?(bugs)
Attachment #643122 - Attachment is obsolete: true
Attachment #644885 - Flags: review?(ted.mielczarek)
There is chrome/content-document-global-created
Comment on attachment 644884 [details] [diff] [review]
Part 2 - Add last opened URI and windowID to a refCounted node's description in debug mode

"chrome-document-global-created"/"content-document-global-created"
should be enough.
Subject is the window and you can get inner/outerwindow id using
the methods in DOMWindowUtils, IIRC.
Attachment #644884 - Flags: review?(bugs) → review-
Comment on attachment 644884 [details] [diff] [review]
Part 2 - Add last opened URI and windowID to a refCounted node's description in debug mode

Er, I managed to r- wrong patch.
Attachment #644884 - Flags: review- → review+
Attachment #644883 - Flags: review?(bugs) → review-
Comment on attachment 644883 [details] [diff] [review]
Part 2 - Add global-window-created notification for debug builds

(In reply to Olli Pettay [:smaug] from comment #33)
> "chrome-document-global-created"/"content-document-global-created"
> should be enough.
> Subject is the window and you can get inner/outerwindow id using
> the methods in DOMWindowUtils, IIRC.

That's easier. We don't need this patch.
Attachment #644883 - Attachment is obsolete: true
Attachment #644884 - Attachment description: Part 3 - Add last opened URI and windowID to a refCounted node's description in debug mode → Part 2 - Add last opened URI and windowID to a refCounted node's description in debug mode
Using {chrome,content}-document-global-created now.
Attachment #644885 - Attachment is obsolete: true
Attachment #644885 - Flags: review?(ted.mielczarek)
Attachment #644909 - Flags: review?(ted.mielczarek)
I'm still working on this. Trying to sort out a problem with patch 2 where it doesn't always show the right URL - it's sometimes "(null)" - which is quite tedious with the current try situation and given that all those leaks are of course intermittent...
Attachment #644882 - Flags: review?(ted.mielczarek) → review+
Comment on attachment 644909 [details] [diff] [review]
Part 3 - Analyze cycle collection logs on testsuite shutdown to detected leaked windows

Review of attachment 644909 [details] [diff] [review]:
-----------------------------------------------------------------

::: testing/mochitest/browser-test.js
@@ +61,5 @@
>               getService(Ci.nsIWindowMediator);
>    this._fm = Cc["@mozilla.org/focus-manager;1"].
>               getService(Ci.nsIFocusManager);
> +  this._os = Cc["@mozilla.org/observer-service;1"].
> +             getService(Ci.nsIObserverService);

We could probably switch some of these to use Services.jsm.

@@ +212,5 @@
> +    let utils = aWindow.QueryInterface(Ci.nsIInterfaceRequestor)
> +                       .getInterface(Ci.nsIDOMWindowUtils);
> +
> +    let outerID = utils.outerWindowID;
> +    if (!this.openedWindows.hasOwnProperty(outerID))

Is hasOwnPropery overkill here, can't you just use "outerID in this.openedWindows"?

Alternately, if you wanted to be more bleeding-edge you could use Map.

::: testing/mochitest/cc-analyzer.js
@@ +1,5 @@
> +/* This Source Code Form is subject to the terms of the Mozilla Public
> + * License, v. 2.0. If a copy of the MPL was not distributed with this file,
> + * You can obtain one at http://mozilla.org/MPL/2.0/. */
> +
> +function CCAnalyzer() {

I don't really have the brain cycles to figure out how this CCAnalyzer works. Can you have someone who knows the CC review this bit?
Attachment #644909 - Flags: review?(ted.mielczarek) → review+
Yeah, smaug or I can probably review the CCAnalyzer.
Comment on attachment 644909 [details] [diff] [review]
Part 3 - Analyze cycle collection logs on testsuite shutdown to detected leaked windows

Just marking this for r? so it doesn't get forgotten.
Attachment #644909 - Flags: review?(continuation)
Attachment #644909 - Flags: review?(bugs)
Ok, here's a question that I wanted to ask mccr8 yesterday but I figured more people CC'ed here might answer this:

I'm having quite a hard time with part 2 of this bug. All we need to have is the windowID (which works fine) and the last opened URL as the description for the refcounted node.

Now, when running these patches on try I sometimes get a correct url and sometimes I just get "(null)" which doesn't really make sense because nsGlobalWindow::mLastOpenedURI is never set to null anywhere and we're cycle collecting even before nsGlobalWindow's destructor is called (that is able to print the correct url).

https://tbpl.mozilla.org/php/getParsedLog.php?id=13878699&tree=Try&full=1

If you search for "serial = 10148" in the build log you see the debug output we get from nsGlobalWindow's constructor and its destructor. Both show the correct URL. The refcounted node's description contains "(null)" for whatever reason:

> TEST-UNEXPECTED-FAIL | chrome://mochitests/content/browser/toolkit/components/social/test/browser/browser_workerAPI.js | leaked until shutdown [nsGlobalWindow #10148 (null)]

Another thing I don't understand is, that the window is clearly an inner window when created:

> ++DOMWINDOW == 93 (0xf0dd484) [serial = 10148] [outer = 0xb12c9b0]

When destroyed it seems like mOuterWindow is null:

> --DOMWINDOW == 3 (0xf0dd484) [serial = 10148] [outer = (nil)] [url = https://motown-dev.mozillalabs.com/social/sidebar]

although we never explicitly set it to null. If someone has any idea what's going on I'd really appreciate some help as that is the last part we need for this bug to be fixed. We can't land this without proper URLs to debug the leaks.
Depends on: 778433
Comment on attachment 644909 [details] [diff] [review]
Part 3 - Analyze cycle collection logs on testsuite shutdown to detected leaked windows

>+  runCC: function (aCounter) {
>+    let utils = window.QueryInterface(Ci.nsIInterfaceRequestor).
>+        getInterface(Ci.nsIDOMWindowUtils);
>+
>+    if (aCounter > 1) {
>+      utils.garbageCollect();
>+      setTimeout(this.runCC.bind(this, aCounter - 1), 125);
Why 125? If this needs to be async, wouldn't 0 be just fine?


>+  processLog: function () {
>+    // Process entire heap step by step in 5K chunks
>+    for (let i = 0; i < 5000; i++) {
>+      if (!this.listener.processNext(this)) {
>+        this.callback();
>+        this.clear();
>+        return;
>+      }
>+    }
>+
>+    // Next chunk on timeout.
>+    setTimeout(this.processLog.bind(this), 125);
Same here

Not all the stuff in the analyzer is needed by the patch, but they can be useful for debugging.
Attachment #644909 - Flags: review?(bugs) → review+
Attachment #644909 - Flags: review?(continuation)
As stated in comment #41, printing the mLastOpenedURI isn't reliable for some reason. The new solution is to record window.location.href on "chrome-document-global-created" and "content-document-global-created" which is implemented in part 3.

mLastOpenedURI was debug-only, so the good thing now is that we have leak detection for opt builds as well!
Attachment #644884 - Attachment is obsolete: true
Attachment #648306 - Flags: review?(bugs)
Addresses all comments and using Services.jsm now where possible. Recording the window's location on "chrome-document-global-created" and "content-document-global-created" to have reliable debug URLs.
Attachment #644909 - Attachment is obsolete: true
Attachment #648307 - Flags: review?(bugs)
After a couple of test runs I found six instances of

browser_workerAPI.js | leaked until shutdown [nsGlobalWindow #10396 https://motown-dev.mozillalabs.com/social/sidebar]

always with the right URL assigned to it which means that the patch works as expected.
Attachment #648306 - Flags: review?(bugs) → review+
Do we actually need to remove the old leak detector? It can, as far as I know, detect different kinds 
of problems. Cycle collector graph tells what ends up being traversed during CC
(so it affects especially to responsiveness), but if a leaked global window doesn't end up to
cc graph, the leak won't be noticed. (I don't know whether it is in practice possible to
get globalwindow leaks which don't show up in the cc graph)
I don't know about the different kinds of leaks it can detect and whether it's in practice possible to get globalwindow leaks which then don't show up in the cc graph. But the old leak detector causes quite some trouble with its false-positives. We'd need to introduce some kind of exclusion list for every new feature that expectedly "leaks" like bug 753448 or anything touching the sidebar that then leaks the outer window.
Ah, I see.
Comment on attachment 648307 [details] [diff] [review]
Part 3 - Analyze cycle collection logs on testsuite shutdown to detected leaked windows

I guess this could work well enough.
There are cases when we reuse innerwindow, hopefully they
don't affect to output badly.
Attachment #648307 - Flags: review?(bugs) → review+
Aurora/beta are affected by the steady stream of false positives from the old ShutdownLeakLogger. Is this something you would be happy to backport to one or both, or should I file a bug to raise the MAX_LEAK_COUNT on them, now that we know most of the leaking tests are false positives?
No longer blocks: 776553
Whiteboard: [MemShrink:P2][Snappy:p2][fixed-in-fx-team] → [MemShrink:P2][Snappy:p2]
Attached patch patch for auroraSplinter Review
Consolidated patches for Aurora.
Attached patch patch for betaSplinter Review
Consolidated patches for Beta.
Comment on attachment 649104 [details] [diff] [review]
patch for aurora

[Approval Request Comment]
Bug caused by (feature/regressing bug #): None.
User impact if declined: None.
Testing completed (on m-c, etc.): Landed on m-c.
Risk to taking this patch (and alternatives if risky): Very low risk, affects mochitest suite only.
String or UUID changes made by this patch: None.

We have quite a few false positives on Beta/Aurora. Applying those patches would certainly take some load off of the tree maintainers.
Attachment #649104 - Flags: approval-mozilla-aurora?
Comment on attachment 649105 [details] [diff] [review]
patch for beta

[Approval Request Comment]
Bug caused by (feature/regressing bug #): None.
User impact if declined: None.
Testing completed (on m-c, etc.): Landed on m-c.
Risk to taking this patch (and alternatives if risky): Very low risk, affects mochitest suite only.
String or UUID changes made by this patch: None.

We have quite a few false positives on Beta/Aurora. Applying those patches would certainly take some load off of the tree maintainers.
Attachment #649105 - Flags: approval-mozilla-beta?
Attachment #649104 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment on attachment 649105 [details] [diff] [review]
patch for beta

low risk, glad it helps with tree maintenance too, approving.
Attachment #649105 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Backed out :/

https://hg.mozilla.org/releases/mozilla-aurora/rev/59cd03317e0b
https://hg.mozilla.org/releases/mozilla-beta/rev/078ece5918e2

I totally forgot about bug 728426 which turned perma-orange. We need to wait until this has approval as well.
thought it worth mentioning even if you are all more familiar then me. Currently, buildbot will append a tinderboxprint line to each unittest logs which is parsed by the tbpl web interface and provides summaries like if a test leaked or not. (note this is not for setting the appropriate result colour)

Our regex looks like(for mochitest and others): 

harnessErrorsRe = re.compile(r"TEST-UNEXPECTED-FAIL \| .* \| (Browser crashed \(minidump found\)|missing output line for total leaks!|negative leaks caught!|leaked \d+ bytes during test execution)")

if this is out of date with any new changes, I would be happy to help make sure we stay current on our end.

this code is from summarizeLog():
http://mxr.mozilla.org/build/source/buildbotcustom/steps/unittest.py#77
(In reply to Jordan Lund (:jlund) from comment #60)
> harnessErrorsRe = re.compile(r"TEST-UNEXPECTED-FAIL \| .* \| (Browser
> crashed \(minidump found\)|missing output line for total leaks!|negative
> leaks caught!|leaked \d+ bytes during test execution)")

Iirc, (the end of) this regex is related to --enable-logrefcnt.
New cc-analyzer is something different, as removed ShutdownLeakLogger was.
I don't know whether new strings should be added, but none should be removed (wrt this bug).
Depends on: 785228
Whiteboard: [MemShrink:P2][Snappy:p2] → [MemShrink:P2][Snappy:p2][qa-]
Depends on: 932898
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: