1177786 - Reduce the peak number of windows in browser chrome and devtools mochitests

Reporter

Description

•

9 years ago

Today in our intermittent orange discussion, we discussed forcing a GC/CC between every test run in order to possibly improve test isolation and reduce timeouts caused by infrequent garbage collection.

Martijn Wargers (dead)

Comment 1

•

9 years ago

Reminds me of bug 949607.

Ryan VanderMeulen [:RyanVM]

Reporter

Comment 2

•

9 years ago

For the record, this came up in conversation as we've noticed mochitest-bc runs with 500+ active DOMWINDOWs even with run-by-dir enabled. Strongly suspect this is a contributing factor to the frequent timeouts we see on the single-core AWS instances we use.

Andrew McCreight [:mccr8]

Assignee

Comment 3

•

9 years ago

We already poke the collectors and do some work on them in between every test:

http://mxr.mozilla.org/mozilla-central/source/testing/mochitest/tests/SimpleTest/iframe-between-tests.html?force=1#14

kPokesBetweenExpensiveCollectorTriggers could be decreased so we collect more, though this will cause tests to be slower.

(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #2)
> For the record, this came up in conversation as we've noticed mochitest-bc
> runs with 500+ active DOMWINDOWs even with run-by-dir enabled. Strongly
> suspect this is a contributing factor to the frequent timeouts we see on the
> single-core AWS instances we use.

That sounds like a problem with those tests that should probably be fixed in those tests. Which ones are they?

Andrew McCreight [:mccr8]

Assignee

Comment 4

•

9 years ago

It is not okay to just force a GC and CC because then we're not actually testing incremental GC or CC on TreeHerder.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 5

•

9 years ago

Yes, if there's a problem here, we need to fix the GC/CC heuristics.

:Gijs (he/him)

Comment 6

•

9 years ago

(In reply to Andrew McCreight [:mccr8] from comment #3)
> (In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #2)
> > For the record, this came up in conversation as we've noticed mochitest-bc
> > runs with 500+ active DOMWINDOWs even with run-by-dir enabled. Strongly
> > suspect this is a contributing factor to the frequent timeouts we see on the
> > single-core AWS instances we use.
> 
> That sounds like a problem with those tests that should probably be fixed in
> those tests. Which ones are they?

I don't think we have tests that open that many windows, but I expect we do have cases where the windows stick around for a "long" time and don't get GC'd (for whatever reason) causing the buildup.

Ryan VanderMeulen [:RyanVM]

Reporter

Comment 7

•

9 years ago

Routinely runs in the 200+ range, peaking at ~290.
http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-linux-debug/1435708545/mozilla-central_ubuntu32_vm-debug_test-mochitest-browser-chrome-1-bm03-tests1-linux32-build66.txt.gz
http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-linux-debug/1435708545/mozilla-central_ubuntu32_vm-debug_test-mochitest-browser-chrome-2-bm01-tests1-linux32-build63.txt.gz

Peaks in the 300+ range.
http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-linux64-debug/1435708545/mozilla-central_ubuntu64_vm-debug_test-mochitest-browser-chrome-1-bm116-tests1-linux64-build5.txt.gz
http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-linux64-debug/1435708545/mozilla-central_ubuntu64_vm-debug_test-mochitest-browser-chrome-3-bm123-tests1-linux64-build1.txt.gz
http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-macosx64-debug/1435708545/mozilla-central_snowleopard-debug_test-mochitest-browser-chrome-2-bm108-tests1-macosx-build199.txt.gz

Over 480.
http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-win64-debug/1435708545/mozilla-central_win8_64-debug_test-mochitest-browser-chrome-1-bm111-tests1-windows-build60.txt.gz

Over 400.
http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-win64-debug/1435708545/mozilla-central_win8_64-debug_test-mochitest-browser-chrome-2-bm111-tests1-windows-build78.txt.gz
http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-win64-debug/1435708545/mozilla-central_win8_64-debug_test-mochitest-browser-chrome-3-bm119-tests1-windows-build121.txt.gz
http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-macosx64-debug/1435708545/mozilla-central_snowleopard-debug_test-mochitest-browser-chrome-3-bm107-tests1-macosx-build225.txt.gz

Almost 400.
http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-macosx64-debug/1435708545/mozilla-central_snowleopard-debug_test-mochitest-browser-chrome-1-bm107-tests1-macosx-build208.txt.gz

Sessionstore tests tend to be particularly bad actors, FWIW. Sorry I couldn't find a 500+ instance offhand, but I only spent 5-10min going through logs.

Ryan VanderMeulen [:RyanVM]

Reporter

Comment 8

•

9 years ago

Then it occurred to me that I hadn't looked at any devtools logs. Literally the first one I opened:
http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-linux64-debug/1435708545/mozilla-central_ubuntu64_vm-debug_test-mochitest-devtools-chrome-1-bm51-tests1-linux64-build5.txt.gz

18:27:41     INFO -  ++DOMWINDOW == 785 (0x7f5bbb3d9000) [pid = 2273] [serial = 1495] [outer = 0x7f5bb4aed800]

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 9

•

9 years ago

An issue is of course that GC/CC triggering heuristics need to be tuned for real world usage - so don't run all the time, but while running tests we do unusual stuff, like open tons of new pages and windows all the time.

Ryan VanderMeulen [:RyanVM]

Reporter

Comment 10

•

9 years ago

(In reply to Olli Pettay [:smaug] from comment #9)
> An issue is of course that GC/CC triggering heuristics need to be tuned for
> real world usage - so don't run all the time, but while running tests we do
> unusual stuff, like open tons of new pages and windows all the time.

How does this jive with comments 4 and 5? If tests aren't a real-world scenario, then isn't trying to test incremental GC/CC nebulous?

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 11

•

9 years ago

I think we need to trigger GC/CC explicitly if there are tons of Window objects alive while testing, and in other cases rely on the incremental GC/CC.

mccr8, do you recall if we have some easy way in mochitest framework to know how many windows are alive?

Flags: needinfo?(continuation)

(not currently active) Ted Mielczarek

Comment 12

•

9 years ago

If we don't currently have that I'm sure we could figure out a way to expose nsGlobalWindow's gRefCnt via nsIDebug2.

Andrew McCreight [:mccr8]

Assignee

Comment 13

•

9 years ago

(In reply to Olli Pettay [:smaug] from comment #11)
> mccr8, do you recall if we have some easy way in mochitest framework to know
> how many windows are alive?

As Ted said, this is already tracked in the DOM. A heuristic based on the number of windows that are created, rather than that are alive, might work okay without doing weird stuff to tab hoarders.

Flags: needinfo?(continuation)

Ryan VanderMeulen [:RyanVM]

Reporter

Comment 14

•

9 years ago

I was reminded of this bug today while looking at a log with >520 active DOMWINDOWs. Can we proceed with comment 13 if everyone is satisfied with that approach?

Jonathan, given comment 12, it sounds like the platform already exposes the necessary info to the harness. Is there anybody available to look into creating such a heuristic?

Flags: needinfo?(jgriffin)

Jonathan Griffin (:jgriffin)

Comment 15

•

9 years ago

Since this heuristic would affect shipping code, I'm not sure anyone on the a-team would be in the best position to do this.

Flags: needinfo?(jgriffin)

Ryan VanderMeulen [:RyanVM]

Reporter

Comment 16

•

9 years ago

Maybe I misunderstood comment 13. It sounded to me like the harness was going to be monitoring the number of active DOMWINDOWs and triggering a GC/CC if it went over a certain threshold.

Jonathan Griffin (:jgriffin)

Comment 17

•

9 years ago

Since comment #13 mentions tab hoarders, I thought we were talking about changing the GC/CC heuristic in general.  Andrew, can you clarify?

Flags: needinfo?(continuation)

Andrew McCreight [:mccr8]

Assignee

Comment 18

•

9 years ago

Sorry I haven't replied here yet. I have been looking into this off and on. I wrote some scripts that let you look at the number of windows alive at each point in handy second-by-second chart form, or figure out which tests have the most windows opened at once, or open the most windows.

Something is definitely wacky with some of these tests. Something seems to be going horribly wrong during the markup view devtools tests. This is where we hit a peak of 700 windows. If you look at the number of windows that are open, it mostly just keeps going up and up. It does drop a small amount a few times, but mostly the windows stick around until (I think) we shut down.

Here's one example.

We start the test:
09:32:09     INFO -  2921 INFO TEST-START | browser/devtools/markupview/test/browser_markupview_node_not_displayed_01.js
A few seconds later we create this window:
09:32:11     INFO -  ++DOMWINDOW == 183 (0x7fe2db5ba400) [pid = 25884] [serial = 1773] [outer = (nil)]
A few seconds later we finish the test:
09:32:13     INFO -  2942 INFO TEST-OK | browser/devtools/markupview/test/browser_markupview_node_not_displayed_01.js | took 4648ms
Then, _four minutes later_, when we're shutting down, we finally destroy that window:
09:36:17     INFO -  --DOMWINDOW == 415 (0x7fe2db5ba400) [pid = 25884] [serial = 1773] [outer = (nil)] [url = chrome://browser/content/devtools/markup-view.xhtml]

I'll try to figure out if this is because we're doing a really bad job at scheduling the GC and CC or if this is some kind of leak-until-shutdown.

Assignee: nobody → continuation

Summary: Force a CC/GC between all test runs during mochitests → Reduce the peak number of windows in browser chrome and devtools mochitests

Andrew McCreight [:mccr8]

Assignee

Comment 19

•

9 years ago

Here's a link to my mochitest log analysis script:
https://github.com/amccreight/mochitest-logs/blob/master/bloatfinder.py

It is fairly simple and you have to edit the script to get it to produce different output, but it is pretty good for getting a basic handle on where the problem areas in a run are.

Flags: needinfo?(continuation)

Andrew McCreight [:mccr8]

Assignee

Comment 20

•

9 years ago

Attached file example output for devtools1 — Details

Here's some example output, though it isn't super useful without the original log. Each asterisk represents 10 windows open at a given second.

Andrew McCreight [:mccr8]

Assignee

Comment 21

•

9 years ago

I tried running with AllTraces (which slows down the CC a lot) to figure out why the windows were alive, and that resulted in the number of live windows at the start of browser_markupview_events_jquery_1.6.js dropping from 543 to 119, which suggests that the problem is that we're not doing enough collection work. (Collection work is largely timer-based, so if tests run slower, we'll do more collection work per test.)

Andrew McCreight [:mccr8]

Assignee

Comment 22

•

9 years ago

It looks like we're not calling into the runNextCollectorTimer() function I added in bug 949607 in browser Mochitests. I added a call to it for every time we produce a test result (just doing it at the end of a test didn't help too much) and that reduced the number of live windows on some test from around 500 to around 280. I'll see what that looks like in a full run. Hopefully there's not too much slowdown from the additional collecting.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=d666188021d0

Andrew McCreight [:mccr8]

Assignee

Comment 23

•

9 years ago

That didn't seem to reduce the number of windows on try. Maybe I should just add some manual forced collections in the few directories that are a problem.

Kris Maglione [:kmag]

Updated

•

8 years ago

Blocks: 1239884

Andrew McCreight [:mccr8]

Assignee

Comment 24

•

5 years ago

This is probably still an issue, but this bug never went anywhere.

Status: NEW → RESOLVED

Closed: 5 years ago

Resolution: --- → INCOMPLETE