Closed Bug 983948 Opened 6 years ago Closed 6 years ago

Linux debug mochitest-browser-chrome tests are taking much longer to complete & frequently timing out after excessive GCing

Categories

(DevTools :: General, defect, critical)

x86
Linux
defect
Not set
critical

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: KWierso, Assigned: miker)

References

(Blocks 2 open bugs)

Details

(Keywords: intermittent-failure)

Attachments

(2 files)

Linux debug bc runs have been failing frequently (but not constantly) all day today. We've been starring most of them as intermittent and retriggering them, but we should probably figure out why they're failing.

Not sure why this is only affecting Linux debug and not Linux64 debug.

The earliest instance I see on mozilla-inbound is https://tbpl.mozilla.org/php/getParsedLog.php?id=36145293&tree=Mozilla-Inbound from Tomcat's merge from m-c back to inbound. I haven't tracked down where this actually started (and I'm retriggering the pushes prior to the merge to hopefully make sure it did come from the merge).

https://tbpl.mozilla.org/php/getParsedLog.php?id=36145293&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=36150556&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=36153768&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=36157278&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=36164105&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=36168057&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=36169082&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=36178922&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=36167205&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=36177299&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=36181546&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=36171967&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=36174847&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=36175835&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=36177256&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=36186007&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=36180387&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=36176803&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=36183463&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=36185567&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=36186956&tree=Mozilla-Inbound
They all seem to be failing during or soon after the tests in content/browser/toolkit/mozapps/extensions/test/xpinstall/ run.
Odd, it started happening with about the same frequency on fx-team right after that same merge from m-c, but I'm not seeing it happen on m-c itself... Retriggering some stuff there.
Blocks: 983941
Bug 772484 was a recent landing that touched those tests, FWIW.
Those tests are just the victim of being the last tests in the suite: if someone takes it from being 7 minutes away from maxtime to being 1 minute over maxtime, then those are what will be running at maxtime.

You don't actually need to retrigger, so much as just look at how long the (approximately) passing runs took. Thursday morning on fx-team they were taking around 157/158 minutes, or put more accurately around 9 or 10 minutes less than the suite's maxtime. Bug 663778 landed, and took it to 30 seconds under, or 30 seconds over, maxtime.

Top Men are working on figuring out what we can do in the meantime, and what we can do to speed those tests back up, but we can live with it until bug 983941 catches a reconfig on Monday and we get our new pair of pants with a 68 inch waist.
Blocks: 663778
Component: Add-ons Manager → Developer Tools
Product: Toolkit → Firefox
Version: 28 Branch → Trunk
And lest you think the move to AWS slaves and running browser-chrome in three chunks will save you, https://tbpl.mozilla.org/php/getParsedLog.php?id=36195513&tree=Mozilla-Inbound is one of several where your chunk has exceeded its 4800 second share.
Disabling the highlighter and comparing reveals that debugging on OSX shows an average slowdown of only 0.03 seconds for tests that use the highlighter. Looks like I will need to look into this on a Linux box... wondering if Linux SVG performance could be the culprit here.
Attached file Slow Test Analysis.ods
Interestingly, the worst affected test appears to be browser/toolkit/mozapps/extensions/test/browser/test-window/browser_bug608316.js, which shows an increase of 963%. In fact, 5 out of the worst 10 offenders are test-window tests all showing at least 500% slowdown although this could be down to server load.

In bc runs where we reach the global timeout the DevTools tests are taking around 2.7 minutes longer to run and the rest are taking 5.36 minutes longer. To put this in context our tests are reaching the global timeout of 2.75 hours. I suspect that these small increases on a busy server are simply the straw that broke the camels back.

Of course, some tests are timing out so these should be split up and we can optimimize the highlighter, look at the worst offending tests etc. so I will continue working on this.
(In reply to Michael Ratcliffe [:miker] [:mratcliffe] from comment #75)
> Created attachment 8392804 [details]
> Slow Test Analysis.ods
> 
> Interestingly, the worst affected test appears to be
> browser/toolkit/mozapps/extensions/test/browser/test-window/
> browser_bug608316.js, which shows an increase of 963%. In fact, 5 out of the
> worst 10 offenders are test-window tests all showing at least 500% slowdown
> although this could be down to server load.
> 

Turns out it is just GC cleaning up old docshells and windows.

I think the best solution since the camels back is breaking would be to create a mochitest-dt testsuite and move the devtools tests into there... would anybody be opposed to us doing this?
Are we sure we haven't regressed GC?

eg There are a bunch of new linux browser-chrome intermittents similar to bug 984843, which show timeouts during excessive GCing.
(In reply to Ed Morley [:edmorley UTC+0] from comment #77)
> Are we sure we haven't regressed GC?
> 
> eg There are a bunch of new linux browser-chrome intermittents similar to
> bug 984843, which show timeouts during excessive GCing.

You are right... GC does appear to be far more aggressive even when running single tests. This wouldn't be related to the devtools changes but we should take a look.
Example newly filed linux browser-chrome timeout intermittents that seem to be caused by GC:
Bug 984843
Bug 984854
Bug 984856
Bug 985348
Bug 984666
Bug 984661
Bug 984659
Bug 984657
Bug 951477
Bug 985239
Bug 984851
Summary: Intermittent Linux debug m-bc runs failing with | command timed out: 9900 seconds elapsed, attempting to kill | During or after xpinstall tests → Linux debug mochitest-browser-chrome tests are taking much longer to complete & frequently timing out after excessive GCing
Severity: normal → critical
One thing to note is that the work to force extra GCs that happened in bug 938945 doesn't run in BC, at least in my local testing.  If we run the GC more often, they will hopefully take less time.  I have a slightly more sophisticated approach to this in bug 949607.
Pushed https://hg.mozilla.org/mozilla-central/rev/51505a8f2a44 to temporarily increase the default timeout from 30 seconds to 45 seconds, which I expect will take this from grotesque (I was surprised that my try push only got one green out of five on linux32 debug, but when I counted back on mozilla-inbound we'd only gotten one green out of the last thirty runs) to merely awful.

I think Critical severely underestimates the severity of this bug: I pushed that because we were seriously talking about doing one of closing all trunk trees, or hiding browser-chrome, or disabling wide swathes of tests, on Friday.
(In reply to Phil Ringnalda (:philor) from comment #81)
> I think Critical severely underestimates the severity of this bug: I pushed
> that because we were seriously talking about doing one of closing all trunk
> trees, or hiding browser-chrome, or disabling wide swathes of tests, on
> Friday.

Those last two ("disable/hide lots of tests") aren't really an option - if the tree should be closed the tree should be closed.
What comment 76 describes is going to happen if the GC isn't getting enough of a chance to run.  When it finally does manage to trigger, then whatever it triggers during will take a really long time because there's so much to clean up, but it will probably cause whatever test it is running during to time out.  Doing something like bug 938945, but during BC, should help.  I'm not sure what the right place would be to trigger occasional GCs in the BC test harness is, though.
60 seconds, https://tbpl.mozilla.org/?tree=Try&rev=d7b7a16a08de, doesn't smell significantly better (as I suspected from the number of timeout-threshold failures that were finishing between 30 and 40 seconds).
I took another look at this and GC is cleaning windows and docshells from previous tests causing other tests to timeout.

Would it not make sense to force GC when all windows are closed between each test? This way the GC wouldn't cause test timeouts.
Yes, you basically want to do something like this:
  https://hg.mozilla.org/mozilla-central/rev/c3d8ec0e53c7
eg every X window closes, call DOMWindowUtils.garbageCollect(), which will trigger the GC and CC.  I don't know what the proper place is for BC.
Assignee: nobody → mratcliffe
GCing between every 2 or 3 tests is not enough during tests that open lots of windows as some of them tend to get GCd during the next test causing timeouts.

This ridiculously simple patch CCs & GCs between tests so that GC is less likely to eat into actual test time.

I am not sure why tab was sometimes not defined in tabbrowser.xml but a added the check there as a "Tab is undefined" error occurred with this patch and didn't without it.

https://tbpl.mozilla.org/?tree=Try&rev=2c82f57975dd
Attachment #8398521 - Flags: review?(rcampbell)
Comment on attachment 8398521 [details] [diff] [review]
GC between each test

Review of attachment 8398521 [details] [diff] [review]:
-----------------------------------------------------------------

tabbrowser thing is surprising. Might want a browser peer to take a look.
Attachment #8398521 - Flags: review?(rcampbell)
Attachment #8398521 - Flags: review?(gijskruitbosch+bugs)
Attachment #8398521 - Flags: review+
Comment on attachment 8398521 [details] [diff] [review]
GC between each test

Review of attachment 8398521 [details] [diff] [review]:
-----------------------------------------------------------------

rs=me to get our trees in a happier place, but please file a followup bug so we can investigate and get stacks when there's no tab. We need to figure out how that's even being called with a null tooltip node, only because we're suddenly GC'ing more...
Attachment #8398521 - Flags: review?(gijskruitbosch+bugs) → review+
Running this on cedar to see how it plays with mochitest-dt and the chunked browser-chrome suites.
https://tbpl.mozilla.org/?tree=Cedar&rev=8c337a10e8d3
(In reply to :Gijs Kruitbosch (no internet 29 Mar - 6 Apr) from comment #89)
> Comment on attachment 8398521 [details] [diff] [review]
> GC between each test
> 
> Review of attachment 8398521 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> rs=me to get our trees in a happier place, but please file a followup bug so
> we can investigate and get stacks when there's no tab. We need to figure out
> how that's even being called with a null tooltip node, only because we're
> suddenly GC'ing more...

Bug 989362 logged.
Do we know why we suddenly need to gc more?

Making us gc between each test is just wallpapering over the issue unless the increase in need to gc is expected?
Bug 991311 could be related, I guess.
Comment on attachment 8398521 [details] [diff] [review]
GC between each test

>diff --git a/browser/base/content/tabbrowser.xml b/browser/base/content/tabbrowser.xml

>       <method name="createTooltip">

>           var tab = document.tooltipNode;

>-          if (tab.localName != "tab") {
>+          if (!tab || tab.localName != "tab") {

Perhaps worth asking one of the Neil's about this, it sounds related to http://hg.mozilla.org/mozilla-central/annotate/5fa70bd90a8b/layout/xul/nsXULPopupManager.cpp#l1429.

(createTooltip is called from a tab tooltip's popupshowing handler)
(In reply to Ed Morley [:edmorley UTC+0] from comment #94)
> Do we know why we suddenly need to gc more?
> 
> Making us gc between each test is just wallpapering over the issue unless
> the increase in need to gc is expected?

This is nothing new for devtools tests.

The increase is probably because we now set dom.send_after_paint_to_content to false for devtools mochitests. We discovered that by doing this leak detection and GC were improved, I suppose because we are no longer flooded by mozafterpaint events. The down side is that GC is more likely to kick in part way through a test.
Ah I see, that's helpful thank you.
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #90)
> Running this on cedar to see how it plays with mochitest-dt and the chunked
> browser-chrome suites.
> https://tbpl.mozilla.org/?tree=Cedar&rev=8c337a10e8d3

We ended up reverting this on Cedar and didn't notice any major differences in test failures. So let's drop this to save some runtime.
https://hg.mozilla.org/mozilla-central/rev/52da0fc935ce
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 31
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #101)
> https://hg.mozilla.org/mozilla-central/rev/52da0fc935ce

This was meant for bug 984930; ignore.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #100)
> (In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #90)
> > Running this on cedar to see how it plays with mochitest-dt and the chunked
> > browser-chrome suites.
> > https://tbpl.mozilla.org/?tree=Cedar&rev=8c337a10e8d3
> 
> We ended up reverting this on Cedar and didn't notice any major differences
> in test failures. So let's drop this to save some runtime.

So I'm sure I'm understanding; you're suggesting wontfixing this yeah?
Target Milestone: Firefox 31 → ---
Yeah, I'm not sure if this bug has anything actionable to be done at this point.
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → WONTFIX
Product: Firefox → DevTools
You need to log in before you can comment on or make changes to this bug.