Closed Bug 1005253 Opened 10 years ago Closed 10 years ago

4.14% tart win7 regression on april 30 from rev e4be5203a3c9

Categories

(Core :: XPCOM, defect)

x86
Windows 7
defect
Not set
normal

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: jmaher, Assigned: mccr8)

References

Details

(Keywords: perf, regression, Whiteboard: [talos_regression])

I found this in a talos alert, we have a noticable tart regression that seems to be specific to windows 7.

Here is a graph:
http://graphs.mozilla.org/graph.html#tests=[[293,131,25]]&sel=none&displayrange=7&datatype=running

I did some retriggers and this is really the deal:
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&fromchange=2cecc0699b45&tochange=44e189f0fd67&jobname=Windows%207%2032-bit%20mozilla-inbound%20talos%20svgr

Tart is documented here:
https://wiki.mozilla.org/Buildbot/Talos/Tests#TART

It is pretty easy to push to try any fixes and see if your fix for this works.  As always, if you need help or more information, please ask in the bug.
That's pretty odd, especially that it is on a single platform.  I'm surprised the cycle collector runs at all during this test.  I guess I'll have to run it locally and see what it is even doing.

I mean, in theory we could just suppress CC during tab animation, but hopefully we can avoid that.
In the tart test we do a garbage collect between *tests*, so there is a chance that this is called for this test.  you can get talos locally here:
https://wiki.mozilla.org/Buildbot/Talos/Running#Running_locally_-_Source_Code

or you can make tart a .xpi and use it as an addon.
Where in the source code does that garbage collection happen?
My theory is that the test harness is forcing a GC, but not a CC.  Without ICC, this will trigger a CC 5 seconds after the GC, but if we're done with the test in that time, there will be no CC during the test.  With ICC, we start doing CC work sooner, after a second or two, so it may start to overlap with the test run.  The solution would be to force a synchronous CC after the GC, which will reset the timer, so we will do a GC in 5 seconds, thus hopefully keeping collector activity from interfering with the test.
here is where we do the gc:
http://hg.mozilla.org/build/talos/file/tip/talos/pageloader/chrome/pageloader.js#l412

a test is a page as defined here:
https://hg.mozilla.org/build/talos/file/b4907f0b27d3/talos/page_load_test/tart/tart.manifest

unfortunately for tart, that is a single page.

I would be happy to cc when we gc- I would want to be careful so we don't affect other tests or adjust our testing so it is further away from real world.  With that said, a little thought before doing something across the board isn't too hard.
Thanks!  Unfortunately for my theory, that already does a cycle collection.
Assignee: nobody → continuation
I looked at this locally on OSX.  It does open and close a bunch of tabs, running for five minutes.  We spend only a second or two of the 5 minute run in the CC.  Maybe something terrible is happening on Win7 in particular.
In Win7 we're still spending less than 10 seconds in cycle collection across all of the Talos tests in this group, and there's no dramatic differences I can see between Win7, Win8 and WinXP, on some basic CC time metrics.
the code change (https://hg.mozilla.org/integration/mozilla-inbound/rev/ac6c395d9364) looks to be more related to network code cleanup, not necessarily cycle collection from my naive point of view.
From the retriggers you listed in comment 0, it looks like the previous push ( https://hg.mozilla.org/integration/mozilla-inbound/rev/e4be5203a3c9 ) is the one where TART regressions from around 7.0 to around 7.4.  Am I misinterpreting that?
oh, you are right:
https://hg.mozilla.org/integration/mozilla-inbound/rev/e4be5203a3c9

title updated.  Should we push to try with this backed out to see if that helps?
Summary: 4.14% tart win7 regression on april 30 from rev ac6c395d9364 → 4.14% tart win7 regression on april 30 from rev e4be5203a3c9
(In reply to Joel Maher (:jmaher) from comment #12)
> Should we push to try with this backed out to see if that helps?

I can do that once try is working again.
How do I see what the TART score is in my push in comment 14?  The datazilla link in the try push just hangs as far as I can tell.
(In reply to Andrew McCreight [:mccr8] from comment #15)
> How do I see what the TART score is in my push in comment 14?  The datazilla
> link in the try push just hangs as far as I can tell.

Datazilla is very slow in the past few days, so except for waiting (up to few minutes) and hoping, not much you can do.

FWIW, I got the link for your win7 tart run showing the data after about a minute: https://datazilla.mozilla.org/?start=1399422132&stop=1400026932&product=Firefox&repository=Try-Non-PGO&os=win&os_version=6.1.7601&test=tart&graph_search=fbae39857d7b&x86_64=false&project=talos

And you can also see the graph server graphs. The summary section at the bottom of tbpl (once you click the test code/char at the top) shows items of type:

- datazilla: <test-name>  <-- datazilla link
- <test-name> : "score"   <-- graph server link

The graphserver scrore is the one which generates the regression reports, and is the average of all the test's sub results (TART has about 30 sub results). Graph server doesn't allow you to view individual sub results, while datazilla does.
just click on the 's' on tbpl and the details panel below has the score, For Win7 it is 7.57.  What is your base tree for the push?  we need to compare against the revision that you pulled from for the best case comparison.
The base revision is this: https://hg.mozilla.org/try/rev/7f5a8526b55a

I did add some printing, too, so I guess that could cause a problem.  I'll try a cleaner push and see what happens there.  7.57 still sounds like it is a regressed value.

> just click on the 's' on tbpl and the details panel below has the score, For Win7 it is 7.57.
Hmm.  I don't see that for some reason.  When I click on the 's', I see "TinderboxPrint: mozharness_revlink: https://hg.mozilla.org/build/mozharness/rev/8b81f2f286de" and some datazilla links, but not that.  It looks like I can search in log for "process_Request line: tart" to get the results, though.
I did a new clean push: https://tbpl.mozilla.org/?tree=Try&rev=d3b580d8dacd

This is against revision 93e03b8c127e.
with the clean push, this is posting a TART score of 7.48->7.92 (really 7.51 if you ignore the 7.92 outlier).

what branch is 93e03b8c127e on? m-c?  We have had some tart regressions lately:
http://graphs.mozilla.org/graph.html#tests=[[293,94,25]]&sel=none&displayrange=30&datatype=running

I believe you are based of m-c from May 12th, that has value of 8.0 for tart, so if you are posting 7.5 then we are seeing a win!
Yeah, this is the revision I was comparing to:
  https://tbpl.mozilla.org/?rev=93e03b8c127e

ICC disabled: 7.92, 7.51, 7.54, 7.48
ICC enabled: 8.28, 8.36, 8.04

So this confirms that we're still seeing a regression with ICC enabled.

One change ICC makes is that we schedule CCs sooner after GCs, even when the CCs are short, so this can potentially cause us to do slightly more GCs and CCs.  I'll see if changing that helps.
at least we are getting to the bottom of this!  Whether or not this is straightforward to fix is another story :)
I still have no idea what is happening.
Depends on: 1011137
(In reply to Andrew McCreight [:mccr8] from comment #1)
> I mean, in theory we could just suppress CC during tab animation, but
> hopefully we can avoid that.

BTW, why do you hope to be able to avoid it? there probably is an API for it already, the start and end points are well defined, the duration is short and known in advance, and UI animations (tabs or otherwise) are quite important for us to keep smooth.

Wouldn't it make sense to explicitly suppress CC when we know we're entering such phase, rather than having possibly complex heuristics which hopefully ends up with the same result?

I'm teasing, but only partially. I'd also prefer automatic behavior which covers all our cases, and I do know that explicit triggers could be a slippery slope, but still, I'd consider it more than a theoretical option.

Your call though (which if goes the explicit way, would have several more implications very soon IMO, e.g. for customize animation and menu animations).
It would certainly be nice to suppress CC during animations, but that won't help during TART, which runs for 5 minutes, and we really don't want to suppress CC for 5 minutes at a time, in general.  I have an early patch in progress that would make us be less janky during painting, which should help, but again, I think that won't help TART, which I have the impression is really more of a throughput test than a jankiness test.
The overall talos test run of tart, which repeats the same test sequence 25 times (where each sequence includes dozens of animations), takes few minutes.

But during each test run, TART does give Firefox time to recover. Any animation which TART measures is preceded with a wait of at least 500ms - intentionally to allow CC and other cleanups to complete before the next animation starts.

The explicit suppression I suggested on comment 24 was for UI animations in general (and specifically tabs as well), not for the TART run.

And if following such change, it turns out that TART method of (unrealistic afterall) tab animation invocation clashes with this explicit suppression, then we could always change TART, or refine the suppression, whichever we think would be better.
(In reply to Avi Halachmi (:avih) from comment #26)
> But during each test run, TART does give Firefox time to recover. Any
> animation which TART measures is preceded with a wait of at least 500ms -
> intentionally to allow CC and other cleanups to complete before the next
> animation starts.

Or at the very least, I hope it's enough time to recover. If it's not, then we should change TART to wait longer.

Afteral, TART is invoking tab animations at a rate which is not expected from users in practice, and the heuristics in place surely expect some kind of behavior pattern in order to perform as expected. If TART happens to trigger cases of unoptimized code path, then we should fix TART.

I'm open and listening for any suggestion or info which can improve TART.
> Any animation which TART measures is preceded with a wait of at least 500ms - intentionally to allow CC and other cleanups to complete before the next animation starts.

Ah, thanks for the explanation.  Yeah, that will be a little sensitive to changes in collector scheduling.  The collectors will run, say, every 10 seconds, so I could imagine that tweaking how they are run could make them end up running during a run, whereas they were not before.  I'll think about how to see if that is happening.
If my reading of the results of a TART run on try is right, bug 1011137 should fix this.

Essentially, ICC was running a CC two seconds earlier.  I suspect this caused us to CC during the TART test, when we weren't before.  My patch makes the behavior more like non-CC, where instead of running a CC sooner, we run a GC sooner, but only if the ICC is long, basically.  I think that's an improvement, even ignoring the TART regression issue.

If I make some future changes to CC scheduling that mess with TART again, I'll try coming up with a more robust fix, but I think it is okay for now.
(In reply to Andrew McCreight [:mccr8] from comment #29)
> ... but I think it is okay for now.

Many thanks.

And what about the CC/GC hints idea during UI animations? we currently don't have tons of them (probably not more than 10), and it would remove the dependency on heuristics refinements.

Not just for TART, but more importantly, for the users.

The heuristics would still be useful, it'll just have better knowledge of when we'd really prefer to not trigger GC/CC.
It looks like there was a pretty clear improvement when bug 1011137 landed, but there's no message on tree-management about it:

http://graphs.mozilla.org/graph.html#tests=[[293,131,25]]&sel=none&displayrange=7&datatype=running

Maybe the regression from OMTC confused things.
yeah, there is enough data points to show the improvement as sustained.  I would expect an improvement email, maybe there is an issue with the regression emailer script :)

We are almost back to 100% now- probably just a slight bit off.  I am inclined to mark this as fixed
Fixed by bug 1011137.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WORKSFORME
See Also: → 1017055
You need to log in before you can comment on or make changes to this bug.