Closed Bug 1270849 Opened 8 years ago Closed 8 years ago

2.29% tpaint (windows7-32) regression on push 2d50de084439 (Wed May 4 2016)

Categories

(Core :: JavaScript Engine: JIT, defect)

47 Branch
defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: jmaher, Unassigned)

References

Details

(Keywords: perf, regression, Whiteboard: [talos_regression])

Talos has detected a Firefox performance regression from push 2d50de084439. As author of one of the patches included in that push, we need your help to address this regression.

This is a list of all known regressions and improvements related to the push:
https://treeherder.mozilla.org/perf.html#/alerts?id=1105

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test(s), please see:
https://wiki.mozilla.org/Buildbot/Talos/Tests#tpaint

Reproducing and debugging the regression:
If you would like to re-run this Talos test on a potential fix, use try with the following syntax:
try: -b o -p win32 -u none -t other[Windows 7] --rebuild 5  # add "mozharness: --spsProfile" to generate profile data

(we suggest --rebuild 5 to be more confident in the results)

To run the test locally and do a more in-depth investigation, first set up a local Talos environment:
https://wiki.mozilla.lorg/Buildbot/Talos/Running#Running_locally_-_Source_Code

Then run the following command from the directory where you set up Talos:
talos --develop -e [path]/firefox -a tpaint


Making a decision:
As the patch author we need your feedback to help us handle this regression.
*** Please let us know your plans by Monday, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations:
https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
I did a lot of retriggers, and a compare view shows this is the only regressions:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-inbound&originalRevision=581279cd287e&newProject=mozilla-inbound&newRevision=2d50de084439&framework=1&filter=tpaint
^ note, I filtered on tpaint, but the other data points are noise as we had a regression land on the push immediately prior to this one.

:terrence, this is fairly small, but still a regression.  Can you help us understand this regression a bit more?
Flags: needinfo?(terrence)
Component: Untriaged → JavaScript Engine: JIT
Product: Firefox → Core
(In reply to Joel Maher (:jmaher) from comment #1)
> :terrence, this is fairly small, but still a regression.  Can you help us
> understand this regression a bit more?

This test does not appear to hit any of the limits I adjusted either before or after the patch. How certain are we that this checkin is actually to blame?
Flags: needinfo?(terrence)
I have pretty high confidence for this specific test, check out this graph:
https://treeherder.mozilla.org/perf.html#/graphs?series=%5Bmozilla-inbound,f63c69ead538d4f6c277a7303b9a7d44fd1b441b,1%5D&zoom=1462376575093.872,1462422166727.0046,266.06257782549665,303.2395190721973

I did a lot of retriggers and it seems to be consistent.  possibly we can push to try, and then push a backout just to see if there is a difference?
Looking closer at tpaint, this test is absurdly short and does not appear to do a GC before starting the timing. Mild changes to allocation behavior on startup are going to have tremendous effects on the results of these tests, independent of any actual change to our throughput. I'm pretty skeptical now that this is measuring anything meaningful, particularly for the GC, given how little memory it allocates.
1) should we force the test to wait on startup, gc, then commence the actual test?  That might be better overall.  
2) Given that explanation above, would you want to mark this as resolved:wontfix.

Let me know your thoughts on the two items above.
(In reply to Joel Maher (:jmaher) from comment #5)
> 1) should we force the test to wait on startup, gc, then commence the actual
> test?  That might be better overall.

The goal is not to reduce the numbers of the test. The goal is to make it better, where possible, for the users.

I don't see how making the test wait more improves anything for users, hence this is not a solution.

IMO we should accept the regression unless Terrence wants to do something about it.
I think that doing a GC before the test would probably make the numbers more stable, which arguably would let this test be more meaningful, which might make optimizing here better for our users.

On the other hand, it's  not that clear that optimizing our time on a blank page actually improves our time on more complicated workloads. I'd go so far as to say that it's actively harmful to have this test at all. Let's say there's some sort of cache we could add to a document to make rendering every subsequent element faster, for example, a Generational Garbage Collector. This test is going to penalize that sort of global optimization.

This is a general problem with any "micro" benchmark, of which this is one. They can be helpful at investigating the performance of specific scenarios, and can be useful for driving decisions when taken with a broad context and knowledge of the limitations of the numbers. Having a default policy of "*** Please let us know your plans by Monday, or the offending patch(es) will be backed out! ***" on a Friday, for a 2% regression on a microbenchmark makes me hate everything and want to take up subsistence farming in the Andes.
Oh, and:
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
thanks for the info here- I am open to having alternate policies for minor regressions or micro style benchmarks which test a small function.  Maybe move the threshold to 5% for those instead of having this on a 2.39% threshold?

we don't want anyone moving to the Andes (actually the idea sounds tempting unrelated to mozilla)- sorry for the stress.
(In reply to Terrence Cole [:terrence] from comment #7)
> On the other hand, it's  not that clear that optimizing our time on a blank
> page actually improves our time on more complicated workloads. I'd go so far
> as to say that it's actively harmful to have this test at all. Let's say
> there's some sort of cache we could add to a document to make rendering
> every subsequent element faster, for example, a Generational Garbage
> Collector. This test is going to penalize that sort of global optimization.

That's a good point, and TBH, I, too, think this specific test is possibly overrated and possibly too micro.

That being said, the goal which we keep infront of us is that we want to catch regressions which affect users. It's better to have false positives than not being able to tell at all.

As long as the developer which pushed the patch knows why the numbers changed, we typically accept his/her opinion on whether or not the diff is real (in terms of users being affected), and even if it is, whether or not we should still take it anyway.

The wording with the backout etc might be a bit harsh (suggestions welcome), but we have not yet forced anyone's hand, and as long as devs are responsive and caring of the impact, that's likely not going to change.
One person's micro-benchmark can be another person's macro-benchmark, depending on what you're working on. With respect to GC, at least, this benchmark is not large enough to give a useful signal. Maybe not true for graphics, which probably care how fast it can paint nothing. Would be nice to just isolate it from GC.
> Would be nice to just isolate it from GC.

Not sure how we can do it. OTOH, if it happens again, you can always sign it off as irrelevant ;)
You need to log in before you can comment on or make changes to this bug.