Closed Bug 1134303 Opened 9 years ago Closed 9 years ago

20% MacOS* tsvgx regression on Mozilla-Inbound (v.38) on February 17, 2015 from push 4a21032847e0

Categories

(Core :: JavaScript Engine, defect)

38 Branch
x86_64
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: mishravikas, Unassigned)

References

Details

(Keywords: perf, regression, Whiteboard: [talos_regression])

User Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.104 Safari/537.36

Steps to reproduce:


Talos has detected a Firefox performance regression from your commit 4a21032847e0 in bug 1131759.  We need you to address this regression.

This is a list of all known regressions and improvements related to your bug:
http://alertmanager.allizom.org:8080/alerts.html?rev=4a21032847e0&showAll=1

On the page above you can see Talos alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test, please see: https://wiki.mozilla.org/Buildbot/Talos/Tests#tsvg.2C_tsvgx

Reproducing and debugging the regression:
If you would like to re-run this Talos test on a potential fix, use try with the following syntax:
try: -b o -p macosx64 -u none -t svgr  # add "mozharness: --spsProfile" to generate profile data

To run the test locally and do a more in-depth investigation, first set up a local Talos environment:
https://wiki.mozilla.org/Buildbot/Talos/Running#Running_locally_-_Source_Code

Then run the following command from the directory where you set up Talos:
talos --develop -e <path>/firefox -a tsvgx

Making a decision:
As the patch author we need your feedback to help us handle this regression.
*** Please let us know your plans by Monday, or the offending patch will be backed out! ***

Our wiki page oulines the common responses and expectations:
https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
Status: UNCONFIRMED → NEW
Ever confirmed: true
:shu,  can you take a look at this- we had done a series of retriggers and for the tsvgx results your push had 3/4 results >325 whereas previous pushes rarely had a value >325.
Flags: needinfo?(shu)
Summary: 20% MacOS* tsvgx regression on Mozilla-Inbound on February 17, 2015 from push 4a21032847e0 → 20% MacOS* tsvgx regression on Mozilla-Inbound (v.38) on February 17, 2015 from push 4a21032847e0
Hi Joel. I don't really know how to look into this. Nor do I know how a 1-liner fix to the JS engine can trigger SVG regressions.
Flags: needinfo?(shu)
Joel, on linux and windows this graph is very stable over the past 3 months (without a regression on Feb 17th), while on OS X it seems very noisy and possibly with a regression.

But TBH, I'm not sure I can see a meaningful change at the graph around Feb 17th for OS X: http://graphs.mozilla.org/graph.html#tests=%5B%5B281,63,24%5D,%5B281,64,24%5D,%5B281,63,25%5D,%5B281,63,33%5D,%5B281,63,31%5D%5D&sel=1423152458053.4358,1424288387834,36.11105459707733,406.48142496744777&displayrange=90&datatype=geo

And if there is one, it might be due to the specific noise pattern we see on OS X for this test. Not necessarily, but possibly.

Should the supposedly offending patch affect different platforms differently?
(In reply to Avi Halachmi (:avih) from comment #3)
> Joel, on linux and windows this graph is very stable over the past 3 months
> (without a regression on Feb 17th), while on OS X it seems very noisy and
> possibly with a regression.
> 
> But TBH, I'm not sure I can see a meaningful change at the graph around Feb
> 17th for OS X:
> http://graphs.mozilla.org/graph.html#tests=%5B%5B281,63,24%5D,%5B281,64,
> 24%5D,%5B281,63,25%5D,%5B281,63,33%5D,%5B281,63,31%5D%5D&sel=1423152458053.
> 4358,1424288387834,36.11105459707733,406.
> 48142496744777&displayrange=90&datatype=geo
> 
> And if there is one, it might be due to the specific noise pattern we see on
> OS X for this test. Not necessarily, but possibly.
> 
> Should the supposedly offending patch affect different platforms differently?

My patch fixes a data race by making the field of a JS engine-internal data structure mozilla::Atomic. AFAIU, mozilla::Atomic is implemented via libc++ <atomic> on both OS X and Linux. (libc++ atomic, of course, may differ on OS X and Linux, but that's such a widely used thing...) I'm unsure about Windows.
first push is a baseline, second push is testing a backout of the two patches in the revision suspected in this.
Joel, could you also try backing out bug 1131759 but not bug 1133530?

If these get backed out, there is nothing actionable for me. I cannot try to fix these another way: these are minimal fixes. Suggestions on how to proceed in that case?
https://tbpl.mozilla.org/?tree=Try&rev=e2a76c9b8574

using relaxed instead of release-acquire for bug 1131759
(In reply to Joel Maher (:jmaher) from comment #5)
> https://tbpl.mozilla.org/?tree=Try&rev=36610eaaf033

TBPL ended up not running T(s). I re-pushed what I had in comment 10 at https://treeherder.mozilla.org/#/jobs?repo=try&revision=4f2decfeb9c5
Please don't back out bug 1133530, by the way. If the backout of bug 1131759 doesn't fix it, I would rather not be backed out due the noisiness of the OS X SVG-ASAP and the completely unrelated code in bug 1133530. I would rather have someone familiar with SVG code investigate.
so the regression is reduced greatly and it starts with the backout, but could also be related to the push right afterwards:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=73715117edd4.

eventually I got the correct try server syntax and have a push with both patches backed out:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=d34d446bf5d2

* I understand that is not ideal, but I wanted to see if the push was the problem.

The data coming in is showing data <325 consistently.  

This is a large regression, so I would like to be a bit more careful about landing this.  If we can reduce this a bit more it would be easier to justify accepting a regression for a new feature/bug fix.  The call is ultimately in your hands though.
(In reply to Joel Maher (:jmaher) from comment #14)
> The data coming in is showing data <325 consistently.  

While that statement might be correct, in itself it's not an indication of ~20% regression.

IMO it would be better to assess the change by stating the average and stddev before and after.

If the average changed by ~20% and stddev is meaningfully lower than 20%, then it's indeed a clear indication for a regression. However, other combinations of these metrics will need further assessment IMO.
Is there existing data to help identify which of the http://hg.mozilla.org/build/talos/file/default/talos/page_load_test/svgx tests regressed?
My own try pushes I forgot to push:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=ee1d6a245a44
backout 1131759: 343, 313, 260, 297, 260, 313, 332, 315, 330, 277, 330, 345, 312, 340, 247, 317, 253

https://treeherder.mozilla.org/#/jobs?repo=try&revision=6916b80fa326
backout 1133530: 339, 324, 322, 283, 338, 314, 282, 281, 338, 339, 338, 323, 322, 322, 339, 326, 328

https://treeherder.mozilla.org/#/jobs?repo=try&revision=3313b7c6f9c8
backout both: 318, 312, 312, 315, 313, 264, 318, 313, 260, 298, 260, 313, 262, 312, 262, 298, 260, 313

So indeed, backing out both seems to reduce the regression the most. Bug 1133530 is a minimal fix for a JS crasher though, so I don't really view backing it out as an option.
s/forgot to push/forgot to comment
looking at the raw test files:
https://treeherder.allizom.org/perf.html#/graphs?timerange=1209600&seriesList=[["mozilla-inbound","11f6e2252086eb1cd1df48f8c3f48ed188fa8614"],["mozilla-inbound","29e209f7a9f70362d45f1ec8f5942af2c231c9fd"]]

composite-scale.svg looks like a culprit and 	tsvgx composite-scale-rotate.svg is suspicious of changing its pattern, but it isn't clear that is affected.

Keeping a js crasher landed is a good thing!
The composite-scale* tests have absolutely no script in them. Perhaps this is a regression in the test harness?
The harness hasn't changed for at least a week, buildbot changes a few times a week, but it would be odd to have just tsvgx on OSX regress 20%.  Also if a backout resolves this a lot of the way, then most likely this is really related to the patch.
"Related" beyond my ken. Like, icache issues?
this regression is fully fixed, if we are content with the backout lets close this, otherwise we can leave this open to track changes when we reland the patch or a variation of it.  Keep in mind that we will be uplifting on Monday, so most likely we will close this out and open a new bug if/when the time comes.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
(In reply to Joel Maher (:jmaher) from comment #21)
> The harness hasn't changed for at least a week...

When I said "Perhaps this is a regression in the test harness?" I meant *of* the test harness.
Keywords: perf, regression
Whiteboard: [talos_regression]
You need to log in before you can comment on or make changes to this bug.