Closed Bug 1153879 Opened 5 years ago Closed 5 years ago

2.6-48% Linux*/Win* tpaint/tsvgr_opacity/tp5o_scroll/tp5o/tscroll-ASAP regression on Mozilla-Inbound-Non-PGO (v. 40) on April 13, 2015 from push 1b8f36a7ee32

Categories

(Core :: Widget, defect)

x86
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
Tracking Status
firefox40 --- affected

People

(Reporter: vaibhav1994, Unassigned)

References

(Blocks 1 open bug)

Details

Talos has detected a Firefox performance regression from your commit 1b8f36a7ee32 in bug 930793.  We need you to address this regression.

This is a list of all known regressions and improvements related to your bug:
http://alertmanager.allizom.org:8080/alerts.html?rev=1b8f36a7ee32&showAll=1

On the page above you can see Talos alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test, please see: https://wiki.mozilla.org/Buildbot/Talos/Tests#tp5

Reproducing and debugging the regression:
If you would like to re-run this Talos test on a potential fix, use try with the following syntax:
try: -b o -p linux,win64,win32 -u none -t tp5o  # add "mozharness: --spsProfile" to generate profile data

To run the test locally and do a more in-depth investigation, first set up a local Talos environment:
https://wiki.mozilla.org/Buildbot/Talos/Running#Running_locally_-_Source_Code

Then run the following command from the directory where you set up Talos:
talos --develop -e <path>/firefox -a tp5o

Making a decision:
As the patch author we need your feedback to help us handle this regression.
*** Please let us know your plans by Wednesday, or the offending patch will be backed out! ***

Our wiki page oulines the common responses and expectations:
https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
:smaug, could you look into this regression? It has caused pretty serious regressions on http://alertmanager.allizom.org:8080/alerts.html?rev=1b8f36a7ee32&showAll=1&testIndex=0&platIndex=0 especially on SVG, Opacity Row Major
Flags: needinfo?(bugs)
Bah, again. I had pushed to try and didn't see talos regressions.

But changes to the timing are expected, but not 48% :/ Well, actually, it might, depending on the
test. Thinking..
I'll backout Bug 930793 again and investigate then the issues.
Wait with the backout.

Here's what I see from the "toggle view" link:

1. All windows platforms improved responsiveness by 50% and practically didn't regress anything.

This is actually pretty great and the original goal of bug 930793 - it improves responsiveness and pays a price of sometimes load pages a bit slower - this is by design.

2. on linux 32/64 there are several regressions with the biggest being 50% on svg opacity on linux 64 and 15% on linux 32. The rest are sub 4%.

3. OS X is completely not mentioned.



Vaibhav:
- Are we expecting any more data to show up in the coming days? or is this everything which bug 930793 caused?

- was OS X completely unaffected?

smaug: for windows this looks like pure win to me. Can this patch be enabled only for selected platforms?
Flags: needinfo?(vaibhavmagarwal)
(In reply to Avi Halachmi (:avih) from comment #4)
> Wait with the backout.
Backed out already.

> 1. All windows platforms improved responsiveness by 50% and practically
> didn't regress anything.
> 
> This is actually pretty great and the original goal of bug 930793 - it
> improves responsiveness and pays a price of sometimes load pages a bit
> slower - this is by design.
Indeed.



> 
> 2. on linux 32/64 there are several regressions with the biggest being 50%
> on svg opacity on linux 64 and 15% on linux 32. The rest are sub 4%.
I need to check again how those tests are written. If they run the test right after load event, the
regression is expected.

> smaug: for windows this looks like pure win to me. Can this patch be enabled
> only for selected platforms?
I don't want to do that (would be horrible to have different event loop handling on different platforms),
and this is especially needed on linux where responsiveness is horrible whenever something
goes wrong and favor perf mode is enabled for good.
Flags: needinfo?(bugs)
> Vaibhav:
> - Are we expecting any more data to show up in the coming days? or is this
> everything which bug 930793 caused?

Yes, all regressions for this particular bug has been shown for inbound-non-pgo branch

> 
> - was OS X completely unaffected?
> 

Looks like OSX was completely unaffected, as I cannot see regression on alert_manager.
Flags: needinfo?(vaibhavmagarwal)
Blocks: 930793
No longer depends on: 930793
(In reply to Vaibhav (:vaibhav1994) from comment #6)
> Looks like OSX was completely unaffected, as I cannot see regression on
> alert_manager.

Joel, I think it's unlikely (though not impossible) that a patch which caused some 50% regressions on linux and some 50% improvements on windows had zero effect on OS X...

It's also somewhat unlikely that Linux has mostly regressions and Windows has mostly improvements.

Could you please recheck what alert manager did on this case and whether or not you still think that OS X was completely unaffected by this patch, windows mostly improved and linux mostly regressed?


smaug is doing some try pushes so hopefully we'll be able to get isolated test results and use compare talos on them.
Avi, there are no alerts on the newsgroup for osx since March 28th:
https://groups.google.com/forum/#!searchin/mozilla.dev.tree-alerts/regression$20osx|sort:date

this could mean that graph server didn't detect any regressions or improvements.
did some retriggers on those builds- it will help with compare-talos, etc.
(In reply to Olli Pettay [:smaug] from comment #9)
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=6014011f6a49
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=2476f877f012

Compare talos: https://compare-talos.paas.mozilla.org/?oldBranch=Try&oldRevs=6014011f6a49&newBranch=Try&newRev=2476f877f012&submit=true

Here's what I see (+ == improvement, - == regression):

a11yr opt:
- win7: -3%

dromaeo dom:
- linux 64: +3.5%
- win7    : -4.4%

sessionrestore opt:
- linux 32/64: +2.5%

sessionrestore_no_auto_restore opt:
- linux 32/64: +5%

tart opt:
- linux 32/64: +3.2% / +2.2%

tp5o opt:
- linux 32: -5%
- linux 64: -6%
- osx 10.8: -50% but very very noisy

tp5o_scroll opt:
- linux 32/64: -4.3%

tpaint opt:
- linux 32/64: -4.4%
- win8 64: -2%

tresize opt:
- linux 32: +2.4%
- linux 64: +4.5%
- win XP: +4%

ts_paint opt:
- linux 32: +3.7%
- linux 64: +3.3%
- win8 64: -2.3%

tscrollx opt:
linux 32: -2.7%
win XP: + 6.8%

tsvgr_opacity opt:
- linux 32/64: -3.4%


overall: some regressions, some improvements, and nothing major.

Joel, correct me if I'm wrong, but I don't see any > 5% regressions on Linux.

If these results are valid, then:
- I think we should land it.
- I think we should revisit the numbers which alerts-manager produces. If it was very wrong here, how do we know it's not very wrong elsewhere too?
Flags: needinfo?(jmaher)
Lets look at the original regression:
http://alertmanager.allizom.org:8080/alerts.html?rev=1b8f36a7ee32&showAll=1

svg opacity on linux 64 was 49%:
http://graphs.mozilla.org/graph.html#tests=[[225,131,35]]&sel=1428790087928.5173,1428974805169.8965,85.71428571428578,377.1428571428571&displayrange=30&datatype=geo

look at the history of that graph for svg opacity.  You will see a clear 49% regression when it landed and a corresponding improvement when it was backed out.  Then around april 16th (OMTC landed - bug 1155649 tracks it) the test went into insane mode and is not worth tracking (this is the same for linux 32 which showed a 14% regression).
Flags: needinfo?(jmaher)
(In reply to Olli Pettay [:smaug] from comment #9)
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=6014011f6a49
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=2476f877f012

Are these builds before or after OMTC was enabled on linux?
Flags: needinfo?(bugs)
that would be are these using a base after April 16th - looking at the push date these are April 14th!  So this is not OMTC related- now I am baffled why try isnot showing this but the landing and backouts were crystal clear.
smaug, how confident are you that one of your try pushes is essentially m-c with the patch that landed, and the other try push completely negates the other patch?
(In reply to Avi Halachmi (:avih) from comment #13)
> Are these builds before or after OMTC was enabled on linux?
no idea.

(In reply to Avi Halachmi (:avih) from comment #15)
> smaug, how confident are you that one of your try pushes is essentially m-c
> with the patch that landed, and the other try push completely negates the
> other patch?
they are definitely not m-c but m-i, 
https://treeherder.mozilla.org/#/jobs?repo=try&revision=6014011f6a49 has just one newline change, and https://treeherder.mozilla.org/#/jobs?repo=try&revision=2476f877f012 has then the actual patch if you look at the most recent changeset.
Flags: needinfo?(bugs)
Are there any next steps here?
The issue was that the try pushes never showed anything close to what this bug reports (48% regressions), and therefore smaug had nothing to work with - on one hand you say it's 50% regression, OTOH try pushes show no more than 5%, and we didn't understand why (I still don't).

I suggest to do try pushes again using current m-c as a base, and hopefully get clearer indication of the perf impact of this patch. If it's closer to the 5% values we've seen by the previous try pushes, then I think we should land the patch as is.

In general, if we report that a patch caused 50% regressions, but try pushes only show 5%, what should a developer do??!?
Flags: needinfo?(jmaher)
it appears these have been resolved either through a backout or a fix.  Regarding the question about pushing to try not showing the same issues we see live.

This I don't know about.  The best bet would be to use the revision right before the one which shows the regression as a baseline, then we are dealing with the same revision set.  It was not misreported numbers from alert manager, in fact linux64 svg opacity really did have a 48.5% regression and continued to show it as a sustained issue.
Status: NEW → RESOLVED
Closed: 5 years ago
Flags: needinfo?(jmaher)
Resolution: --- → FIXED
(In reply to Joel Maher (:jmaher) from comment #19)
> it appears these have been resolved either through a backout or a fix. 
> Regarding the question about pushing to try not showing the same issues we
> see live.

Of course it's been resolved - the patch has been backed out, due to this bug.

But our goal is not fix regressions by backing out the offending patches, but rather by landing the patches and keeping the performance.

> It was not misreported numbers from alert
> manager, in fact linux64 svg opacity really did have a 48.5% regression and
> continued to show it as a sustained issue.

I'm not saying it was misreported, but do we understand where does the discrepancy between the reported numbers and the try pushes numbers come from? and if yes, how can developers avoid it and be able to reproduce the regressions magnitudes which we report and expect them to fix?

Otherwise, if the reported numbers cannot be reproduced via try, devs are kept in the dark without real tools to assess potential fixes. This is not a good place to be.
You need to log in before you can comment on or make changes to this bug.