Closed Bug 1150559 Opened 9 years ago Closed 9 years ago

16-22% Linux* cart/tart regression on Mozilla-Inbound-Non-PGO (v. 40) on April 01, 2015 from push b4efb3f1976f

Categories

(Testing :: Talos, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: vaibhav1994, Unassigned)

References

Details

Attachments

(1 file)

Talos has detected a Firefox performance regression from your commit b4efb3f1976f in bug 674779.  We need you to address this regression.

This is a list of all known regressions and improvements related to your bug:
http://alertmanager.allizom.org:8080/alerts.html?rev=b4efb3f1976f&showAll=1

On the page above you can see Talos alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test, please see: https://wiki.mozilla.org/Buildbot/Talos/Tests#TART.2FCART

Reproducing and debugging the regression:
If you would like to re-run this Talos test on a potential fix, use try with the following syntax:
try: -b o -p linux -u none -t svgr  # add "mozharness: --spsProfile" to generate profile data

To run the test locally and do a more in-depth investigation, first set up a local Talos environment:
https://wiki.mozilla.org/Buildbot/Talos/Running#Running_locally_-_Source_Code

Then run the following command from the directory where you set up Talos:
talos --develop -e <path>/firefox -a cart

Making a decision:
As the patch author we need your feedback to help us handle this regression.

Our wiki page oulines the common responses and expectations:
https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
:Yoric, could you look into this regression which occurs on linux: http://alertmanager.allizom.org:8080/alerts.html?rev=b4efb3f1976f&showAll=1&testIndex=0&platIndex=0
Flags: needinfo?(dteller)
So we regress CART and TART and improve WebGL Terrain?
Avih, can you tell me how grave these regressions are?
Flags: needinfo?(dteller) → needinfo?(avihpit)
Blocks: 1150582
No longer blocks: 1138995
Summary: 16-22% Linux* cart/tart regression on Mozilla-Inbound-Non-PGO (v. 39) on April 01, 2015 from push b4efb3f1976f → 16-22% Linux* cart/tart regression on Mozilla-Inbound-Non-PGO (v. 40) on April 01, 2015 from push b4efb3f1976f
Component: Talos → JavaScript Engine
Product: Testing → Core
(In reply to David Rajchenbach-Teller [:Yoric] (use "needinfo") from comment #2)
> So we regress CART and TART and improve WebGL Terrain?
> Avih, can you tell me how grave these regressions are?

Looks quite bad TBH. all the tests regressed meaningfully on mozilla-central between 17b24c5f7773 (old) and 35046df9df1f (new).

Overall comparison:
https://compare-talos.paas.mozilla.org/?oldBranch=Firefox&oldRevs=17b24c5f7773&newBranch=Firefox&newRev=35046df9df1f&submit=true

CART on linux:
https://compare-talos.paas.mozilla.org/breakdown.htm?test_name=cart&oldRevs=17b24c5f7773&newRev=35046df9df1f&oldBranch=Firefox-Non-PGO&newBranch=Firefox-Non-PGO&os_name=linux&os_version=Ubuntu%2012.04&processor=x86

TART on Linux:
https://compare-talos.paas.mozilla.org/breakdown.htm?test_name=tart&oldRevs=17b24c5f7773&newRev=35046df9df1f&oldBranch=Firefox-Non-PGO&newBranch=Firefox-Non-PGO&os_name=linux&os_version=Ubuntu%2012.04&processor=x86

That being said, recently Silk vsync timing was enabled by default on Linux (bug 1149391). I think it could also be the cause for these, but it's hard for me to assess the source better since the decision to point a specific offending changeset is a bit opaque to me.

Would it be possible to identify the first m-c changeset which contains revision b4efb3f1976f ?
Flags: needinfo?(avihpit) → needinfo?(jmaher)
Joel helped me to get a more specific compare-talos view:
https://compare-talos.paas.mozilla.org/?oldBranch=Mozilla-Inbound&oldRevs=66945c5c4a40&newBranch=Mozilla-Inbound&newRev=b4efb3f1976f&submit=true

So this compares only b4efb3f1976f (from bug 674779) to the previous push on inbound (click the "details" link of regressed test to see how all the subtests compare). It shows similar breakdown to what comment 3 shows - overall - most subtests seems to have regressed.

Note that TART is an addon (as well as the pageloader which loads and runs TART), and bug 674779 does measure CPU usage of addons, so maybe it affected the results not by regressing Firefox performance but rather by regressing addons performance?
Well, the patch causes small slowdowns whenever we enter/exit a compartment that do not belong to the same performance group (e.g. between the platform and an add-on, or between two different add-ons). I seem to remember that Talos TART executes small code very often, right? If so, that would put us on the slow path indeed.

I'm not entirely sure how we could avoid this without either 1/ losing meaningful data; or 2/ opening a hornet's nest (bug 1134591). It might be possible for the watcher mechanism to self-shutdown, or to automatically stop monitoring a given add-on if it finds out that it is causing too much disruption.
Flags: needinfo?(avihpit)
(In reply to David Rajchenbach-Teller [:Yoric] (use "needinfo") from comment #6)
> ... I seem to remember that
> Talos TART executes small code very often, right? If so, that would put us
> on the slow path indeed.

Probably not correct enough.

TART (and CART which works the same) trigger animations and then grab the frames intervals which happens during the animation.

During the animation - TART is idle just waiting for the animation to finish. Once it finishes, it looks at the recorded intervals and crunches the numbers into the results it reports. The intervals recording happens outside TART and inside Gecko.

What these test results appear to indicate is that the recorded intervals are now longer. In other words - it seems to have affected every frame of the animations.

TART measures the animations of opening and closing tabs, and CART measures the animation of entering and exiting Customize-mode.
Flags: needinfo?(avihpit)
So I ran TART locally on Nightly. Very early during the test run, the "Slow Addon" notification shows at the bottom of the window and stays there till the test finishes.

These notifications are known to affect performance, and generally on talos we disable them. I expect that disabling this notification would address some/all of this regression.

That being said, both from this test run and during other experiments which I did, my feeling is that the "slow addon" notification might be showing earlier than a user expects it to.

It practically takes few seconds of normal activity with the addon to show the notification (the user did install the addon because she wants it to do _some_ work).

So regardless of the talos regression, maybe we should also file a bug to decide on a possibly better criteria for displaying this notification.
Bug 1150446 fixed an issue with the "slow addon" notification showing up when it shouldn't.
(In reply to David Rajchenbach-Teller [:Yoric] (use "needinfo") from comment #9)
> Bug 1150446 fixed an issue with the "slow addon" notification showing up
> when it shouldn't.

The fix is to show the notice when an addon occupies the main thread for 128ms or more in a single call, and before the fix it was 8ms (round(log2(7)) == 3, 4th bucket, where durations[3] is for 8ms or more).

I'd still recommend to disable it in talos, since the notice can still show up because talos uses several addons, some of them could definitely occupy the main thread for 128ms.

(In reply to David Rajchenbach-Teller [:Yoric] (use "needinfo") from comment #10)
> Looks like bug 1150446 fixed the issue:
> https://compare-talos.paas.mozilla.org/?oldBranch=Mozilla-
> Inbound&oldRevs=1748feaa98e2&newBranch=Mozilla-
> Inbound&newRev=52691dffa026&submit=true
> 
> Avi, can you confirm that I'm reading the results correctly?

This compares exactly before/after the 7/128 fix.

It does show varying improvements on TART/CART, but It's hard to tell from this comparison alone if it negates the original regression.

To understand if it does negate the original regression, we'll need to compare the post-fix changeset with a pre-addon-cpu-measurements patches (let's call it "base" changeset).

Could push such "base" changeset and compare the results with these two?
Flags: needinfo?(avihpit)
Monitoring inbound just before any monitoring vs. just after landing bug 1150446 suggests that the TART/CART issue has been fixed: https://compare-talos.paas.mozilla.org/?oldBranch=Mozilla-Inbound&oldRevs=66945c5c4a40&newBranch=Mozilla-Inbound&newRev=52691dffa026&submit=true

As avih suggested, I have just pushed to Try just before any monitoring vs. patch causing the regression + bug 1150446: https://compare-talos.paas.mozilla.org/?oldBranch=Try&oldRevs=85716dd32c63&newBranch=Try&newRev=ca1ffffc30a9&submit=true

Waiting for the results.
So, according to the above comparison, we have indeed resolve the TART/CART issue.
There is, however, a dromadeo css regression that didn't show up earlier.
Actually, the dromadeo css regression did show up. Apparently, we got back some of the loss, but not all.

Testing whether the above patch also fixes the dromadeo regression: https://compare-talos.paas.mozilla.org/?oldBranch=Try&oldRevs=85716dd32c63&newBranch=Try&newRev=efbae00a6fd3&submit=true
Looks like a victory to me.
did the original bug get backed out?  I see april 1 landing and april 2 all the talos values got reset.
Flags: needinfo?(dteller)
(In reply to Joel Maher (:jmaher) from comment #17)
> did the original bug get backed out?  I see april 1 landing and april 2 all
> the talos values got reset.

Bug 1150446, which landed just after the original bug, made sure that the slow add-on dialog didn't pop up by error, and had as side-effect to fix the slowdown for most tests. The patch you landed earlier today was a Talos-specific fix that made sure that this dialog never shows, which fixed a few more tests.
Flags: needinfo?(dteller)
Blocks: 1150563
s/backout/fix.
Comment on attachment 8587952 [details] [diff] [review]
Disable add-on watcher during Talos runs

Review of attachment 8587952 [details] [diff] [review]:
-----------------------------------------------------------------

The talos framework is Joel's domain.

(In reply to Joel Maher (:jmaher) from comment #23)
> with the fix in place for this, we have two new regressions that we see:
> 6.17% - win8 - tscroll-asap:
> 4.1% - osx 10.6 - tscroll-asap:

Joel, we still need to disable the notice completely, which is likely to improve stuff further.
Attachment #8587952 - Flags: review?(avihpit) → review?(jmaher)
Ah, right, the Talos fix is still very much necessary. I expect some slowdown from bug 674779, but that's around 1-2% according to my profiling.
Attachment #8587952 - Flags: review?(jmaher) → review+
Component: JavaScript Engine → Talos
Keywords: checkin-needed
Product: Core → Testing
https://hg.mozilla.org/build/talos/rev/b55d7d150c2c

Note that Joel still needs to update other things before this gets into production.
This occurs for tscroll-ASAP test on windows giving a 5.53% regression: http://alertmanager.allizom.org:8080/alerts.html?rev=3d4fb6a3ab9e&showAll=1&testIndex=0&platIndex=0
(In reply to Vaibhav (:vaibhav1994) from comment #28)
> This occurs for tscroll-ASAP test on windows giving a 5.53% regression:
> http://alertmanager.allizom.org:8080/alerts.
> html?rev=3d4fb6a3ab9e&showAll=1&testIndex=0&platIndex=0

Is that with the attached fix?
Flags: needinfo?(vaibhavmagarwal)
:Yoric, yes the regression still remains on tscroll-ASAP even after landing your patch. See in the graph: http://graphs.mozilla.org/graph.html#tests=[[287,64,31],[287,132,31],[287,131,31]]&sel=none&displayrange=30&datatype=geo , the graph rises and is still there, showing that the regression is not fixed.
Flags: needinfo?(vaibhavmagarwal)
Looks like the graph doesn't contain my patch (or perhaps the last few dots). I'll wait until tomorrow to have a better picture.
So, let's recapitulate.

Commit b4efb3f1976f caused:
1. severe CART regressions on all platforms;
2. severe TART regressions on all platforms;
3. medium tscroll-ASAP regressions on Windows XP;
4. small dromadeo-DOM regressions on Windows 7.

Regressions were expected, as the commit patches SpiderMonkey to measure lots of stuff. While there might be further optimizations that could possibly decrease the regression, I can't think of any simple-to-medium work that could provide such optimizations.

1. and 2. were poor interactions between the commit and Talos but were not actual performance issues, and were both RESOLVED FIXED as part of bug 1151466.

Now, 3. is not necessarily surprising – this is probably due to the underlying system calls being slower on Windows XP than on other platforms. I don't really want to spend too much time fixing Windows XP stuff, so I would tend to mark this point as WONTFIX.

After discussing with Avih, it seems that dromadeo is generally considered not a good test, so I would tend to mark this point as WONTFIX.

My conclusion is that we can mark this bug as RESOLVED FIXED.

Any objection?
Flags: needinfo?(vaibhavmagarwal)
Flags: needinfo?(jmaher)
Flags: needinfo?(avihpit)
sounds good to me!
Flags: needinfo?(jmaher)
WFM.

Joel, is it possible to verify that 1/2 were indeed fixed, 3/4 are relatively minor, and that no further regressions show up?
Flags: needinfo?(avihpit)
Bug 1152930 has been filed to cover a very precise regression, which could be the cause of regressions 3. and 4. as discussed in comment 33.
the regressions from 1 and 2 (http://alertmanager.allizom.org:8080/alerts.html?rev=b4efb3f1976f&showAll=1) are resolved.  All we have to do is decide on #3.

I believe #3 is more than just winxp, I thought it was win8 and osx 10.6:
http://graphs.mozilla.org/graph.html#tests=[[287,132,31],[287,131,31]]&sel=none&displayrange=30&datatype=geo
Flags: needinfo?(vaibhavmagarwal)
(In reply to Joel Maher (:jmaher) from comment #37)
> the regressions from 1 and 2
> (http://alertmanager.allizom.org:8080/alerts.
> html?rev=b4efb3f1976f&showAll=1) are resolved.

Great!


> I believe #3 is more than just winxp, I thought it was win8 and osx 10.6:
> http://graphs.mozilla.org/graph.html#tests=[[287,132,31],[287,131,
> 31]]&sel=none&displayrange=30&datatype=geo

Are you absolutely sure that these patches are the cause for the regression? yoric's compare-talos from comment 32 isolates his patches and is unaffected by external noise, retrigered several times, and doesn't show a regression on windows 8 (32) (the "details" page for win8 32):

https://compare-talos.paas.mozilla.org/breakdown.htm?test_name=tscrollx&oldRevs=49bee93f4cb4&newRev=f751a51e484c&oldBranch=Try-Non-PGO&newBranch=Try-Non-PGO&os_name=win&os_version=6.1.7601&processor=x86

Unfortunately, OS X doesn't show on that page.
Is there a status for the fixes here?
There were no fixes, but the issue was that the try pushes never showed anything close to what this bug reports (48% regressions), and therefore smaug had nothing to work with - on one hand you say it's 50% regression, OTOH try pushes show no more than 5%, and we didn't understand why (I still don't).

I suggest to do try pushes again using current m-c as a base, and hopefully get clearer indication of the perf impact of this patch. If it's closer to the 5% values we've seen by the previous try pushes, then I think we should land the patch as is.
(In reply to Avi Halachmi (:avih) from comment #41)
> There were no fixes, but the issue was that the try pushes never showed
> anything close to what this bug reports (48% regressions), and therefore
> smaug had nothing to work with - on one hand you say it's 50% regression,
> OTOH try pushes show no more than 5%, and we didn't understand why (I still
> don't).
> 
> I suggest to do try pushes again using current m-c as a base, and hopefully
> get clearer indication of the perf impact of this patch. If it's closer to
> the 5% values we've seen by the previous try pushes, then I think we should
> land the patch as is.

Ouch. Wrong bug. Ignore this.
reading through this bug, I really don't see anything that is actionable.  Can we close this out or make it actionable?
Re comment 38, I'm not sure the status is clear. I.e. what are the remaining regressions and on which platforms?
I looked back a bit to understand the issue.  I had previously stated there was a winxp, win8, and osx 10.6 regression.  We have stopped running osx 10.6, so this leaves in question winxp and win8.  Lets look on a graph:
https://treeherder.mozilla.org/perf.html#/graphs?timerange=7776000&series=%257B%2522project%2522%253A%2522mozilla-inbound%2522%252C%2522signature%2522%253A%2522ce63129ec33e983e301442a1fad4d7d744923e70%2522%252C%2522visible%2522%253Atrue%257D&series=%257B%2522project%2522%253A%2522mozilla-inbound%2522%252C%2522signature%2522%253A%25229bb3500701afc759aabc2669875a94def4449193%2522%252C%2522visible%2522%253Atrue%257D&highlightedRevisions=b4efb3f1976f&zoom=%7B%22x%22:%5B1427660838027.5862,1428218208493.596%5D,%22y%22:%5B2.476449385933254,3.4112319946289062%5D%7D

looking at this specific revision, it is pretty clear for winxp, but for win8, it started roughly 1 day and 5 hours later.  Both Win8 and WinXP receive a bump in the graph on April 2nd most likely from:
https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?changeset=52691dffa026

Given the fact that try didn't show a regression, there might not be much to do here.
Assignee: dteller → nobody
I don't think there is anything else we can do for the moment.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.