Closed Bug 1141252 Opened 9 years ago Closed 9 years ago

2% WinXP tp5o_scroll regression on inbound (v.39) on March 08, 2015 from push 16f240fd86de

Categories

(Core :: MathML, defect)

x86_64
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla39
Tracking Status
firefox39 --- fixed

People

(Reporter: jmaher, Assigned: roc)

References

Details

(Keywords: perf, regression, Whiteboard: [talos_regression])

Attachments

(1 file)

Talos has detected a Firefox performance regression from your commit 16f240fd86de in bug 1139709.  We need you to address this regression.

This is a list of all known regressions and improvements related to your bug:
http://alertmanager.allizom.org:8080/alerts.html?rev=16f240fd86de&showAll=1

On the page above you can see Talos alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test, please see: https://wiki.mozilla.org/Buildbot/Talos/Tests#tp5o_scroll

Reproducing and debugging the regression:
If you would like to re-run this Talos test on a potential fix, use try with the following syntax:
try: -b o -p win32 -u none -t g1  # add "mozharness: --spsProfile" to generate profile data

To run the test locally and do a more in-depth investigation, first set up a local Talos environment:
https://wiki.mozilla.org/Buildbot/Talos/Running#Running_locally_-_Source_Code

Then run the following command from the directory where you set up Talos:
talos --develop -e <path>/firefox -a tp5o_scroll

Making a decision:
As the patch author we need your feedback to help us handle this regression.
*** Please let us know your plans by Thursday, or the offending patch will be backed out! ***

Our wiki page oulines the common responses and expectations:
https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
Roc, can you look into this.  This is a really small regression on xp only, so I would be sad if you spent a lot of time investigating this.  Although if there is an obvious thing to adjust or a good reason to have this as a known regression, it would be good to know that as well.
Flags: needinfo?(roc)
The patches in that bug very clearly only affect pages that use MathML.

As far as I can tell, Tp5 pages do not use MathML, and if they did, those patches would speed them up.

I can only conclude that this is some kind of anomaly in the build or test process.
Flags: needinfo?(roc)
The big NSS upgrade that landed a few hours later seems like a more likely culprit.
http://hg.mozilla.org/integration/mozilla-inbound/rev/3ec78ec97624

Looking at the graph, I agree a regression occurred early morning Mar 7.
Nevermind, this is about the tp5o_scroll test, not tp5 as I first thought.
Clearly NSS won't affect scroll performance.  I don't see any likely culprit
in the regression window.  I kind of suspect an external event like IT
maintenance or something caused it.

Would it be possible to trigger more tp5o_scroll test runs on a few changesets
on both sides of rev 16f240fd86de on the affected platform(s)?
Flags: needinfo?(jmaher)
doing the retriggers:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&fromchange=826e9c97526b&tochange=dca901fa0641&filter-searchStr=Windows%20XP%2032-bit%20mozilla-inbound%20talos%20g1

still points to 16f240fd86de as the push that caused the increase in tp5o scroll.  Again, this is a small increase and on winxp only.  tp5 scroll loads a webpage and scrolls it down to the end for the 49 pages we have in the tp5 pageset.
Flags: needinfo?(jmaher)
(In reply to Joel Maher (:jmaher) from comment #5)
> Again, this is a small increase and on winxp only.

I wouldn't consider a 2% scroll regression small.  It's major IMO.
Now I see that bug 1082249 landed together with the MathML patches.
That's probably what caused it.
Flags: needinfo?(roc)
(In reply to Mats Palmgren (:mats) from comment #6)
> (In reply to Joel Maher (:jmaher) from comment #5)
> > Again, this is a small increase and on winxp only.
> 
> I wouldn't consider a 2% scroll regression small.  It's major IMO.

What would be your reference when you state that it's major?

Typically 2% is the threshold below which we don't file regression bugs. Because it's very close to the noise level of the test results.
The noise level is irrelevant really.  If you have results that vary in the range
95 - 105 normally and then go to the range 97 - 107, that's a ~2% regression even
though it's within the level of noise.  It's no different to having zero noise
and going from 100 to 102.

I consider a 2% scroll regression major since janky scroll performance is something
users dislike strongly.  And, as far as I know, we have had some problems in that
area historically so I suspect we might still only have barely acceptable performance
on some hardware.  A 2% regression in that situation might push us over the edge
we're it's no longer acceptable to many users and they switch browser.

As far as I know, we still care about WinXP performance since we have many users there.
(in China mostly, I think)
I had a look at the patch in bug 1082249 and it seems pretty clear it won't affect anything other than table parts with background-attachment:fixed. Maybe our Tp5o set contains one of those? I'll get hold of the pageset and see which part of the patch is being triggered (if any).
Flags: needinfo?(roc)
OK, I found the problem and it's a doozy.

It's in the xinhuanet.com page, which contains the following rules:

.dian { background-attachment: fixed; background-image: url(../news.xinhuanet.com/test2009/img/dian.gif); background-repeat: repeat; background-position: center center; }
.landian { background-attachment: fixed; background-image: url(../news.xinhuanet.com/test2009/img/landian.gif); background-repeat: repeat; background-position: center center; }

dian.gif is a 3x3 transparent image with a gray dot in the center. landian.gif is a 3x3 transparent image with a blue dot in the center. These classes are assigned to 3px-width or 3px-height table cells to create dotted vertical or horizontal lines dividing articles and other content from one another.

Before the fix for bug 1082249, we were rendering this page incorrectly during scrolling, although it's practically impossible to tell visually, which is probably why no-one ever noticed. Chrome renders the background-attachment:fixed correctly, but it's almost impossible to tell that they actually are attachment-fixed even when you're looking for it ... which is a good thing, since it would look terrible if you could notice it.

Since almost all the page content is in a table which will be fully invalidated by the patch in bug 1082249, I guess we've slowed down the scrolling of this page substantially.

The current Xinhuanet.com page doesn't seem to have anything like this.

I guess the question is whether there are other pages still doing something like this, and whether we should care that they will scroll slowly. I'm inclined to say we shouldn't care. I don't see a simple way to speed the xinhuanet test page back up while getting "correct" rendering.
Actually there might be...
Assignee: nobody → roc
Status: NEW → ASSIGNED
Paint flashing shows that this patch fixes the xinhuanet.com test page.
I tried earlier to get graphs with sub tests of tpro-scroll, but tree herder is not quite there yet.

So then I tried picking two revisions and looking at their talos run logs. I looked at the XO op and pgo logs. Here's how the summary looks for these pages:

5334d2bead3e (2015-03-12)

XP opt
> 21:17:01     INFO -  [#46] xinhuanet.com/xinhuanet.com/index.html  Cycles:12  Average:8.10  Median:8.09  stddev:0.10 (1.2%)  stddev-sans-first:0.07
> 21:17:09     INFO -  Values: 8.3  8.0  8.1  8.2  8.0  8.0  8.0  8.1  8.1  8.0  8.1  8.2

XP pgo
> 22:48:13     INFO -  [#46] xinhuanet.com/xinhuanet.com/index.html  Cycles:12  Average:5.82  Median:5.80  stddev:0.10 (1.8%)  stddev-sans-first:0.08
> 22:48:13     INFO -  Values: 6.0  5.8  5.8  5.9  5.7  5.8  5.7  5.8  5.8  5.8  6.0  5.8


43fb1f92e8d4 (2015-03-06)

XP opt
> 15:33:41     INFO -  [#46] xinhuanet.com/xinhuanet.com/index.html  Cycles:12  Average:6.84  Median:6.70  stddev:0.70 (10.4%)  stddev-sans-first:0.13
> 15:33:41     INFO -  Values: 9.0  6.8  6.7  6.7  6.5  6.7  6.6  6.4  6.5  6.6  6.8  6.7

XP pgo
> 17:20:13     INFO -  [#46] xinhuanet.com/xinhuanet.com/index.html  Cycles:12  Average:3.69  Median:3.70  stddev:0.06 (1.5%)  stddev-sans-first:0.04
> 17:20:13     INFO -  Values: 3.8  3.7  3.6  3.7  3.7  3.7  3.7  3.6  3.7  3.7  3.7  3.6

So this page indeed regressed in both pgo and non pgo.

I then thought of giving compare-talos a try. I wasn't expecting it to allow sub-results of tp5o-scroll, but it did, though it doesn't show if it's for pgo or opt, I'm guessing it's both together.

compare talos overview of all tests between these revisions indicates that this test on XP "regressed" by ~6%, but it doesn't flag it as such since it thinks it's within the noise level (original values 3.25 +/1 1.38, new values 3.44 +/- 1.76). I'm guessing that's stddev:

http://compare-talos.mattn.ca/?oldRevs=43fb1f92e8d4&newRev=5334d2bead3e&server=graphs.mozilla.org&submit=true


"details" page for tpro-scroll on XP:
http://compare-talos.mattn.ca/breakdown.html?oldTestIds=45906904,45905066,45914122,45911500,45923298&newTestIds=46067648,46072074&testName=tp5o_scroll&osName=Windows%20XP&server=graphs.mozilla.org

And indeed it shows that the xinhuanet regressed ~60%, however, it seems also that all the other pages (except myspace.com) regressed ~5% .

It's possible that the overall 5% regression is indeed the noise, with the real regression being 60% only on the xinhuanet page (which is one of 50 pages this test scrolls).

Sorry that I couldn't provide this data earlier, but at least we know now that compare-talos will be useful in evaluating a fix.
And apologies for the typos. tpro -> tp5o, op -> opt, etc.
(In reply to Avi Halachmi (:avih) from comment #15)
> ... though it doesn't show if it's for pgo or opt, I'm guessing it's both together.

It's indeed together.

> compare talos overview of all tests between these revisions indicates that
> this test on XP "regressed" by ~6%, but it doesn't flag it as such since it
> thinks it's within the noise level (original values 3.25 +/1 1.38, new
> values 3.44 +/- 1.76). I'm guessing that's stddev:

Also correct, since it's both pgo and opt results, the noise level appears high.

However, I then tried clicking the "new site" at the top of the page which takes it data from datazilla and can distinguish between pgo and opt results (the "default" site takes data from graph server where the API doesn't allow this distinction).

This new site indeed indicates that both pgo and non pgo regressed about 2.4% on average (and uses red color to indicate it's conclusive), and on both cases most pages were roughly unmodified while xinhuanet page took a ~55% hit.

That's the opt XP build: https://compare-talos.paas.mozilla.org/breakdown.htm?test_name=tp5o_scroll&oldRevs=43fb1f92e8d4&newRev=5334d2bead3e&oldBranch=Firefox-Non-PGO&newBranch=Firefox-Non-PGO&os_name=win&os_version=5.1.2600&processor=x86

And the pgo one looks similar.
No longer blocks: 1131000, 1139709
Comment on attachment 8576376 [details] [diff] [review]
Don't mark the TablePainter display item as having background-attachment:fixed content if the background-attachment:fixed image will be painted by a dedicated nsDisplayTableCellBackground

Nice!
Attachment #8576376 - Flags: review?(mats) → review+
thanks :roc!  I see an improvement for this test.
https://hg.mozilla.org/mozilla-central/rev/4100738d662d
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla39
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: