Closed Bug 1254684 Opened 5 years ago Closed 5 years ago

speed-battle.com reports a 10% performance regression in Render and Storage performance in 46.0b1 versus 45.0, Peacekeeper reports nearly the same (9.8% perf regression)

Categories

(Core :: General, defect)

46 Branch
defect
Not set
normal

Tracking

()

RESOLVED INCOMPLETE
Tracking Status
firefox46 + wontfix
firefox47 - ---

People

(Reporter: thee.chicago.wolf, Unassigned)

References

Details

(Keywords: perf)

Attachments

(2 files)

Attached image ff-45-sp.jpg
User Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:45.0) Gecko/20100101 Firefox/45.0 SeaMonkey/2.42
Build ID: 20160226091329

Steps to reproduce:

Visit http://www.speed-battle.com using FF 45.0 final. Repeat the test a few times to attain a high score. Install and run FF 46.0b1. Repeat the same steps as for FF 45.0.


Actual results:

The average score has dropped 9.34% on my i5-3470 @ 8GB DDR 1866 with 512MB NVidia Geforce 210 on Render performance and dropped 10% in Storage performance. Calculate performance has remained about the same.


Expected results:

There should not be a drop in performance. Performance should stay the same or slightly increase with a newer Ff version.
Attached image ff-46-sp.jpg
Some screen grabs of the difference in performance numbers.
Summary: speed-battle.com reports a 10% performance regression in Render and Storage performance regression in 46.0b1 versus 45.0 → speed-battle.com reports a 10% performance regression in Render and Storage performance in 46.0b1 versus 45.0
I'm not sure this online test can be trusted, I obtain many variable results for a same version of Firefox, so results for 45 and 46 can't compared between them, IMHO.
(In reply to Loic from comment #2)
> I'm not sure this online test can be trusted, I obtain many variable results
> for a same version of Firefox, so results for 45 and 46 can't compared
> between them, IMHO.

It's a good thermometer nonetheless. However, no one argues Peacekeeper (http://peacekeeper.futuremark.com/) scores. Here are mine on the same hardware:

45.0: 4957
46.0b1, build3: 4505 or 9.12% lower
46.0b1, build4: 4437 or 10.49% lower

Those lower scores average out to a 9.8% perf regression. Consistent with what Speed Battle is reporting. So yes, there is a perf regression in 46.
Summary: speed-battle.com reports a 10% performance regression in Render and Storage performance in 46.0b1 versus 45.0 → speed-battle.com reports a 10% performance regression in Render and Storage performance in 46.0b1 versus 45.0, Peacekeeper reports nearly the same (9.8% perf regression)
User Agent 	Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0
Build ID 	20160303134406

I did these tests on Firefox 45 release and Firefox 46 beta. On http://www.speed-battle.com I received different results for same Firefox version. However the average results show FF 45 is better performing than FF 46.0b1.

The http://peacekeeper.futuremark.com/ scores also show FF 45 is better performing than FF 46.0b1:
45.0: 4559
46.0b1, build 3: 4254
46.0b1, build 4: 4191
46.0b1, build 8: 4141
Status: UNCONFIRMED → NEW
Component: Untriaged → Performance Monitoring
Ever confirmed: true
Keywords: perf
Product: Firefox → Toolkit
To try and find a regression range, I went back to some older builds to test. Here's what I found:

Last good build: 2016010730235, scored 5059 on Peacekeeper
First bad build: 2016010803034, scored 4716 on Peacekeeper

Of note, build 20160101030330 scored 5249 on Peacekeeper

Hopefully with that info someone else can confirm and do some granular regression testing to figure out the bad change.
So I played with the Mozgression tool (1st time) but didn't get too far. My Jedi skills lack in this area.

However, I did manage to get the Pushlog from 1/7 to 1/8: https://hg.mozilla.org/mozilla-central/pushloghtml?startdate=2016-01-07&enddate=2016-01-08

Just glancing through, nothing seems to jump out at me.
Testing with 46.0b4 shows perf went down even further on Peacekeeper with a score of 4141. I thought it was an error but on second run it scored 4199. Averaging those scores, we get 4170. If comparing to my original score of 4957 on 45.0, it would appear perf has now gone down 15.877%. This is a rather alarming perf decrease and indicates a negative trend.
Arthur in your beta build could you check about:support and see if you were part of the e10s experiment? We're shipping e10s to nightly/aurora and a subset of beta.
Component: Performance Monitoring → Untriaged
Flags: needinfo?(thee.chicago.wolf)
Product: Toolkit → Core
chutten could you try to reproduce this with e10s and non-e10s and if you can reproduce try to generate some comparative profiles?
Flags: needinfo?(chutten)
Latest Nightly (48.0a1)
e10s - 92.04 949.8 99.9 1141.74
!e10s - 90.02 963.76 99.9 1153.68

Latest DevEd (47.0a2)
e10s - 91.02 949.8 106.39 1147.21
!e10s - 94.16 936.23 109.23 1139.62

In other words, e10s doesn't seem to have much/any effect on the stated numbers.

Also, I noted that individual runs can range anywhere from 700s to nearly 1200. I tried my best to pick a "representative" number, but humans aren't good at that. Regardless, I'm not sure speed-battle is a rigorous measure.
Flags: needinfo?(chutten)
(In reply to Chris H-C :chutten from comment #10)
> Latest Nightly (48.0a1)
> e10s - 92.04 949.8 99.9 1141.74
> !e10s - 90.02 963.76 99.9 1153.68
> 
> Latest DevEd (47.0a2)
> e10s - 91.02 949.8 106.39 1147.21
> !e10s - 94.16 936.23 109.23 1139.62
> 
> In other words, e10s doesn't seem to have much/any effect on the stated
> numbers.
> 
> Also, I noted that individual runs can range anywhere from 700s to nearly
> 1200. I tried my best to pick a "representative" number, but humans aren't
> good at that. Regardless, I'm not sure speed-battle is a rigorous measure.

Peacekeeper is more representative than Speed Battle IMO.
Flags: needinfo?(thee.chicago.wolf)
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #8)
> Arthur in your beta build could you check about:support and see if you were
> part of the e10s experiment? We're shipping e10s to nightly/aurora and a
> subset of beta.

Meaning:
Multi-process staged rollout	1.0	true	e10srollout@mozilla.org

or:
Multi-process Firefox A/B Test 46.1	e10s-beta46-noapz@experiments.mozilla.org	Measure the effect of multi-process Firefox	false	1458647505079		control-addon-user
I thought the bug was about speed-battle, not peacekeeper. Maybe it's about the regression.

Sure, I'll give a quick e10s/!e10s comparison of Peacekeeper.

Of course, none of these or my earlier numbers are statistically significant in any way (no repeated trials, the hardware running the test wasn't dedicated to the test (ie, I was doing other things while running the tests), AFAICT the test hasn't been vetted...) so they are illustrative.

Nightly 48.0a1 2016-03-22
e10s 3936
!e10s 3916

I think this regression is not related to e10s. It sounds as though there's some evidence here for there being a potential 45/46 regression, though I wonder how much of it can be attributed to build differences between Beta and Release. Anyone want to dig up a copy of 45beta kicking around for an apples-to-apples test?
There aren't real build differences between beta and release (releases are basically just promoted beta builds). But there are differences in behavior because telemetry (and experiments) are enabled by default when you're on the beta channel.

In any case, I'm looking for a recommendation of next steps: see if you can reproduce an obvious problem (e10s-related or a standard regression) and what next steps should be. You can of course find old beta builds at https://archive.mozilla.org/pub/firefox/releases/
Flags: needinfo?(chutten)
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #14)
> There aren't real build differences between beta and release (releases are
> basically just promoted beta builds). But there are differences in behavior
> because telemetry (and experiments) are enabled by default when you're on
> the beta channel.
> 
> In any case, I'm looking for a recommendation of next steps: see if you can
> reproduce an obvious problem (e10s-related or a standard regression) and
> what next steps should be. You can of course find old beta builds at
> https://archive.mozilla.org/pub/firefox/releases/

So...the below is not useful?

Last good build: 2016010730235, scored 5059 on Peacekeeper
First bad build: 2016010803034, scored 4716 on Peacekeeper
(In reply to Arthur K. from comment #15)
> So...the below is not useful?
> 
> Last good build: 2016010730235, scored 5059 on Peacekeeper
> First bad build: 2016010803034, scored 4716 on Peacekeeper

Regression ranges are always useful. Thank you very much!

Unfortunately, I'm not having such luck with Peacekeeper. In trying to nail down relevant changesets, I tried getting a regression range on the aurora channel. Three runs of a single build had scores that differed by 1100 points (mozilla-aurora 2016-01-22 (41159a2) gave me results of 3768 4590 and 4857, for instance). I can't nail down a 10% regression if the numbers vary at least 20% without generating a statistical population of results. 

At 5 minutes for each result, generating enough for statistics is going to take more time than I'd like.

Could you help me by telling me how you came by those builds and those scores? Were they just one-off, or were they stable across multiple runs of the benchmark? Did you build these yourself (and off of which channel?)?
Flags: needinfo?(chutten)
Flags: needinfo?(thee.chicago.wolf)
(In reply to Chris H-C :chutten from comment #16)
> (In reply to Arthur K. from comment #15)
> > So...the below is not useful?
> > 
> > Last good build: 2016010730235, scored 5059 on Peacekeeper
> > First bad build: 2016010803034, scored 4716 on Peacekeeper
> 
> Regression ranges are always useful. Thank you very much!
> 
> Unfortunately, I'm not having such luck with Peacekeeper. In trying to nail
> down relevant changesets, I tried getting a regression range on the aurora
> channel. Three runs of a single build had scores that differed by 1100
> points (mozilla-aurora 2016-01-22 (41159a2) gave me results of 3768 4590 and
> 4857, for instance). I can't nail down a 10% regression if the numbers vary
> at least 20% without generating a statistical population of results. 
> 
> At 5 minutes for each result, generating enough for statistics is going to
> take more time than I'd like.

I find that with Peacekeeper my scores are within a few percentage of each other. I tend to do a clean reboot and run it a few times with no other apps open.

 
> Could you help me by telling me how you came by those builds and those
> scores? Were they just one-off, or were they stable across multiple runs of
> the benchmark? Did you build these yourself (and off of which channel?)?

The specific builds I used to find last good/first bad came from http://ftp.mozilla.org/pub/firefox/nightly/2016/01/2016-01-07-03-02-35-mozilla-central/ and http://ftp.mozilla.org/pub/firefox/nightly/2016/01/2016-01-08-03-03-34-mozilla-central/ respectively.
Flags: needinfo?(thee.chicago.wolf)
Awesome, thanks!

For me, ./mach mozregression --good 2016-01-07 --bad 2016-01-08 eventually points me to bug 1205288 which doesn't immediately look like something that ought to cause a regression (maybe checking the method early on in a blob request counts towards the bench whereas, before, it wouldn't have?)

I'm going to see if reverting that change actually has an effect.
No obvious effect. But, then, I wasn't really expecting one. For the first time, mozregression has failed me. 

The changelog from the builds you cite is here: https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=e0bcd16e1d4b99ba3e542149d0d41e0f60c54b5c&tochange=1424cdfc075d1b7e277be914488ac73e20d1c982

It isn't short. Nothing in particular jumps out at me as an obvious candidate.

Passing those revs into mozregression, I get (3956, 3612, 3818) for 20160107032110 and (3955, 3916, 3891) for 20160108062355

I don't think I have the appropriate hardware to test this regression. Have you used mozregression[0] before? It helps bisect down to the change that introduced a regression. If you have a build environment handy, you can try running 

./mach mozregression --good e0bcd16e1 --bad 1424cdfc075 

(those are the SHAs for the good and bad builds you supplied earlier)

[0]: http://mozilla.github.io/mozregression/
Flags: needinfo?(thee.chicago.wolf)
> I don't think I have the appropriate hardware to test this regression. Have
> you used mozregression[0] before? It helps bisect down to the change that
> introduced a regression. If you have a build environment handy, you can try
> running 
> 
> ./mach mozregression --good e0bcd16e1 --bad 1424cdfc075 
> 
> (those are the SHAs for the good and bad builds you supplied earlier)
> 
> [0]: http://mozilla.github.io/mozregression/

I tried installing Mozgression GUI but didn't come up with much. Not sure if it requires my hardware to repro as it seems Liviu was seeing the perf regression I am seeing. Did you also test with 45.0 and 46.0b1?
Flags: needinfo?(thee.chicago.wolf)
Based on the comments in the bug, we have not been able to notice this perf slowdown due to the variance Chris has seen across multiple runs. It is still too early to track this for 47. Liz, just fyi, there is a claim that 46 is 10% worse in perf.
Flags: needinfo?(lhenry)
Will I am flagging this one for your attention too. Can somebody from Talos team take a look at the perf regression that is being mentioned here? There is a notable different between 45.0 and 46.0b1. Any help here is appreciated.
Flags: needinfo?(wlachance)
All the unresolved talos issues in firefox 46 are tracked here: bug 1233209. I don't see anything offhand that would be related to this. That said, we don't test have that many benchmarks like this in Talos, we tend to leave that to the JS team w/ arewefastyet:

https://arewefastyet.com/

... but I don't see any large changes there either.
Flags: needinfo?(wlachance)
Running mozregression also failed. I was in a 1-2day regression range, when it diverted to taskcluster and things got weird.

E.g.
> $ mozregression --good=2016-01-06 --bad=2016-01-10
>  0:04.37 INFO: Testing good and bad builds to ensure that they are really good and bad...
>  0:04.38 INFO: Downloading build from: https://archive.mozilla.org/pub/firefox/nightly/2016/01/2016-01-06-03-02-25-mozilla-central/firefox-46.0a1.en-US.linux-x86_64.tar.bz2
> ===== Downloaded 100% =====
>  0:13.32 INFO: Running mozilla-central build for 2016-01-06
>  0:25.89 INFO: Launching /tmp/tmp0oDJF2/firefox/firefox
>  0:25.90 INFO: application_buildid: 20160106030225
>  0:25.90 INFO: application_changeset: 9d6ffc7a08b6b47056eefe1e652710a3849adbf7
>  0:25.90 INFO: application_name: Firefox
>  0:25.90 INFO: application_repository: https://hg.mozilla.org/mozilla-central
>  0:25.90 INFO: application_version: 46.0a1
> Was this nightly build good, bad, or broken? (type 'good', 'bad', 'skip', 'retry' or 'exit' and press Enter):

Score: 4000

> $ mozregression --repo=mozilla-central --launch 9d6ffc7a08b6b47056eefe1e652710a3849adbf7
> 
>  0:03.37 INFO: Downloading build from: https://queue.taskcluster.net/v1/task/dJ5fBPeGRJiGVOIUpFc8LQ/runs/0/artifacts/public%2Fbuild%2Ffirefox-46.0a1.en-US.linux-x86_64.tar.bz2
> ===== Downloaded 100% =====
>  0:20.05 INFO: Running mozilla-central build built on 2016-01-06 01:43:50.916000, revision 9d6ffc7a
>  0:31.66 INFO: Launching /tmp/tmpm2QOOp/firefox/firefox
>  0:31.66 INFO: application_buildid: 20160105163504
>  0:31.66 INFO: application_changeset: 9d6ffc7a08b6b47056eefe1e652710a3849adbf7
>  0:31.66 INFO: application_name: Firefox
>  0:31.66 INFO: application_repository: https://hg.mozilla.org/mozilla-central
>  0:31.66 INFO: application_version: 46.0a1

Score: 3400

Do the builds differ and how?
Hannes, can you take on the investigation of this bug from here? I'm a little worried about it for the 46 release, and want to make sure we figure out how serious the issue is, and if there's anything we can do about it for 46.  For now I'll track the bug and consider it affected for 46. 

If there is any help you need from QE, maybe Florin's team can contribute or do some testing.
Flags: needinfo?(lhenry)
Flags: needinfo?(hv1989)
Flags: needinfo?(florin.mezei)
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #25)
> Hannes, can you take on the investigation of this bug from here? I'm a
> little worried about it for the 46 release, and want to make sure we figure
> out how serious the issue is, and if there's anything we can do about it for
> 46.  For now I'll track the bug and consider it affected for 46. 
> 
> If there is any help you need from QE, maybe Florin's team can contribute or
> do some testing.

The issue is tracked in bug 1260435. Bug 1207922 is the cause of it.
There we improved performance and maintainability of regexp test and exec functions.

What we are seeing here is an extra optimization that doesn't happen anymore, due to the refactor.
This optimization was aimed at improving creating a regexp in a loop. In particular we had to do this for benchmark scores (octane regexp). In tight loops with an easy/fast regexp, the creation of the regexp takes the most time. The optimization tried to hoist it outside the loop. The refactor accidentally disabled this on most regexp creations and even with this optimization disabled the refactor was a net win on octane regexp.

I don't think we need to track this for beta/aurora/... . On AWFY we haven't seen a regression (which tests the 'official' benchmarks). Peacekeeper isn't one of them. (Peacekeeper itself is outdated, according to the benchmark itself). The regressions on the two subtests (this fix also improved two tests) are micro benchmarks. We will fix the issue, since it will improve octane-regexp also. But I don't think backports and tracking are needed.
Flags: needinfo?(hv1989)
Sounds like there's nothing more for QA to do here as this is known and tracked separately.

Arthur, I guess you can retry getting some measurements when bug 1260435 is fixed.
Flags: needinfo?(florin.mezei)
(In reply to Florin Mezei, QA (:FlorinMezei) from comment #27)
> Sounds like there's nothing more for QA to do here as this is known and
> tracked separately.
> 
> Arthur, I guess you can retry getting some measurements when bug 1260435 is
> fixed.

Will keep an eye on it after bug 1260435 is fixed. Thanks all for investigating. I know Peacekeeper's not being maintained anymore by the author but it's still useful. Well, at least for me. =/
In the interest of clarity, we're not going to track this generic issue by itself, so I'm going to close this bug.
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → INCOMPLETE
Moving from Core::Untriaged to Core::General https://bugzilla.mozilla.org/show_bug.cgi?id=1407598
Component: Untriaged → General
You need to log in before you can comment on or make changes to this bug.