1650398 - 27.7 - 38.75% raptor-tp6-google-firefox-cold replayed (linux64-shippable, windows10-64-shippable, windows7-32-shippable) regression on push f0365e7eadd029a669a72bc6cb9c5171e735d3e2 (Wed June 3 2020)

Reporter

Description

•

4 years ago

Perfherder has detected a raptor performance regression from push f0365e7eadd029a669a72bc6cb9c5171e735d3e2. As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

39% raptor-tp6-google-firefox-cold replayed windows10-64-shippable opt 812.71 -> 497.75
37% raptor-tp6-google-firefox-cold replayed windows7-32-shippable opt 740.00 -> 466.17
28% raptor-tp6-google-firefox-cold replayed linux64-shippable opt 850.29 -> 614.75

Improvements:

64% raptor-tp6-google-firefox-cold loadtime windows7-32-shippable opt 1,101.83 -> 397.08
58% raptor-tp6-google-firefox-cold loadtime windows10-64-shippable opt 1,112.67 -> 464.50
31% raptor-tp6-google-firefox-cold not-replayed windows10-64-shippable opt 233.08 -> 160.33
23% raptor-tp6-google-firefox-cold windows7-32-shippable opt 385.83 -> 298.67
21% raptor-tp6-google-firefox-cold windows10-64-shippable opt 384.43 -> 304.04

Details of the alert can be found in the alert summary, including links to graphs and comparisons for each of the affected tests. Please follow our guide to handling regression bugs and let us know your plans within 3 business days, or the offending patch(es) will be backed out in accordance with our regression policy.

For more information on performance sheriffing please see our FAQ.

Florin Strugariu [:Bebe]

Reporter

Updated

•

4 years ago

Flags: needinfo?(botond)

Florin Strugariu [:Bebe]

Reporter

Comment 1

•

4 years ago

Looking back on this as it caused Firefox to request significantly less requests for google.com

you can find the request logs here:
First bad:
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/CdpWrcX_TUCRSkuHjFABtQ/runs/0/artifacts/public/test_info/mitm_netlocs_google-search.json

Last good:
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/LXdMDS8dRRaaoTb-w6A8hg/runs/0/artifacts/public/test_info/mitm_netlocs_google-search.json

Component: Performance → Panning and Zooming

Product: Testing → Core

Florin Strugariu [:Bebe]

Reporter

Comment 2

•

4 years ago

Opening the page locally it looks the same but we need to make sure we are not loosing and valuable requests

Botond Ballo [:botond]

Assignee

Comment 3

•

4 years ago

Will have a look.

Assignee: nobody → botond

Flags: needinfo?(botond)

Dave Hunt [:davehunt] [he/him] ⌚BST

Comment 4

•

4 years ago

Looking over the mitm_netlocs artifacts for before/after the regression there appear to be some page loads that are not requesting certain URLs. For example, both include 25 requests for https://www.google.com/search?hl=en&q=barack+obama&cad=h (the test URL), but the image at https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcT15FuICR-PSp1_Pv6dLvbYjyu8eUyhMynwX664i0T78vfZYo9YPL_KsVE is requested 16 times in 9c0a4461457638440c1383f3a42f51c3d18f7c69 but only 11 times in f0365e7eadd029a669a72bc6cb9c5171e735d3e2. As every load should be identical, even having 16 requests out of 25 is concerning.

Timothy Nikkel (:tnikkel)

Comment 5

•

4 years ago

Where are these requests recorded? I assume somewhere within the network code? The image cache can mean the network code never knows about an image request.

Florin Strugariu [:Bebe]

Reporter

Comment 6

•

4 years ago

we have request recordings done using mitmproxy and these requests are replayed using the same tool.
During playback re record the url we got a request for and the status code we(the tool) replys with and generate the replayed, not-replayed, and confidence metrics

Kim Moir [:kmoir] ET

Comment 7

•

4 years ago

botond: Triaging as REO for 79 - can you set a priority and severity on this bug?

Flags: needinfo?(botond)

Botond Ballo [:botond]

Assignee

Comment 8

•

4 years ago

•

Edited

Marking as P2 until we understand better what's going on here.

However, my investigation so far is pointing to bug 1611660 not being the culprit here.

I ran two Try pushes:

one with a recent mozilla-central
another with the same m-c revision + bug 1643604 and bug 1611660 backed out (bug 1643604 was a previous perf fix for bug 1611660 which touched the same code and so had to be backed out for bug 1611660 to back out cleanly)

I ran Raptor tests on Linux x64 shippable, and retriggered the google-c job 10 times in each Try push.

I then entered these two revisions into the Perfherder compare view and looked at the entry raptor-tp6-google-firefox-cold replayed opt, which shows the expected 10/10 runs.

The average delta in that entry is +1.00%. There is a fair amount of variation between individual runs, with a standard deviation of ~15%. So, while an individual run may show the patches that include bug 1611660 scoring 20-30% worse, others show them scoring better by a similar amount, and there seems to be no systematic effect.

Am I missing something / looking at the wrong thing here?

Severity: -- → S2

Flags: needinfo?(botond) → needinfo?(fstrugariu)

Priority: -- → P2

Florin Strugariu [:Bebe]

Reporter

Comment 9

•

4 years ago

your info is right here but looking at the jobs you should have compared.
Windows 7 Shippable opt Raptor performance tests on Firefox test-windows7-32-shippable/opt-raptor-tp6-firefox-cold-google-e10s Rap(google-c)

Ran some try build in your links hopefully this will show the regresssion

Flags: needinfo?(fstrugariu) → needinfo?(botond)

Timothy Nikkel (:tnikkel)

Comment 10

•

4 years ago

After retriggering those win7 jobs its pretty clear they are both pretty much the same and no where near the "before" numbers mentioned in comment 0.

Flags: needinfo?(fstrugariu)

Florin Strugariu [:Bebe]

Reporter

Comment 11

•

4 years ago

Did some retriggers on try and the number s are constant

Also added some jobs of the same tests on Browsertime to see if it's the framework:

After (bug 1611660 and browsertime on google):
https://treeherder.mozilla.org/#/jobs?repo=try&revision=10d65dc1386d719e9890aff07ba7c958f49b6704

Before (bug before 1611660 (Bug 1641291 ) and browsertime on google)::
https://treeherder.mozilla.org/#/jobs?repo=try&revision=0df348ca0f393cec382f7a4e1468c8090e58e77f

Timothy Nikkel (:tnikkel)

Comment 12

•

4 years ago

I'm concerned that this perf test isn't reliable.

We see number of requests almost get cut in half, and a corresponding drop in load time of also about a half. And there doesn't appear to be any content missing from the page. The missing requests were pointed out to be images. The image cache in Firefox could easily explain this if small timing changes affect when entries in the image cache get removed from the cache, and hence if we use a cached image or have to fetch the image again over the network. As someone who works on the image cache this seems entirely plausible to me.

Timothy Nikkel (:tnikkel)

Comment 13

•

4 years ago

(In reply to Florin Strugariu [:Bebe] (needinfo me) from comment #11)

Also added some jobs of the same tests on Browsertime to see if it's the framework:

All of these jobs failed with "raptor-main Critical: test 'google' could not be found for firefox"

Florin Strugariu [:Bebe]

Reporter

Comment 14

•

4 years ago

•

Edited

(In reply to Timothy Nikkel (:tnikkel) from comment #13)

(In reply to Florin Strugariu [:Bebe] (needinfo me) from comment #11)

Also added some jobs of the same tests on Browsertime to see if it's the framework:

All of these jobs failed with "raptor-main Critical: test 'google' could not be found for firefox"

it looks like settings on browsertime failed
I did a backout of bug 1611660 and ran the browsertime tests
https://treeherder.mozilla.org/#/jobs?repo=try&revision=c4a952e8e5c1d2ac146e0876e0fc015f60a59f59

Mozilla central try:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=a9236835fac366066055eacd3d1bb30a9667290c

Florin Strugariu [:Bebe]

Reporter

Comment 15

•

4 years ago

Compare view between the 2 builds:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=a9236835fac366066055eacd3d1bb30a9667290c&newProject=try&newRevision=c4a952e8e5c1d2ac146e0876e0fc015f60a59f59&framework=13

Timothy Nikkel (:tnikkel)

Comment 16

•

4 years ago

(In reply to Florin Strugariu [:Bebe] (needinfo me) from comment #15)

Compare view between the 2 builds:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=a9236835fac366066055eacd3d1bb30a9667290c&newProject=try&newRevision=c4a952e8e5c1d2ac146e0876e0fc015f60a59f59&framework=13

Do you see any significant differences? Doesn't look like it to me, but I don't read perf compares very often.

Florin Strugariu [:Bebe]

Reporter

Comment 17

•

4 years ago

•

Edited

Found the diffs in the subtest link of the tests:
Raptor:
https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&newProject=try&newRevision=c4a952e8e5c1d2ac146e0876e0fc015f60a59f59&originalSignature=2602001&newSignature=2602001&framework=13&originalRevision=a9236835fac366066055eacd3d1bb30a9667290c

Browsertime:
https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&newProject=try&newRevision=c4a952e8e5c1d2ac146e0876e0fc015f60a59f59&originalSignature=2602001&newSignature=2602001&framework=13&originalRevision=a9236835fac366066055eacd3d1bb30a9667290c

Keep in mind the alert is on windows10-64-shippable the test name is google-search opt cold and the stubtest is loadtime

Flags: needinfo?(fstrugariu)

Florin Strugariu [:Bebe]

Reporter

Comment 18

•

4 years ago

Note: as the before and after is central and backout it proves your bug introduced the regression.

Tim Spurway [:tspurway]

Updated

•

4 years ago

status-firefox78: --- → unaffected

status-firefox80: --- → affected

Ryan VanderMeulen [:RyanVM]

Comment 19

•

4 years ago

Timothy and/or Botond, I'm wondering if we should backout bug 1611660 from Beta in light of comment 18. That would at last buy us more time to investigate without shipping this regression to release. Note that next week is RC week.

status-firefox-esr68: --- → unaffected

status-firefox-esr78: --- → unaffected

Flags: needinfo?(tnikkel)

Botond Ballo [:botond]

Assignee

Comment 20

•

4 years ago

(In reply to Florin Strugariu [:Bebe] (needinfo me) from comment #17)

Found the diffs in the subtest link of the tests:
Raptor:
https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&newProject=try&newRevision=c4a952e8e5c1d2ac146e0876e0fc015f60a59f59&originalSignature=2602001&newSignature=2602001&framework=13&originalRevision=a9236835fac366066055eacd3d1bb30a9667290c

I think you meant to post a different link here, as this is also showing browsertime.

Flags: needinfo?(fstrugariu)

Botond Ballo [:botond]

Assignee

Comment 21

•

4 years ago

(In reply to Botond Ballo [:botond] from comment #20)

(In reply to Florin Strugariu [:Bebe] (needinfo me) from comment #17)

Found the diffs in the subtest link of the tests:
Raptor:
https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&newProject=try&newRevision=c4a952e8e5c1d2ac146e0876e0fc015f60a59f59&originalSignature=2602001&newSignature=2602001&framework=13&originalRevision=a9236835fac366066055eacd3d1bb30a9667290c

I think you meant to post a different link here, as this is also showing browsertime.

Anyways, starting from that link and going to "Back to all tests and platforms" and then selecting "raptor" from the dropdown, I'm not seeing any significant improvements, certainly nothing comparable to the 27-38% mentioned in the bug description.

Botond Ballo [:botond]

Assignee

Comment 22

•

4 years ago

•

Edited

(In reply to Florin Strugariu [:Bebe] (needinfo me) from comment #17)

Browsertime:
https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&newProject=try&newRevision=c4a952e8e5c1d2ac146e0876e0fc015f60a59f59&originalSignature=2602001&newSignature=2602001&framework=13&originalRevision=a9236835fac366066055eacd3d1bb30a9667290c

Keep in mind the alert is on windows10-64-shippable the test name is google-search opt cold and the stubtest is loadtime

After increasing the number of retriggers from 7 to 16, the improvement in this metric went from 17% to 3.6%.

The results of individual test runs are clustered around three values (~380, ~720, and ~1050) both before and after, and the difference in the average seems to be due to variations in which value individual tests hit; this difference gets smaller as the number of test runs increases.

So, I don't think the patch is having a systematic effect here.

Flags: needinfo?(botond)

Botond Ballo [:botond]

Assignee

Comment 23

•

4 years ago

(In reply to Ryan VanderMeulen [:RyanVM] from comment #19)

Timothy and/or Botond, I'm wondering if we should backout bug 1611660 from Beta in light of comment 18. That would at last buy us more time to investigate without shipping this regression to release. Note that next week is RC week.

A few thoughts on this:

Bug 1611660 fixes user-noticeable functional regressions (the STR in bug 1611660 itself, and bug 1604504).
In bug 1643604, we investigated a seemingly large tscrollx regression caused by bug 1611660, which turned out to be an artifact of the tscrollx test itself being flaky in the sense of small variations in startup timing resulting in large differences in test outcomes (bug 1645275).
The raptor and browsertime regressions discussed in this bug seem to be due to existing variation in test outcomes (i.e. individual test runs producing different very different scores), and differences between "before" and "after" diminish with an increasing number of retriggers (comment 10, comment 22).

Based on the above, my recommendation is not to back out bug 1611660. I am open to investigating raptor or browsertime further (particularly if a difference that persists across many retriggers is demonstrated), but based on what the patch in bug 1611660 is doing, I think it's fairly unlikely for it to actually cause a user-noticeable performance regression (as opposed to some effect related to a test harness).

(Timothy, I would be curious to hear your opinion as well.)

Florin Strugariu [:Bebe]

Reporter

Comment 24

•

4 years ago

(In reply to Botond Ballo [:botond] from comment #20)

(In reply to Florin Strugariu [:Bebe] (needinfo me) from comment #17)

Found the diffs in the subtest link of the tests:
Raptor:
https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&newProject=try&newRevision=c4a952e8e5c1d2ac146e0876e0fc015f60a59f59&originalSignature=2602001&newSignature=2602001&framework=13&originalRevision=a9236835fac366066055eacd3d1bb30a9667290c

I think you meant to post a different link here, as this is also showing browsertime.

Compere view links are so hard to fallow :((
https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&newProject=try&newRevision=c4a952e8e5c1d2ac146e0876e0fc015f60a59f59&originalSignature=1994339&newSignature=1994339&framework=10&originalRevision=a9236835fac366066055eacd3d1bb30a9667290c

Flags: needinfo?(fstrugariu)

Florin Strugariu [:Bebe]

Reporter

Comment 25

•

4 years ago

(In reply to Botond Ballo [:botond] from comment #23)

(In reply to Ryan VanderMeulen [:RyanVM] from comment #19)

Timothy and/or Botond, I'm wondering if we should backout bug 1611660 from Beta in light of comment 18. That would at last buy us more time to investigate without shipping this regression to release. Note that next week is RC week.

A few thoughts on this:

Bug 1611660 fixes user-noticeable functional regressions (the STR in bug 1611660 itself, and bug 1604504).

In bug 1643604, we investigated a seemingly large tscrollx regression caused by bug 1611660, which turned out to be an artifact of the tscrollx test itself being flaky in the sense of small variations in startup timing resulting in large differences in test outcomes (bug 1645275).

The raptor and browsertime regressions discussed in this bug seem to be due to existing variation in test outcomes (i.e. individual test runs producing different very different scores), and differences between "before" and "after" diminish with an increasing number of retriggers (comment 10, comment 22).

Based on the above, my recommendation is not to back out bug 1611660. I am open to investigating raptor or browsertime further (particularly if a difference that persists across many retriggers is demonstrated), but based on what the patch in bug 1611660 is doing, I think it's fairly unlikely for it to actually cause a user-noticeable performance regression (as opposed to some effect related to a test harness).

(Timothy, I would be curious to hear your opinion as well.)

I'm not suggesting a backout just to investigate and acknowledge that the loadtime improvement is correct

Florin Strugariu [:Bebe]

Reporter

Comment 26

•

4 years ago

(In reply to Florin Strugariu [:Bebe] (needinfo me) from comment #24)

(In reply to Botond Ballo [:botond] from comment #20)

(In reply to Florin Strugariu [:Bebe] (needinfo me) from comment #17)

Found the diffs in the subtest link of the tests:
Raptor:
https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&newProject=try&newRevision=c4a952e8e5c1d2ac146e0876e0fc015f60a59f59&originalSignature=2602001&newSignature=2602001&framework=13&originalRevision=a9236835fac366066055eacd3d1bb30a9667290c

I think you meant to post a different link here, as this is also showing browsertime.

Compere view links are so hard to fallow :((
https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&newProject=try&newRevision=c4a952e8e5c1d2ac146e0876e0fc015f60a59f59&originalSignature=1994339&newSignature=1994339&framework=10&originalRevision=a9236835fac366066055eacd3d1bb30a9667290c

From the graph you can see there is a -23.30% improvement for the loadtime metric

Gian-Carlo Pascutto [:gcp]

Updated

•

4 years ago

status-firefox79: affected → wontfix

Botond Ballo [:botond]

Assignee

Comment 27

•

4 years ago

(In reply to Florin Strugariu [:Bebe] (needinfo me) from comment #26)

https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&newProject=try&newRevision=c4a952e8e5c1d2ac146e0876e0fc015f60a59f59&originalSignature=1994339&newSignature=1994339&framework=10&originalRevision=a9236835fac366066055eacd3d1bb30a9667290c

From the graph you can see there is a -23.30% improvement for the loadtime metric

Thanks. I think I overlooked this one because the averaging over subtest results hides larger changes in individual subtests.

(I also admit I'm a bit confused about the relationship between raptor and browsertime. I thought "loadtime" pertained to browsertime only, and for raptor we were looking at the "replayed" metric.)

Anyways, looking at this subtest, I see that without bug 1611660, all individual results are in the 380-480 range, but with bug 1611660, in addition to most results being in that range, there are a few results that are much higher (1000-1100 range), resulting in this 23% overall change. I will try to investigate further.

Botond Ballo [:botond]

Assignee

Comment 28

•

4 years ago

(In reply to Botond Ballo [:botond] from comment #27)

Anyways, looking at this subtest, I see that without bug 1611660, all individual results are in the 380-480 range, but with bug 1611660, in addition to most results being in that range, there are a few results that are much higher (1000-1100 range), resulting in this 23% overall change. I will try to investigate further.

Interestingly, on Linux the same difference does not manifest: both "before" and "after" runs are mostly in the ~400 range which a handful of results in the 1100-1200 range.

Botond Ballo [:botond]

Assignee

Comment 29

•

4 years ago

Here's something interesting: the "loadtime" score for an individual test job is itself derived from the scores of 25 individual replicates. The method used to derive the average is to take the median of the replicates.

Now, let's have a look at an example set of replicates before:

"replicates": [
            416, 
            1261, 
            1231, 
            382, 
            1171, 
            397, 
            400, 
            1189, 
            1220, 
            409, 
            402, 
            410, 
            385, 
            383, 
            408, 
            401, 
            406, 
            1137, 
            1238, 
            1256, 
            1146, 
            399, 
            397, 
            1243, 
            369
          ], 
          "shouldAlert": true, 
          "unit": "ms", 
          "value": 409

and an example set of replicates after:

"replicates": [
            431, 
            379, 
            1126, 
            1102, 
            1083, 
            1087, 
            404, 
            375, 
            377, 
            1184, 
            1125, 
            391, 
            392, 
            382, 
            1275, 
            389, 
            1165, 
            1085, 
            1102, 
            1144, 
            376, 
            1163, 
            1113, 
            398, 
            1134
          ], 
          "shouldAlert": true, 
          "unit": "ms", 
          "value": 1085

Before, 10 out of 25 replicates (a bit less than half) were around ~1100, with the rest around ~400, leading to a median around ~400.

After, 14 out of 25 replicates (a bit more than half) were around ~1100, with the rest around ~400, leading to a median around ~1100.

So, a fairly small change in the actual numbers (a bit less than half being around ~1100 rather than ~400, vs. a bit more than half) is being amplified into a large difference in median score (~400 vs. ~1100).

Dave, any thoughts on this? It seems like using median rather than mean to average over the replicates is working against us here, and amplifying an insignificant difference into a seemingly significant one.

Flags: needinfo?(dave.hunt)

Greg Mierzwinski [:sparky]

Comment 30

•

4 years ago

:bebe, this alert was incorrectly reassigned to this bug: https://treeherder.mozilla.org/perf.html#/alerts?id=26427

Flags: needinfo?(fstrugariu)

Greg Mierzwinski [:sparky]

Comment 31

•

4 years ago

•

Edited

:bebe, this one as well: https://treeherder.mozilla.org/perf.html#/alerts?id=26131
EDIT: Nevermind for this one, it's correct.

Dave Hunt [:davehunt] [he/him] ⌚BST

Comment 32

•

4 years ago

(In reply to Botond Ballo [:botond] from comment #29)

Dave, any thoughts on this? It seems like using median rather than mean to average over the replicates is working against us here, and amplifying an insignificant difference into a seemingly significant one.

Excerpts from a conversation in Matrix:
<sparky> I agree with you're thoughts on the median, but the median's a little more useful since we often get outliers and bimodality in the data and it's quite rare that these situations that you're in arise (a nearly 50-50 split). So with an average, our results are more susceptible to noise than the median.
<botond> The issue remains that in this case it's amplifying the magnitude of a difference, such that a 26% difference is reported when it fact nothing actually changed by 26%
<botond> But perhaps that's just something to note when analyzing a regression in this test, not something we need to change
<sparky> Your patch did change the metric though and it's very clear from the time series but it's more the variance of the test that changed rather than the actual value
<sparky> The reason I say it's the variance that changed is because it's the only way for the median to have been pushed into the lower-value mode. With your patch, we are more likely to be faster than we are to be slower.
<botond> Ok, I see. That seems to be consistent with what I've seen in terms of individual replicate results.

This test's results came back to their previous range after the recent page re-recording, although they are still bimodal, which suggests the variance has shifted back. I don't believe there's anything further to investigate here.

:sparky You mentioned that you were going to review a few videos. Did you come to any conclusions?

Flags: needinfo?(dave.hunt) → needinfo?(gmierz2)

Greg Mierzwinski [:sparky]

Comment 33

•

4 years ago

So far I have not been able to get a passing browsertime windows run in CI - I say we resolve this bug as wontfix given that there was no real regression.

Flags: needinfo?(gmierz2)

Dave Hunt [:davehunt] [he/him] ⌚BST

Comment 34

•

4 years ago

Marking as WONTFIX as there was a shift in variance, but it has since shifted back.

Status: NEW → RESOLVED

Closed: 4 years ago

Resolution: --- → WONTFIX

Florin Strugariu [:Bebe]

Reporter

Comment 35

•

4 years ago

removing ni?

Flags: needinfo?(fstrugariu)

Julien Cristau [:jcristau]

Updated

•

4 years ago

status-firefox80: affected → wontfix

Ryan VanderMeulen [:RyanVM]

Updated

•

4 years ago

Flags: needinfo?(tnikkel)

BMO Automation

Updated

•

4 years ago

Has Regression Range: --- → yes