Closed Bug 1650398 Opened 1 year ago Closed 1 year ago

27.7 - 38.75% raptor-tp6-google-firefox-cold replayed (linux64-shippable, windows10-64-shippable, windows7-32-shippable) regression on push f0365e7eadd029a669a72bc6cb9c5171e735d3e2 (Wed June 3 2020)

Categories

(Core :: Panning and Zooming, defect, P2)

79 Branch
defect

Tracking

()

RESOLVED WONTFIX
Tracking Status
firefox-esr68 --- unaffected
firefox-esr78 --- unaffected
firefox78 --- unaffected
firefox79 --- wontfix
firefox80 --- wontfix

People

(Reporter: Bebe, Assigned: botond)

References

(Regression)

Details

(Keywords: perf, perf-alert, regression)

Perfherder has detected a raptor performance regression from push f0365e7eadd029a669a72bc6cb9c5171e735d3e2. As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

39% raptor-tp6-google-firefox-cold replayed windows10-64-shippable opt 812.71 -> 497.75
37% raptor-tp6-google-firefox-cold replayed windows7-32-shippable opt 740.00 -> 466.17
28% raptor-tp6-google-firefox-cold replayed linux64-shippable opt 850.29 -> 614.75

Improvements:

64% raptor-tp6-google-firefox-cold loadtime windows7-32-shippable opt 1,101.83 -> 397.08
58% raptor-tp6-google-firefox-cold loadtime windows10-64-shippable opt 1,112.67 -> 464.50
31% raptor-tp6-google-firefox-cold not-replayed windows10-64-shippable opt 233.08 -> 160.33
23% raptor-tp6-google-firefox-cold windows7-32-shippable opt 385.83 -> 298.67
21% raptor-tp6-google-firefox-cold windows10-64-shippable opt 384.43 -> 304.04

Details of the alert can be found in the alert summary, including links to graphs and comparisons for each of the affected tests. Please follow our guide to handling regression bugs and let us know your plans within 3 business days, or the offending patch(es) will be backed out in accordance with our regression policy.

For more information on performance sheriffing please see our FAQ.

Flags: needinfo?(botond)
Component: Performance → Panning and Zooming
Product: Testing → Core

Opening the page locally it looks the same but we need to make sure we are not loosing and valuable requests

Will have a look.

Assignee: nobody → botond
Flags: needinfo?(botond)

Looking over the mitm_netlocs artifacts for before/after the regression there appear to be some page loads that are not requesting certain URLs. For example, both include 25 requests for https://www.google.com/search?hl=en&q=barack+obama&cad=h (the test URL), but the image at https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcT15FuICR-PSp1_Pv6dLvbYjyu8eUyhMynwX664i0T78vfZYo9YPL_KsVE is requested 16 times in 9c0a4461457638440c1383f3a42f51c3d18f7c69 but only 11 times in f0365e7eadd029a669a72bc6cb9c5171e735d3e2. As every load should be identical, even having 16 requests out of 25 is concerning.

Where are these requests recorded? I assume somewhere within the network code? The image cache can mean the network code never knows about an image request.

we have request recordings done using mitmproxy and these requests are replayed using the same tool.
During playback re record the url we got a request for and the status code we(the tool) replys with and generate the replayed, not-replayed, and confidence metrics

botond: Triaging as REO for 79 - can you set a priority and severity on this bug?

Flags: needinfo?(botond)

Marking as P2 until we understand better what's going on here.

However, my investigation so far is pointing to bug 1611660 not being the culprit here.

I ran two Try pushes:

I ran Raptor tests on Linux x64 shippable, and retriggered the google-c job 10 times in each Try push.

I then entered these two revisions into the Perfherder compare view and looked at the entry raptor-tp6-google-firefox-cold replayed opt, which shows the expected 10/10 runs.

The average delta in that entry is +1.00%. There is a fair amount of variation between individual runs, with a standard deviation of ~15%. So, while an individual run may show the patches that include bug 1611660 scoring 20-30% worse, others show them scoring better by a similar amount, and there seems to be no systematic effect.

Am I missing something / looking at the wrong thing here?

Severity: -- → S2
Flags: needinfo?(botond) → needinfo?(fstrugariu)
Priority: -- → P2

your info is right here but looking at the jobs you should have compared.
Windows 7 Shippable opt Raptor performance tests on Firefox test-windows7-32-shippable/opt-raptor-tp6-firefox-cold-google-e10s Rap(google-c)

Ran some try build in your links hopefully this will show the regresssion

Flags: needinfo?(fstrugariu) → needinfo?(botond)

After retriggering those win7 jobs its pretty clear they are both pretty much the same and no where near the "before" numbers mentioned in comment 0.

Flags: needinfo?(fstrugariu)

Did some retriggers on try and the number s are constant

Also added some jobs of the same tests on Browsertime to see if it's the framework:

After (bug 1611660 and browsertime on google):
https://treeherder.mozilla.org/#/jobs?repo=try&revision=10d65dc1386d719e9890aff07ba7c958f49b6704

Before (bug before 1611660 (Bug 1641291 ) and browsertime on google)::
https://treeherder.mozilla.org/#/jobs?repo=try&revision=0df348ca0f393cec382f7a4e1468c8090e58e77f

I'm concerned that this perf test isn't reliable.

We see number of requests almost get cut in half, and a corresponding drop in load time of also about a half. And there doesn't appear to be any content missing from the page. The missing requests were pointed out to be images. The image cache in Firefox could easily explain this if small timing changes affect when entries in the image cache get removed from the cache, and hence if we use a cached image or have to fetch the image again over the network. As someone who works on the image cache this seems entirely plausible to me.

(In reply to Florin Strugariu [:Bebe] (needinfo me) from comment #11)

Also added some jobs of the same tests on Browsertime to see if it's the framework:

All of these jobs failed with "raptor-main Critical: test 'google' could not be found for firefox"

(In reply to Timothy Nikkel (:tnikkel) from comment #13)

(In reply to Florin Strugariu [:Bebe] (needinfo me) from comment #11)

Also added some jobs of the same tests on Browsertime to see if it's the framework:

All of these jobs failed with "raptor-main Critical: test 'google' could not be found for firefox"

it looks like settings on browsertime failed
I did a backout of bug 1611660 and ran the browsertime tests
https://treeherder.mozilla.org/#/jobs?repo=try&revision=c4a952e8e5c1d2ac146e0876e0fc015f60a59f59

Mozilla central try:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=a9236835fac366066055eacd3d1bb30a9667290c

(In reply to Florin Strugariu [:Bebe] (needinfo me) from comment #15)

Compare view between the 2 builds:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=a9236835fac366066055eacd3d1bb30a9667290c&newProject=try&newRevision=c4a952e8e5c1d2ac146e0876e0fc015f60a59f59&framework=13

Do you see any significant differences? Doesn't look like it to me, but I don't read perf compares very often.

Note: as the before and after is central and backout it proves your bug introduced the regression.

Timothy and/or Botond, I'm wondering if we should backout bug 1611660 from Beta in light of comment 18. That would at last buy us more time to investigate without shipping this regression to release. Note that next week is RC week.

Flags: needinfo?(tnikkel)

(In reply to Florin Strugariu [:Bebe] (needinfo me) from comment #17)

Found the diffs in the subtest link of the tests:
Raptor:
https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&newProject=try&newRevision=c4a952e8e5c1d2ac146e0876e0fc015f60a59f59&originalSignature=2602001&newSignature=2602001&framework=13&originalRevision=a9236835fac366066055eacd3d1bb30a9667290c

I think you meant to post a different link here, as this is also showing browsertime.

Flags: needinfo?(fstrugariu)

(In reply to Botond Ballo [:botond] from comment #20)

(In reply to Florin Strugariu [:Bebe] (needinfo me) from comment #17)

Found the diffs in the subtest link of the tests:
Raptor:
https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&newProject=try&newRevision=c4a952e8e5c1d2ac146e0876e0fc015f60a59f59&originalSignature=2602001&newSignature=2602001&framework=13&originalRevision=a9236835fac366066055eacd3d1bb30a9667290c

I think you meant to post a different link here, as this is also showing browsertime.

Anyways, starting from that link and going to "Back to all tests and platforms" and then selecting "raptor" from the dropdown, I'm not seeing any significant improvements, certainly nothing comparable to the 27-38% mentioned in the bug description.

(In reply to Florin Strugariu [:Bebe] (needinfo me) from comment #17)

Browsertime:
https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&newProject=try&newRevision=c4a952e8e5c1d2ac146e0876e0fc015f60a59f59&originalSignature=2602001&newSignature=2602001&framework=13&originalRevision=a9236835fac366066055eacd3d1bb30a9667290c

Keep in mind the alert is on windows10-64-shippable the test name is google-search opt cold and the stubtest is loadtime

After increasing the number of retriggers from 7 to 16, the improvement in this metric went from 17% to 3.6%.

The results of individual test runs are clustered around three values (~380, ~720, and ~1050) both before and after, and the difference in the average seems to be due to variations in which value individual tests hit; this difference gets smaller as the number of test runs increases.

So, I don't think the patch is having a systematic effect here.

Flags: needinfo?(botond)

(In reply to Ryan VanderMeulen [:RyanVM] from comment #19)

Timothy and/or Botond, I'm wondering if we should backout bug 1611660 from Beta in light of comment 18. That would at last buy us more time to investigate without shipping this regression to release. Note that next week is RC week.

A few thoughts on this:

  • Bug 1611660 fixes user-noticeable functional regressions (the STR in bug 1611660 itself, and bug 1604504).
  • In bug 1643604, we investigated a seemingly large tscrollx regression caused by bug 1611660, which turned out to be an artifact of the tscrollx test itself being flaky in the sense of small variations in startup timing resulting in large differences in test outcomes (bug 1645275).
  • The raptor and browsertime regressions discussed in this bug seem to be due to existing variation in test outcomes (i.e. individual test runs producing different very different scores), and differences between "before" and "after" diminish with an increasing number of retriggers (comment 10, comment 22).

Based on the above, my recommendation is not to back out bug 1611660. I am open to investigating raptor or browsertime further (particularly if a difference that persists across many retriggers is demonstrated), but based on what the patch in bug 1611660 is doing, I think it's fairly unlikely for it to actually cause a user-noticeable performance regression (as opposed to some effect related to a test harness).

(Timothy, I would be curious to hear your opinion as well.)

(In reply to Botond Ballo [:botond] from comment #23)

(In reply to Ryan VanderMeulen [:RyanVM] from comment #19)

Timothy and/or Botond, I'm wondering if we should backout bug 1611660 from Beta in light of comment 18. That would at last buy us more time to investigate without shipping this regression to release. Note that next week is RC week.

A few thoughts on this:

  • Bug 1611660 fixes user-noticeable functional regressions (the STR in bug 1611660 itself, and bug 1604504).
  • In bug 1643604, we investigated a seemingly large tscrollx regression caused by bug 1611660, which turned out to be an artifact of the tscrollx test itself being flaky in the sense of small variations in startup timing resulting in large differences in test outcomes (bug 1645275).
  • The raptor and browsertime regressions discussed in this bug seem to be due to existing variation in test outcomes (i.e. individual test runs producing different very different scores), and differences between "before" and "after" diminish with an increasing number of retriggers (comment 10, comment 22).

Based on the above, my recommendation is not to back out bug 1611660. I am open to investigating raptor or browsertime further (particularly if a difference that persists across many retriggers is demonstrated), but based on what the patch in bug 1611660 is doing, I think it's fairly unlikely for it to actually cause a user-noticeable performance regression (as opposed to some effect related to a test harness).

(Timothy, I would be curious to hear your opinion as well.)

I'm not suggesting a backout just to investigate and acknowledge that the loadtime improvement is correct

(In reply to Florin Strugariu [:Bebe] (needinfo me) from comment #24)

(In reply to Botond Ballo [:botond] from comment #20)

(In reply to Florin Strugariu [:Bebe] (needinfo me) from comment #17)

Found the diffs in the subtest link of the tests:
Raptor:
https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&newProject=try&newRevision=c4a952e8e5c1d2ac146e0876e0fc015f60a59f59&originalSignature=2602001&newSignature=2602001&framework=13&originalRevision=a9236835fac366066055eacd3d1bb30a9667290c

I think you meant to post a different link here, as this is also showing browsertime.

Compere view links are so hard to fallow :((
https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&newProject=try&newRevision=c4a952e8e5c1d2ac146e0876e0fc015f60a59f59&originalSignature=1994339&newSignature=1994339&framework=10&originalRevision=a9236835fac366066055eacd3d1bb30a9667290c

From the graph you can see there is a -23.30% improvement for the loadtime metric

(In reply to Florin Strugariu [:Bebe] (needinfo me) from comment #26)

https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&newProject=try&newRevision=c4a952e8e5c1d2ac146e0876e0fc015f60a59f59&originalSignature=1994339&newSignature=1994339&framework=10&originalRevision=a9236835fac366066055eacd3d1bb30a9667290c

From the graph you can see there is a -23.30% improvement for the loadtime metric

Thanks. I think I overlooked this one because the averaging over subtest results hides larger changes in individual subtests.

(I also admit I'm a bit confused about the relationship between raptor and browsertime. I thought "loadtime" pertained to browsertime only, and for raptor we were looking at the "replayed" metric.)

Anyways, looking at this subtest, I see that without bug 1611660, all individual results are in the 380-480 range, but with bug 1611660, in addition to most results being in that range, there are a few results that are much higher (1000-1100 range), resulting in this 23% overall change. I will try to investigate further.

(In reply to Botond Ballo [:botond] from comment #27)

Anyways, looking at this subtest, I see that without bug 1611660, all individual results are in the 380-480 range, but with bug 1611660, in addition to most results being in that range, there are a few results that are much higher (1000-1100 range), resulting in this 23% overall change. I will try to investigate further.

Interestingly, on Linux the same difference does not manifest: both "before" and "after" runs are mostly in the ~400 range which a handful of results in the 1100-1200 range.

Here's something interesting: the "loadtime" score for an individual test job is itself derived from the scores of 25 individual replicates. The method used to derive the average is to take the median of the replicates.

Now, let's have a look at an example set of replicates before:

"replicates": [
            416, 
            1261, 
            1231, 
            382, 
            1171, 
            397, 
            400, 
            1189, 
            1220, 
            409, 
            402, 
            410, 
            385, 
            383, 
            408, 
            401, 
            406, 
            1137, 
            1238, 
            1256, 
            1146, 
            399, 
            397, 
            1243, 
            369
          ], 
          "shouldAlert": true, 
          "unit": "ms", 
          "value": 409

and an example set of replicates after:

"replicates": [
            431, 
            379, 
            1126, 
            1102, 
            1083, 
            1087, 
            404, 
            375, 
            377, 
            1184, 
            1125, 
            391, 
            392, 
            382, 
            1275, 
            389, 
            1165, 
            1085, 
            1102, 
            1144, 
            376, 
            1163, 
            1113, 
            398, 
            1134
          ], 
          "shouldAlert": true, 
          "unit": "ms", 
          "value": 1085

Before, 10 out of 25 replicates (a bit less than half) were around ~1100, with the rest around ~400, leading to a median around ~400.

After, 14 out of 25 replicates (a bit more than half) were around ~1100, with the rest around ~400, leading to a median around ~1100.

So, a fairly small change in the actual numbers (a bit less than half being around ~1100 rather than ~400, vs. a bit more than half) is being amplified into a large difference in median score (~400 vs. ~1100).

Dave, any thoughts on this? It seems like using median rather than mean to average over the replicates is working against us here, and amplifying an insignificant difference into a seemingly significant one.

Flags: needinfo?(dave.hunt)

:bebe, this alert was incorrectly reassigned to this bug: https://treeherder.mozilla.org/perf.html#/alerts?id=26427

Flags: needinfo?(fstrugariu)

:bebe, this one as well: https://treeherder.mozilla.org/perf.html#/alerts?id=26131
EDIT: Nevermind for this one, it's correct.

(In reply to Botond Ballo [:botond] from comment #29)

Dave, any thoughts on this? It seems like using median rather than mean to average over the replicates is working against us here, and amplifying an insignificant difference into a seemingly significant one.

Excerpts from a conversation in Matrix:
<sparky> I agree with you're thoughts on the median, but the median's a little more useful since we often get outliers and bimodality in the data and it's quite rare that these situations that you're in arise (a nearly 50-50 split). So with an average, our results are more susceptible to noise than the median.
<botond> The issue remains that in this case it's amplifying the magnitude of a difference, such that a 26% difference is reported when it fact nothing actually changed by 26%
<botond> But perhaps that's just something to note when analyzing a regression in this test, not something we need to change
<sparky> Your patch did change the metric though and it's very clear from the time series but it's more the variance of the test that changed rather than the actual value
<sparky> The reason I say it's the variance that changed is because it's the only way for the median to have been pushed into the lower-value mode. With your patch, we are more likely to be faster than we are to be slower.
<botond> Ok, I see. That seems to be consistent with what I've seen in terms of individual replicate results.

This test's results came back to their previous range after the recent page re-recording, although they are still bimodal, which suggests the variance has shifted back. I don't believe there's anything further to investigate here.

:sparky You mentioned that you were going to review a few videos. Did you come to any conclusions?

Flags: needinfo?(dave.hunt) → needinfo?(gmierz2)

So far I have not been able to get a passing browsertime windows run in CI - I say we resolve this bug as wontfix given that there was no real regression.

Flags: needinfo?(gmierz2)

Marking as WONTFIX as there was a shift in variance, but it has since shifted back.

Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → WONTFIX

removing ni?

Flags: needinfo?(fstrugariu)
Flags: needinfo?(tnikkel)
You need to log in before you can comment on or make changes to this bug.