Closed Bug 1627621 Opened 4 years ago Closed 4 years ago

2.2 - 3.82% raptor-tp6- loadtime (windows10-64-shippable, windows10-64-shippable-qr, windows7-32-shippable) regression on push 787a82df49667019d3353c08d2c8a3f707d502cf (Sat April 4 2020)

Categories

(Testing :: Raptor, defect, P2)

Version 3
defect

Tracking

(firefox76 wontfix, firefox77 wontfix, firefox78 wontfix)

RESOLVED WONTFIX
Tracking Status
firefox76 --- wontfix
firefox77 --- wontfix
firefox78 --- wontfix

People

(Reporter: Bebe, Assigned: sparky)

References

(Regression)

Details

(Keywords: perf, perf-alert, regression)

Raptor has detected a Firefox performance regression from push:

https://hg.mozilla.org/integration/autoland/pushloghtml?changeset=787a82df49667019d3353c08d2c8a3f707d502cf

As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

4% raptor-tp6-google-mail-firefox-cold loadtime windows10-64-shippable-qr opt 627.04 -> 651.00
4% raptor-tp6-wikipedia-firefox-cold windows10-64-shippable-qr opt 1,356.71 -> 1,404.55
3% raptor-tp6-wikipedia-firefox-cold windows10-64-shippable opt 1,305.28 -> 1,350.84
3% raptor-tp6-wikipedia-firefox-cold fcp windows10-64-shippable opt 1,279.54 -> 1,322.08
3% raptor-tp6-outlook-firefox-cold fcp windows10-64-shippable opt 212.33 -> 219.33
3% raptor-tp6-docs-firefox-cold fcp windows10-64-shippable-qr opt 999.25 -> 1,030.42
3% raptor-tp6-wikipedia-firefox-cold loadtime windows10-64-shippable opt 1,424.42 -> 1,466.92
3% raptor-tp6-docs-firefox-cold fcp windows7-32-shippable opt 1,026.58 -> 1,055.75
3% raptor-tp6-docs-firefox-cold windows7-32-shippable opt 1,088.48 -> 1,118.37
3% raptor-tp6-google-mail-firefox-cold loadtime windows7-32-shippable opt 609.17 -> 625.00
3% raptor-tp6-docs-firefox-cold windows10-64-shippable-qr opt 1,079.91 -> 1,107.85
2% raptor-tp6-slides-firefox-cold loadtime windows7-32-shippable opt 1,948.58 -> 1,991.50

Improvements:

5% raptor-tp6-outlook-firefox-cold-mozproxy-replay windows7-32-shippable opt 87.75 -> 92.25

You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=25544

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a Treeherder page showing the Raptor jobs in a pushlog format.

To learn more about the regressing test(s) or reproducing them, please see: https://wiki.mozilla.org/TestEngineering/Performance/Raptor

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/TestEngineering/Performance/Talos/RegressionBugsHandling

Flags: needinfo?(hskupin)
Component: Performance → Raptor
Priority: -- → P2
Flags: needinfo?(gmierz2)

I assume that due to the changes we have a more realistic image of the different tests. Before the tests were intersecting with others specifically when a new tab is opened for each individual page cycle. Now we clearly wait. Greg, did you more expect a regression or improvement when comparing with the results we get from Browsertime jobs?

Flags: needinfo?(hskupin)

If this patch did cause the regression/improvements, then the metrics we have now should become the new baseline.

Between Browsertime/Raptor the differences in the mean change depending on the page (the error/difference is not systematic) but, on average, the geometric mean of the metrics decreases in browsertime. I didn't test any of these pages so that's as much as I can say.

I find it interesting how your code caused such a significant performance change. What's funny is the changes are not that drastic, and we really nearly do the same thing as before in the runner. I wonder if we should bring this up with some product teams (I don't know which one).

Flags: needinfo?(gmierz2) → needinfo?(hskupin)

As discussed in the perftest triage meeting we are going with the following steps:

  1. Check some affected tp6 page load tests with the gecko profiler enabled. Maybe there is a CC/GC which runs during page load and negatively affects our page load. If that is the case we could force a GC, or at least add another delay after tab opening and before starting the test.

  2. Run try builds for each individual revision of my patch and check which of those triggered the regressions.

Assignee: nobody → hskupin
Status: NEW → ASSIGNED
Flags: needinfo?(hskupin)
Priority: P2 → P1

So given that only cold tests show this problem it cannot be related to closing/opening tabs and possibly running garbage collections. In-between each page cycle the browser gets restarted. Also very suspicious is that it only happens for Windows shippable opt builds, and not for just opt, or other platforms.

I think we have to push some try builds to check which revision caused this change. I will do that tomorrow.

For the try builds to be triggered for each of the revisions we would need the following Raptor jobs to be run on Windows shippable opt:

tp6-c-5 (docs)
tp6-c-24 (wikipedia)
tp6-c-25 (google-mail)

(In reply to Alexandru Ionescu :alexandrui (needinfo me) from bug 1625892 comment #19)

This is the revision that introduced the regression: [raptor] Refactor webextension code for opening and closing tabs.

So this is most likely the cause from removing the delay when closing/opening the tab. As such we run a lot faster through the test, and might give the browser less time to recover and maybe garbage collect memory from the former page cycle.

I will push a try build shortly with the delay being re-added.

When checking the old code I saw something curious:
https://hg.mozilla.org/integration/autoland/file/d2a334ceab53916e967cbe8e2dbbbd0733956217/testing/raptor/webext/raptor/runner.js#l444

It didn't matter if we are closing/opening a new tab per page cycle, we always paused the execution of the next page cycle by 1000ms. This is clearly a bug, which got fixed with my commit, and I don't think that we should re-introduce it.

Also we still have the pageCycleDelay of 1000ms, which is executed BEFORE we close/open the tab. I feel that this is a problem, and that it has to be executed right before we start the cycle. That way we will give Firefox time to do eg. garbage collection for the old tab.

With both together we were waiting 2000ms per page cycle, and now it's only 1000ms.

I will move up the code for closing/opening the tab, so that we wait the 1000ms afterward. Here the try build:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=837929ab284c46a9c28ca13fdbdc75ab887fc41a

The performance results of this try build are even worse then current mozilla-central jobs:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central&newProject=try&newRevision=837929ab284c46a9c28ca13fdbdc75ab887fc41a&framework=10&selectedTimeRange=172800

There are huge decreases across different page load and benchmark tests.

Greg, I tried to understand when exactly the pageload.js background script starts its timers again for the next page cycle. Maybe I miss something here but I cannot spot it right now, and don't have the time today to deep dive into it today. Any hint from your side would be great.

I feel that runner.js should indicate the pageload.js script when a new measurement has to start, or even better the timing is based on that moment when the page load request is triggered via updateTab().

Flags: needinfo?(gmierz2)

:whimboo can you retry your last test with more trials (at least 5-10)? We can't say anything about the results with only one trial - those issues might be outliers.

Flags: needinfo?(gmierz2)
Flags: needinfo?(hskupin)

I suspect the Raptor changes in bug 1625892 also caused a 200% (!!) tp6 warm page load regressions (and increased variability) on "raptor-tp6-outlook-firefox":

https://treeherder.mozilla.org/perf.html#/graphs?highlightAlerts=1&series=mozilla-central,1983556,1,10&series=mozilla-central,2193876,1,10&series=mozilla-central,2237516,1,10&timerange=5184000

status-firefox77=affected because these regressions are still present in 77 Nightly.

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #8)

(In reply to Alexandru Ionescu :alexandrui (needinfo me) from bug 1625892 comment #19)

This is the revision that introduced the regression: [raptor] Refactor webextension code for opening and closing tabs.

So this is most likely the cause from removing the delay when closing/opening the tab. As such we run a lot faster through the test, and might give the browser less time to recover and maybe garbage collect memory from the former page cycle.

The delay might not have been the issue here. Over the weekend we got a great response from Matt Woodrow via bug 1602318:

(In reply to Matt Woodrow (:mattwoodrow) from bug 1602318 comment #44)

I took a local profile of the twitch regression to try to understand it: https://bit.ly/3d1H4uI (sorry, xul symbolication didn't seem to work).

Looking at that it appears that the main document load of twitch returns, but we spend ~33ms waiting for the main-thread to become free, so that we can send it to the content process (which is roughly similar to the magnitude of the regression).

The main thread is busy in NeckoParent::RecvGetExtensionStream, and then we run a layout/paint cycle for the parent process content.

Looking at other profiles, it appears that have this same work on all pages, but twitch has an http response time of ~50ms, which is really quick and leads to it being ready before the other work completes. My local run of instagram had a response time of 180ms, so the main-thread was idle by the time we needed it.

I think this explains the regression, and why it comes and goes, since it wouldn't take much for the scheduling to work out better.

The extension work appears to be handling calls from raptor's 'postToControlServer' (I suspect the one inside updateTab, followed by the "status" message in nextCycle, but hard to be sure).

Henrik, is this something you've looked at?

I suspect reducing (or delaying) the number of postToControlServer calls during the time we're measuring might lead to better and more consistent results.

Given that my patch actually added a couple more calls to postToControlServer this might have been caused this problem.

I will go ahead and push a patch with all of those changed to normal Raptor log calls.

Flags: needinfo?(hskupin)

Here the try build which has all the status messages to the control server replaced by simple log calls:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=e31f4e625ce002ba3d366b0829e35268e547775a

Matt could you please try with https://hg.mozilla.org/try/rev/95a75121395bed6bcb5168128aea0b6f8897bfcc, and check if that makes a difference for you?

Flags: needinfo?(matt.woodrow)

I chatted with Matt via Riot and he mentioned there there are no differences for twitch with the above try build:

https://treeherder.mozilla.org/perf.html#/graphs?highlightAlerts=1&series=try,2144558,1,10&series=try,2144560,1,10&timerange=5184000

As such I updated my patch to even stop logging anything to the console. Let see if that helps:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=c44eb9f727e72d3ca5e21ac918794b131bf23362

(In reply to Chris Peterson [:cpeterson] from comment #12)

I suspect the Raptor changes in bug 1625892 also caused a 200% (!!) tp6 warm page load regressions (and increased variability) on "raptor-tp6-outlook-firefox":

It's 100% which means a jump from 200 to 400, and at the same time reduced the noise a lot. But as you can see there is another jump recently to 600 as caused by bug 1624813 by upgrading mitmproxy to 5.0.1. Interestingly the alert as created by Bebe shows improvements for outlook? How can this be? Maybe that's better to discuss on a new alert bug. Can you please file one, Bebe?

Flags: needinfo?(fstrugariu)

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #17)

(In reply to Chris Peterson [:cpeterson] from comment #12)

I suspect the Raptor changes in bug 1625892 also caused a 200% (!!) tp6 warm page load regressions (and increased variability) on "raptor-tp6-outlook-firefox":

It's 100% which means a jump from 200 to 400, and at the same time reduced the noise a lot. But as you can see there is another jump recently to 600 as caused by bug 1624813 by upgrading mitmproxy to 5.0.1. Interestingly the alert as created by Bebe shows improvements for outlook? How can this be? Maybe that's better to discuss on a new alert bug. Can you please file one, Bebe?

I also wonder why this huge regression for raptor-tp6-outlook-firefox isn't reflected in the alert as referenced in the first comment.

Opend Bug 1636048 - Investigate performance changes after updating to mitm 5.0.1 to investigate the performance changes after mitm5.0.1 update

Flags: needinfo?(fstrugariu)
Flags: needinfo?(matt.woodrow)

I should have time to continue here by next week.

Assignee: hskupin → gmierz2

We'll accept these changes as the new baseline given how long they've been in place now.

Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Priority: P1 → P2
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.