Closed Bug 1516540 Opened 4 years ago Closed 4 years ago
.95 - 41 .01% tp6-amazon / tp6-facebook / tp6-imdb / tp6-microsoft / tp6-wikia / tp6-yandex regression on push 5d8e428324c662d96fd47677733880ae546266d4 (Sun Dec 23 2018)
Raptor has detected a Firefox performance regression from push: https://hg.mozilla.org/integration/autoland/pushloghtml?changeset=5d8e428324c662d96fd47677733880ae546266d4 As author of one of the patches included in that push, we need your help to address this regression. Regressions: 41% raptor-tp6-microsoft-firefox linux64-qr opt 1,108.81 -> 1,563.52 40% raptor-tp6-microsoft-firefox linux64 opt 1,104.64 -> 1,543.98 36% raptor-tp6-yandex-firefox windows10-64 pgo 111.72 -> 152.49 34% raptor-tp6-yandex-firefox windows7-32 pgo 113.03 -> 150.96 32% raptor-tp6-yandex-firefox windows10-64 opt 122.47 -> 161.52 31% raptor-tp6-yandex-firefox osx-10-10 opt 236.12 -> 308.80 29% raptor-tp6-yandex-firefox windows7-32 opt 122.45 -> 158.46 28% raptor-tp6-imdb-firefox windows10-64-qr opt 224.70 -> 288.58 27% raptor-tp6-yandex-firefox windows10-64-qr opt 111.28 -> 141.64 22% raptor-tp6-microsoft-firefox osx-10-10 opt 1,524.60 -> 1,866.10 21% raptor-tp6-yandex-firefox linux64 pgo 118.36 -> 143.14 20% raptor-tp6-wikia-firefox windows10-64 opt 153.04 -> 184.38 20% raptor-tp6-yandex-firefox linux64-qr opt 121.08 -> 145.79 20% raptor-tp6-yandex-firefox linux64 opt 127.23 -> 152.69 20% raptor-tp6-wikia-firefox windows10-64-qr opt 149.24 -> 179.00 18% raptor-tp6-microsoft-firefox linux64 pgo 1,230.84 -> 1,457.39 18% raptor-tp6-wikia-firefox windows7-32 opt 148.49 -> 175.06 15% raptor-tp6-imdb-firefox osx-10-10 opt 285.00 -> 327.12 13% raptor-tp6-wikia-firefox linux64 pgo 143.61 -> 162.94 12% raptor-tp6-wikia-firefox linux64 opt 163.41 -> 183.82 8% raptor-tp6-amazon-firefox osx-10-10 opt 1,158.90 -> 1,257.07 8% raptor-tp6-amazon-firefox linux64-qr opt 476.83 -> 517.09 8% raptor-tp6-amazon-firefox windows10-64 opt 442.58 -> 478.07 8% raptor-tp6-amazon-firefox windows10-64-qr opt 444.13 -> 478.96 7% raptor-tp6-wikia-firefox linux64-qr opt 180.75 -> 193.73 7% raptor-tp6-amazon-firefox linux64 opt 447.90 -> 477.35 6% raptor-tp6-amazon-firefox windows7-32 pgo 390.35 -> 414.23 6% raptor-tp6-amazon-firefox linux64 pgo 403.62 -> 427.72 5% raptor-tp6-amazon-firefox windows10-64 pgo 405.06 -> 424.82 4% raptor-tp6-facebook-firefox windows10-64 pgo 365.45 -> 379.12 3% raptor-tp6-facebook-firefox linux64 opt 380.79 -> 393.78 3% raptor-tp6-facebook-firefox windows10-64 opt 385.94 -> 399.08 3% raptor-tp6-facebook-firefox linux64 pgo 354.86 -> 366.01 3% raptor-tp6-facebook-firefox windows7-32 pgo 360.03 -> 371.30 3% raptor-tp6-facebook-firefox linux64 pgo 355.08 -> 365.55 You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=18472 On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a Treeherder page showing the Raptor jobs in a pushlog format. To learn more about the regressing test(s) or reproducing them, please see: https://wiki.mozilla.org/Performance_sheriffing/Raptor *** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! *** Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Performance_sheriffing/Talos/RegressionBugsHandling
I am not sure who to needinfo here, but when bug 1514853 landed we saw some big perf improvements (I will comment on the bug shortly), and when it was backed out, we see these improvements reverted. If pageload is so important in 2019 (yes it is 2018) for Mozilla, we should consider leaving in the code from bug 1514853. I am not sure who to needinfo, but if there is more context we can provide, it would be nice to see these improvements come back.
Thanks for filing this bug, Joel, this is really helpful! First and foremost, is it possible to get some links to the Gecko Profiles for the before/after results? I do have some suspicions on where this regression may be coming from based on some of the previous Talos regressions we've had in this code (my primary suspect is bug 1514340) but I'd like to know if that's a correct guess and/or whether there is anything else to know about... Also, did we see the same results on beta, out of curiosity? (In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #1) > I am not sure who to needinfo here, but when bug 1514853 landed we saw some > big perf improvements (I will comment on the bug shortly), and when it was > backed out, we see these improvements reverted. FWIW that patch was temporarily backed out, but it got relanded, so these improvements should have been added back now. > If pageload is so important in 2019 (yes it is 2018) for Mozilla, we should > consider leaving in the code from bug 1514853. So for 65, the majority of users will indeed run the code that we landed in bug 1514853, so this regression has yet to be shipped. So no need to be alarmed just yet. But I disagree with your conclusion here. My alternate proposal is that we should get to the bottom of where the regression comes from, and fix it, just like any other regression. :-) I think the reason we didn't know about this yet is the classic story with Talos, where as we do incremental development in the form of small patches, we may be regressing Talos numbers by a small amount over time, and then nobody knows what the total sum would amount to. The bug I linked to above was an example of that, where we first saw it causing multi-second long pauses on Bugzilla (bug 1510275) which was fixed with some band-aid and I intend to do the proper fix hopefully for 66. Hopefully that's also the reason behind this other regression but there may be other stuff that have crept in as well. But let's not jump to solutions before we have a diagnosis on the patient. We won't be able to stitch back the limps if we first amputate and then diagnose. :-) > I am not sure who to needinfo, but if there is more context we can provide, > it would be nice to see these improvements come back. Talos profiles would be of extreme value here. (BTW, are those something that can be added to the template used for filing these bugs? I find myself almost always requesting them when getting CCed on one of these bugs...) Thanks again!
Flags: needinfo?(ehsan) → needinfo?(jmaher)
I did see a small dip and reset on mozilla-beta, not 100% sure if this is the same, but it is similar enough: https://treeherder.mozilla.org/perf.html#/graphs?timerange=5184000&series=autoland,1782727,1,10&series=mozilla-inbound,1772174,1,10&series=mozilla-beta,1797348,1,10 I also don't see this fixed in recent days, the improvement on the 20th and regression/backout on the 23rd are the main changes. here are some talos profile, apologies for not getting them: * treeherder links: https://treeherder.mozilla.org/#/jobs?repo=autoland&fromchange=a40b74d3e7c07b94330201c8d64eb54afd7216ca&tochange=5d8e428324c662d96fd47677733880ae546266d4&searchStr=raptor%2Ctp6%2C-p%29&group_state=expanded * linux64 amazon: ** before: https://perf-html.io/from-url/https%3A%2F%2Fqueue.taskcluster.net%2Fv1%2Ftask%2FJSDFn39IRpKH0T9g9Hgnbw%2Fruns%2F0%2Fartifacts%2Fpublic%2Ftest_info%2Fprofile_raptor-tp6-amazon-firefox.zip/calltree/?file=profile_raptor-tp6-amazon-firefox%2Fraptor-tp6-amazon-firefox_pagecycle_2.profile&globalTrackOrder=&localTrackOrderByPid=&thread&v=3 ** after: https://perf-html.io/from-url/https%3A%2F%2Fqueue.taskcluster.net%2Fv1%2Ftask%2FdQDQyRlCQq27dt73blws-g%2Fruns%2F0%2Fartifacts%2Fpublic%2Ftest_info%2Fprofile_raptor-tp6-amazon-firefox.zip * linux64 facebook: ** before: https://perf-html.io/from-url/https%3A%2F%2Fqueue.taskcluster.net%2Fv1%2Ftask%2FJSDFn39IRpKH0T9g9Hgnbw%2Fruns%2F0%2Fartifacts%2Fpublic%2Ftest_info%2Fprofile_raptor-tp6-google-firefox.zip ** after: https://perf-html.io/from-url/https%3A%2F%2Fqueue.taskcluster.net%2Fv1%2Ftask%2FdQDQyRlCQq27dt73blws-g%2Fruns%2F0%2Fartifacts%2Fpublic%2Ftest_info%2Fprofile_raptor-tp6-facebook-firefox.zip * linux64 microsoft: ** before: https://perf-html.io/from-url/https%3A%2F%2Fqueue.taskcluster.net%2Fv1%2Ftask%2FIv908CMbT-GxG4rtk8QL4g%2Fruns%2F0%2Fartifacts%2Fpublic%2Ftest_info%2Fprofile_raptor-tp6-microsoft-firefox.zip ** after: https://perf-html.io/from-url/https%3A%2F%2Fqueue.taskcluster.net%2Fv1%2Ftask%2FI1sFo2KfRtyLU2YSA8eX6w%2Fruns%2F0%2Fartifacts%2Fpublic%2Ftest_info%2Fprofile_raptor-tp6-microsoft-firefox.zip let me know if you need different profiles, these are easy to generate on the tree.
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #3) > I did see a small dip and reset on mozilla-beta, not 100% sure if this is > the same, but it is similar enough: > https://treeherder.mozilla.org/perf.html#/ > graphs?timerange=5184000&series=autoland,1782727,1,10&series=mozilla-inbound, > 1772174,1,10&series=mozilla-beta,1797348,1,10 Yup, it's the same. It's less apparent probably because we have much fewer runs there... (BTW we will see it once again once the early beta flag is turned off...) > I also don't see this fixed in recent days, the improvement on the 20th and > regression/backout on the 23rd are the main changes. Hmm, not sure what you mean exactly. But if you mean you don't see it again on central, that's expected because the first landing of bug 1514853 on central changed the value of the network.cookie.cookieBehavior pref on central by mistake, and the backout fixed that mistake (also causing this regression again) and the second landing didn't change the pref this time so there should have been no perf change as a result, which is what happened. All expected. > let me know if you need different profiles, these are easy to generate on > the tree. Thanks, these are great, I'm looking at them now. BTW can I ask how you generate these links? Is it possible to generate them from try pushes as well?
there is an option in treeherder when clicking on a job there is an information panel that pops up on the bottom. This is divided into two parts left/right; the header on the left side has a '...' and clicking that is a menu which has 'Create Gecko Profile'. This retriggers a job (talos/raptor) with the proper flags to run it as a gecko profile variant and adjust the treeherder flags. the only caveat is that it does it for the job requested and the previous build's equivalent job. so on try that would add this to other people. Not that worrisome, but if we want to do it on try more often, we can adjust the script to be branch aware. the code lives here: https://searchfox.org/mozilla-central/source/taskcluster/taskgraph/actions/gecko_profile.py#40 The second (and I assume last caveat) is that this might be restricted to sheriffs only as some features are sheriff specific- in this case I think you just need to be logged in: https://github.com/mozilla/treeherder/blob/42a827cc6ee92b669496131c2bab99353444bfa5/ui/job-view/details/summary/ActionBar.jsx#L66
Great, thanks a lot! I will update this bug when I know more...
Component: General → Tracking Protection
Product: Testing → Firefox
Version: Version 3 → unspecified
Here is a Talos push with the patches for bug 1517014, bug 1517057, bug 1517062, and bug 1517136, comparing network.cookie.cookieBehavior=0 vs 4. Before: https://treeherder.mozilla.org/#/jobs?repo=try&revision=35cc5b52a0f57ce89386d7f4c231db7e0f83933c After: https://treeherder.mozilla.org/#/jobs?repo=try&revision=4be132299cf96193be86263234a3208dae0c9cbb Comparison: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=35cc5b52a0f5&newProject=try&newRevision=4be132299cf9&framework=1 For unknown reasons, only the tests for amazon, facebook, google and youtube are running there (where are the rest?!). Also, no Windows runs due to bug 1517333. On Linux, the regression for Facebook has now been fixed. The regression from Amazon is still there... On Mac, the regression for Amazon is now 12.3% which is weird since that's more than the reported amount in comment 0. This makes me think that I may be measuring something different here! No regression on Facebook though. I think that there is something wrong with the try syntax/results. I asked to run all tp6 tests, but I only got "1_thread e10s stylo". Joel, what do you think? Am I doing something wrong?
I did "add new jobs" for the two try pushes and added 'raptor tp6' to both linux64 opt and osx opt. In the compare view, I changed the framework to be raptor instead of talos and you can see the comparison: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=35cc5b52a0f5&newProject=try&newRevision=4be132299cf9&framework=10
I did "add new jobs" for the two try pushes and added 'raptor tp6' to both linux64 opt and osx opt. In the compare view, I changed the framework to be raptor instead of talos and you can see the comparison: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=35cc5b52a0f5&newProject=try&newRevision=4be132299cf9&framework=10 I agree this is confusing and need to make sure it is easier to schedule and view results across various frameworks.
Thanks a lot, Joel! So please ignore comment 8 completely! The regressions remaining are as follows based on the results that are available so far: On Amazon the regression is still present on both Linux and Mac. There is a big (33% on OSX) regression on ebay that wasn't originally reported (I think because the job is tier 2?). There are sill small regressions present on Facebook. Microsoft has improved about 5% on Linux, but the regressions are still present and really big both on Linux and OSX. The wikia regression is also still present. Yandex has improved by about 3% on OSX but the regressions are still present on both Linux and OSX and are really big. The summary is that the optimizations done so far have yet to make a large dent. My suspicion is that the majority of the cost here is coming from the overhead related to bug 1514340 every time that a STATE_COOKIES_LOADED notification is dispatched. I do hope that we don't end up having to go as far as fixing bug 1510569 here...
Component: Tracking Protection → Privacy: Anti-Tracking
Product: Firefox → Core
Target Milestone: --- → mozilla67
Whiteboard: [privacy65] → [privacy65][anti-tracking]
Assignee: nobody → ehsan
Status: NEW → ASSIGNED
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: mozilla67 → mozilla68
You need to log in before you can comment on or make changes to this bug.