Closed Bug 1516540 Opened 6 years ago Closed 6 years ago

2.95 - 41.01% tp6-amazon / tp6-facebook / tp6-imdb / tp6-microsoft / tp6-wikia / tp6-yandex regression on push 5d8e428324c662d96fd47677733880ae546266d4 (Sun Dec 23 2018)

Categories

(Core :: Privacy: Anti-Tracking, defect, P1)

defect

Tracking

()

VERIFIED FIXED
mozilla68
Tracking Status
firefox-esr60 --- unaffected
firefox67 --- wontfix
firefox68 --- fixed

People

(Reporter: jmaher, Assigned: ehsan.akhgari)

References

Details

(Keywords: perf, regression, Whiteboard: [privacy65][anti-tracking])

Raptor has detected a Firefox performance regression from push: https://hg.mozilla.org/integration/autoland/pushloghtml?changeset=5d8e428324c662d96fd47677733880ae546266d4 As author of one of the patches included in that push, we need your help to address this regression. Regressions: 41% raptor-tp6-microsoft-firefox linux64-qr opt 1,108.81 -> 1,563.52 40% raptor-tp6-microsoft-firefox linux64 opt 1,104.64 -> 1,543.98 36% raptor-tp6-yandex-firefox windows10-64 pgo 111.72 -> 152.49 34% raptor-tp6-yandex-firefox windows7-32 pgo 113.03 -> 150.96 32% raptor-tp6-yandex-firefox windows10-64 opt 122.47 -> 161.52 31% raptor-tp6-yandex-firefox osx-10-10 opt 236.12 -> 308.80 29% raptor-tp6-yandex-firefox windows7-32 opt 122.45 -> 158.46 28% raptor-tp6-imdb-firefox windows10-64-qr opt 224.70 -> 288.58 27% raptor-tp6-yandex-firefox windows10-64-qr opt 111.28 -> 141.64 22% raptor-tp6-microsoft-firefox osx-10-10 opt 1,524.60 -> 1,866.10 21% raptor-tp6-yandex-firefox linux64 pgo 118.36 -> 143.14 20% raptor-tp6-wikia-firefox windows10-64 opt 153.04 -> 184.38 20% raptor-tp6-yandex-firefox linux64-qr opt 121.08 -> 145.79 20% raptor-tp6-yandex-firefox linux64 opt 127.23 -> 152.69 20% raptor-tp6-wikia-firefox windows10-64-qr opt 149.24 -> 179.00 18% raptor-tp6-microsoft-firefox linux64 pgo 1,230.84 -> 1,457.39 18% raptor-tp6-wikia-firefox windows7-32 opt 148.49 -> 175.06 15% raptor-tp6-imdb-firefox osx-10-10 opt 285.00 -> 327.12 13% raptor-tp6-wikia-firefox linux64 pgo 143.61 -> 162.94 12% raptor-tp6-wikia-firefox linux64 opt 163.41 -> 183.82 8% raptor-tp6-amazon-firefox osx-10-10 opt 1,158.90 -> 1,257.07 8% raptor-tp6-amazon-firefox linux64-qr opt 476.83 -> 517.09 8% raptor-tp6-amazon-firefox windows10-64 opt 442.58 -> 478.07 8% raptor-tp6-amazon-firefox windows10-64-qr opt 444.13 -> 478.96 7% raptor-tp6-wikia-firefox linux64-qr opt 180.75 -> 193.73 7% raptor-tp6-amazon-firefox linux64 opt 447.90 -> 477.35 6% raptor-tp6-amazon-firefox windows7-32 pgo 390.35 -> 414.23 6% raptor-tp6-amazon-firefox linux64 pgo 403.62 -> 427.72 5% raptor-tp6-amazon-firefox windows10-64 pgo 405.06 -> 424.82 4% raptor-tp6-facebook-firefox windows10-64 pgo 365.45 -> 379.12 3% raptor-tp6-facebook-firefox linux64 opt 380.79 -> 393.78 3% raptor-tp6-facebook-firefox windows10-64 opt 385.94 -> 399.08 3% raptor-tp6-facebook-firefox linux64 pgo 354.86 -> 366.01 3% raptor-tp6-facebook-firefox windows7-32 pgo 360.03 -> 371.30 3% raptor-tp6-facebook-firefox linux64 pgo 355.08 -> 365.55 You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=18472 On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a Treeherder page showing the Raptor jobs in a pushlog format. To learn more about the regressing test(s) or reproducing them, please see: https://wiki.mozilla.org/Performance_sheriffing/Raptor *** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! *** Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Performance_sheriffing/Talos/RegressionBugsHandling
I am not sure who to needinfo here, but when bug 1514853 landed we saw some big perf improvements (I will comment on the bug shortly), and when it was backed out, we see these improvements reverted. If pageload is so important in 2019 (yes it is 2018) for Mozilla, we should consider leaving in the code from bug 1514853. I am not sure who to needinfo, but if there is more context we can provide, it would be nice to see these improvements come back.
Flags: needinfo?(tanvi)
Flags: needinfo?(ehsan)
Thanks for filing this bug, Joel, this is really helpful! First and foremost, is it possible to get some links to the Gecko Profiles for the before/after results? I do have some suspicions on where this regression may be coming from based on some of the previous Talos regressions we've had in this code (my primary suspect is bug 1514340) but I'd like to know if that's a correct guess and/or whether there is anything else to know about... Also, did we see the same results on beta, out of curiosity? (In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #1) > I am not sure who to needinfo here, but when bug 1514853 landed we saw some > big perf improvements (I will comment on the bug shortly), and when it was > backed out, we see these improvements reverted. FWIW that patch was temporarily backed out, but it got relanded, so these improvements should have been added back now. > If pageload is so important in 2019 (yes it is 2018) for Mozilla, we should > consider leaving in the code from bug 1514853. So for 65, the majority of users will indeed run the code that we landed in bug 1514853, so this regression has yet to be shipped. So no need to be alarmed just yet. But I disagree with your conclusion here. My alternate proposal is that we should get to the bottom of where the regression comes from, and fix it, just like any other regression. :-) I think the reason we didn't know about this yet is the classic story with Talos, where as we do incremental development in the form of small patches, we may be regressing Talos numbers by a small amount over time, and then nobody knows what the total sum would amount to. The bug I linked to above was an example of that, where we first saw it causing multi-second long pauses on Bugzilla (bug 1510275) which was fixed with some band-aid and I intend to do the proper fix hopefully for 66. Hopefully that's also the reason behind this other regression but there may be other stuff that have crept in as well. But let's not jump to solutions before we have a diagnosis on the patient. We won't be able to stitch back the limps if we first amputate and then diagnose. :-) > I am not sure who to needinfo, but if there is more context we can provide, > it would be nice to see these improvements come back. Talos profiles would be of extreme value here. (BTW, are those something that can be added to the template used for filing these bugs? I find myself almost always requesting them when getting CCed on one of these bugs...) Thanks again!
Flags: needinfo?(ehsan) → needinfo?(jmaher)
I did see a small dip and reset on mozilla-beta, not 100% sure if this is the same, but it is similar enough: https://treeherder.mozilla.org/perf.html#/graphs?timerange=5184000&series=autoland,1782727,1,10&series=mozilla-inbound,1772174,1,10&series=mozilla-beta,1797348,1,10 I also don't see this fixed in recent days, the improvement on the 20th and regression/backout on the 23rd are the main changes. here are some talos profile, apologies for not getting them: * treeherder links: https://treeherder.mozilla.org/#/jobs?repo=autoland&fromchange=a40b74d3e7c07b94330201c8d64eb54afd7216ca&tochange=5d8e428324c662d96fd47677733880ae546266d4&searchStr=raptor%2Ctp6%2C-p%29&group_state=expanded * linux64 amazon: ** before: https://perf-html.io/from-url/https%3A%2F%2Fqueue.taskcluster.net%2Fv1%2Ftask%2FJSDFn39IRpKH0T9g9Hgnbw%2Fruns%2F0%2Fartifacts%2Fpublic%2Ftest_info%2Fprofile_raptor-tp6-amazon-firefox.zip/calltree/?file=profile_raptor-tp6-amazon-firefox%2Fraptor-tp6-amazon-firefox_pagecycle_2.profile&globalTrackOrder=&localTrackOrderByPid=&thread&v=3 ** after: https://perf-html.io/from-url/https%3A%2F%2Fqueue.taskcluster.net%2Fv1%2Ftask%2FdQDQyRlCQq27dt73blws-g%2Fruns%2F0%2Fartifacts%2Fpublic%2Ftest_info%2Fprofile_raptor-tp6-amazon-firefox.zip * linux64 facebook: ** before: https://perf-html.io/from-url/https%3A%2F%2Fqueue.taskcluster.net%2Fv1%2Ftask%2FJSDFn39IRpKH0T9g9Hgnbw%2Fruns%2F0%2Fartifacts%2Fpublic%2Ftest_info%2Fprofile_raptor-tp6-google-firefox.zip ** after: https://perf-html.io/from-url/https%3A%2F%2Fqueue.taskcluster.net%2Fv1%2Ftask%2FdQDQyRlCQq27dt73blws-g%2Fruns%2F0%2Fartifacts%2Fpublic%2Ftest_info%2Fprofile_raptor-tp6-facebook-firefox.zip * linux64 microsoft: ** before: https://perf-html.io/from-url/https%3A%2F%2Fqueue.taskcluster.net%2Fv1%2Ftask%2FIv908CMbT-GxG4rtk8QL4g%2Fruns%2F0%2Fartifacts%2Fpublic%2Ftest_info%2Fprofile_raptor-tp6-microsoft-firefox.zip ** after: https://perf-html.io/from-url/https%3A%2F%2Fqueue.taskcluster.net%2Fv1%2Ftask%2FI1sFo2KfRtyLU2YSA8eX6w%2Fruns%2F0%2Fartifacts%2Fpublic%2Ftest_info%2Fprofile_raptor-tp6-microsoft-firefox.zip let me know if you need different profiles, these are easy to generate on the tree.
Flags: needinfo?(jmaher)
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #3) > I did see a small dip and reset on mozilla-beta, not 100% sure if this is > the same, but it is similar enough: > https://treeherder.mozilla.org/perf.html#/ > graphs?timerange=5184000&series=autoland,1782727,1,10&series=mozilla-inbound, > 1772174,1,10&series=mozilla-beta,1797348,1,10 Yup, it's the same. It's less apparent probably because we have much fewer runs there... (BTW we will see it once again once the early beta flag is turned off...) > I also don't see this fixed in recent days, the improvement on the 20th and > regression/backout on the 23rd are the main changes. Hmm, not sure what you mean exactly. But if you mean you don't see it again on central, that's expected because the first landing of bug 1514853 on central changed the value of the network.cookie.cookieBehavior pref on central by mistake, and the backout fixed that mistake (also causing this regression again) and the second landing didn't change the pref this time so there should have been no perf change as a result, which is what happened. All expected. > let me know if you need different profiles, these are easy to generate on > the tree. Thanks, these are great, I'm looking at them now. BTW can I ask how you generate these links? Is it possible to generate them from try pushes as well?
there is an option in treeherder when clicking on a job there is an information panel that pops up on the bottom. This is divided into two parts left/right; the header on the left side has a '...' and clicking that is a menu which has 'Create Gecko Profile'. This retriggers a job (talos/raptor) with the proper flags to run it as a gecko profile variant and adjust the treeherder flags. the only caveat is that it does it for the job requested and the previous build's equivalent job. so on try that would add this to other people. Not that worrisome, but if we want to do it on try more often, we can adjust the script to be branch aware. the code lives here: https://searchfox.org/mozilla-central/source/taskcluster/taskgraph/actions/gecko_profile.py#40 The second (and I assume last caveat) is that this might be restricted to sheriffs only as some features are sheriff specific- in this case I think you just need to be logged in: https://github.com/mozilla/treeherder/blob/42a827cc6ee92b669496131c2bab99353444bfa5/ui/job-view/details/summary/ActionBar.jsx#L66
Depends on: 1517014
Great, thanks a lot! I will update this bug when I know more...
Depends on: 1517057
Depends on: 1517062
Depends on: 1517136
Flags: needinfo?(tanvi)
Component: General → Tracking Protection
Product: Testing → Firefox
Version: Version 3 → unspecified
Bug 1514340 also shows up in microsoft.com profiles, as I had expected.
Depends on: 1514340
See Also: → 1517333
Here is a Talos push with the patches for bug 1517014, bug 1517057, bug 1517062, and bug 1517136, comparing network.cookie.cookieBehavior=0 vs 4. Before: https://treeherder.mozilla.org/#/jobs?repo=try&revision=35cc5b52a0f57ce89386d7f4c231db7e0f83933c After: https://treeherder.mozilla.org/#/jobs?repo=try&revision=4be132299cf96193be86263234a3208dae0c9cbb Comparison: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=35cc5b52a0f5&newProject=try&newRevision=4be132299cf9&framework=1 For unknown reasons, only the tests for amazon, facebook, google and youtube are running there (where are the rest?!). Also, no Windows runs due to bug 1517333. On Linux, the regression for Facebook has now been fixed. The regression from Amazon is still there... On Mac, the regression for Amazon is now 12.3% which is weird since that's more than the reported amount in comment 0. This makes me think that I may be measuring something different here! No regression on Facebook though. I think that there is something wrong with the try syntax/results. I asked to run all tp6 tests, but I only got "1_thread e10s stylo". Joel, what do you think? Am I doing something wrong?
Flags: needinfo?(jmaher)
I did "add new jobs" for the two try pushes and added 'raptor tp6' to both linux64 opt and osx opt. In the compare view, I changed the framework to be raptor instead of talos and you can see the comparison: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=35cc5b52a0f5&newProject=try&newRevision=4be132299cf9&framework=10
Flags: needinfo?(jmaher)
I did "add new jobs" for the two try pushes and added 'raptor tp6' to both linux64 opt and osx opt. In the compare view, I changed the framework to be raptor instead of talos and you can see the comparison: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=35cc5b52a0f5&newProject=try&newRevision=4be132299cf9&framework=10 I agree this is confusing and need to make sure it is easier to schedule and view results across various frameworks.
Thanks a lot, Joel! So please ignore comment 8 completely! The regressions remaining are as follows based on the results that are available so far: On Amazon the regression is still present on both Linux and Mac. There is a big (33% on OSX) regression on ebay that wasn't originally reported (I think because the job is tier 2?). There are sill small regressions present on Facebook. Microsoft has improved about 5% on Linux, but the regressions are still present and really big both on Linux and OSX. The wikia regression is also still present. Yandex has improved by about 3% on OSX but the regressions are still present on both Linux and OSX and are really big. The summary is that the optimizations done so far have yet to make a large dent. My suspicion is that the majority of the cost here is coming from the overhead related to bug 1514340 every time that a STATE_COOKIES_LOADED notification is dispatched. I do hope that we don't end up having to go as far as fixing bug 1510569 here...
Depends on: 1518491

Here is a status update on what I've tried so far.

Bug 1514340 makes things a bit better, for example we no longer need to encode the certificate along the notifications we send to the parent process, but the JS code we execute still shows up in the profile. :-(

In order to try to mitigate the impact of that a little bit, I tried to delay the dispatching of OnContentBlockingEvent() notifications a bit in the nsBrowserStatusFilter object similar to how status and progress notifications are delay-batched there, but sadly that approach isn't going to work either, for two reasons, a) the OnContentBlockingEvent() actually depends on its nsIWebProgress* argument, so we can't get away with just passing null instead of it, and if we want to do this properly we would need to have some rather complex logic in order to batch notification on a per-nsIWebProgress object. But more importantly, b) the profiles I was looking at actually don't dispatch enough of these notifications for the batching to be really effective, even with my quick and dirty attempt which was batching any notification no matter which nsIWebProgress/nsIRequest object it was dispatched for. The best results I was getting was that I could suppress just one of these notifications, so this wasn't a super useful optimization anyway.

This leaves us with bug 1510569 basically, which is a humongous amount of work... So I am trying to come up with a strategy to see if I can avoid doing all of that work. My hope is that now that OnContentBlockingEvent is a separate nsIWebProgressListener notification, we can only port that one to C++ and leave the rest for someone else to fix...

Priority: -- → P1
Whiteboard: [privacy65]

One case that this regresses performance is when a page has a ton of images. This simple test case loads 10,000 images: https://pastoral-salesman-1.glitch.me/

The profiles of that test case easily show the impact of bug 1518491.

Depends on: 1519205

I think I finally uncovered the underlying reason behind the regressions we are seeing here. Interestingly, I was lead down an incorrect path all along since the beginning of my investigations until Friday when I first started to pay attention to which network resources are loaded in the raptor tests with network.cookie.cookieBehavior set to 0 and 4. It turns out that the regressions are caused because the pages affected load a different set of resources based on the configuration:

raptor-tp6-microsoft-firefox
  cookieBehavior = 0
    507 requests
  cookieBehavior = 1
    538 requests
  cookieBehavior = 4
    630 requests

raptor-tp6-yandex-firefox
  cookieBehavior = 0
    84 requests
  cookieBehavior = 1
    84 requests
  cookieBehavior = 4
    85 requests

raptor-tp6-imdb-firefox
  cookieBehavior = 0
    65 requests
  cookieBehavior = 1
    65 requests
  cookieBehavior = 4
    65 requests

raptor-tp6-wikia-firefox
  cookieBehavior = 0
    63 requests
  cookieBehavior = 1
    64 requests
  cookieBehavior = 4
    63 requests

raptor-tp6-facebook-firefox
  cookieBehavior = 0
    94 requests
  cookieBehavior = 1
    94 requests
  cookieBehavior = 4
    94 requests

raptor-tp6-amazon-firefox
  cookieBehavior = 0
    147 requests
  cookieBehavior = 1
    147-163 requests
  cookieBehavior = 4
    153 requests

I captured these numbers using devtools by manually running ./mach raptor-test locally. As can be seen, in some of the tests the behaviour of the test in terms of the number of resources loaded depends on whether third-party cookies are blocked and exactly how. The other notable difference was that in some tests there were a lot of 404 requests that were being generated (requests that did not have a recorded response) but looking at the URLs of the resources captured also showed that pages may actually load different resources with the different values of the prefs.

For example, profiling the live yandex.ru site with cookieBehavior set to 0 (https://perfht.ml/2RqlWIe) vs cookieBehavior set to 4 (https://perfht.ml/2RqlYjk) shows an example of the site loading different resources during its page load in the two modes.

The take away from this is that while this bug prompted investigation into the performance of this feature which was very valuable, the raptor regressions here are a combination of things caused by the browser processing a different workload as well as being a little bit less efficient due to the dependencies captured here. Fixing all of the dependencies shouldn't be expected to fix the regressions captured here since the pages being tested can and do change their behavior when their third-party tracking resources can access storage vs when they cannot. That being said, there is still more work that I would like to do here in order to finish all of the performance improvements I was planning which have been uncovered here.

Depends on: 1520879
Depends on: 1521595
Depends on: 1521598
Depends on: 1522370
Depends on: 1522490
Depends on: 1522630
Depends on: 1523106
See Also: → 1523705

So I have some more interesting updates to share here.

Firstly, this is where we are currently standing, with all of the fixes that have landed so far: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=1b73a3b810a4b8edab6451c106a866429fc59d95&newProject=try&newRevision=15313fe8f75a63db2264ea8391c8f9342c6bb7a5&framework=10&showOnlyImportant=1 Things are still not looking good. Although I suspect bug 1270059 has sort of changed the baselines here quite significantly. I'm not actually sure how meaningful it is to compare the geometric means of the subtests before/after that change any more.

I decided to follow an approach of eliminating possible culprits, by pushing trial patches to the try server that progressively disabled code that is hidden behind the checks for network.cookie.cookieBehavior being equal to 4, especially now that I discovered bug 1523705.

Look and behold. This patch https://hg.mozilla.org/try/rev/80f83ae8e6284dd7c71ac6772c92a06697f43e1f erases all of the regressions: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=1b73a3b810a4b8edab6451c106a866429fc59d95&newProject=try&newRevision=a08c3e5f54a7c14017a1c259181496ecb34666c0&framework=10&showOnlyImportant=1. IOW, all of the regressions are coming from loading images in this codepath: https://searchfox.org/mozilla-central/rev/78cd247b5d7a08832f87d786541d3e2204842e8e/toolkit/components/antitracking/AntiTrackingCommon.cpp#1341-1401

http://bit.ly/2RYnF82 should be a profile that captures the issue in comment 15 (taken from this test case: https://pastoral-salesman-1.glitch.me/. Beware that the test case loads 1 million images!)

Thanks for this debugging. Today, I want to investigate why we have such horrible regression in image loading with cookie behavior == 4.

Depends on: 1524313
Depends on: 1524465

This created a talos regresion also:
== Change summary for alert #19108 (as of Fri, 01 Feb 2019 11:35:53 GMT) ==

Regressions:

2% tp5o_webext responsiveness linux64 opt e10s stylo 1.44 -> 1.46

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=19108

(In reply to Florin Strugariu [:Bebe] from comment #18)

This created a talos regresion also:
== Change summary for alert #19108 (as of Fri, 01 Feb 2019 11:35:53 GMT) ==

Regressions:

2% tp5o_webext responsiveness linux64 opt e10s stylo 1.44 -> 1.46

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=19108

Isn't that related to bug 1522630?

Flags: needinfo?(fstrugariu)

(In reply to :Ehsan Akhgari from comment #19)

(In reply to Florin Strugariu [:Bebe] from comment #18)

This created a talos regresion also:
== Change summary for alert #19108 (as of Fri, 01 Feb 2019 11:35:53 GMT) ==

Regressions:

2% tp5o_webext responsiveness linux64 opt e10s stylo 1.44 -> 1.46

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=19108

Isn't that related to bug 1522630?

You are right, it is actually related to bug 1522630.

Flags: needinfo?(fstrugariu)
Depends on: 1525208

This profile https://perfht.ml/2Sr96Jl is captured from a build after the patches in bug 1525208.

This profile https://perfht.ml/2D6AKBC is captured from a build after the patch in bug 1525356. Now you can see that nsContentUtils::IsThirdPartyWindowOrChannel() is no longer dominating the time spent under AntiTrackingCommon::MaybeIsFirstPartyStorageAccessGrantedFor().

Depends on: 1525356
Component: Tracking Protection → Privacy: Anti-Tracking
Product: Firefox → Core
Target Milestone: --- → mozilla67
Whiteboard: [privacy65] → [privacy65][anti-tracking]

This profile https://perfht.ml/2DbZzfF is captured from a build after the patches in bug 1525502. Now you can see that CheckContentBlockingAllowList() is almost eliminated in this profile. Of course this is because of the specific nature of this test case which is a best case for this type of cache.

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=878491318f3e17f19dc0500046993c2110bad886&newProject=try&newRevision=9de01b6ce5c2dd3180a04386dade7b2bb2599b54&framework=10 shows the cumulative impact of all of the optimizations from comment 21 up to here and as you can see we still haven't managed to make a sufficient dent yet. :-(

Depends on: 1525502
No longer depends on: 1525502
Depends on: 1525502
Assignee: nobody → ehsan
Status: NEW → ASSIGNED
Depends on: 1525245
Depends on: 1526542
Depends on: 1548349

This regression will finally be completely fixed in bug 1548349.

Fixed by bug 1548349.

Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: mozilla67 → mozilla68
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.