1393910 - 39.89% sessionrestore_many_windows (windows7-32) regression on push 3cbad3a6fc6e (Thu Aug 24 2017)

Reporter

Description

•

8 years ago

Talos has detected a Firefox performance regression from push: https://hg.mozilla.org/integration/autoland/pushloghtml?changeset=3cbad3a6fc6e As author of one of the patches included in that push, we need your help to address this regression. Regressions: 40% sessionrestore_many_windows windows7-32 pgo e10s 2,597.00 -> 3,632.83 Improvements: 25% sessionrestore_many_windows windows10-64 opt e10s 4,107.08 -> 3,079.17 You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=9006 On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format. To learn more about the regressing test(s), please see: https://wiki.mozilla.org/Buildbot/Talos/Tests For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/Buildbot/Talos/Running *** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! *** Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 1

•

8 years ago

I am not sure why we have a large improvement on win10 opt (non pgo), and a large regression on win7 pgo. I know we changed some preferences only in all test harnesses- I see a few options here: * the preferences really do affect session restore and we need to document this * the regression is related to a pgo specific issue, that is a black box * there are possibly differences between 32 and 64 bit, but I doubt this is the case as win7 opt didn't change, nor did win10 pgo :francois, can you look into this and help figure out if there is a good explanation for this?

Flags: needinfo?(francois)

Kohei Yoshino [:kohei]

Updated

•

8 years ago

Component: Untriaged → Safe Browsing

Product: Firefox → Toolkit

François Marier [:francois]

Comment 2

•

8 years ago

The first thing would be to figure which of these prefs actually changes the sessionrestore talos numbers: https://hg.mozilla.org/integration/autoland/diff/3cbad3a6fc6e/testing/talos/talos/config.py We can ignore browser.safebrowsing.downloads.enabled since it won't do anything during session restore. It's one of these two: - plugins.flashBlock.enabled - privacy.trackingprotection.annotate_channels Is it just a matter of changing one of these at a time and pushing to Try or is there something else to worry about? What Try chooser settings should I use to replicate the results from comment 0?

Flags: needinfo?(francois) → needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 3

•

8 years ago

./mach try -b o -p win32,win64 -u none -t other-e10s --rebuild 3

Flags: needinfo?(jmaher)

François Marier [:francois]

Comment 4

•

8 years ago

Talos try run with channel annotations enabled: https://treeherder.mozilla.org/#/jobs?repo=try&revision=5bf56df8286950f88d05b094e37b9b12959a3522 Talos try run with flashblock enabled: https://treeherder.mozilla.org/#/jobs?repo=try&revision=8fa8f6b06f8d8c7c7f3bc2248a52ad41cb9b5d8d

Ionuț Goldan [:igoldan]

Comment 5

•

8 years ago

From Try pushes above, the one with annotations enabled seems to point a slightly bigger regression. I did another set of pushes, using https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=07c2203ca05e5fd802f61e740b4511b02aafbe48&tochange=072ae808c7b3fdd26ff74f5b9221c44a31f21224 as baseline. Baseline, before the settings were added: https://treeherder.mozilla.org/#/jobs?repo=try&revision=0e06c7ca29a2dacce8a46925bd206b4a8fcbb606 privacy.trackingprotection.annotate_channels disabled: https://treeherder.mozilla.org/#/jobs?repo=try&revision=20a250bb8cfc52a4f673c0b581e8aae1af6719ec plugins.flashBlock.enabled disabled: https://treeherder.mozilla.org/#/jobs?repo=try&revision=619533164980d8d0397b7a435bb31d442e8e00e6 Once these reruns finish, I'll come back with the proper conclusion.

Ionuț Goldan [:igoldan]

Comment 6

•

8 years ago

OK, results are now in. Looks like by commenting out plugins.flashBlock.enabled and leaving privacy.trackingprotection.annotate_channels settting on, we can reproduce a >38% regression on the same test. In other words, the privacy.trackingprotection.annotate_channels is the culprit for this bug. François, please review the Try pushes I did, to confirm my findings. I want to be sure I pushed the code changes correctly.

Flags: needinfo?(francois)

François Marier [:francois]

Comment 7

•

8 years ago

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #6) > In other words, the privacy.trackingprotection.annotate_channels is the > culprit for this bug. François, please review the Try pushes I did, to > confirm my findings. I want to be sure I pushed the code changes correctly. These commits look good to me: https://hg.mozilla.org/try/rev/20a250bb8cfc52a4f673c0b581e8aae1af6719ec https://hg.mozilla.org/try/rev/619533164980d8d0397b7a435bb31d442e8e00e6 However, I have no idea why we are seeing both a regression and an improvement in session_restore. This pref disables a few perf-related features from Quantum that depend on the tracking protection list. I could understand a perf impact one way or the other: 1. session restore slower because the tracking annotation cause more work (CPU) to happen, or 2. session restore faster because unnecessary resources are throttled or loaded in a more efficient way but if I understand the results correctly, we're seeing both of these at the same time on different platforms. That doesn't make sense to me. I disabled it in bug 1388938 assuming it would be better not to rely on an external service (and a list that may not be fully downloaded by the time the Talos tests start), similar to how we disable tracking protection and Safe Browsing in these tests. Do we want to have Talos results rely on a feature with a network dependency?

Flags: needinfo?(francois)

Ionuț Goldan [:igoldan]

Comment 8

•

8 years ago

(In reply to François Marier [:francois] from comment #7) > Do we want to have Talos results rely on a > feature with a network dependency? Ni?ing test authors and Joel on that.

Flags: needinfo?(mdeboer)

Flags: needinfo?(jmaher)

Flags: needinfo?(dteller)

Mike de Boer [:mikedeboer]

Comment 9

•

8 years ago

(In reply to François Marier [:francois] from comment #7) > I disabled it in bug 1388938 assuming it would be better not to rely on an > external service (and a list that may not be fully downloaded by the time > the Talos tests start), similar to how we disable tracking protection and > Safe Browsing in these tests. Do we want to have Talos results rely on a > feature with a network dependency? I think that question is for Talos in its entirety, thus better for Joel to answer. However, my gut tells me that the only network dependency that these test should rely on is LAN and loopback. WAN should be out of the equation, because the WWW infra is not reliable, obviously. Especially localhost network traffic is something we should include and consider a constant - because we control that segment of infrastructure.

Flags: needinfo?(mdeboer)

David Teller [:Yoric] - still alive but not very active

Comment 10

•

8 years ago

Mmmmh... I thought we had managed to remove network dependencies from that test. If not, that's not intended.

Flags: needinfo?(dteller)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 11

•

8 years ago

all unittests and talos tests run in our CI system should not rely on external servers outside of original setup and cleanup. this means: * setting preferences/env variables to set server names/ports to be localhost or invalid * setting preferences to disable services where possible * ensuring the tests themselves have all the resources in-tree (or in a pageset bundle) Typically I load a test and look at the network monitor in devtools, for sessionrestore I am not sure that will be possible- we could look at tcpdump or similar software for external network access and filter it on the firefox processes.

Flags: needinfo?(jmaher)

François Marier [:francois]

Comment 12

•

8 years ago

(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #11) > all unittests and talos tests run in our CI system should not rely on > external servers outside of original setup and cleanup. Ok, so my assumption was correct. Based on that, it sounds like the next step here is to do the first option from comment 1: - document that privacy.trackingprotection.annotate_channels affects session_restore

(no longer active)

Comment 13

•

8 years ago

Mike, do you have cycles to get a profile from a Talos run with the annotate_channels pref set to on when you get back from PTO by any chance? It's pretty worrying that this pref causes a regression since we're shipping with it. :-(

Flags: needinfo?(mconley)

Mike Conley (:mconley) (:⚙️)

Comment 14

•

8 years ago

Wasn't able to reproduce locally, so I've got try pushes cooking that should hopefully result in profiles: annotate disabled: https://treeherder.mozilla.org/#/jobs?repo=try&revision=80d431811c60bda9c0dfac01b631da19da3feddd annotate enabled (regression): https://treeherder.mozilla.org/#/jobs?repo=try&revision=d0226ffad0cd9c13bf8f3f6305433645507bd77c Leaving ni? so that I can come back and post the profiles when these things are done.

Mike Conley (:mconley) (:⚙️)

Comment 15

•

8 years ago

Frustratingly, neither of those pushes resulted in profiles. That's particularly frustrating because it worked for me locally. Classic Talos. I have a Window 7 machine that I might be able to reproduce the problem on locally, so I'm trying that next.

:Felipe Gomes (needinfo for replies!)

Comment 16

•

8 years ago

One thing to keep in mind here, (and this is just my theory).. is that the same cset caused a large regression in one platform, and a large improvement in another (unless I'm misunderstanding comment 0). With that in mind, I think it's more likely that the change shifted the timing for when something particularly expensive happens, making it fall inside the timing of the test for one platform, and outside for another.. IOW, it's _probably_ not a regression as in more work is being done, but just for the timing of things.. (I could be wrong, of course..)

François Marier [:francois]

Updated

•

8 years ago

status-firefox57: --- → affected

Priority: -- → P3

Erin Lancaster [:elan]

Comment 17

•

8 years ago

mconley: Kevin's perf report listed this ticket as an open/deferred item from 57. Can you comment about the priority of this regression so the P3 decision is documented? Thank you.

Mike Conley (:mconley) (:⚙️)

Comment 18

•

8 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=d29bb386c28ae3e32c256126b5bdea1d0600e234

Mike Conley (:mconley) (:⚙️)

Comment 19

•

8 years ago

Hey elan, and everybody, Sorry about the delay here. I haven't really had much luck determining what's going wrong here yet. I've got profiles working on Try, so that's good. I'm trying to get PGO profiles out now. I think this is worth investigating, but I think P3 is the right call - specifically because this test appears to be bi-modal: https://treeherder.mozilla.org/perf.html#/graphs?timerange=7776000&series=mozilla-inbound,1514529,1,1&series=autoland,1514549,1,1 I suspect something about tracking protection annotation just happened to push us off of some kind of cliff. But if you look at the graph I linked to, we seem to hop between these two modes, and we should probably figure out why to make this test more reliable. So I'm 70% sure this is a test issue, and the test needs to be made less bi-modal. I could be wrong, but that's my read on this. Is that sufficient? I'm going to keep the needinfo? on me to get those PGO profiles. Soon!

Mike Conley (:mconley) (:⚙️)

Updated

•

8 years ago

Flags: needinfo?(elancaster)

Erin Lancaster [:elan]

Comment 20

•

8 years ago

:mconley, thank you so much. I feel like 70% is acceptable, we just wanted to make sure we are being deliberate in our choices for perf related tickets. You've done due diligence, thanks again! If something changes, we'll see your comment or ping me.

Mike Conley (:mconley) (:⚙️)

Comment 21

•

8 years ago

Wow, that took a while. I finally have some data here: No regression profile: https://perfht.ml/2jXE5gj Regression profile: https://perfht.ml/2jWoL3y Still not super clear what's happening, but it looks like in the regressing case, there's a longer gap of time before it starts (and completes) restoring tab content in this test. The actual loading doesn't appear to take longer - it's just that the onStateChange that ContentRestore.jsm is waiting for seems to arrive later. Hey francois, are there other threads besides the main threads in the parent and content processes that I should gather profiles from to help shed light on what the delay is here?

Flags: needinfo?(mconley) → needinfo?(francois)

François Marier [:francois]

Comment 22

•

8 years ago

(In reply to Mike Conley (:mconley) (:⚙️) - Backlogged on reviews from comment #21) > Hey francois, are there other threads besides the main threads in the parent > and content processes that I should gather profiles from to help shed light > on what the delay is here? The classification occurs on the main thread: https://searchfox.org/mozilla-central/rev/f6dc0e40b51a37c34e1683865395e72e7fca592c/toolkit/components/url-classifier/nsUrlClassifierDBService.cpp#1771 but I wonder whether the Safe Browsing update thread is interfering with this. Henry Chang did a lot of work in the past few months to mitigate that (bug 1338017) so it's probably fine, but it might be worth a quick profile.

Flags: needinfo?(francois)

Flags: needinfo?(elancaster)

Jim Mathies [:jimm]

Comment 23

•

8 years ago

Hey Mike, for 57 we're thinking fix option based on your feedback. Are you ok with that?

status-firefox57: affected → fix-optional

Flags: needinfo?(mconley)

Mike Conley (:mconley) (:⚙️)

Comment 24

•

8 years ago

I think that's the right call.

Flags: needinfo?(mconley)

BMO Automation

Updated

•

3 years ago

Severity: normal → S3