Last Comment Bug 1317312 - [e10s-multi] Talos regressions with multiple content processes
: [e10s-multi] Talos regressions with multiple content processes
Status: RESOLVED FIXED
[e10s-multi:?]
:
Product: Core
Classification: Components
Component: DOM: Content Processes (show other bugs)
: unspecified
: Unspecified Unspecified
-- normal with 1 vote (vote)
: mozilla53
Assigned To: Gabor Krizsanits [:krizsa :gabor]
:
: Bill McCloskey (:billm)
Mentors:
Depends on: 1317304 1324033
Blocks: 1303113 1323319
  Show dependency treegraph
 
Reported: 2016-11-14 06:49 PST by Gabor Krizsanits [:krizsa :gabor]
Modified: 2017-01-27 05:07 PST (History)
17 users (show)
gkrizsanits: needinfo? (mrbkap)
mconley: needinfo? (jmuizelaar)
ryanvm: in‑testsuite+
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
fixed
fixed


Attachments
part1: Use single content process in tabpaint (1.75 KB, patch)
2016-12-30 04:22 PST, Gabor Krizsanits [:krizsa :gabor]
mconley: review+
Details | Diff | Splinter Review
part2: Tps should filter out glyph cache init (2.48 KB, patch)
2016-12-30 04:24 PST, Gabor Krizsanits [:krizsa :gabor]
mconley: review+
jcristau: approval‑mozilla‑aurora+
Details | Diff | Splinter Review

Description User image Gabor Krizsanits [:krizsa :gabor] 2016-11-14 06:49:42 PST
From Bug 1303113 Comment 11:

(In reply to Joel Maher ( :jmaher) from comment #11)
> as a note this had a large impact on talos:
> == Change summary for alert #4055 (as of November 09 2016 13:46 UTC) ==
> 
> Regressions:
> 
> 270%  tabpaint summary osx-10-10 opt e10s       58.11 -> 214.79
> 263%  tabpaint summary windowsxp pgo e10s       50.45 -> 183.01
> 223%  tabpaint summary windows7-32 pgo e10s     49.69 -> 160.26
> 217%  tabpaint summary linux64 pgo e10s         59.5 -> 188.54
> 214%  tabpaint summary linux64 opt e10s         68.86 -> 216.27
> 205%  tabpaint summary windows8-64 opt e10s     59.03 -> 180.29
> 196%  tabpaint summary windowsxp opt e10s       73.17 -> 216.68
> 177%  tabpaint summary windows7-32 opt e10s     71.44 -> 197.61
>  67%  tps summary windows7-32 pgo e10s          33.25 -> 55.64
>  58%  tps summary windows7-32 opt e10s          40.02 -> 63.24
>  42%  tps summary osx-10-10 opt e10s            39.78 -> 56.29
>  27%  tps summary windowsxp opt e10s            39.5 -> 50.02
>  25%  tps summary windowsxp pgo e10s            33.74 -> 42.19
>  16%  damp summary windows7-32 pgo e10s         222.94 -> 259.04
>  14%  damp summary windows8-64 opt e10s         267.11 -> 303.89
>  13%  damp summary osx-10-10 opt e10s           302.7 -> 342.71
>  11%  damp summary linux64 opt e10s             293.99 -> 326.18
>  10%  damp summary linux64 pgo e10s             245.15 -> 270.28
>   9%  tps summary linux64 opt e10s              41.7 -> 45.47
>   8%  damp summary windows7-32 opt e10s         300.24 -> 325.07
>   8%  tps summary linux64 pgo e10s              36.23 -> 39.22
> 
> Improvements:
> 
>   3%  tps summary windows8-64 opt e10s     37.1 -> 35.86
> 
> For up to date results, see:
> https://treeherder.mozilla.org/perf.html#/alerts?id=4055
> 
> 
> please address these talos issues prior to landing this again.
Comment 1 User image Gabor Krizsanits [:krizsa :gabor] 2016-11-14 06:53:56 PST
Content process startup optimization will fix some of this regression, we should decide what is the acceptable regression for release (process startup and script execution won't happen without a cost).

We also have to decide if we want to force single content process for these tests for now or we want to block turning on 2cp's on nightly.
Comment 2 User image Joel Maher ( :jmaher) 2016-11-14 06:56:44 PST
let me know if I can help answer questions or debug stuff.
Comment 3 User image Gabor Krizsanits [:krizsa :gabor] 2016-11-14 07:25:20 PST
Jan, I will try to delay some of these scripts, but realistically we will need something like bug 876173 for this as well. Are you working on something related these days?
Comment 4 User image Gabor Krizsanits [:krizsa :gabor] 2016-11-14 07:26:56 PST
Eh sorry for the spam this was the wrong bug :(
Comment 5 User image Gabor Krizsanits [:krizsa :gabor] 2016-11-14 07:34:35 PST
(In reply to Joel Maher ( :jmaher) from comment #2)
> let me know if I can help answer questions or debug stuff.

Thanks! So tabpaint is hardly surprising but I'm not sure about the damp regression. Opening new tab or window is expected to start slower, is this something that would explain the damp regression or do you think there is something else?
Comment 6 User image Gabor Krizsanits [:krizsa :gabor] 2016-11-14 07:37:28 PST
Mike, is it expected that tps is regressed by multiple content processes? I would expect it not to open tabs during the run but the extra process startup might cause some delays at the start of the test... which should not matter that much either... do you have any ideas why this test is so sensitive about an extra content process?
Comment 7 User image Gabor Krizsanits [:krizsa :gabor] 2016-11-14 07:39:27 PST
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #1)
> We also have to decide if we want to force single content process for these
> tests for now or we want to block turning on 2cp's on nightly.

:jimm thinks if we decide to turn it on on nightly with this regression, we should not hide it by forcing single content process.
Comment 8 User image Joel Maher ( :jmaher) 2016-11-14 07:51:15 PST
regarding damp, looking at the subtests:
linux64: https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=mozilla-inbound&originalRevision=fb8d5979d0f1e4f17416e5d027a46e7a2068e1c5&newProject=mozilla-inbound&newRevision=085035586d2b4d47ed03dee2ea727df6f2de6679&originalSignature=edaec66500db21d37602c99daa61ac983f21a6ac&newSignature=edaec66500db21d37602c99daa61ac983f21a6ac&framework=1

osx: https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=mozilla-inbound&originalRevision=fb8d5979d0f1e4f17416e5d027a46e7a2068e1c5&newProject=mozilla-inbound&newRevision=085035586d2b4d47ed03dee2ea727df6f2de6679&originalSignature=9663a154caf9b713fb951c8db24518c7e0ceddce&newSignature=9663a154caf9b713fb951c8db24518c7e0ceddce&framework=1

you can see similar patterns where we have some improvements and many regressions.  specifically in the open/reload measurements for the different tools, the close measurements have little to no regressions.  The wins that we see are in the reload typically, so that leaves the smoking gun pointing at the open measurements for damp- let me include bgrins on here as he might have some valuable insight into the damp test and specifically the open measurements.
Comment 9 User image (Unavailable until Apr 3) [:bgrins] 2016-11-15 08:09:33 PST
(In reply to Joel Maher ( :jmaher) from comment #8)
> regarding damp, looking at the subtests:
> linux64:
> https://treeherder.mozilla.org/perf.html#/
> comparesubtest?originalProject=mozilla-
> inbound&originalRevision=fb8d5979d0f1e4f17416e5d027a46e7a2068e1c5&newProject=
> mozilla-
> inbound&newRevision=085035586d2b4d47ed03dee2ea727df6f2de6679&originalSignatur
> e=edaec66500db21d37602c99daa61ac983f21a6ac&newSignature=edaec66500db21d37602c
> 99daa61ac983f21a6ac&framework=1
> 
> osx:
> https://treeherder.mozilla.org/perf.html#/
> comparesubtest?originalProject=mozilla-
> inbound&originalRevision=fb8d5979d0f1e4f17416e5d027a46e7a2068e1c5&newProject=
> mozilla-
> inbound&newRevision=085035586d2b4d47ed03dee2ea727df6f2de6679&originalSignatur
> e=9663a154caf9b713fb951c8db24518c7e0ceddce&newSignature=9663a154caf9b713fb951
> c8db24518c7e0ceddce&framework=1
> 
> you can see similar patterns where we have some improvements and many
> regressions.  specifically in the open/reload measurements for the different
> tools, the close measurements have little to no regressions.  The wins that
> we see are in the reload typically, so that leaves the smoking gun pointing
> at the open measurements for damp- let me include bgrins on here as he might
> have some valuable insight into the damp test and specifically the open
> measurements.

FYI there's a possibility for some noise in between DAMP subtests right now, so if RDP requests are still ongoing after opening the toolbox then it might bleed into the reload time.  I'm hoping that Bug 1291815 will prevent that from happening.

Generally, open time is how long it takes from the start of opening the toolbox until the toolbox and specific tool are fully loaded.  Most of the work is in the parent process (loading a bunch of modules and building UI), but it's also waiting for the RDP (which is running in the content process) to return whatever is needed for the tool.  In some tools there can be quite a bit of traffic on a complicated page (like the inspector needs a lot of info about the DOM even at startup).
Comment 10 User image Mike Conley (:mconley) 2016-11-15 12:15:48 PST
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #6)
> Mike, is it expected that tps is regressed by multiple content processes? I
> would expect it not to open tabs during the run but the extra process
> startup might cause some delays at the start of the test... which should not
> matter that much either... do you have any ideas why this test is so
> sensitive about an extra content process?

Nothing comes to mind, I'm afraid. By the time that we start measuring tab switch times, the tp5 pages should have finished loading (and presumably, the content processes are both fully started-up).

I think we'll need to profile it to get a sense of what's going wrong here.
Comment 11 User image Mike Conley (:mconley) 2016-11-18 14:36:14 PST
ni?ing myself to get some profiles to gabor / mrbkap next week.
Comment 12 User image Gabor Krizsanits [:krizsa :gabor] 2016-11-22 07:23:24 PST
(In reply to Mike Conley (:mconley) from comment #11)
> ni?ing myself to get some profiles to gabor / mrbkap next week.

tps e10s-multi
https://cleopatra.io/#report=a9685a46f48e7aedd4a06cbfd7d7e39fa61f12c4&selection=0,8

tps e10s-single
https://cleopatra.io/#report=4c2b79c10b4d6240c6890fc4e3e18cc49863d19d&selection=0,8

It seems to me that in the multi process case we run the first batch of test in one process and the second in another, plus I'm not sure we wait for the process to fully initialize (loading all the process scripts  frame scripts and what not). I'm not sure why we can't see what is in the XRE_InitChildProcess but probably we don't start the measurements early enough, which is OK as long as we start the test measurement after the process warmed up.

Anyway, what I have assumed from this test that it loads some of the tabs in one process and some in the other does not seem to happen and the regression seem to be an indicator that this test should be adjusted for e10s-multi.
Comment 13 User image Gabor Krizsanits [:krizsa :gabor] 2016-11-22 07:49:32 PST
tabpaint e10s-single
https://cleopatra.io/#report=8f75b775d7210af02374f8e06df648f7936d7bac&selection=0,8

tabpaint e10s-multi
https://cleopatra.io/#report=ab2845f370e9282e28ee2d4a2fe6258465c50f1f

This test isn't exactly fair either. The single content process version seem to run the test in the original content process that is already warmed up mostly, while the e10s-multi version does not (so we basically cheating for e10s-single). On the plus side the e10s-multi case gives a nice view about the time it takes about some scripts to run at startup (I don't see the process scripts though).
Comment 14 User image Gabor Krizsanits [:krizsa :gabor] 2016-11-23 05:53:33 PST
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #12)
> It seems to me that in the multi process case we run the first batch of test
> in one process and the second in another, plus I'm not sure we wait for the
> process to fully initialize

http://searchfox.org/mozilla-central/source/testing/talos/talos/tests/tabswitch/bootstrap.js#204
This should guarantee that the process that owns the tab we're switching to is initialized, but does not seem to guarantee that all the other processes finished their initialization, so that can be a source of regression but I'm not sure. We should probably do this when we open the batch of tabs with an array of promises.

I still don't get what's up with the profiler, the multi-process case is weird, I cannot interpret it. Here is one with processCount set to 5:
https://cleopatra.io/#report=2f611a922c19ca2e72566711a2e2a9040d2c6649
Comment 15 User image Mike Conley (:mconley) 2016-12-09 20:29:59 PST
Here's some interesting data.

With bug 1322758 fixed, I was able to get some profiles from try on OS X. I thought this was kind of interesting - here's a single tab switch to an Amazon page about a Kindle (part of the tp5 set), in both e10s-multi and in e10s-single:

e10s-multi: https://cleopatra.io/#report=af72e1a87b2296fcfe11862c922b770868902d58&filter=%5B%7B%22type%22%3A%22RangeSampleFilter%22,%22start%22%3A134258,%22end%22%3A162499%7D,%7B%22type%22%3A%22RangeSampleFilter%22,%22start%22%3A153384,%22end%22%3A162499%7D,%7B%22type%22%3A%22RangeSampleFilter%22,%22start%22%3A160241,%22end%22%3A162499%7D,%7B%22type%22%3A%22RangeSampleFilter%22,%22start%22%3A160412,%22end%22%3A160540%7D,%7B%22type%22%3A%22RangeSampleFilter%22,%22start%22%3A160421,%22end%22%3A160527%7D%5D&selection=0,5509,5510,5511,5512,5101,5513,8,9,10,11,12,13,14,15,16,17,20,21,22,23,24,25,26,139,140,141,3731,5562,5652,5653,5654,5655,32,33,34,35,36,37,39,54,55,56,90,91,92,93,94,5820,5821,96,97,98,99,1757

e10s-single: https://cleopatra.io/#report=9e0b12ae6cc1cccca154f0622c1f5eab7ddfdb01&filter=%5B%7B%22type%22%3A%22RangeSampleFilter%22,%22start%22%3A203410,%22end%22%3A214004%7D,%7B%22type%22%3A%22RangeSampleFilter%22,%22start%22%3A211142,%22end%22%3A214004%7D,%7B%22type%22%3A%22RangeSampleFilter%22,%22start%22%3A211222,%22end%22%3A211386%7D,%7B%22type%22%3A%22RangeSampleFilter%22,%22start%22%3A211266,%22end%22%3A211322%7D%5D&selection=0,5117,5118,5119,5120,4712,5121,173,174,175,176,177,178,179,180,181,182,183,184,2,185,186,187,188,379,380,381,3606,5144,5147,5148,5149,5150,269,270,271,272,273,274,276,284,285,296,297,298,299,300,301,5151,5152,303,304,305,306,3451

There is a huge difference in the amount of time we're taking to paint here.

79 samples are being spent in nsPresShell::Paint in e10s-multi. Only 34 samples are being spent in nsPresShell::Paint in e10s-single. That's a pretty substantial increase.

Digging in there, it looks like the biggest difference has to do with nsDisplayText::Paint. There are 46 samples in that frame in e10s-multi, and only 5 in e10s-single.

We might want to get someone from the Graphics team to weigh in on this. Hey jrmuizel - anything illuminating about this?
Comment 16 User image Gabor Krizsanits [:krizsa :gabor] 2016-12-22 14:48:51 PST
Filtering out the glyph caching difference between single and multi cp cases from the tps test seems to be doing the trick: https://treeherder.mozilla.org/#/jobs?repo=try&revision=09f2e414b80dc9184563a3d4527ea96663fde46d
Comment 17 User image Gabor Krizsanits [:krizsa :gabor] 2016-12-30 04:22:05 PST
Created attachment 8822647 [details] [diff] [review]
part1: Use single content process in tabpaint

We don't measure the content process startup time for the single content process case in this test, so we should not measure it for the multi case either. The easiest way to do that is just forcing single cp.
Comment 18 User image Gabor Krizsanits [:krizsa :gabor] 2016-12-30 04:24:57 PST
Created attachment 8822648 [details] [diff] [review]
part2: Tps should filter out glyph cache init

Thanks again Mike for tracking this down :)
Comment 20 User image Mike Conley (:mconley) 2016-12-30 20:15:34 PST
Comment on attachment 8822647 [details] [diff] [review]
part1: Use single content process in tabpaint

Review of attachment 8822647 [details] [diff] [review]:
-----------------------------------------------------------------

Reasoning is sound and the patch looks fine to me. You'll need to bump the revision of the tabpaint test's install.rdf[1], and also package / re-sign[2] the add-on (I know, it kinda blows).

[1]: http://searchfox.org/mozilla-central/rev/51aa673e28802fe6ea89f4793721fc55334a6ac8/testing/talos/talos/tests/tabpaint/install.rdf#7
[2]: https://wiki.mozilla.org/EngineeringProductivity/HowTo/SignExtensions#Signing_an_Addon
Comment 21 User image Mike Conley (:mconley) 2016-12-30 20:17:08 PST
Comment on attachment 8822648 [details] [diff] [review]
part2: Tps should filter out glyph cache init

Review of attachment 8822648 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good here too (modulo typo). Same need as above, re: bumping the version[1] and signing.

[1]: http://searchfox.org/mozilla-central/rev/51aa673e28802fe6ea89f4793721fc55334a6ac8/testing/talos/talos/tests/tabswitch/install.rdf#7

::: testing/talos/talos/tests/tabswitch/bootstrap.js
@@ +405,5 @@
> +      // Let's do an initial run to warm up any paint related caches
> +      // (like glyph caches for text). In the next loop we will start with
> +      // a GC before each switch so we don't need here.
> +      // Note: in case of multiple content processes, closing all the tabs
> +      // would close the related conent processes, and even if we kept them

typo: "conent" -> "content".
Comment 22 User image Pulsebot 2017-01-03 05:03:00 PST
Pushed by gkrizsanits@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/d0d55af22bcb
Tps should filter out glyph cache overhead. r=mconley
https://hg.mozilla.org/integration/mozilla-inbound/rev/851c282874b1
Tabpaint test should not measure content process startup. r=mconley
Comment 23 User image Gabor Krizsanits [:krizsa :gabor] 2017-01-03 05:05:26 PST
(In reply to Mike Conley (:mconley) - PTO until Jan 3 from comment #20)
> Reasoning is sound and the patch looks fine to me. You'll need to bump the
> revision of the tabpaint test's install.rdf[1], and also package /
> re-sign[2] the add-on (I know, it kinda blows).

I had no idea... and thanks for the links it saved me a lot of time :)
Comment 25 User image Joel Maher ( :jmaher) 2017-01-04 06:11:19 PST
we have improvements in the tps test after this landed:
== Change summary for alert #4695 (as of January 03 2017 13:02 UTC) ==

Improvements:

 38%  tps summary windows8-64 opt e10s     40.39 -> 24.85
 35%  tps summary windows7-32 opt e10s     42.37 -> 27.56
 19%  tps summary osx-10-10 opt            70.13 -> 56.96
 11%  tps summary linux64 opt              59.15 -> 52.9
 10%  tps summary windows7-32 opt          73.54 -> 66
 10%  tps summary windows8-64 opt          68.5 -> 61.48

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=4695


I assume these are expected and probably a test specific change.
Comment 26 User image Mike Conley (:mconley) 2017-01-19 08:21:25 PST
Comment on attachment 8822648 [details] [diff] [review]
part2: Tps should filter out glyph cache init

Approval Request Comment
[Feature/Bug causing the regression]:

Bug 1323319

[User impact if declined]:

None - but we'll show a talos regression on the tps test.

[Is this code covered by automated tests?]:

Yes.

[Has the fix been verified in Nightly?]:

Yes.

[Needs manual test from QE? If yes, steps to reproduce]: 

No.

[List of other uplifts needed for the feature/fix]:

None.

[Is the change risky?]:

No.

[Why is the change risky/not risky?]:

It's updating a talos test to more correctly measure what it's supposed to measure, and avoiding alarms going off if and when bug 1323319 uplifts.

[String changes made/needed]:

None.
Comment 27 User image Julien Cristau [:jcristau] 2017-01-20 08:08:50 PST
Comment on attachment 8822648 [details] [diff] [review]
part2: Tps should filter out glyph cache init

talos test change, aurora52+, needed for bug 1323319
Comment 29 User image Joel Maher ( :jmaher) 2017-01-27 05:07:44 PST
to confirm, we are fine with the damp regressions?

I see these changes after we landed this change:
== Change summary for alert #4876 (as of January 23 2017 13:16 UTC) ==

Regressions:

 15%  damp summary osx-10-10 opt       312.58 -> 357.95
 13%  damp summary windows7-32 opt     310.46 -> 349.73
 10%  damp summary linux64 pgo         252.03 -> 277.76
  6%  damp summary windows8-64 pgo     248.59 -> 263.63

Improvements:

 16%  tps summary osx-10-10 opt     30.11 -> 25.42

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=4876

Note You need to log in before you can comment on or make changes to this bug.