The default bug view has changed. See this FAQ.

[e10s-multi] Talos regressions with multiple content processes

RESOLVED FIXED in Firefox 52

Status

()

Core
DOM: Content Processes
RESOLVED FIXED
4 months ago
19 days ago

People

(Reporter: krizsa, Assigned: krizsa, NeedInfo)

Tracking

unspecified
mozilla53
Points:
---
Dependency tree / graph
Bug Flags:
in-testsuite +

Firefox Tracking Flags

(firefox52 fixed, firefox53 fixed)

Details

(Whiteboard: [e10s-multi:?])

Attachments

(2 attachments)

(Assignee)

Description

4 months ago
From Bug 1303113 Comment 11:

(In reply to Joel Maher ( :jmaher) from comment #11)
> as a note this had a large impact on talos:
> == Change summary for alert #4055 (as of November 09 2016 13:46 UTC) ==
> 
> Regressions:
> 
> 270%  tabpaint summary osx-10-10 opt e10s       58.11 -> 214.79
> 263%  tabpaint summary windowsxp pgo e10s       50.45 -> 183.01
> 223%  tabpaint summary windows7-32 pgo e10s     49.69 -> 160.26
> 217%  tabpaint summary linux64 pgo e10s         59.5 -> 188.54
> 214%  tabpaint summary linux64 opt e10s         68.86 -> 216.27
> 205%  tabpaint summary windows8-64 opt e10s     59.03 -> 180.29
> 196%  tabpaint summary windowsxp opt e10s       73.17 -> 216.68
> 177%  tabpaint summary windows7-32 opt e10s     71.44 -> 197.61
>  67%  tps summary windows7-32 pgo e10s          33.25 -> 55.64
>  58%  tps summary windows7-32 opt e10s          40.02 -> 63.24
>  42%  tps summary osx-10-10 opt e10s            39.78 -> 56.29
>  27%  tps summary windowsxp opt e10s            39.5 -> 50.02
>  25%  tps summary windowsxp pgo e10s            33.74 -> 42.19
>  16%  damp summary windows7-32 pgo e10s         222.94 -> 259.04
>  14%  damp summary windows8-64 opt e10s         267.11 -> 303.89
>  13%  damp summary osx-10-10 opt e10s           302.7 -> 342.71
>  11%  damp summary linux64 opt e10s             293.99 -> 326.18
>  10%  damp summary linux64 pgo e10s             245.15 -> 270.28
>   9%  tps summary linux64 opt e10s              41.7 -> 45.47
>   8%  damp summary windows7-32 opt e10s         300.24 -> 325.07
>   8%  tps summary linux64 pgo e10s              36.23 -> 39.22
> 
> Improvements:
> 
>   3%  tps summary windows8-64 opt e10s     37.1 -> 35.86
> 
> For up to date results, see:
> https://treeherder.mozilla.org/perf.html#/alerts?id=4055
> 
> 
> please address these talos issues prior to landing this again.
(Assignee)

Comment 1

4 months ago
Content process startup optimization will fix some of this regression, we should decide what is the acceptable regression for release (process startup and script execution won't happen without a cost).

We also have to decide if we want to force single content process for these tests for now or we want to block turning on 2cp's on nightly.
Depends on: 1317304
Flags: needinfo?(mrbkap)
Whiteboard: [e10s-multi:?]

Comment 2

4 months ago
let me know if I can help answer questions or debug stuff.
(Assignee)

Updated

4 months ago
Flags: needinfo?(jdemooij)
(Assignee)

Comment 3

4 months ago
Jan, I will try to delay some of these scripts, but realistically we will need something like bug 876173 for this as well. Are you working on something related these days?
(Assignee)

Comment 4

4 months ago
Eh sorry for the spam this was the wrong bug :(
Flags: needinfo?(jdemooij)
(Assignee)

Comment 5

4 months ago
(In reply to Joel Maher ( :jmaher) from comment #2)
> let me know if I can help answer questions or debug stuff.

Thanks! So tabpaint is hardly surprising but I'm not sure about the damp regression. Opening new tab or window is expected to start slower, is this something that would explain the damp regression or do you think there is something else?
(Assignee)

Comment 6

4 months ago
Mike, is it expected that tps is regressed by multiple content processes? I would expect it not to open tabs during the run but the extra process startup might cause some delays at the start of the test... which should not matter that much either... do you have any ideas why this test is so sensitive about an extra content process?
Flags: needinfo?(mconley)
(Assignee)

Comment 7

4 months ago
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #1)
> We also have to decide if we want to force single content process for these
> tests for now or we want to block turning on 2cp's on nightly.

:jimm thinks if we decide to turn it on on nightly with this regression, we should not hide it by forcing single content process.

Comment 8

4 months ago
regarding damp, looking at the subtests:
linux64: https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=mozilla-inbound&originalRevision=fb8d5979d0f1e4f17416e5d027a46e7a2068e1c5&newProject=mozilla-inbound&newRevision=085035586d2b4d47ed03dee2ea727df6f2de6679&originalSignature=edaec66500db21d37602c99daa61ac983f21a6ac&newSignature=edaec66500db21d37602c99daa61ac983f21a6ac&framework=1

osx: https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=mozilla-inbound&originalRevision=fb8d5979d0f1e4f17416e5d027a46e7a2068e1c5&newProject=mozilla-inbound&newRevision=085035586d2b4d47ed03dee2ea727df6f2de6679&originalSignature=9663a154caf9b713fb951c8db24518c7e0ceddce&newSignature=9663a154caf9b713fb951c8db24518c7e0ceddce&framework=1

you can see similar patterns where we have some improvements and many regressions.  specifically in the open/reload measurements for the different tools, the close measurements have little to no regressions.  The wins that we see are in the reload typically, so that leaves the smoking gun pointing at the open measurements for damp- let me include bgrins on here as he might have some valuable insight into the damp test and specifically the open measurements.
(In reply to Joel Maher ( :jmaher) from comment #8)
> regarding damp, looking at the subtests:
> linux64:
> https://treeherder.mozilla.org/perf.html#/
> comparesubtest?originalProject=mozilla-
> inbound&originalRevision=fb8d5979d0f1e4f17416e5d027a46e7a2068e1c5&newProject=
> mozilla-
> inbound&newRevision=085035586d2b4d47ed03dee2ea727df6f2de6679&originalSignatur
> e=edaec66500db21d37602c99daa61ac983f21a6ac&newSignature=edaec66500db21d37602c
> 99daa61ac983f21a6ac&framework=1
> 
> osx:
> https://treeherder.mozilla.org/perf.html#/
> comparesubtest?originalProject=mozilla-
> inbound&originalRevision=fb8d5979d0f1e4f17416e5d027a46e7a2068e1c5&newProject=
> mozilla-
> inbound&newRevision=085035586d2b4d47ed03dee2ea727df6f2de6679&originalSignatur
> e=9663a154caf9b713fb951c8db24518c7e0ceddce&newSignature=9663a154caf9b713fb951
> c8db24518c7e0ceddce&framework=1
> 
> you can see similar patterns where we have some improvements and many
> regressions.  specifically in the open/reload measurements for the different
> tools, the close measurements have little to no regressions.  The wins that
> we see are in the reload typically, so that leaves the smoking gun pointing
> at the open measurements for damp- let me include bgrins on here as he might
> have some valuable insight into the damp test and specifically the open
> measurements.

FYI there's a possibility for some noise in between DAMP subtests right now, so if RDP requests are still ongoing after opening the toolbox then it might bleed into the reload time.  I'm hoping that Bug 1291815 will prevent that from happening.

Generally, open time is how long it takes from the start of opening the toolbox until the toolbox and specific tool are fully loaded.  Most of the work is in the parent process (loading a bunch of modules and building UI), but it's also waiting for the RDP (which is running in the content process) to return whatever is needed for the tool.  In some tools there can be quite a bit of traffic on a complicated page (like the inspector needs a lot of info about the DOM even at startup).
See Also: → bug 1291815
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #6)
> Mike, is it expected that tps is regressed by multiple content processes? I
> would expect it not to open tabs during the run but the extra process
> startup might cause some delays at the start of the test... which should not
> matter that much either... do you have any ideas why this test is so
> sensitive about an extra content process?

Nothing comes to mind, I'm afraid. By the time that we start measuring tab switch times, the tp5 pages should have finished loading (and presumably, the content processes are both fully started-up).

I think we'll need to profile it to get a sense of what's going wrong here.
Flags: needinfo?(mconley)
ni?ing myself to get some profiles to gabor / mrbkap next week.
Flags: needinfo?(mconley)
(Assignee)

Comment 12

4 months ago
(In reply to Mike Conley (:mconley) from comment #11)
> ni?ing myself to get some profiles to gabor / mrbkap next week.

tps e10s-multi
https://cleopatra.io/#report=a9685a46f48e7aedd4a06cbfd7d7e39fa61f12c4&selection=0,8

tps e10s-single
https://cleopatra.io/#report=4c2b79c10b4d6240c6890fc4e3e18cc49863d19d&selection=0,8

It seems to me that in the multi process case we run the first batch of test in one process and the second in another, plus I'm not sure we wait for the process to fully initialize (loading all the process scripts  frame scripts and what not). I'm not sure why we can't see what is in the XRE_InitChildProcess but probably we don't start the measurements early enough, which is OK as long as we start the test measurement after the process warmed up.

Anyway, what I have assumed from this test that it loads some of the tabs in one process and some in the other does not seem to happen and the regression seem to be an indicator that this test should be adjusted for e10s-multi.
(Assignee)

Comment 13

4 months ago
tabpaint e10s-single
https://cleopatra.io/#report=8f75b775d7210af02374f8e06df648f7936d7bac&selection=0,8

tabpaint e10s-multi
https://cleopatra.io/#report=ab2845f370e9282e28ee2d4a2fe6258465c50f1f

This test isn't exactly fair either. The single content process version seem to run the test in the original content process that is already warmed up mostly, while the e10s-multi version does not (so we basically cheating for e10s-single). On the plus side the e10s-multi case gives a nice view about the time it takes about some scripts to run at startup (I don't see the process scripts though).
(Assignee)

Comment 14

4 months ago
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #12)
> It seems to me that in the multi process case we run the first batch of test
> in one process and the second in another, plus I'm not sure we wait for the
> process to fully initialize

http://searchfox.org/mozilla-central/source/testing/talos/talos/tests/tabswitch/bootstrap.js#204
This should guarantee that the process that owns the tab we're switching to is initialized, but does not seem to guarantee that all the other processes finished their initialization, so that can be a source of regression but I'm not sure. We should probably do this when we open the batch of tabs with an array of promises.

I still don't get what's up with the profiler, the multi-process case is weird, I cannot interpret it. Here is one with processCount set to 5:
https://cleopatra.io/#report=2f611a922c19ca2e72566711a2e2a9040d2c6649
Here's some interesting data.

With bug 1322758 fixed, I was able to get some profiles from try on OS X. I thought this was kind of interesting - here's a single tab switch to an Amazon page about a Kindle (part of the tp5 set), in both e10s-multi and in e10s-single:

e10s-multi: https://cleopatra.io/#report=af72e1a87b2296fcfe11862c922b770868902d58&filter=%5B%7B%22type%22%3A%22RangeSampleFilter%22,%22start%22%3A134258,%22end%22%3A162499%7D,%7B%22type%22%3A%22RangeSampleFilter%22,%22start%22%3A153384,%22end%22%3A162499%7D,%7B%22type%22%3A%22RangeSampleFilter%22,%22start%22%3A160241,%22end%22%3A162499%7D,%7B%22type%22%3A%22RangeSampleFilter%22,%22start%22%3A160412,%22end%22%3A160540%7D,%7B%22type%22%3A%22RangeSampleFilter%22,%22start%22%3A160421,%22end%22%3A160527%7D%5D&selection=0,5509,5510,5511,5512,5101,5513,8,9,10,11,12,13,14,15,16,17,20,21,22,23,24,25,26,139,140,141,3731,5562,5652,5653,5654,5655,32,33,34,35,36,37,39,54,55,56,90,91,92,93,94,5820,5821,96,97,98,99,1757

e10s-single: https://cleopatra.io/#report=9e0b12ae6cc1cccca154f0622c1f5eab7ddfdb01&filter=%5B%7B%22type%22%3A%22RangeSampleFilter%22,%22start%22%3A203410,%22end%22%3A214004%7D,%7B%22type%22%3A%22RangeSampleFilter%22,%22start%22%3A211142,%22end%22%3A214004%7D,%7B%22type%22%3A%22RangeSampleFilter%22,%22start%22%3A211222,%22end%22%3A211386%7D,%7B%22type%22%3A%22RangeSampleFilter%22,%22start%22%3A211266,%22end%22%3A211322%7D%5D&selection=0,5117,5118,5119,5120,4712,5121,173,174,175,176,177,178,179,180,181,182,183,184,2,185,186,187,188,379,380,381,3606,5144,5147,5148,5149,5150,269,270,271,272,273,274,276,284,285,296,297,298,299,300,301,5151,5152,303,304,305,306,3451

There is a huge difference in the amount of time we're taking to paint here.

79 samples are being spent in nsPresShell::Paint in e10s-multi. Only 34 samples are being spent in nsPresShell::Paint in e10s-single. That's a pretty substantial increase.

Digging in there, it looks like the biggest difference has to do with nsDisplayText::Paint. There are 46 samples in that frame in e10s-multi, and only 5 in e10s-single.

We might want to get someone from the Graphics team to weigh in on this. Hey jrmuizel - anything illuminating about this?
Flags: needinfo?(mconley) → needinfo?(jmuizelaar)
Depends on: 1324033
(Assignee)

Updated

3 months ago
Blocks: 1303113
(Assignee)

Comment 16

3 months ago
Filtering out the glyph caching difference between single and multi cp cases from the tps test seems to be doing the trick: https://treeherder.mozilla.org/#/jobs?repo=try&revision=09f2e414b80dc9184563a3d4527ea96663fde46d
(Assignee)

Comment 17

3 months ago
Created attachment 8822647 [details] [diff] [review]
part1: Use single content process in tabpaint

We don't measure the content process startup time for the single content process case in this test, so we should not measure it for the multi case either. The easiest way to do that is just forcing single cp.
Attachment #8822647 - Flags: review?(mconley)
(Assignee)

Comment 18

3 months ago
Created attachment 8822648 [details] [diff] [review]
part2: Tps should filter out glyph cache init

Thanks again Mike for tracking this down :)
Attachment #8822648 - Flags: review?(mconley)
https://treeherder.mozilla.org/#/jobs?repo=try&revision=8b5356ff3472
Comment on attachment 8822647 [details] [diff] [review]
part1: Use single content process in tabpaint

Review of attachment 8822647 [details] [diff] [review]:
-----------------------------------------------------------------

Reasoning is sound and the patch looks fine to me. You'll need to bump the revision of the tabpaint test's install.rdf[1], and also package / re-sign[2] the add-on (I know, it kinda blows).

[1]: http://searchfox.org/mozilla-central/rev/51aa673e28802fe6ea89f4793721fc55334a6ac8/testing/talos/talos/tests/tabpaint/install.rdf#7
[2]: https://wiki.mozilla.org/EngineeringProductivity/HowTo/SignExtensions#Signing_an_Addon
Attachment #8822647 - Flags: review?(mconley) → review+
Comment on attachment 8822648 [details] [diff] [review]
part2: Tps should filter out glyph cache init

Review of attachment 8822648 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good here too (modulo typo). Same need as above, re: bumping the version[1] and signing.

[1]: http://searchfox.org/mozilla-central/rev/51aa673e28802fe6ea89f4793721fc55334a6ac8/testing/talos/talos/tests/tabswitch/install.rdf#7

::: testing/talos/talos/tests/tabswitch/bootstrap.js
@@ +405,5 @@
> +      // Let's do an initial run to warm up any paint related caches
> +      // (like glyph caches for text). In the next loop we will start with
> +      // a GC before each switch so we don't need here.
> +      // Note: in case of multiple content processes, closing all the tabs
> +      // would close the related conent processes, and even if we kept them

typo: "conent" -> "content".
Attachment #8822648 - Flags: review?(mconley) → review+
Blocks: 1323319

Comment 22

3 months ago
Pushed by gkrizsanits@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/d0d55af22bcb
Tps should filter out glyph cache overhead. r=mconley
https://hg.mozilla.org/integration/mozilla-inbound/rev/851c282874b1
Tabpaint test should not measure content process startup. r=mconley
(Assignee)

Comment 23

3 months ago
(In reply to Mike Conley (:mconley) - PTO until Jan 3 from comment #20)
> Reasoning is sound and the patch looks fine to me. You'll need to bump the
> revision of the tabpaint test's install.rdf[1], and also package /
> re-sign[2] the add-on (I know, it kinda blows).

I had no idea... and thanks for the links it saved me a lot of time :)

Comment 24

3 months ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/d0d55af22bcb
https://hg.mozilla.org/mozilla-central/rev/851c282874b1
Status: NEW → RESOLVED
Last Resolved: 3 months ago
status-firefox53: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla53
we have improvements in the tps test after this landed:
== Change summary for alert #4695 (as of January 03 2017 13:02 UTC) ==

Improvements:

 38%  tps summary windows8-64 opt e10s     40.39 -> 24.85
 35%  tps summary windows7-32 opt e10s     42.37 -> 27.56
 19%  tps summary osx-10-10 opt            70.13 -> 56.96
 11%  tps summary linux64 opt              59.15 -> 52.9
 10%  tps summary windows7-32 opt          73.54 -> 66
 10%  tps summary windows8-64 opt          68.5 -> 61.48

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=4695


I assume these are expected and probably a test specific change.
Comment on attachment 8822648 [details] [diff] [review]
part2: Tps should filter out glyph cache init

Approval Request Comment
[Feature/Bug causing the regression]:

Bug 1323319

[User impact if declined]:

None - but we'll show a talos regression on the tps test.

[Is this code covered by automated tests?]:

Yes.

[Has the fix been verified in Nightly?]:

Yes.

[Needs manual test from QE? If yes, steps to reproduce]: 

No.

[List of other uplifts needed for the feature/fix]:

None.

[Is the change risky?]:

No.

[Why is the change risky/not risky?]:

It's updating a talos test to more correctly measure what it's supposed to measure, and avoiding alarms going off if and when bug 1323319 uplifts.

[String changes made/needed]:

None.
Attachment #8822648 - Flags: approval-mozilla-aurora?
Comment on attachment 8822648 [details] [diff] [review]
part2: Tps should filter out glyph cache init

talos test change, aurora52+, needed for bug 1323319
Attachment #8822648 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Assignee: nobody → gkrizsanits

Comment 28

2 months ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-aurora/rev/773f4844d11b
https://hg.mozilla.org/releases/mozilla-aurora/rev/d37e0d774b83
status-firefox52: --- → fixed
Flags: in-testsuite+
to confirm, we are fine with the damp regressions?

I see these changes after we landed this change:
== Change summary for alert #4876 (as of January 23 2017 13:16 UTC) ==

Regressions:

 15%  damp summary osx-10-10 opt       312.58 -> 357.95
 13%  damp summary windows7-32 opt     310.46 -> 349.73
 10%  damp summary linux64 pgo         252.03 -> 277.76
  6%  damp summary windows8-64 pgo     248.59 -> 263.63

Improvements:

 16%  tps summary osx-10-10 opt     30.11 -> 25.42

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=4876

Updated

19 days ago
Flags: needinfo?(mrbkap)
You need to log in before you can comment on or make changes to this bug.