Closed Bug 1179732 Opened 9 years ago Closed 8 years ago

11-147% linux*/winxp tresize regression when comparing e10s to non e10s on the same build

Categories

(Testing :: Talos, defect, P2)

defect

Tracking

(e10s+)

RESOLVED FIXED
Tracking Status
e10s + ---

People

(Reporter: jmaher, Assigned: gw280)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

comparing:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=aad95360a002&exclusion_profile=false&filter-searchStr=talos

we have:
linux:
:( tresize               17.8 +/-  2% (7#)  [ +11.7%]      19.9 +/-  1% (7#)
linux64:
:( tresize               15.7 +/-  1% (7#)  [ +17.8%]      18.4 +/-  3% (7#)
win7:
:) tresize               15.7 +/-  1% (7#)  [ -14.3%]      13.4 +/-  1% (7#)
winxp:
:( tresize               11.8 +/-  2% (7#)  [+147.0%]      29.2 +/-  3% (7#)
win8:
:) tresize               12.4 +/-  1% (7#)  [ -11.3%]      11.0 +/-  1% (7#)


notice win7/win8 have improvements.  osx10.10 doesn't have any difference.

in general the linux non-e10s vs e10s follow a similar pattern:
http://graphs.mozilla.org/graph.html#tests=[[254,1,45],[254,1,41],[254,1,35],[254,1,43],[254,1,33],[254,1,37]]&sel=1428061558484,1435837558484&displayrange=90&datatype=geo

winxp is just wonky.  fyi, winxp is wonky for tscrollx and tp5o_scroll as well.
Flags: needinfo?(bgirard)
Some bad links in there. Let me try that again.
non-e10s:
* http://people.mozilla.org/~bgirard/cleopatra/?zippedProfile=http://mozilla-releng-blobs.s3.amazonaws.com/blobs/Try-Non-PGO/sha512/5a82eda28a55fc4e1f6780a90d5569be714655d3c9ee2eb6dc9dcc88404eced842464ef36ba01772264dcc809ede04f5648747cac060f276f1ee033eaa4d3c79&pathInZip=profile_tresize/tresize/cycle_0.sps#report=6a79a4841916fbac7e10910c8f39574f018fed2c
* http://people.mozilla.org/~bgirard/cleopatra/?zippedProfile=http://mozilla-releng-blobs.s3.amazonaws.com/blobs/Try-Non-PGO/sha512/5a82eda28a55fc4e1f6780a90d5569be714655d3c9ee2eb6dc9dcc88404eced842464ef36ba01772264dcc809ede04f5648747cac060f276f1ee033eaa4d3c79&pathInZip=profile_tresize/tresize/cycle_1.sps#report=d5145bee3af8afcc9e60fc81d8c15545ef3ef0de
* http://people.mozilla.org/~bgirard/cleopatra/?zippedProfile=http://mozilla-releng-blobs.s3.amazonaws.com/blobs/Try-Non-PGO/sha512/5a82eda28a55fc4e1f6780a90d5569be714655d3c9ee2eb6dc9dcc88404eced842464ef36ba01772264dcc809ede04f5648747cac060f276f1ee033eaa4d3c79&pathInZip=profile_tresize/tresize/cycle_2.sps#report=45d247924d8a2ef7c1b4b961fbaf0230276356b1


e10s:
* http://people.mozilla.org/~bgirard/cleopatra/?zippedProfile=http://mozilla-releng-blobs.s3.amazonaws.com/blobs/Try-Non-PGO/sha512/950f0b481ae4dd67389b8503d4f0194d91779636e3a7aed215b45708b8a3366723b39605cfd38bc61debf86269cad8e59a3206ae41f8d861411dbd6a822175ef&pathInZip=profile_tresize/tresize/cycle_0.sps#report=56d6b1001887bdd67b9ec77743498cbbb532e923
* http://people.mozilla.org/~bgirard/cleopatra/?zippedProfile=http://mozilla-releng-blobs.s3.amazonaws.com/blobs/Try-Non-PGO/sha512/950f0b481ae4dd67389b8503d4f0194d91779636e3a7aed215b45708b8a3366723b39605cfd38bc61debf86269cad8e59a3206ae41f8d861411dbd6a822175ef&pathInZip=profile_tresize/tresize/cycle_1.sps#report=11b2e4d053e518828788c6a96672dd6a885094fb
* http://people.mozilla.org/~bgirard/cleopatra/?zippedProfile=http://mozilla-releng-blobs.s3.amazonaws.com/blobs/Try-Non-PGO/sha512/950f0b481ae4dd67389b8503d4f0194d91779636e3a7aed215b45708b8a3366723b39605cfd38bc61debf86269cad8e59a3206ae41f8d861411dbd6a822175ef&pathInZip=profile_tresize/tresize/cycle_2.sps#report=fd0f120b5c19ae1393baaddd996aa39bbaafa3ae
The thing that's sticking out at me right now is that we seem to call nsPresContext::FireDOMPaintEvent farrrrrrr more often in the content process (which gets dispatched synchronously to the parent via the add-on shims), than we do with non-e10s.

Looking at at the first profiles for non-e10s and e10s that I listed...

Calls to nsPresContext::FireDOMPaintEvent:

non-e10s: 30 samples
e10s (content process): 1418 samples
My gut feeling would be to look into what's the deal with that PushGroup. It's very costly on the e10s profile. If we improve that then the RPC message would get through a lot faster.

Killing that RpcMessage as well would be wonderful.

FireDOMPaintEvent showing up as well seems suspicious.

We likely need an answer for all three of these if we want this regression to go away.
Flags: needinfo?(bgirard)
Assignee: nobody → gwright
Priority: -- → P1
For the record, that's a profile from my Linux box running tresize. We're trying to debug why there's such a huge startup time for the content process.
So, what appears to be happening here is that tresize uses an addon to run the test, and that addon is using shims which is really killing our startup performance. We seem to spend around 4-5 seconds initialising before we actually start to run the test, which on my box at least accounts for the discrepancy between e10s and non-e10s. 

I haven't been able to conclusively prove this as the first 5s of any profile I take is a black hole with no samples recorded. Mike and I have looked into trying to get the profiler to sample but haven't had any luck.

Focusing my efforts now on killing the use of shims in this addon and hoping the issue goes away.
good find- fixing the addon is fair game!
(In reply to Joel Maher (:jmaher) from comment #13)
> good find- fixing the addon is fair game!

Good find, but not sure if fixing the addon is the first thing I'd do.

The addon doesn't use any special features IIRC, and it communicates cross process using messages, so not sure what the shims would be used for TBH. Maybe for the resize commands themselves? IIRC making it work with e10s was done with the advice of the e10s guys (don't recall the bug number, Joel?), so it shouldn't be doing something really bad.

Even if it does end up using shims, this might happen with other addons in the wild too, so it's probably worth understanding where does the slow-down come from, at the very least (which according to comment 12 is being done already, which is great).
This is a good first step imho. It narrows the gap but doesn't quite get us there. Investigating further.
Attachment #8712404 - Flags: review?(jmaher)
try shows that e10s vs non e10s we have:
* win7 is equal
* win8 is slightly faster
* linux64 is much slower
Comment on attachment 8712404 [details] [diff] [review]
0001-Bug-1179732-Make-the-tresize-addon-e10s-compatible.patch

Review of attachment 8712404 [details] [diff] [review]:
-----------------------------------------------------------------

these changes look good.  Try doesn't have any issues with it :)
Attachment #8712404 - Flags: review?(jmaher) → review+
Keywords: leave-open
https://treeherder.mozilla.org/perf.html#/graphs?timerange=86400&series=[mozilla-inbound,ff4ce004871790022f413b1abb21632e7c788f65,1]&series=[mozilla-inbound,2c2a8f5bf693abaac9441bda475152c167bf24e0,1]

The initial results for WinXP non-PGO are looking good. e10s tresize timings are now basically the same as non-e10s. PGO is still showing the same though, but it's small.
now that we have more data, this is looking much more aligned!  tresize is bi-modal on opt, but for opt e10s, we seem to match (or beat) the upper mode of the results- and in e10s land we have a single mode.
Jeff, given that Windows and OS X are now no longer showing a regression or a very slight one (~5%), and it's only Linux that's showing a significant regression (~25%), how much of a priority is this now?
Flags: needinfo?(jgriffiths)
(In reply to George Wright (:gw280) (:gwright) from comment #24)
> Jeff, given that Windows and OS X are now no longer showing a regression or
> a very slight one (~5%), and it's only Linux that's showing a significant
> regression (~25%), how much of a priority is this now?

Lower priority, for sure. What are my options? This should not block and also I would not call it a P1, but I think we need to understand why Linux is regressing.
Flags: needinfo?(jgriffiths)
Let's kick this back into perf triage then.
Priority: P1 → --
Priority: -- → P2
Joel, according to your measurements at bug 1144120 comment 14, tresize on e10s linux is now on par (within 1%) compared to non-e10s.

Can we use this data to close this bug?
Flags: needinfo?(jmaher)
Though it's still worse by 7-12% on windows 7/XP.
linux is fixed, winxp is the only one that is measurably slower, win7/osx10.10 are similar and win8 has an improvement.  Close this out if we are not going to work on winxp.
Flags: needinfo?(jmaher)
Based on https://treeherder.mozilla.org/perf.html#/e10s?filter=tresize it looks like we can close this bug. The release criteria state we can regress this test by up to 5%, so to me it looks like we have hit that.

Jeff, what do you think?
Flags: needinfo?(jgriffiths)
(In reply to George Wright (:gw280) (:gwright) from comment #30)
> Based on https://treeherder.mozilla.org/perf.html#/e10s?filter=tresize it
> looks like we can close this bug. The release criteria state we can regress
> this test by up to 5%, so to me it looks like we have hit that.
> 
> Jeff, what do you think?

I would like to see a followup bug for the winxp regression in particular, but besides that I agree.
Flags: needinfo?(jgriffiths)
Blocks: 1252947
Filed bug 1252947
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Keywords: leave-open
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: