1369662 - 17.99% tabpaint (osx-10-10) regression on push e398680c483332f16c829adbddb29528c566c180 (Thu Jun 1 2017)

Reporter

Description

•

8 years ago

Talos has detected a Firefox performance regression from push: https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?changeset=e398680c483332f16c829adbddb29528c566c180 As author of one of the patches included in that push, we need your help to address this regression. Regressions: 18% tabpaint summary osx-10-10 opt e10s 53.74 -> 63.41 You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=6999 On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format. To learn more about the regressing test(s), please see: https://wiki.mozilla.org/Buildbot/Talos/Tests For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/Buildbot/Talos/Running *** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! *** Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling

Ionuț Goldan [:igoldan]

Reporter

Updated

•

8 years ago

Blocks: 1346783

Product: Firefox → Core

Ionuț Goldan [:igoldan]

Reporter

Comment 1

•

8 years ago

This regression appeared after a bigger patch, with mozilla-inbound changesets from 675c893a028d and up to e398680c4833, here: https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=7abafd0fead900aec67b0de342f71ec856f561cf&tochange=e398680c483332f16c829adbddb29528c566c180 The ones related to this issue seem to be Bug 558184, Bug 1363323 and Bug 1369140. Could you please state which of the bugs affected this?

Flags: needinfo?(peter.vanderwoude)

Flags: needinfo?(ehsan)

Joel Maher ( :jmaher ) (UTC -8)

Comment 2

•

8 years ago

pushed to try: https://treeherder.mozilla.org/#/jobs?repo=try&author=jmaher@mozilla.com&fromchange=3f053f6a785f9ac9f768e44cd37fd56600fb695d&tochange=44a141d40ff31168a2091f108fdedda0d3c6953f of course I didn't annotate the commits well, so looking at: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&group_state=expanded&filter-searchStr=osx%20talos%20other&tochange=e398680c483332f16c829adbddb29528c566c180&fromchange=0f04ee5c2e1c916ae454c911bcead18a04927460 the pushes are in this order: hg update 980375a7b029: https://treeherder.mozilla.org/#/jobs?repo=try&revision=3f053f6a785f9ac9f768e44cd37fd56600fb695d hg update 9255719d469c: https://treeherder.mozilla.org/#/jobs?repo=try&revision=acce7fc1bad446bb03aa409e715173e32a496f6c hg update 87c1327b918d: https://treeherder.mozilla.org/#/jobs?repo=try&revision=dc0b5284d8fd94d95cce1d491b36849c7d99dea6 hg update e398680c4833: https://treeherder.mozilla.org/#/jobs?repo=try&revision=44a141d40ff31168a2091f108fdedda0d3c6953f

(no longer active)

Comment 3

•

8 years ago

mconley and I talked about this on IRC. Bug 1369140 _may_ be the culprit because it removed a layout flush which used to occur during tab opening, however I'd have expected that to make things faster not slower... However it's possible (and maybe even likely) that it has moved the timing of something in a way that is making this test complain. Mike offered help looking into this.

Flags: needinfo?(ehsan) → needinfo?(mconley)

Joel Maher ( :jmaher ) (UTC -8)

Comment 4

•

8 years ago

ok, new pushes: https://treeherder.mozilla.org/#/jobs?repo=try&author=jmaher@mozilla.com&fromchange=d7d8751e8a0666aef72ea4016fcafeacebe98936&tochange=0f5aca96f7830c4beb2c0852a7a38e7c34864900

Peter Van der Beken [:peterv]

Comment 5

•

8 years ago

Bug 558184 https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=2ec31aacaf0569cbe1c08fc787963188c8947590&newProject=try&newRevision=6095abfded6d84b0337b42a233b4128336c4d26e&framework=1&filter=tabpaint&showOnlyImportant=0 55.96 -> 55.51 -0.80% Bug 1363323 https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=6095abfded6d84b0337b42a233b4128336c4d26e&newProject=try&newRevision=673603c2c36e107062f484ee5a90b05594d2084d&framework=1&filter=tabpaint&showOnlyImportant=0 55.51 -> 54.51 -2.53%

Flags: needinfo?(peter.vanderwoude)

Joel Maher ( :jmaher ) (UTC -8)

Comment 6

•

8 years ago

and a 3rd attempt: https://treeherder.mozilla.org/#/jobs?repo=try&author=jmaher@mozilla.com&fromchange=2c979be5a3b6acf76582e8c34cdfd340e9dc091f&tochange=212f61c16dfeaa783821abf2cca075ea6b257ed7 this time without --artifact

Mike Conley (:mconley) (:⚙️)

Assignee

Comment 7

•

8 years ago

One thing to note is that the regression is almost 16ms, which is roughly 1 frame. Current theory: TL;DR: I think ehsan's patch made us run faster, which ironically makes us report worse results. ehsan's patch makes focus changes a lot cheaper because we get to avoid flushing layout. When the tabpaint test opens a new tab from the parent, the following things happen in this order: 1. The remote <xul:browser> is created 2. The browser is told to load the target URI 3. The tab for that browser is switched to, which involves some focus switching I believe - both before and after the TabParent is told to activate the child. 4. The target URI, after the load event fires, waits for the next MozAfterPaint event, which it uses to determine the delta between the tab was requested to open, and the time that the tab content was presented. When step (3) happens, a race starts: we send a message to the TabChild on the main thread asking it to activate itself (and paint), and we also send a message along the ProcessHangMonitor thread asking it to interrupt JavaScript if any is running and _force_ paint. The winner of that race will cause the MozAfterPaint event in (4) to run. Here's where I think ehsan's patch changed things: Because focus became a lot cheaper, I think the time between steps 2 through 3 occur faster now. Fast enough, in fact, that I suspect that the content process is busy enough loading the target page that the force paint message (which interrupts JS) is more likely to "win" the race. I suspect that when we've interrupted JS, this is _before_ the target page has had the opportunity to set up the MozAfterPaint listener. This means that we force paint, MozAfterPaint fires, but nobody hears it. Then, the page loads as per normal, and then on the next tick we paint _again_, and this time MozAfterPaint hears it. Whereas before ehsan's, I _think_ it's possible that the slow focus work gave the content process enought ime to set up the MozAfterPaint handler before the tab switch / force paint occurred. That's my theory, anyhow. I've got a try build cooking at [1] to try to prove it. Keeping needinfo so that I know to look at this try build when results are in. [1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=946624498935b274aaab1940b4b04ad68dcb45ec

Joel Maher ( :jmaher ) (UTC -8)

Comment 8

•

8 years ago

87c1327b918d is where the regression started which corresponds to :ehsan's patch from bug 1369140. It sounds like what :mconley is up to will be the right thing.

(no longer active)

Comment 9

•

8 years ago

(In reply to Mike Conley (:mconley) from comment #7) > Current theory: Fascinating! > Keeping needinfo so that I know to look at this try build when results are > in. That didn't build!

Mike Conley (:mconley) (:⚙️)

Assignee

Comment 10

•

8 years ago

(In reply to :Ehsan Akhgari (needinfo please, extremely long backlog) from comment #9) > That didn't build! How embarrassing! Re-pushed.

Kohei Yoshino

Updated

•

8 years ago

Component: Untriaged → DOM

Mike Conley (:mconley) (:⚙️)

Assignee

Comment 11

•

7 years ago

Quick update here - I've got a potential fix here, but I'm trying to cross all t's, dot all i's (and ensuring this isn't just blind luck) before posting it for review. I've got some Talos profiling builds cooking for confirmation.

Comment hidden (mozreview-request)

Mike Conley (:mconley) (:⚙️)

Assignee

Updated

•

7 years ago

Attachment #8875384 - Flags: review?(jmaher)

Mike Conley (:mconley) (:⚙️)

Assignee

Comment 13

•

7 years ago

Got some math wrong - new patch coming up.

Comment hidden (mozreview-request)

Joel Maher ( :jmaher ) (UTC -8)

Comment 16

•

7 years ago

mozreview-review

Comment on attachment 8875384 [details] Bug 1369662 - Make tabpaint talos test rely on MozAfterPaint paint timestamps as opposed to event processing timestamps. https://reviewboard.mozilla.org/r/146818/#review150936 don't forget to sign the extension and add it to the patch.

Attachment #8875384 - Flags: review?(jmaher) → review+

Comment hidden (mozreview-request)

Mike Conley (:mconley) (:⚙️)

Assignee

Comment 18

•

7 years ago

Garrr. Even with this test fix, ehsan's patch is apparently causing a regression. I'm now less confident that there's not a real regression here in our time to paint and upload the tab. https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=55a453c8c048&newProject=try&newRevision=5100eb0dc0aa&framework=1&showOnlyImportant=0 What I need - desperately - is to get a profile from one of these Talos machines. Unfortunately, this has proven extremely difficult. Hopefully the information in bug 1370907 will allow me to figure out how to do it. So stay tuned, more to do here...

Mike Conley (:mconley) (:⚙️)

Assignee

Comment 19

•

7 years ago

Before anyone points out that the comparison link I posted shows that ehsan's patch causes a win, I'm comparing mozilla-central with a backout - so the _backout_ is what's showing the win. :/

Mike Conley (:mconley) (:⚙️)

Assignee

Updated

•

7 years ago

Depends on: 1371054

Mark Banner (:standard8)

Updated

•

7 years ago

Blocks: 1371198

Mike Conley (:mconley) (:⚙️)

Assignee

Comment 20

•

7 years ago

I was _finally_ able to tease a (unfortunately unsymbolicated) profile out of our Talos automation. With the patch backed out (no regression): https://perfht.ml/2rPSgnT With the patch in (regression): https://perfht.ml/2rPQliX

Mike Conley (:mconley) (:⚙️)

Assignee

Comment 21

•

7 years ago

Even though these profiles are unsymbolicated, I think I've found something useful. Here is a zoomed in view of the profiles: No regression (ehsan's patch backed out): https://perfht.ml/2r6AjEj Regression (mozilla-central): https://perfht.ml/2r6e55C This is the timeline view of these two profiles, and the thing I want to draw attention to is in the "No regression" profile. Notice that just before the 11.230 timestamp, there's a very short (0.005ms) rasterization. That's extremely short, but I think it does cause a MozAfterPaint event to fire. That MozAfterPaint event appears to have a bounding rect with 0 height and 0 width. This actually sounds similar to a problem that ehsan was facing in a separate bug - bug 1343728. We have this mysterious early paint occurring, and by all accounts appears to be empty. But it's enough to fire a MozAfterPaint event, which the test detects. This little extra paint is missing in the "regression" profile. So what I think I'm getting at is that this test has been likely flawed from the start, and has not properly measured meaningful paint of content. I've pushed patches to try that wait for the first MozAfterPaint event that has a rect with some dimensions to it, both with and without ehsan's patch: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=b80e1f83576e&newProject=try&newRevision=3534252d4864&framework=1&showOnlyImportant=0 The comparison shows _no difference_ taking that into account. So my recommendation here is: 1) Update my patch in this bug to include ignoring MozAfterPaint events that have boundingClientRects with no height or width 2) File a follow-up bug to investigate and understand why that event was being fired, and why it seemed to only be an issue in this bug on OS X. 3) Leave bug 1369140 in the tree.

Flags: needinfo?(mconley)

Joel Maher ( :jmaher ) (UTC -8)

Comment 22

•

7 years ago

do you think for all tests/platforms (at least in talos) we should do #1 (ignore MozAfterPaint events w/no boundingClientRects)

Mike Conley (:mconley) (:⚙️)

Assignee

Updated

•

7 years ago

Comment 23

•

7 years ago

(In reply to Joel Maher ( :jmaher) from comment #22) > do you think for all tests/platforms (at least in talos) we should do #1 > (ignore MozAfterPaint events w/no boundingClientRects) It might be a good idea, yes.

Comment hidden (mozreview-request)

Pulsebot

Comment 26

•

7 years ago

Pushed by mconley@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/132095a40bde Make tabpaint talos test rely on MozAfterPaint paint timestamps as opposed to event processing timestamps. r=jmaher

Andrew Overholt [:overholt]

Updated

•

7 years ago

Priority: -- → P1

Ionuț Goldan [:igoldan]

Reporter

Comment 27

•

7 years ago

This dimmed down the osx regression to half, and brought regressions on the other platforms. == Change summary for alert #7141 (as of June 08 2017 17:31 UTC) == Regressions: 29% tabpaint summary linux64 pgo e10s 51.05 -> 66.09 27% tabpaint summary linux64 opt e10s 54.44 -> 69.11 25% tabpaint summary windows10-64 opt e10s56.45 -> 70.55 10% tabpaint summary osx-10-10 opt e10s 62.28 -> 68.73 For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=7141

Joel Maher ( :jmaher ) (UTC -8)

Comment 28

•

7 years ago

well it actually added 10% onto the existing 18%. This is probably expected as we are measuring mozAfterPaint for something that is painting, not a blank event.

Carsten Book [:Tomcat]

Comment 29

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/132095a40bde

Status: NEW → RESOLVED

Closed: 7 years ago

status-firefox55: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla55

Joel Maher ( :jmaher ) (UTC -8)

Comment 30

•

7 years ago

in summary: * original 18% regresion on osx landed * mconley changed the tabpaint test to measure proper mozAfterPaint events * when that landed all platforms regressed by 25-30% except osx which regressed 10% (+ original 18% = 28%) * now all platforms measure the right thing and they all have a similar change in results.

Ryan VanderMeulen [:RyanVM]

Updated

•

7 years ago

status-firefox54: --- → unaffected

status-firefox-esr52: --- → unaffected

Ryan VanderMeulen [:RyanVM]

Updated

•

7 years ago

Assignee: nobody → mconley

Nobody; OK to take it and work on it

Updated

•

6 years ago

Component: DOM → DOM: Core & HTML