Closed Bug 1369662 Opened 7 years ago Closed 7 years ago
.99% tabpaint (osx-10-10) regression on push e398680c483332f16c829adbddb29528c566c180 (Thu Jun 1 2017)
Bug 1369662 - Make tabpaint talos test rely on MozAfterPaint paint timestamps as opposed to event processing timestamps.
59 bytes, text/x-review-board-request
Talos has detected a Firefox performance regression from push: https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?changeset=e398680c483332f16c829adbddb29528c566c180 As author of one of the patches included in that push, we need your help to address this regression. Regressions: 18% tabpaint summary osx-10-10 opt e10s 53.74 -> 63.41 You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=6999 On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format. To learn more about the regressing test(s), please see: https://wiki.mozilla.org/Buildbot/Talos/Tests For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/Buildbot/Talos/Running *** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! *** Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
This regression appeared after a bigger patch, with mozilla-inbound changesets from 675c893a028d and up to e398680c4833, here: https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=7abafd0fead900aec67b0de342f71ec856f561cf&tochange=e398680c483332f16c829adbddb29528c566c180 The ones related to this issue seem to be Bug 558184, Bug 1363323 and Bug 1369140. Could you please state which of the bugs affected this?
pushed to try: https://email@example.com&fromchange=3f053f6a785f9ac9f768e44cd37fd56600fb695d&tochange=44a141d40ff31168a2091f108fdedda0d3c6953f of course I didn't annotate the commits well, so looking at: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&group_state=expanded&filter-searchStr=osx%20talos%20other&tochange=e398680c483332f16c829adbddb29528c566c180&fromchange=0f04ee5c2e1c916ae454c911bcead18a04927460 the pushes are in this order: hg update 980375a7b029: https://treeherder.mozilla.org/#/jobs?repo=try&revision=3f053f6a785f9ac9f768e44cd37fd56600fb695d hg update 9255719d469c: https://treeherder.mozilla.org/#/jobs?repo=try&revision=acce7fc1bad446bb03aa409e715173e32a496f6c hg update 87c1327b918d: https://treeherder.mozilla.org/#/jobs?repo=try&revision=dc0b5284d8fd94d95cce1d491b36849c7d99dea6 hg update e398680c4833: https://treeherder.mozilla.org/#/jobs?repo=try&revision=44a141d40ff31168a2091f108fdedda0d3c6953f
mconley and I talked about this on IRC. Bug 1369140 _may_ be the culprit because it removed a layout flush which used to occur during tab opening, however I'd have expected that to make things faster not slower... However it's possible (and maybe even likely) that it has moved the timing of something in a way that is making this test complain. Mike offered help looking into this.
Flags: needinfo?(ehsan) → needinfo?(mconley)
Bug 558184 https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=2ec31aacaf0569cbe1c08fc787963188c8947590&newProject=try&newRevision=6095abfded6d84b0337b42a233b4128336c4d26e&framework=1&filter=tabpaint&showOnlyImportant=0 55.96 -> 55.51 -0.80% Bug 1363323 https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=6095abfded6d84b0337b42a233b4128336c4d26e&newProject=try&newRevision=673603c2c36e107062f484ee5a90b05594d2084d&framework=1&filter=tabpaint&showOnlyImportant=0 55.51 -> 54.51 -2.53%
and a 3rd attempt: https://firstname.lastname@example.org&fromchange=2c979be5a3b6acf76582e8c34cdfd340e9dc091f&tochange=212f61c16dfeaa783821abf2cca075ea6b257ed7 this time without --artifact
87c1327b918d is where the regression started which corresponds to :ehsan's patch from bug 1369140. It sounds like what :mconley is up to will be the right thing.
(In reply to Mike Conley (:mconley) from comment #7) > Current theory: Fascinating! > Keeping needinfo so that I know to look at this try build when results are > in. That didn't build!
(In reply to :Ehsan Akhgari (needinfo please, extremely long backlog) from comment #9) > That didn't build! How embarrassing! Re-pushed.
Quick update here - I've got a potential fix here, but I'm trying to cross all t's, dot all i's (and ensuring this isn't just blind luck) before posting it for review. I've got some Talos profiling builds cooking for confirmation.
Attachment #8875384 - Flags: review?(jmaher)
Got some math wrong - new patch coming up.
Comment on attachment 8875384 [details] Bug 1369662 - Make tabpaint talos test rely on MozAfterPaint paint timestamps as opposed to event processing timestamps. https://reviewboard.mozilla.org/r/146818/#review150936 don't forget to sign the extension and add it to the patch.
Attachment #8875384 - Flags: review?(jmaher) → review+
Garrr. Even with this test fix, ehsan's patch is apparently causing a regression. I'm now less confident that there's not a real regression here in our time to paint and upload the tab. https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=55a453c8c048&newProject=try&newRevision=5100eb0dc0aa&framework=1&showOnlyImportant=0 What I need - desperately - is to get a profile from one of these Talos machines. Unfortunately, this has proven extremely difficult. Hopefully the information in bug 1370907 will allow me to figure out how to do it. So stay tuned, more to do here...
Before anyone points out that the comparison link I posted shows that ehsan's patch causes a win, I'm comparing mozilla-central with a backout - so the _backout_ is what's showing the win. :/
I was _finally_ able to tease a (unfortunately unsymbolicated) profile out of our Talos automation. With the patch backed out (no regression): https://perfht.ml/2rPSgnT With the patch in (regression): https://perfht.ml/2rPQliX
Even though these profiles are unsymbolicated, I think I've found something useful. Here is a zoomed in view of the profiles: No regression (ehsan's patch backed out): https://perfht.ml/2r6AjEj Regression (mozilla-central): https://perfht.ml/2r6e55C This is the timeline view of these two profiles, and the thing I want to draw attention to is in the "No regression" profile. Notice that just before the 11.230 timestamp, there's a very short (0.005ms) rasterization. That's extremely short, but I think it does cause a MozAfterPaint event to fire. That MozAfterPaint event appears to have a bounding rect with 0 height and 0 width. This actually sounds similar to a problem that ehsan was facing in a separate bug - bug 1343728. We have this mysterious early paint occurring, and by all accounts appears to be empty. But it's enough to fire a MozAfterPaint event, which the test detects. This little extra paint is missing in the "regression" profile. So what I think I'm getting at is that this test has been likely flawed from the start, and has not properly measured meaningful paint of content. I've pushed patches to try that wait for the first MozAfterPaint event that has a rect with some dimensions to it, both with and without ehsan's patch: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=b80e1f83576e&newProject=try&newRevision=3534252d4864&framework=1&showOnlyImportant=0 The comparison shows _no difference_ taking that into account. So my recommendation here is: 1) Update my patch in this bug to include ignoring MozAfterPaint events that have boundingClientRects with no height or width 2) File a follow-up bug to investigate and understand why that event was being fired, and why it seemed to only be an issue in this bug on OS X. 3) Leave bug 1369140 in the tree.
do you think for all tests/platforms (at least in talos) we should do #1 (ignore MozAfterPaint events w/no boundingClientRects)
(In reply to Joel Maher ( :jmaher) from comment #22) > do you think for all tests/platforms (at least in talos) we should do #1 > (ignore MozAfterPaint events w/no boundingClientRects) It might be a good idea, yes.
Pushed by email@example.com: https://hg.mozilla.org/integration/autoland/rev/132095a40bde Make tabpaint talos test rely on MozAfterPaint paint timestamps as opposed to event processing timestamps. r=jmaher
This dimmed down the osx regression to half, and brought regressions on the other platforms. == Change summary for alert #7141 (as of June 08 2017 17:31 UTC) == Regressions: 29% tabpaint summary linux64 pgo e10s 51.05 -> 66.09 27% tabpaint summary linux64 opt e10s 54.44 -> 69.11 25% tabpaint summary windows10-64 opt e10s56.45 -> 70.55 10% tabpaint summary osx-10-10 opt e10s 62.28 -> 68.73 For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=7141
well it actually added 10% onto the existing 18%. This is probably expected as we are measuring mozAfterPaint for something that is painting, not a blank event.
in summary: * original 18% regresion on osx landed * mconley changed the tabpaint test to measure proper mozAfterPaint events * when that landed all platforms regressed by 25-30% except osx which regressed 10% (+ original 18% = 28%) * now all platforms measure the right thing and they all have a similar change in results.
You need to log in before you can comment on or make changes to this bug.