Closed Bug 1369662 Opened 7 years ago Closed 7 years ago

17.99% tabpaint (osx-10-10) regression on push e398680c483332f16c829adbddb29528c566c180 (Thu Jun 1 2017)


(Core :: DOM: Core & HTML, defect, P1)




Tracking Status
firefox-esr52 --- unaffected
firefox54 --- unaffected
firefox55 --- fixed


(Reporter: igoldan, Assigned: mconley)



(Keywords: perf, regression, talos-regression)


(1 file)

Talos has detected a Firefox performance regression from push:

As author of one of the patches included in that push, we need your help to address this regression.


 18%  tabpaint summary osx-10-10 opt e10s     53.74 -> 63.41

You can find links to graphs and comparison views for each of the above tests at:

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test(s), please see:

For information on reproducing and debugging the regression, either on try or locally, see:

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations:
Blocks: 1346783
Product: Firefox → Core
This regression appeared after a bigger patch, with mozilla-inbound changesets from 675c893a028d and up to e398680c4833, here:

The ones related to this issue seem to be Bug 558184, Bug 1363323 and Bug 1369140.

Could you please state which of the bugs affected this?
Flags: needinfo?(peter.vanderwoude)
Flags: needinfo?(ehsan)
mconley and I talked about this on IRC.  Bug 1369140 _may_ be the culprit because it removed a layout flush which used to occur during tab opening, however I'd have expected that to make things faster not slower...  However it's possible (and maybe even likely) that it has moved the timing of something in a way that is making this test complain.

Mike offered help looking into this.
Flags: needinfo?(ehsan) → needinfo?(mconley)
One thing to note is that the regression is almost 16ms, which is roughly 1 frame.

Current theory:

TL;DR: I think ehsan's patch made us run faster, which ironically makes us report worse results.

ehsan's patch makes focus changes a lot cheaper because we get to avoid flushing layout.

When the tabpaint test opens a new tab from the parent, the following things happen in this order:

1. The remote <xul:browser> is created
2. The browser is told to load the target URI
3. The tab for that browser is switched to, which involves some focus switching I believe - both before and after the TabParent is told to activate the child.
4. The target URI, after the load event fires, waits for the next MozAfterPaint event, which it uses to determine the delta between the tab was requested to open, and the time that the tab content was presented.

When step (3) happens, a race starts: we send a message to the TabChild on the main thread asking it to activate itself (and paint), and we also send a message along the ProcessHangMonitor thread asking it to interrupt JavaScript if any is running and _force_ paint.

The winner of that race will cause the MozAfterPaint event in (4) to run.

Here's where I think ehsan's patch changed things:

Because focus became a lot cheaper, I think the time between steps 2 through 3 occur faster now. Fast enough, in fact, that I suspect that the content process is busy enough loading the target page that the force paint message (which interrupts JS) is more likely to "win" the race.

I suspect that when we've interrupted JS, this is _before_ the target page has had the opportunity to set up the MozAfterPaint listener. This means that we force paint, MozAfterPaint fires, but nobody hears it. Then, the page loads as per normal, and then on the next tick we paint _again_, and this time MozAfterPaint hears it.

Whereas before ehsan's, I _think_ it's possible that the slow focus work gave the content process enought ime to set up the MozAfterPaint handler before the tab switch / force paint occurred.

That's my theory, anyhow. I've got a try build cooking at [1] to try to prove it.

Keeping needinfo so that I know to look at this try build when results are in.

87c1327b918d is where the regression started which corresponds to :ehsan's patch from bug 1369140.  It sounds like what :mconley is up to will be the right thing.
(In reply to Mike Conley (:mconley) from comment #7)
> Current theory:


> Keeping needinfo so that I know to look at this try build when results are
> in.

That didn't build!
(In reply to :Ehsan Akhgari (needinfo please, extremely long backlog) from comment #9)
> That didn't build!

How embarrassing! Re-pushed.
Component: Untriaged → DOM
Quick update here - I've got a potential fix here, but I'm trying to cross all t's, dot all i's (and ensuring this isn't just blind luck) before posting it for review.

I've got some Talos profiling builds cooking for confirmation.
Attachment #8875384 - Flags: review?(jmaher)
Got some math wrong - new patch coming up.
Comment on attachment 8875384 [details]
Bug 1369662 - Make tabpaint talos test rely on MozAfterPaint paint timestamps as opposed to event processing timestamps.

don't forget to sign the extension and add it to the patch.
Attachment #8875384 - Flags: review?(jmaher) → review+
Garrr. Even with this test fix, ehsan's patch is apparently causing a regression. I'm now less confident that there's not a real regression here in our time to paint and upload the tab.

What I need - desperately - is to get a profile from one of these Talos machines. Unfortunately, this has proven extremely difficult. Hopefully the information in bug 1370907 will allow me to figure out how to do it.

So stay tuned, more to do here...
Before anyone points out that the comparison link I posted shows that ehsan's patch causes a win, I'm comparing mozilla-central with a backout - so the _backout_ is what's showing the win. :/
Depends on: 1371054
Blocks: 1371198
I was _finally_ able to tease a (unfortunately unsymbolicated) profile out of our Talos automation.

With the patch backed out (no regression):
With the patch in (regression):
Even though these profiles are unsymbolicated, I think I've found something useful.

Here is a zoomed in view of the profiles:

No regression (ehsan's patch backed out):
Regression (mozilla-central):

This is the timeline view of these two profiles, and the thing I want to draw attention to is in the "No regression" profile. Notice that just before the 11.230 timestamp, there's a very short (0.005ms) rasterization. That's extremely short, but I think it does cause a MozAfterPaint event to fire.

That MozAfterPaint event appears to have a bounding rect with 0 height and 0 width. This actually sounds similar to a problem that ehsan was facing in a separate bug - bug 1343728. We have this mysterious early paint occurring, and by all accounts appears to be empty. But it's enough to fire a MozAfterPaint event, which the test detects.

This little extra paint is missing in the "regression" profile.

So what I think I'm getting at is that this test has been likely flawed from the start, and has not properly measured meaningful paint of content. I've pushed patches to try that wait for the first MozAfterPaint event that has a rect with some dimensions to it, both with and without ehsan's patch:

The comparison shows _no difference_ taking that into account.

So my recommendation here is:

1) Update my patch in this bug to include ignoring MozAfterPaint events that have boundingClientRects with no height or width
2) File a follow-up bug to investigate and understand why that event was being fired, and why it seemed to only be an issue in this bug on OS X.
3) Leave bug 1369140 in the tree.
Flags: needinfo?(mconley)
do you think for all tests/platforms (at least in talos) we should do #1 (ignore MozAfterPaint events w/no boundingClientRects)
See Also: → 1371332
(In reply to Joel Maher ( :jmaher) from comment #22)
> do you think for all tests/platforms (at least in talos) we should do #1
> (ignore MozAfterPaint events w/no boundingClientRects)

It might be a good idea, yes.
Pushed by
Make tabpaint talos test rely on MozAfterPaint paint timestamps as opposed to event processing timestamps. r=jmaher
Priority: -- → P1
This dimmed down the osx regression to half, and brought regressions on the other platforms.

== Change summary for alert #7141 (as of June 08 2017 17:31 UTC) ==


 29%  tabpaint summary linux64 pgo e10s     51.05 -> 66.09
 27%  tabpaint summary linux64 opt e10s     54.44 -> 69.11
 25%  tabpaint summary windows10-64 opt e10s56.45 -> 70.55
 10%  tabpaint summary osx-10-10 opt e10s   62.28 -> 68.73

For up to date results, see:
well it actually added 10% onto the existing 18%.  This is probably expected as we are measuring mozAfterPaint for something that is painting, not a blank event.
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
in summary:
* original 18% regresion on osx landed
* mconley changed the tabpaint test to measure proper mozAfterPaint events
* when that landed all platforms regressed by 25-30% except osx which regressed 10% (+ original 18% = 28%)
* now all platforms measure the right thing and they all have a similar change in results.
Assignee: nobody → mconley
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.