Closed Bug 1456411 Opened 6 years ago Closed 6 years ago

22.5 - 34.46% tp5o responsiveness (linux64, windows10-64, windows7-32) regression on push a7fc392e48ab (Fri Apr 20 2018)

Categories

(Firefox :: General, defect)

defect
Not set
normal

Tracking

()

RESOLVED WONTFIX
Tracking Status
firefox-esr52 --- unaffected
firefox59 --- unaffected
firefox60 --- unaffected
firefox61 + wontfix

People

(Reporter: igoldan, Unassigned)

References

Details

(Keywords: perf, regression, talos-regression)

Talos has detected a Firefox performance regression from push:

https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=4895ec59cf2b5240ad79fa35d3a9d32cc1f206d9&tochange=a7fc392e48ab87dc3d02c155188c9e433425c7d7

As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

 34%  tp5o responsiveness windows10-64 pgo e10s stylo     0.31 -> 0.41
 25%  tp5o responsiveness windows7-32 pgo e10s stylo      0.33 -> 0.41
 23%  tp5o responsiveness linux64 pgo e10s stylo          0.49 -> 0.60
 23%  tp5o responsiveness linux64 opt e10s stylo          0.60 -> 0.73


You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=12840

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://wiki.mozilla.org/Buildbot/Talos/Tests

For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/Buildbot/Talos/Running

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
Component: Untriaged → General
:mconley It looks like big perf regressions got in since bug 1358712 landed. Can we do something to resolve this? Or should we back this out and find another approach?
Flags: needinfo?(mconley)
Thanks, investigating.
Just for the record, "big" is pretty ill-defined when it comes to e10s responsiveness regressions. .31 -> .41 is probably not a user-observable difference if it happens in consistent, small chunks. If it happens in 3 or 4 big chunks, it probably is.

We really need responsive numbers broken into percentiles...
(In reply to Kris Maglione [:kmag] (long backlog; ping on IRC if you're blocked) from comment #3)
> Just for the record, "big" is pretty ill-defined when it comes to e10s
> responsiveness regressions. .31 -> .41 is probably not a user-observable
> difference if it happens in consistent, small chunks.

I suspect this is what's happening.

Having looked over some instrumented profiles, I think I know what's happening.

Before, with the sync layout flushes for the StatusPanel, when the parent process wanted to update the string in the StatusPanel and display it, it'd set the label in the (invisible) panel, and then synchronously flush layout in script to determine on which side to reveal the StatusPanel. That means that usually, in the parent, we'd have a sync flush, and then a paint to show the StatusPanel during the next refresh driver tick.

With bug 1358712, in order to avoid sync flushes, we set the label on the StatusPanel, and then wait for the next refresh driver tick to get the panel's dimensions, and then make the panel visible in a requestAnimationFrame.

That means we went from;

Update label
*sync layout flush*
Make panel visible
Refresh driver tick
  Style
  Layout
  Paint

to:

Update label
Refresh driver tick
  Style
  Layout
  Paint
Get cheap dimensions for panel, wait for next refresh driver tick...
Refresh driver tick
  Make panel visible
  Style
  Layout
  Paint


So we're spreading the work out a little bit in the main thread of the parent process over multiple refresh driver ticks, which I think is affecting the responsiveness score, since we're more likely to impact more EventTracer samples.

I don't think this is user perceivable, and I believe the removal of the layout flush is worth it. I'm going to WONTFIX this for now.
Status: NEW → RESOLVED
Closed: 6 years ago
Flags: needinfo?(mconley)
Resolution: --- → WONTFIX
Thanks for the detailed analysis, Mike.
You need to log in before you can comment on or make changes to this bug.