Closed Bug 1325910 Opened 3 years ago Closed 3 years ago

8.17 - 15.68% tps (linux64, osx-10-10, windows7-32, windows8-64) regression on push c2f49fba3f4d316188afc39c5d85a7336c17b75d (Sun Dec 25 2016)

Categories

(Core :: ImageLib, defect)

53 Branch
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla53
Tracking Status
firefox50 --- unaffected
firefox51 --- unaffected
firefox52 --- unaffected
firefox53 --- fixed

People

(Reporter: ashiue, Unassigned)

References

Details

(Keywords: perf, regression, talos-regression)

Talos has detected a Firefox performance regression from push c2f49fba3f4d316188afc39c5d85a7336c17b75d. As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

 16%  tps summary linux64 opt e10s         42.56 -> 49.23
 15%  tps summary linux64 pgo e10s         35.96 -> 41.48
 11%  tps summary windows8-64 opt e10s     38.11 -> 42.48
  8%  tps summary windows7-32 opt e10s     41.23 -> 44.63
  8%  tps summary osx-10-10 opt e10s       39.95 -> 43.22


You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=4635

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://wiki.mozilla.org/Buildbot/Talos/Tests

For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/Buildbot/Talos/Running

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
After doing some retriggers, this issue might be caused by this patch:
https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=d83cea57d22f0c74afa56611e6ce9ec6b2b25a92&tochange=c2f49fba3f4d316188afc39c5d85a7336c17b75d

Hi Timothy, as you are the patch author, can you take a look at this and determine what is the root cause? Thanks!
Blocks: 1318540, 1317562
Flags: needinfo?(tnikkel)
I pushed a guess to try to see.
Looks like my guess had no effect.
My current theory is that all the runnables used to async notify for each image are flooding the main thread, delaying the "tab has switched" message. Non-e10s probably uses a different method to determine when the tab switch has completed, so that might explain why this is e10s only.
Component: Untriaged → ImageLib
Product: Firefox → Core
Still investigating. It seems part of (but not all of) the regression is caused because we are able to draw an image that hasn't notified yet (because of bug 1325297), we then notify after the paint, and then we invalid and we have to paint the image again. The theory from comment 4 might be responsible for the rest of the regression, but I haven't verified that yet.

Meanwhile I backed out the offending part of bug 1317562 because it's looking like the fix for this isn't going to be quick and easy.
One thing that might make us more immune to this sort of thing would be to provide the compositing timestamp with the MozAfterPaint event (bug 1264798). Currently, we have a number of tests that start a stopwatch, does a thing, and stops that stopwatch once MozAfterPaint is seen. However, MozAfterPaint might be handled an arbitrary number of ms after the paint / composite actually occurred, which makes us sensitive to things like many runnables in a queue.

If we had the composition time in the MozAfterPaint event, we could use that instead, and we'd be measuring more of the right thing.
Fixed by backout. When we reland we should not re-regress this.
Status: NEW → RESOLVED
Closed: 3 years ago
Flags: needinfo?(tnikkel)
Resolution: --- → FIXED
Target Milestone: --- → mozilla53
You need to log in before you can comment on or make changes to this bug.