Closed Bug 1643604 Opened 4 years ago Closed 4 years ago

2.59 - 48.3% displaylist_mutate / tabswitch / tp5o / tp5o_scroll / tscrollx (linux64-shippable|-qr, windows10-64-shippable|-qr, windows7-32-shippable) regression on push 5a874f1887ff0644f19936203132cc05ff346ec7 (Wed June 3 2020)

Categories

(Core :: Panning and Zooming, defect)

defect

Tracking

()

RESOLVED FIXED
mozilla79
Tracking Status
firefox-esr68 --- unaffected
firefox77 --- unaffected
firefox78 --- unaffected
firefox79 --- fixed

People

(Reporter: alexandrui, Assigned: botond, NeedInfo)

References

(Regression)

Details

(4 keywords)

Attachments

(1 file)

Perfherder has detected a talos performance regression from push 5a874f1887ff0644f19936203132cc05ff346ec7. As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

48% tscrollx linux64-shippable-qr opt e10s stylo 1.01 -> 1.49
25% tscrollx windows10-64-shippable-qr opt e10s stylo 0.73 -> 0.92
25% tscrollx windows10-64-shippable-qr opt e10s stylo 0.73 -> 0.91
18% tp5o_scroll linux64-shippable-qr opt e10s stylo 2.24 -> 2.64
16% tscrollx linux64-shippable opt e10s stylo 0.66 -> 0.76
15% tscrollx windows10-64-shippable opt e10s stylo 0.59 -> 0.68
15% tp5o_scroll windows10-64-shippable-qr opt e10s stylo 1.69 -> 1.94
8% displaylist_mutate windows10-64-shippable opt e10s stylo 1,356.71 -> 1,459.12
7% tabswitch windows10-64-shippable opt e10s stylo 9.36 -> 10.03
6% displaylist_mutate linux64-shippable opt e10s stylo 1,408.10 -> 1,492.34
5% displaylist_mutate windows7-32-shippable opt e10s stylo 1,448.26 -> 1,524.52
5% tabswitch windows7-32-shippable opt e10s stylo 10.87 -> 11.42
3% displaylist_mutate linux64-shippable-qr opt e10s stylo 2,915.59 -> 3,007.15
3% tp5o linux64-shippable opt e10s stylo 200.23 -> 205.43

Details of the alert can be found in the alert summary, including links to graphs and comparisons for each of the affected tests. Please follow our guide to handling regression bugs and let us know your plans within 3 business days, or the offending patch(es) will be backed out in accordance with our regression policy.

For more information on performance sheriffing please see our FAQ.

Flags: needinfo?(botond)
Component: Performance → Panning and Zooming
Product: Testing → Core

Set release status flags based on info from the regressing bug 1611660

I don't have an initial theory of why this patch would have caused talos regressions, but I suppose I can try to reproduce it locally and investigate.

One thing that's interesting is that QR platforms are seeing larger regressions than non-QR platforms, e.g. on Linux, the tscrollx regression is 45% (QR) vs. 18% (non-QR), on Windows it's 25% (QR) vs. 15% (non-QR).

Took some profiles of tscrollx to compare.

Before
After

Just wanted to post an update to say I've been actively investigating this. Markus identified a suspicious difference between the above profiles (the presence of "waiting for paint" markers in the "before" profile but not in the "after" profile), and I see the same thing during local talos runs. I've been debugging it locally to try to pin down the root cause but haven't gotten anywhere meaningful yet.

Assignee: nobody → botond
Flags: needinfo?(botond)

I've tracked the issue down to this line causing an extra repaint after every first-paint of a scroll frame. This repaint appears to be unnecessary, as we're just accepting the scroll frame's layout viewport for the first time.

I don't understand the exact mechanism by which this extra repaint results in the observed performance regression, but it seems to somehow make it so that going into nsRefreshDriver::GetTransactionId(), mOutstandingTransactionId is always the same as mCompletedTransaction (whereas before the change, mOutstandingTransactionId would typically be ahead of mCompletedTransaction by 1), causing us to avoid setting mWaitingForTransaction, which in turn causes us to avoid this early exit in Tick() and reach the place where we stop the refresh timer much more often.

In the first-paint case, we are accepting the incoming scroll metadata
wholesale, so most of the consequences of this code block will happen
anyways -- except for the extra repaint, which is not actually needed
and in fact undesirable in the first-paint case.

Blocks: 1557133
Pushed by bballo@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/d4c8b8c748ee Move the check for a resized layout viewport in NotifylayersUpdated to the non-first-paint case. r=kats

Can we file something for the flakiness you observed in tscroll? At least so it's documented.

Flags: needinfo?(botond)

Oh yeah I was thinking of mentioning this but forgot: https://wiki.mozilla.org/TestEngineering/Performance/Talos/Tests#tscrollx would be a good place to add a "Possible regression causes" section with notes.

(In reply to Timothy Nikkel (:tnikkel) from comment #9)

Can we file something for the flakiness you observed in tscroll? At least so it's documented.

Yup, filed bug 1645275 with a summary of my investigation into said flakiness.

Flags: needinfo?(botond)
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED

Botond, it looks like the displaylist_mutate regression was only partially fixed by the fix here. Do you have any ideas on why that might be?

Flags: needinfo?(botond)
Has Regression Range: --- → yes
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: