1546490 - Stopped refresh drivers have high start latency

Reporter

Description

•

6 years ago

As revealed by both the enhancement in bug 1521786 and the regression in bug 1546421, stopped refresh drivers have a high start latency.

This is caused by having to wait for the next vblank after enabling the refresh timer.

Two ways to lower this latency comes to mind:

We could expand the logic introduced by bug 1521786 so that more timers have delayed stops. This will keep the timer running for a short period of time, at the cost of some (but not as much) CPU.
We can trigger a timer tick on restart, so that we do not have to wait for next vblank. This removes any start latency and lets us terminate all timers immediately, leading to no wasted CPU.

#2 leads to both the best latency and lowest resource utilization, and with my ultimate goal being CPU/power improvements, that would be my preference (running but empty refresh drivers make the difference between 5% CPU and 0.7% CPU on my ultra-book).

It would mean that the first refresh in a series would not be on a vblank, but as we are not compositing, this is unlikely to matter. Considering that the content processes are level-triggered ("vsync compression", as Silk.rst calls it), it would seem that the design does not intend to be powered by exact vsync.

Emilio Cobos Álvarez (:emilio)

Updated

•

6 years ago

Status: UNCONFIRMED → NEW

Ever confirmed: true

Emilio Cobos Álvarez (:emilio)

Updated

•

6 years ago

Blocks: 1546421

Kenny Levinsen :kennylevinsen

Reporter

Comment 1

•

6 years ago

Attached file Bug 1546490 - Tick refresh drivers immediately on timer change in EnsureTimerStarted. — Details

Enabling a disabled timer leaves a worst-case latency to first tick of a full
timer period, delaying the start of whatever work requested the timer.

Ticking the driver immediately on timer change (e.g. timer started) allows us
to avoid this start latency without having to use grace periods for timer
shutdowns.

Kenny Levinsen :kennylevinsen

Reporter

Comment 2

•

6 years ago

I reverted the timer stop fix, and did some tests from before the the fix to after, and from before fix to the the patch in this bug (fire timer immediately on activation).

From before timer stop fix to after timer stop fix
From before timer stop fix to immediate timer fire

Interesting observations:

There's a bunch of improvements with good confidence, so not bad.
The tsvg_static test hasn't improved (and actually now pulled OS X into the problem too).
Tab switch is killed by this. I think that's a bug, as Thaw() also seems to fire an additional DoRefresh() before calling EnsureTimerStarted, which then fires its own additional DoRefresh().
tabpaint is a mixed bag
Not sure what twinopen is, or what happened to it.

Kenny Levinsen :kennylevinsen

Reporter

Comment 3

•

6 years ago

From before timer stop fix to immediate timer fire with extra DoRefresh() in Thaw() removed. Unfortunately, something made the Windows machines upset with regards to I/O, which tainted the results a little bit. Didn't fix tab switch. My next suspicion is that the test relied on timers not being fired under the switch.

Interesting thing: In this test where DoRefresh() is dispatched with a 2ms delay, you see up to 29% improvement on about_preferences, possibly due to being ready to do some work shortly after timer start.

Emilio Cobos Álvarez (:emilio)

Comment 4

•

6 years ago

It may be worth getting some profiles to be able to compare... Ionut, do you know how to get profiles from treeherder talos pushes?

(In reply to Kenny Levinsen :kennylevinsen from comment #3)

Interesting thing: In this test where DoRefresh() is dispatched with a 2ms delay, you see up to 29% improvement on about_preferences, possibly due to being ready to do some work shortly after timer start.

Hmm, maybe we could use the idle queue for this, cancelling if we get to vsync before idle?

Emilio Cobos Álvarez (:emilio)

Comment 5

•

6 years ago

(In reply to Emilio Cobos Álvarez (:emilio) from comment #4)

It may be worth getting some profiles to be able to compare... Ionut, do you know how to get profiles from treeherder talos pushes?

Flags: needinfo?(igoldan)

Kenny Levinsen :kennylevinsen

Reporter

Comment 6

•

6 years ago

(In reply to Emilio Cobos Álvarez (:emilio) from comment #4)

Hmm, maybe we could use the idle queue for this, cancelling if we get to vsync before idle?

I had considered something like this, but then other other hand, firing on high-priority gives improved scroll and responsiveness scores, which are lost if delayed.

We basically have 3 situations:

Some tests, in particular responsiveness tests are improved by immediate tick.
Some tests see greater improvement if the tick is delayed until the test was ready, potentially through use of the idle queue.
Some tests regress, likely due to now doing more work than it did before if it could sneak in between timer fires, effectively never observing refresh.

Maybe we'd be able to categorize the entrypoints, so that listeners registered that affect responsiveness cause immediate fire, whereas others cause delayed (idle-queue) and cancellable fire (for freeze(), upgrade to immediate fire, or true vsync arriving), while potentially some entrypoints do nothing at all.

However, while that would allow us to realize some nice perf numbers, it still leaves one odd-ball benchmark: The tsvg_static regression that started this work in the first place.

So far, the observations have been:

Original stop of some timers: Good
Stop of all timers, no immediate tick: Bad
Stop of all timers, immediate tick: Bad
Stop of all timers, 2ms delayed tick: Bad

There are two things that between all these runs:

The number and exact timing of refresh events.
Whether ::Tick() sees regular execution outside of need. The call to IsWaitingForPaint, for example, could be keeping a root refresh driver ticking due to an AddRefreshObserver call on it.

The test would have to be awfully finicky about the timing of ticks for it to be #1, so I'm starting to suspect #2.

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 7

•

6 years ago

Pushed to try
Without patch: https://treeherder.mozilla.org/#/jobs?repo=try&revision=47e5e1d86cad6de388734811f35179e6b9aa11c0
with patch: https://treeherder.mozilla.org/#/jobs?repo=try&revision=f5989a56ef8ab20463f89561be2be2497c766a9a

Emilio Cobos Álvarez (:emilio)

Updated

•

6 years ago

Blocks: 1546937

Emilio Cobos Álvarez (:emilio)

Comment 8

•

6 years ago

Kenny, do you know the status of this work? Can I help getting it through the finish line?

Thanks!

Flags: needinfo?(bugzilla)

Emilio Cobos Álvarez (:emilio)

Updated

•

6 years ago

Assignee: nobody → bugzilla

Priority: -- → P2

Kenny Levinsen :kennylevinsen

Reporter

Comment 9

•

6 years ago

(In reply to Emilio Cobos Álvarez (:emilio) from comment #8)

Kenny, do you know the status of this work? Can I help getting it through the finish line?

Thanks!

Ah, I have been handicapped by an inability to successfully run talos on try that suddenly showed up and persisted ever since, something due to missing entitlements on a single node (IIRC, an ARM windows device). While I have implemented review comments, but I have kept them in the drawer due to lacking talos runs. Any idea as to how to proceed to fix that would be appreciated!

The last results I remember were that there were some rather nice improvements to many things, although I do not believe it fixed all the regressions. Without talos I can't say much, but I think it would be good to also make a test with solution #1 (delay stop, but for all timers) to compare. We could also merge the two for a "best of both worlds".

Flags: needinfo?(bugzilla) → needinfo?(emilio)

Emilio Cobos Álvarez (:emilio)

Comment 10

•

6 years ago

Ah, I have been handicapped by an inability to successfully run talos on try that suddenly showed up and persisted ever since, something due to missing entitlements on a single node (IIRC, an ARM windows device). While I have implemented review comments, but I have kept them in the drawer due to lacking talos runs. Any idea as to how to proceed to fix that would be appreciated!

D'oh, that sucks. That looks like an infra issue. Could you file a bug for that and cc me? I can try to get it looked at or diagnosed, but it doesn't really ring a bell to me.

Anyhow for now I pushed it to try for you:

(Pushed with $ ./mach try fuzzy --preset perf --rebuild 10, but if you want something else let me know)

The last results I remember were that there were some rather nice improvements to many things, although I do not believe it fixed all the regressions. Without talos I can't say much, but I think it would be good to also make a test with solution #1 (delay stop, but for all timers) to compare. We could also merge the two for a "best of both worlds".

Yeah, there's issues with "delay stop for all timers" that make me a bit skeptical of it since during load ticking too often is not good... But what we're doing right now is also not ideal, so I pushed:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=94bc7bdb6dab8e3301f827e6e745b3ae2bc67b03

With something that maybe works, comments welcome (patch is https://hg.mozilla.org/try/rev/ebda715c7a7909fa92e0fb0a0d2e3706812784cf, and the comment hopefully hints what the code is trying to accomplish).

Thanks for all your work on this btw :)

Flags: needinfo?(emilio)

Alexandru Ionescu (needinfo me) [:alexandrui]

Updated

•

5 years ago

status-firefox69: --- → affected

status-firefox70: --- → affected

Ionuț Goldan [:igoldan]

Comment 11

•

5 years ago

(In reply to Emilio Cobos Álvarez (:emilio) from comment #5)

(In reply to Emilio Cobos Álvarez (:emilio) from comment #4)

It may be worth getting some profiles to be able to compare... Ionut, do you know how to get profiles from treeherder talos pushes?

Gecko profiles for tsvg_static on Linux shippable:

before bug 1546098: https://profiler.firefox.com/from-url/https%3A%2F%2Fqueue.taskcluster.net%2Fv1%2Ftask%2FHBN0KKB_RU-U0K4V00ND3w%2Fruns%2F0%2Fartifacts%2Fpublic%2Ftest_info%2Fprofile_tsvg_static.zip

after: https://profiler.firefox.com/from-url/https%3A%2F%2Fqueue.taskcluster.net%2Fv1%2Ftask%2FVM7R4E-oRdOx3B-Ic7L-LA%2Fruns%2F0%2Fartifacts%2Fpublic%2Ftest_info%2Fprofile_tsvg_static.zip

Flags: needinfo?(igoldan)

BugBot [:suhaib / :marco/ :calixte]

Comment 12

•

3 years ago

The bug assignee didn't login in Bugzilla in the last months and this bug has priority 'P2'.
:dholbert, could you have a look please?
For more information, please visit auto_nag documentation.

Assignee: bugzilla → nobody

Flags: needinfo?(dholbert)

Daniel Holbert [:dholbert]

Comment 13

•

3 years ago

--> Resetting priority, and setting severity to S3.

emilio, looks like you were helping out here; do you know if this bug and its patch (now 3 years old) are still useful?

Severity: normal → S3

Flags: needinfo?(dholbert) → needinfo?(emilio)

Priority: P2 → --

Emilio Cobos Álvarez (:emilio)

Comment 14

•

3 years ago

Olli has a better grasp of the refresh driver subtlety.

Flags: needinfo?(emilio) → needinfo?(bugs)

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 15

•

3 years ago

Isn't this fixed? We have extra tick and catch-up tick.
bug 1708325 and bug 1675614.

Please reopen if you think we're still missing some case.

Status: NEW → RESOLVED

Closed: 3 years ago

Flags: needinfo?(bugs)

Resolution: --- → WORKSFORME