Closed Bug 1546490 Opened 5 years ago Closed 2 years ago

Stopped refresh drivers have high start latency

Categories

(Core :: Layout, defect)

68 Branch
defect

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox68 --- affected
firefox69 --- affected
firefox70 --- affected

People

(Reporter: kennylevinsen, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

As revealed by both the enhancement in bug 1521786 and the regression in bug 1546421, stopped refresh drivers have a high start latency.

This is caused by having to wait for the next vblank after enabling the refresh timer.

Two ways to lower this latency comes to mind:

  1. We could expand the logic introduced by bug 1521786 so that more timers have delayed stops. This will keep the timer running for a short period of time, at the cost of some (but not as much) CPU.
  2. We can trigger a timer tick on restart, so that we do not have to wait for next vblank. This removes any start latency and lets us terminate all timers immediately, leading to no wasted CPU.

#2 leads to both the best latency and lowest resource utilization, and with my ultimate goal being CPU/power improvements, that would be my preference (running but empty refresh drivers make the difference between 5% CPU and 0.7% CPU on my ultra-book).

It would mean that the first refresh in a series would not be on a vblank, but as we are not compositing, this is unlikely to matter. Considering that the content processes are level-triggered ("vsync compression", as Silk.rst calls it), it would seem that the design does not intend to be powered by exact vsync.

Status: UNCONFIRMED → NEW
Ever confirmed: true

Enabling a disabled timer leaves a worst-case latency to first tick of a full
timer period, delaying the start of whatever work requested the timer.

Ticking the driver immediately on timer change (e.g. timer started) allows us
to avoid this start latency without having to use grace periods for timer
shutdowns.

I reverted the timer stop fix, and did some tests from before the the fix to after, and from before fix to the the patch in this bug (fire timer immediately on activation).

From before timer stop fix to after timer stop fix
From before timer stop fix to immediate timer fire

Interesting observations:

  1. There's a bunch of improvements with good confidence, so not bad.
  2. The tsvg_static test hasn't improved (and actually now pulled OS X into the problem too).
  3. Tab switch is killed by this. I think that's a bug, as Thaw() also seems to fire an additional DoRefresh() before calling EnsureTimerStarted, which then fires its own additional DoRefresh().
  4. tabpaint is a mixed bag
  5. Not sure what twinopen is, or what happened to it.

From before timer stop fix to immediate timer fire with extra DoRefresh() in Thaw() removed. Unfortunately, something made the Windows machines upset with regards to I/O, which tainted the results a little bit. Didn't fix tab switch. My next suspicion is that the test relied on timers not being fired under the switch.

Interesting thing: In this test where DoRefresh() is dispatched with a 2ms delay, you see up to 29% improvement on about_preferences, possibly due to being ready to do some work shortly after timer start.

It may be worth getting some profiles to be able to compare... Ionut, do you know how to get profiles from treeherder talos pushes?

(In reply to Kenny Levinsen :kennylevinsen from comment #3)

Interesting thing: In this test where DoRefresh() is dispatched with a 2ms delay, you see up to 29% improvement on about_preferences, possibly due to being ready to do some work shortly after timer start.

Hmm, maybe we could use the idle queue for this, cancelling if we get to vsync before idle?

(In reply to Emilio Cobos Álvarez (:emilio) from comment #4)

It may be worth getting some profiles to be able to compare... Ionut, do you know how to get profiles from treeherder talos pushes?

Flags: needinfo?(igoldan)

(In reply to Emilio Cobos Álvarez (:emilio) from comment #4)

Hmm, maybe we could use the idle queue for this, cancelling if we get to vsync before idle?

I had considered something like this, but then other other hand, firing on high-priority gives improved scroll and responsiveness scores, which are lost if delayed.

We basically have 3 situations:

  1. Some tests, in particular responsiveness tests are improved by immediate tick.
  2. Some tests see greater improvement if the tick is delayed until the test was ready, potentially through use of the idle queue.
  3. Some tests regress, likely due to now doing more work than it did before if it could sneak in between timer fires, effectively never observing refresh.

Maybe we'd be able to categorize the entrypoints, so that listeners registered that affect responsiveness cause immediate fire, whereas others cause delayed (idle-queue) and cancellable fire (for freeze(), upgrade to immediate fire, or true vsync arriving), while potentially some entrypoints do nothing at all.

However, while that would allow us to realize some nice perf numbers, it still leaves one odd-ball benchmark: The tsvg_static regression that started this work in the first place.

So far, the observations have been:

  1. Original stop of some timers: Good
  2. Stop of all timers, no immediate tick: Bad
  3. Stop of all timers, immediate tick: Bad
  4. Stop of all timers, 2ms delayed tick: Bad

There are two things that between all these runs:

  1. The number and exact timing of refresh events.
  2. Whether ::Tick() sees regular execution outside of need. The call to IsWaitingForPaint, for example, could be keeping a root refresh driver ticking due to an AddRefreshObserver call on it.

The test would have to be awfully finicky about the timing of ticks for it to be #1, so I'm starting to suspect #2.

Kenny, do you know the status of this work? Can I help getting it through the finish line?

Thanks!

Flags: needinfo?(bugzilla)
Assignee: nobody → bugzilla
Priority: -- → P2

(In reply to Emilio Cobos Álvarez (:emilio) from comment #8)

Kenny, do you know the status of this work? Can I help getting it through the finish line?

Thanks!

Ah, I have been handicapped by an inability to successfully run talos on try that suddenly showed up and persisted ever since, something due to missing entitlements on a single node (IIRC, an ARM windows device). While I have implemented review comments, but I have kept them in the drawer due to lacking talos runs. Any idea as to how to proceed to fix that would be appreciated!

The last results I remember were that there were some rather nice improvements to many things, although I do not believe it fixed all the regressions. Without talos I can't say much, but I think it would be good to also make a test with solution #1 (delay stop, but for all timers) to compare. We could also merge the two for a "best of both worlds".

Flags: needinfo?(bugzilla) → needinfo?(emilio)

Ah, I have been handicapped by an inability to successfully run talos on try that suddenly showed up and persisted ever since, something due to missing entitlements on a single node (IIRC, an ARM windows device). While I have implemented review comments, but I have kept them in the drawer due to lacking talos runs. Any idea as to how to proceed to fix that would be appreciated!

D'oh, that sucks. That looks like an infra issue. Could you file a bug for that and cc me? I can try to get it looked at or diagnosed, but it doesn't really ring a bell to me.

Anyhow for now I pushed it to try for you:

(Pushed with $ ./mach try fuzzy --preset perf --rebuild 10, but if you want something else let me know)

The last results I remember were that there were some rather nice improvements to many things, although I do not believe it fixed all the regressions. Without talos I can't say much, but I think it would be good to also make a test with solution #1 (delay stop, but for all timers) to compare. We could also merge the two for a "best of both worlds".

Yeah, there's issues with "delay stop for all timers" that make me a bit skeptical of it since during load ticking too often is not good... But what we're doing right now is also not ideal, so I pushed:

With something that maybe works, comments welcome (patch is https://hg.mozilla.org/try/rev/ebda715c7a7909fa92e0fb0a0d2e3706812784cf, and the comment hopefully hints what the code is trying to accomplish).

Thanks for all your work on this btw :)

Flags: needinfo?(emilio)

(In reply to Emilio Cobos Álvarez (:emilio) from comment #5)

(In reply to Emilio Cobos Álvarez (:emilio) from comment #4)

It may be worth getting some profiles to be able to compare... Ionut, do you know how to get profiles from treeherder talos pushes?

Gecko profiles for tsvg_static on Linux shippable:

before bug 1546098: https://profiler.firefox.com/from-url/https%3A%2F%2Fqueue.taskcluster.net%2Fv1%2Ftask%2FHBN0KKB_RU-U0K4V00ND3w%2Fruns%2F0%2Fartifacts%2Fpublic%2Ftest_info%2Fprofile_tsvg_static.zip

after: https://profiler.firefox.com/from-url/https%3A%2F%2Fqueue.taskcluster.net%2Fv1%2Ftask%2FVM7R4E-oRdOx3B-Ic7L-LA%2Fruns%2F0%2Fartifacts%2Fpublic%2Ftest_info%2Fprofile_tsvg_static.zip

Flags: needinfo?(igoldan)

The bug assignee didn't login in Bugzilla in the last months and this bug has priority 'P2'.
:dholbert, could you have a look please?
For more information, please visit auto_nag documentation.

Assignee: bugzilla → nobody
Flags: needinfo?(dholbert)

--> Resetting priority, and setting severity to S3.

emilio, looks like you were helping out here; do you know if this bug and its patch (now 3 years old) are still useful?

Severity: normal → S3
Flags: needinfo?(dholbert) → needinfo?(emilio)
Priority: P2 → --

Olli has a better grasp of the refresh driver subtlety.

Flags: needinfo?(emilio) → needinfo?(bugs)

Isn't this fixed? We have extra tick and catch-up tick.
bug 1708325 and bug 1675614.

Please reopen if you think we're still missing some case.

Status: NEW → RESOLVED
Closed: 2 years ago
Flags: needinfo?(bugs)
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: