Closed Bug 1783405 Opened 3 years ago Closed 2 years ago

Add optional tolerance to timers, and coalesce timers that could be run at the same time

Categories

(Core :: XPCOM, task)

task

Tracking

()

RESOLVED FIXED
114 Branch
Tracking Status
firefox114 --- fixed

People

(Reporter: mozbugz, Assigned: jlink)

References

Details

Attachments

(6 files)

Lots of timers probably don't need to run exactly as specified, for example multi-second timeouts to run garbage collection, or to backup some data.
These timers could also specify a tolerance (something like "I can run up to 20% later", or "I can run up to 4s later").
Then the TimerThread (that manages the list of timers, and runs lapsed ones) could coalesce timers for which the range of firing times have some overlap.

The goal is to reduce the number of wake-ups, each of which consumes some energy.

Note that garbage collection mostly runs off of IdleTaskRunner these days, so it's already doing something like this. "Run me the next time there's idle time, or in 4s, whichever comes first." Maybe what's described here would additionally coalesce that 4s fallback with other timers? The GC would be fine with that, though note that you might want to intentionally spread things out some if the coalesced tasks run long enough to produce noticeable pauses. Even the fallback GC run will normally try to limit its time budget to something unnoticeable, but if it starts getting batched up with other tasks, then the TimerThread ought to be doing some budgeting as well to prevent the overall chunk of work from janking the UI.

(In reply to Steve Fink [:sfink] [:s:] from comment #1)

Even the fallback GC run will normally try to limit its time budget to something unnoticeable, but if it starts getting batched up with other tasks, then the TimerThread ought to be doing some budgeting as well to prevent the overall chunk of work from janking the UI.

I think if taskcontroller priorities are set correctly on the tasks, we should be able to queue multiple tasks on the event queue without needing to worry about user events being delayed, as they should have higher priority.

(In reply to Florian Quèze [:florian] (PTO until August 29) from comment #2)

(In reply to Steve Fink [:sfink] [:s:] from comment #1)

Even the fallback GC run will normally try to limit its time budget to something unnoticeable, but if it starts getting batched up with other tasks, then the TimerThread ought to be doing some budgeting as well to prevent the overall chunk of work from janking the UI.

I think if taskcontroller priorities are set correctly on the tasks, we should be able to queue multiple tasks on the event queue without needing to worry about user events being delayed, as they should have higher priority.

Ah! That's a better solution.

I would also worry about animation stutters, but those should already be prevented by the deadline logic.

I will note that in theory if the GC starts running first and is given the whole of the idle time (eg the inter-frame idle time), but then a user event comes in, it should stop almost immediately since there's an interrupt mechanism that will pause the GC slice. In practice, this seems to happen very rarely, seemingly because there are very few interrupts sent to the scheduler. I feel like it should be more common, especially when I'm running something with artificially frequent GCs and hammering on the keyboard. But I've never tracked down what's going on there.

One important thing to note, which I hope will alleviate some concerns, is that this proposed timer tolerance will be opt-in.
So when starting on this adventure, we can carefully pick which timers we think can safely be shifted, and by how much; and then if something goes wrong, it should be easy to change the tolerance again.

Sorry if that was not clear.
And my initial description did mention "Lots of timers probably don't need to run exactly as specified, for example multi-second timeouts to run garbage collection". It's helpful to know that we should be careful before making that one too loose.

Thanks for all the comments so far.

Note: As per bug 1768822 comment 4, when we implement this (which may require a modification of how timers are stored), we should make sure that we keep the desired behavior of not notifying the timer thread when adding a timer that's scheduled after the next already-scheduled wake-up.

To note, when I think about coalescing timers, I often think about how we use lookups to create shared thread pools. We could probably implement some shared timer instances based on time similarly, and issue timer wakeups in groups for timers enrolled in their shared timer instance.

Assignee: mozbugz → nobody

(In reply to Kris Wright :KrisWright from comment #6)

To note, when I think about coalescing timers, I often think about how we use lookups to create shared thread pools. We could probably implement some shared timer instances based on time similarly, and issue timer wakeups in groups for timers enrolled in their shared timer instance.

We actually do something like that for DOM timers in TimeoutManager
https://searchfox.org/mozilla-central/rev/0a2eba79c24300ce0539f91c1bebac2e75264e58/dom/base/TimeoutManager.cpp#691-692

One very simple way to increase the number of timers that we process in a single wake-up (and thus reducing the frequency of wake-ups) would seem to be to schedule TimerThread to wake up, not at the exact time when the next timer needs to fire (as it currently does) but rather some amount of time after that. Some initial statistics that I gathered indicated that to me that allowing even a small delay in triggering timers would go a long way toward reducing the needed frequency of wake-ups.

I haven't noticed any problems in some initial local testing (with a 100ms delay) but I'm interested to hear what others think of this option.

I can think of couple possible downsides to this approach:

  1. Some clients might be expecting for their timers to be triggered run as close as possible to their requested time.
  2. In cases where there isn't at least one timer close behind the "next" timer, we are delaying the next needlessly.
  3. Some timers might be more, or less, tolerant of being delayed which isn't accounted for but, with a single fixed delay, we would have to use only the smallest tolerance that any client would want.

Regarding 1), I would hope that it wouldn't be the case that any clients would be very sensitive to the start times of the timers as, even in the current version, there is no guarantee that
2) could, of course, be avoided by looking ahead in the buffer at the upcoming timers and only delaying by the amount needed to cover the timers that we want to include in the next run.
For 3), the extent to which this is a downside is going to depend entirely on how much variation there is in tolerance for timer delays.

I've added some counters (locally) to keep track of how many timers we execute in a single wake-up and, as expected, I am seeing an increase with this change, but not nearly as much as expected. It seems to be the case quite frequently (both with and without my change) that we wake up and don't have any timers to run and that is unexpected to me so I'm looking into that right now.

tldr: There might be a quick and simple way to reduce the frequency of TimerThread wake-ups. I'm interested in hearing any thoughts from others on whether I'm missing anything with this approach.

(In reply to Justin Link from comment #9)

I haven't noticed any problems in some initial local testing (with a 100ms delay) but I'm interested to hear what others think of this option.

A potential issue with doing this is that some pages may use short timers to schedule things (eg. animations written in JavaScript), and waiting 100ms might cause the page to skip a frame (and cause visible flicker). Arguably there are now much better ways to write these animations (eg. requestAnimationFrame), but old scripts might still exist. I don't know how to assess how big of an issue that would be.

It seems to be the case quite frequently (both with and without my change) that we wake up and don't have any timers to run and that is unexpected to me so I'm looking into that right now.

This happens if a timer was added, and then removed before it fires. The current code isn't able to cancel the already planned wake-up of the timer thread when canceling a timer.
On Nightly this will happen frequently because the Background Hang Reporter code adds 128ms timers whenever a piece of JS starts, and removes the timer when that piece of JS ends.

Thanks for the feedback Florian. I had thought that 100ms would be too high and that I would see some kind of artifacts but I wanted to test in an extreme case to see how noticeable the delays would be (and to try and suss out any bugs where clients were implicitly relying on a certain response time for correctness). I also tested at 10s and 1s to verify that my change could have an observable impact and a decrease in responsiveness of some things was clearly evident at those levels.

I also figured out what was happening with the frequent wake-ups and it was mostly due to a problem that I introduced with my PoC - in the case where there were no pending timers I ended up requesting to sleep for a negative amount of time (due to overflow/wrap-around in TimeDuration) which, of course, caused TimerThread to wake up back up almost immediately. Once I fixed that the numbers that I was tracking started being more as expected.

So now I'm been mostly messing around with measuring what the impact is of delaying timers in reducing the number of wake-ups. To that end, I've started by measuring the number of wake-ups required to service a fixed number (2000 in my tests) of timers. Going from the current behavior (a 0ms wake-up delay) to a fixed 10ms wake-up delay results in a 6.8% decrease in the number of wake-ups needed. Going from the current behavior to a fixed 100ms delay results in a 31.8% decrease.

A 31.8% seems pretty nice but a possible 100ms delay in firing a timer might be too much for at least some timers (while probably completely acceptable for others), so some more investigation will be needed here to determine what can actually be rolled out.

Next I am going to compute the wake-up delay based on the actual timers that are pending so that we won't wait any more than is actually necessary to bundle additional timers into the same wake-up.

In order to determine the impact of that change I'm going to start tracking the mean (and maybe the RMS) delay between when a timer requested to fire and when it actually fires to see how that changes.

As a side note, I'm not really sure if the Javascript timers go through/use TimerThread at all. I had initially thought that the JS timers were the primary use case (for this work and for TimerThread in general) but, I haven't seen any evidence that this is the case and, based on some code exploration that Doug and I did a couple weeks ago, it seems that the JS timers are going through https://searchfox.org/mozilla-central/source/dom/base/TimeoutManager.cpp. Am I missing a connection between the two? (Granted I haven't spend a lot of time looking at TimeoutManager.)

Assignee: nobody → jlink

(In reply to Justin Link from comment #11)

A 31.8% seems pretty nice but a possible 100ms delay in firing a timer might be too much for at least some timers (while probably completely acceptable for others), so some more investigation will be needed here to determine what can actually be rolled out.

Have you considered testing a delay that would be a percentage of the timer duration? A 100ms delay is a lot of a 5ms timer, but for a timer of multiple seconds, it might be totally acceptable (and a longer delay would be acceptable too). Maybe a 10% delay could be fine in most cases?

As a side note, I'm not really sure if the Javascript timers go through/use TimerThread at all.

Seeing what uses the timer thread is easy using the profiler. Here I grabbed a profile of a random google.com content process I had in my current Nightly (I captured the profile from about:processes) https://share.firefox.dev/3Tqxvua
The 'setTimeout callback' markers of the main thread are aligned with the 'PostTimerEvent - TimeoutExecutor Runnable' markers of the timer thread.

It's been too long since I provided any update here on this ticket.

Since the last comment I implemented the "dynamic" wake-up delay (where the amount of additional time that we sleep is determined by what timers are next in line behind the "next" timer that we need to process). While testing this with a zero "additional wait" tolerance and comparing for differences with the existing behavior I noticed a problem and tracked this back to a bug with Gerald's changes which meant that the TimerThread's list of "to-be-fired" timers wasn't always actually sorted. This has been fixed with the patch set that I pushed on top of this stack...

At that point I returned to the topic of how to decide how the additional wait time should be calculated. For the record, here are the options that I think

  1. Wake up at exactly when the next timer is to be fired (current behavior).
  2. Wake up some fixed amount of time after the next timer is to be fired (my initial proof-of-concept).
  3. Wake up some variable amount of time after the next timer is to be fired...
    a) ... where each timer has a fixed tolerance for how late it can fire.
    b) ... where each timer has a variable tolerance for how late it can fire.
  4. A combination of 2 and some version of 3 where we look forward for timers to bundle but always delay by at least a little bit from when the next timer should fire.

However ... while chatting with Doug a couple weeks ago I mentioned that, even prior to any of my (or Gerald's) changes, I had measured an average timer delay (the time difference between the timer actually fires and when it requested to fire) of 8ms. After looking into that a bit more closely (and mistakenly thinking that TimerThread itself was taking a long time to actually go to sleep after calculating how long to sleep for) and getting some help from Florian it is clear now that this is due to the 64Hz resolution of the scheduler that decides when to wake up the timer thread.

This is a very interesting piece of information because it means that all timers are already subjected to a delay that is regularly as large as ~16ms. This means that we are already getting some coalescing of timers whether we like/knew it or not (although the delay, and the amount of coalescing, is a bit haphazard).

This also explains why I only saw a ~7% decrease in the wake-up rate when I tried out a fixed 10ms wake-up before. (Rather than that change taking the delay from from 0ms to 10ms it takes it from [0, 16ms] to [10ms, 26ms].)

I am of the opinion that we should start with a simple approach that makes an improvement and then evaluate further changes from there. All timers have already been subject to an up-to 16ms delay in their firing so I don't think think that we need to be too worried about short-duration timers needing small delay tolerances (feel free to disagree/prove me wrong!).

Thus I would propose going with either a fixed wake-up delay of 10-20ms or dynamic delay based on looking ahead at the timers that are currently scheduled (with a minimum delay of, say, 8ms, in all cases).

I will try these two options out and get some numbers of their respective impacts but I'd still be interested to hear other opinions/ideas.

Status update:

  • The current patch stack (in https://bugzilla.mozilla.org/show_bug.cgi?id=1787974) causes a very repeatable failure in WebRtcIceGatherTest.TestStunTcpServerTrickle/TestStunTcpAndUdpServerTrickle on Linux and OSX platforms (those tests do not run on other platforms).
    -- The test failures start specifically with https://phabricator.services.mozilla.com/D164294. This is interesting because that patch set should be fixing what would be unquestionably a bug (timers not always being added in the correct spot).
    -- The failure appears to be related to timing. The test is waiting for some work to complete and, for some reason, when the timers are ordered properly, it takes slightly longer for the work in the test to complete.
    --- If I adjust the test to wait longer, the failures go away.
    --- If I disable the background hang timer, then the failures in try go away for Linux. For Doug, who can reproduce the problem locally in OSX, disabling BHR makes the failure completely disappear.

  • After adding some logging and running some unit tests locally (specifically SimpleTimerTest.* which are failing intermittently unrelated to my change - see https://bugzilla.mozilla.org/show_bug.cgi?id=1744404) I am seeing a lot of calls to TimerThread::RemoveTimer for timers that are not currently "in" the timer thread. I suspect that this is mostly defensive (ie the client has a timer that they want to use and isn't sure if they have cleaned it up properly so they remove it redundantly right before they add it back in). I don't know if this behavior carries over to Firefox or not but it is wasteful. I would like to come back to take a quick look at this once this work is completed.

Next steps:

  • Add some more exhaustive logging and figure out why exactly having a properly sorted timer list is causing the change in behavior that is resulting in the failures.

(In reply to Justin Link from comment #14)

Status update:

  • The current patch stack (in https://bugzilla.mozilla.org/show_bug.cgi?id=1787974) causes a very repeatable failure in WebRtcIceGatherTest.TestStunTcpServerTrickle/TestStunTcpAndUdpServerTrickle on Linux and OSX platforms (those tests do not run on other platforms).
    -- The test failures start specifically with https://phabricator.services.mozilla.com/D164294. This is interesting because that patch set should be fixing what would be unquestionably a bug (timers not always being added in the correct spot).
    -- The failure appears to be related to timing. The test is waiting for some work to complete and, for some reason, when the timers are ordered properly, it takes slightly longer for the work in the test to complete.
    --- If I adjust the test to wait longer, the failures go away.

I wasn't sure if it was possible to profile gtests, but it turned out to be easy: MOZ_PROFILER_STARTUP_FILTERS="*" MOZ_PROFILER_STARTUP=1 MOZ_PROFILER_SHUTDOWN="$(pwd)/gtest.json" ./mach gtest '*TestStunTcpAndUdpServerTrickle*' && samply load gtest.json
(with samply installed with cargo install samply)

I reproduced the failure only once out of many runs, and unfortunately I didn't have the profiler running at the time.

But in profiles of local passing runs, I see something that I think is unexpected behavior in the Timer thread
https://share.firefox.dev/3GVzfsc and https://share.firefox.dev/3IFhCOA <-- there's a 128ms BackgroundHangThread_timer that gets removed after 1s.
https://share.firefox.dev/3IAeIuC <-- there's a 128ms BackgroundHangThread_timer that gets dispatched after 794ms

Is it possible that after inserting the timer in the correct position, we don't re-compute the next wait time correctly for the timer thread?

Interesting observations! Let me see what I can figure out ...

Ahh, hmm, I have an idea. The logic to determine whether or not the timer thread should be woken up when a timer is added is fairly simple - it just looks to see if the newly-added timer is the first entry in the list. This is insufficient and incorrect because is doesn't account for the fact that there can be cancelled timers at the front of the list.

With the previous version of the insertion, which slot the newly-inserted timer would go into would not be consistent (it's not even guaranteed that it would be at the front of the list) but with the fixed version the new timer would reliably go in the last of the empty timer slots.

I will try out a quick fix to see if I can confirm this theory.

The above theory does seem to be correct (see try run at https://treeherder.mozilla.org/jobs?revision=0cfa75857e3042bcc5f683cd811e73c4a9362568&repo=try&selectedTaskRun=WQLxlZVGTH-8Ak0OXyZGTQ.0). I took a quick look through the rest of the TimerThread code to see if a similar incorrect assumption might be lying somewhere else and didn't notice anything that looked suspect.

I think that there is still some work that could be done in TimerThread to clean up/simplify timer list management and efficiency but I'm not sure if that would be warranted at this time.

Is there a bug to file on the timer insertion code here?

Flags: needinfo?(jlink)

The timeout of the next timer is a good proxy for when we are planning
to wake up for as long as we are always intending to wake-up for the
next timer but that is going to stop being the case soon.

Depends on D173101

On Windows this resulted in a significant (~48%) reduction in the number
of times that AddTimer() woke up the main thread (for the sole purpose
of re-scheduling its wake-up) with a minimal (1.3% / 0.08ms) increase to
the average timer firing delay (difference between when a timer
requested to fire and when it actually did fire).
The tolerance is very large (8ms) right now on Windows on account of
the low resolution with which we can control our waking up.

Depends on D173102

This results is a shift in the distribution of how many timers we fire per wake-up, with a ~11% increase in the rate at which we wake up and fire 3 timers, a 38% increase in the rate at which we fire 4 timers, and a 76% increase in the rate at which we fire 5 timers.

This sounds really impressive but the final behavior is still dominated by all of the wake-ups in which we fire one or two timers.

Depends on D173104

Treeherder link for the current patch stack: https://treeherder.mozilla.org/jobs?repo=try&revision=59a18ac6f03a66f6c6df51729de9460f9d9cc3b9 (looks okay to me).

Flags: needinfo?(jlink)
Pushed by opettay@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/a66b4ef398e7 Add detailed statistics pt 1 r=smaug,bwc https://hg.mozilla.org/integration/autoland/rev/3451a46f8ef8 Explicitly track intended wake-up time r=smaug,bwc https://hg.mozilla.org/integration/autoland/rev/4eacdb9fd6ac Don't wake up in AddTimer() if the currently-scheduled wake-up time can work r=smaug,bwc https://hg.mozilla.org/integration/autoland/rev/c3ede0ba2cca Added statistics for early wake-ups r=smaug,bwc https://hg.mozilla.org/integration/autoland/rev/3f1b33e32142 Look at the upcoming timers to see what can be bundled together when determining when to wake-up. r=smaug,bwc https://hg.mozilla.org/integration/autoland/rev/970281a0f13d Longer duration timers can have longer firing delays r=smaug
Regressions: 1828427
Blocks: 1830139
See Also: → 1830139
Blocks: 1838557
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: