Closed Bug 1363829 Opened 7 years ago Closed 7 years ago

consider using a single nsITimer for setTimeout() in windows

Categories

(Core :: DOM: Core & HTML, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla55
Performance Impact low
Tracking Status
firefox55 --- fixed

People

(Reporter: bkelly, Assigned: bkelly)

References

(Depends on 1 open bug, Regressed 2 open bugs)

Details

Attachments

(17 files, 61 obsolete files)

5.21 KB, patch
bkelly
: review+
Details | Diff | Splinter Review
8.95 KB, patch
bkelly
: review+
Details | Diff | Splinter Review
20.18 KB, patch
bkelly
: review+
Details | Diff | Splinter Review
4.01 KB, patch
bkelly
: review+
Details | Diff | Splinter Review
10.53 KB, patch
bkelly
: review+
Details | Diff | Splinter Review
25.96 KB, patch
bkelly
: review+
Details | Diff | Splinter Review
6.92 KB, patch
bkelly
: review+
Details | Diff | Splinter Review
14.50 KB, patch
bkelly
: review+
Details | Diff | Splinter Review
3.38 KB, patch
bkelly
: review+
Details | Diff | Splinter Review
2.69 KB, patch
bkelly
: review+
Details | Diff | Splinter Review
1.28 KB, patch
bkelly
: review+
Details | Diff | Splinter Review
9.52 KB, patch
bkelly
: review+
Details | Diff | Splinter Review
2.47 KB, patch
bkelly
: review+
Details | Diff | Splinter Review
2.77 KB, patch
bkelly
: review+
Details | Diff | Splinter Review
6.31 KB, patch
bkelly
: review+
Details | Diff | Splinter Review
4.23 KB, patch
bkelly
: review+
Details | Diff | Splinter Review
10.75 KB, patch
bkelly
: review+
Details | Diff | Splinter Review
Currently we register a separate nsITimer for every setTimeout()/setInterval() call in nsGlobalWindow.  This is unnecessary since we only really need to be woken up for the *next* timer.  We can then schedule another nsITimer for the next one and so on.

This would be beneficial because we still have some performance issues with large numbers of nsITimers.  In particular, trying to cancel many timers so we can reschedule them triggers a lot of mutex contention with the timer thread.  This can block the main thread.  We do this sort of thing when switching tabs and rescheduling throttled timers due to the window coming to the foreground.

If we make this change we would be relying on the ThrottledEventQueue less.  The queue would still be useful to yield between timers from many windows in the same TabGroup, but timer floods would not back up the queue any more.  So the back pressure mechanism would likely need to change to look at the TimeoutManager's list of timers instead.

I'm going to see if I can get this into FF55.
(In reply to Ben Kelly [not reviewing due to deadline][:bkelly] from comment #0)
> This would be beneficial because we still have some performance issues with
> large numbers of nsITimers.  In particular, trying to cancel many timers so
> we can reschedule them triggers a lot of mutex contention with the timer
> thread.  This can block the main thread.

Would it be equally feasible to add APIs for canceling multiple timers in one go instead?
Flags: needinfo?(bkelly)
(In reply to Nathan Froyd [:froydnj] from comment #1)
> Would it be equally feasible to add APIs for canceling multiple timers in
> one go instead?

We could, but I think it would be strictly worse than what I am proposing.  Keeping the current many-timers approach with bulk cancellation would:

1) Use more memory for nsITimers objects
2) Lock the timer thread mutex more frequently for insertions
3) Generate more runnables for timeouts firing near the same time
4) Require building the list of timers to cancel, which would probably not scale well

Using a single timer per window avoids all these issues.  It just does less work for the same functional behavior.
Flags: needinfo?(bkelly)
Ehsan asked me to put a profile showing the problem:

https://perfht.ml/2r2LQRy

This is from running the stress test at:

https://people-mozilla.org/~bkelly/timer-flood/index.html

Search for "CriticalSectionContended".

Cycle collection is also bad here, but its unclear to me if there is anything we can do to improve analyzing so many cycle collectable timer objects.
Attached patch hmm (obsolete) — Splinter Review
Depends on: 1365111
No longer depends on: 1365111
This is still a work-in-progress.  I have the executor written and some cleanup done, but I still have to make TimeoutManager use the executor instead of InitTimer().
Blocks: 1362322
Whiteboard: [qf] → [qf:p3]
These changes look pretty promising locally.  They greatly smooth out the servo tiles demo.  Lets get a build on some other platforms to see how it behaves there (and if it breaks tests):

https://treeherder.mozilla.org/#/jobs?repo=try&revision=838a4951ae9352a0cfa590e702cdbe0a3933198b

I still need to figure out what to do about the back pressure mechanism.  I either need to rewrite it not to depend on ThrottledEventQueue or just remove it.
Hopefully this clears up most of the orange:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=b4f72aac2d49ef43c10c5fc5d8c7dc6e82d252a8

It looks like there are some tests, though, that race setTimeout(f, 0) against other behavior and expect setTimeout() to lose.  Since this patch speeds up setTimeout() these tests are failing.  I'll have to work through these things.

The reason setTimeout(f, 0) is much faster now is because we uses to schedule an nsITimer for it, but now we just dispatch a runnable to the current thread to execute the timer.  I believe this is correct behavior and its the tests that are broken in this case.
Depends on: 1277657
Depends on: 1366642
No longer blocks: TimeToFirstPaint_FB
Depends on: 1366783
Depends on: 1366791
Attachment #8869231 - Attachment is obsolete: true
Attachment #8869232 - Attachment is obsolete: true
Unfortunately the talos test improvements were due to a bug where we would often fire timers earlier than before.  I've fixed that and now its looking like this bug is talos neutral.
The first few patches here will refactor TimeoutManager so that I can remove dependencies on the nsITimer and the specific Timeout in RunTimeout.

To start, this patch removes some dead code that used to involve passing nullptr to RunTimeout() for the Timeout arg.  This code was orphaned by my changes previously in bug 1303167.
Attachment #8870116 - Attachment is obsolete: true
Attachment #8870244 - Flags: review?(ehsan)
Next, replace the Timeout argument to RunTimeout() with simply the deadline we wish to execute up to.  Most of the code in RunTimeout() only needs the deadline.

The code that checks to see if the we executed the target Timeout is removed.  This was really previously needed for cases where we did aggressive yielding.  Consider:

1. Timer t1 is scheduled
2. Timer t2 is scheduled
3. nsITimer for t2 fires, executes t1, and yields after only one timer
4. When t1 is executed its nsITimer is canceled.
5. No nsITimer ever fires again and t2 is never executed.

With the remaining patches in this bug this will no longer be a problem at all.  Also, after bug 1343912 we yield much less aggressively.  Its highly unlikely we will hit this condition even if someone were to run on just this patch and not the others in this bug.
Attachment #8870117 - Attachment is obsolete: true
Attachment #8870245 - Flags: review?(ehsan)
Next, we need to change how the Timeout objects are held alive.  The nsITimer for each Timeout holds a ref and that will be going away.  Also, manual AddRef()/Release() is used for holding a ref while the timer is in the list which is error prone.

This patch simplifies everything by just using LinkedList<RefPtr<Timeout>> now that we support it.  This makes things much easier to reason about.

The ref held during the nsITimer callback is implemented as a self RefPtr<Timeout>.  This is also easier to understand, but it will be going away in a later patch anyway.
Attachment #8870118 - Attachment is obsolete: true
Attachment #8870248 - Flags: review?(ehsan)
The last place the nsITimer is used is for extracting the actual scheduled delay when adjusting timers.  This patch just stores this value on the Timeout instead so we don't need to access the nsITimer.
Attachment #8870119 - Attachment is obsolete: true
Attachment #8870251 - Flags: review?(ehsan)
This patch adds the bulk of the new execution code.  Its contained in a new TimeoutExecutor class:

* The TimeoutExecutor is can be scheduled in "Immediate" mode as a runnable or "Delayed" mode as an nsITimerCallback.
* There is the concept of "flex time" representing how far into the future we may fire a Timeout.  This matches the rounding performed by TimerThread today.
* If the next deadline has passed or less than "flex time" in the future then Immediate mode is triggered as a runnable dispatch.
* Otherwise the deadline is scheduled as a nsITimer in the future.
* The TimeoutExecutor provides a MaybeSchedule(deadline) method that can be called frequently.  It only reschedules the TimeoutExecutor if it sees an earlier deadline than its already scheduled for.
* The TimeoutExecutor must deal with duplicate execution calls.  The timer may still fire even if its canceled and we cannot cancel the nsIRunnable dispatch.

The next patch will actually hook this code up into TimeoutManager.
Attachment #8870147 - Attachment is obsolete: true
Attachment #8870254 - Flags: review?(ehsan)
This integrates the TimeoutExecutor into the TimeoutManager.  This mainly involves calling MaybeSchedule() whenever we get a new Timeout or when we know the executor has been stopped.

The only places we need to explicitly stop the executor are:

1. If the next soonest Timeout is canceled with ClearTimeout().
2. If all the timeouts are canceled.
3. If the window is suspended.
Attachment #8870121 - Attachment is obsolete: true
Attachment #8870256 - Flags: review?(ehsan)
This removes the now unused Timeout nsITimer code.
Attachment #8870122 - Attachment is obsolete: true
Attachment #8870257 - Flags: review?(ehsan)
This removes the back pressure mechanism I added for FF52.  It is designed around every Timeout getting a runnable in the ThrottledEventQueue, but that is not longer the case.

I considered ways of implementing a new backpressure mechanism, but I decided not to for a few reasons:

1. We already provide some back pressure by limiting how many timers we will implement at once.  In particular, once you get many Timeout objects the cycle collector tends to limit things as well.
2. The old back pressure mechanism didn't work very well.  It prevented memory from growing rapidly, but it couldn't stop it for every work load.  It also triggered jank in this work loads due to ResetTimersForThrottleReduction().
3. The backpressue mechanism is complicated and hard to maintain.  The old one had to be tuned a few times as the browser changed.

Overall I think we are better off without a backpressure mechanism for now.  If I can think up a simple way to do it in the future we can always add it again.

My limited testing showed the CC tended to slow things down and memory growth slowed after 1+GB of Timeout objects.
Attachment #8870123 - Attachment is obsolete: true
Attachment #8870260 - Flags: review?(ehsan)
Olli, I asked you about this test last week, I think.  Basically, it assumes that we can call setTimeout(f, 0) and that various async work will be done.  This is racy and fails with the new setTimeout() here.  Since we dispatch a runnable instead of using nsITimer we can execute slightly faster.

This patch converts to use requestAnimationFrame() to wait for next paint.  A double rAF is used since paint has not occurred after the first one.  (I tried with a single rAF and it did not fix the test.)
Attachment #8870261 - Flags: review?(bugs)
David, I think I asked you about this reftest a little in IRC.  It turned out not to be a problem with the reftest harness.

Instead, this reftest does a setTimeout(f, 0) on load and then tries to call focus() on an image map.  Unfortunately bug 1366642 means this won't work if first paint has not occurred yet.

This patch changes the reftest to use requestAnimationFrame() to wait for first paint before starting.
Attachment #8870262 - Flags: review?(dbaron)
This try is basically the same as the code I just asked for review on.  It just has the round.patch separated from the P5 here.  For review I folded it into P5.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=47262140282a6aa0e5273b69d3c25b93c116a1e8
It seems this test is failing about 10% of the time with these patches:

https://dxr.mozilla.org/mozilla-central/source/browser/base/content/test/general/browser_bug462289.js

I'll try to figure out what is wrong there before anything lands.
Comment on attachment 8870254 [details] [diff] [review]
P5 Add an executor class to represent the single timer executing timers for a window. r=ehsan

Sorry, I did not actually fold the rounding patch into here yet.
Attachment #8870254 - Flags: review?(ehsan)
Here is a profile with these patches of the timer-flood stress test:

https://perfht.ml/2qR48aC

Compare to comment 3.

There are still some long pauses, but they are all due to cycle collection now.  The only CriticalSectionContended sample is in a malloc call.
What was the failure mode for the image map reftests in P10 without the patch?
Flags: needinfo?(bkelly)
(In reply to David Baron :dbaron: ⌚️UTC-4 from comment #49)
> What was the failure mode for the image map reftests in P10 without the
> patch?

It depended on the timing, but typically one of them would run their init() method before first paint.  This would cause the image page to miss the focus() and not trigger any rendering of the dotted outline.  This would then fail the image comparison.  Sometimes it was the -1.html output that was wrong and sometimes it was the -1-ref.html output that was wrong.

I put more description of the underlying problem in bug 1366642.
Flags: needinfo?(bkelly)
Comment on attachment 8870261 [details] [diff] [review]
P9 Make docshell chrome tests wait for next paint to avoid races. r=smaug

rs+
Attachment #8870261 - Flags: review?(bugs) → review+
Comment on attachment 8870262 [details] [diff] [review]
P10 Make image map reftest wait for first paint before triggering focus/blur. r=dbaron

OK, r=dbaron, I guess.

I'm a little concerned that there might be tests that are no longer testing what they were originally designed to test, that you didn't detect through test failures, though.  Is there a way you could find such tests?
Attachment #8870262 - Flags: review?(dbaron) → review+
(In reply to David Baron :dbaron: ⌚️UTC-4 from comment #53)
> I'm a little concerned that there might be tests that are no longer testing
> what they were originally designed to test, that you didn't detect through
> test failures, though.  Is there a way you could find such tests?

Hmm, I'm not sure how if they don't fail in try.  To be clear, I believe this is a race in the test that has always been there due to the underlying image map issue.  Unfortunately I don't know a good way to find setTimeout() races in non-failing tests.

I will, however, do a new try without this patch to see if its still necessary with the final rounding changes I put in.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=9a86663f19f4cd74b32e4a3993c3b278373acb47
On my windows desktop with a opt non-pgo build the patches here make setTimeout(f, 0) about 5us faster.  A t-test shows this is statistically significant.

I need to adjust some of the delayed rounding code, though.  My tests show that the current patches fire about 500us later than before.  The old code did no rounding, so I am probably being too conservative here.
Comment on attachment 8870256 [details] [diff] [review]
P6 Use the TimeoutExecutor in TimeoutManager. r=ehsan

Going to change this patch as well to fix some of the rounding issues.
Attachment #8870256 - Flags: review?(ehsan)
(In reply to Ben Kelly [reviewing, but slowly][:bkelly] from comment #54)
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=9a86663f19f4cd74b32e4a3993c3b278373acb47

P9 and P10 are still necessary.
Lets try with a different rounding strategy.  This gives us the most accurate firing time in my local tests.  For a 25ms timer:

current m-c: mean delay of 27.5ms with stdev of 2.4ms
this patch set: mean delay of 25.1ms with stdev of 0.7ms

I wonder if our tests can handle this improved accuracy:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=7e7192e9c9fac05998e9ebf559d851ee73898eb9
I need to investigate these two test failures:

devtools/client/debugger/test/mochitest/browser_dbg_worker-window.js
browser/base/content/test/general/browser_bug462289.js
See comment 38 for the main description.

This update patch further adjusts how precise delays are converted to integral milliseconds for nsITimer.  The previous version of the patch was trying to do a more typical round operation while previous m-c truncated in many cases.

After doing some testing this code now favors rounding down pretty aggressively.  If we fire too early we then schedule another runnable or 0ms timer to get us to the right time.  This provides greatly improved accuracy at the cost of only a couple more runnables.  Even on a fast machine it does not fire large number of runnables.  Also, this cost won't be incurred if the browser is busy as we are unlikely to get the timer fired early in that case.
Attachment #8870254 - Attachment is obsolete: true
Attachment #8870575 - Flags: review?(ehsan)
See comment 39 for the main description.

This updated patch just adjusted how RunTimeout() uses the now timestamp a bit.
Attachment #8870256 - Attachment is obsolete: true
Attachment #8870577 - Flags: review?(ehsan)
I'm going to make any additional changes as follow-on patches, so the current set can be reviewed.  I won't change them any more.  Sorry for the churn.
So, every time we fail browser_dbg_worker-window.js the WorkerPrivate is cycle collected after the test tab-leak-check.  Every time we succeed the WorkerPrivate is cycle collected first.

While I didn't trace the object graph completely, I believe this test is creating a cycle between the window and worker.  It depends on CC to fire in order to clean up properly.

In the past we happened to always get the CC to run before the end of the test, but it was not guaranteed.  With the new setTimeout() implementation sometimes the test finishes just before the CC and we fail.

Ryan, this patch expands the debugger test cleanup routine to perform a CC in addition to the existing forceGC().  I also made it wait for an exact GC to complete for good measure.

This seemed more robust for the overall test suite instead of trying to manually cut the cycle in this particular test.  We don't want to have to hunt these things down again in the future.  I don't think manually cutting cycles is the point of these tests.

Here's a try build with 20 retriggers that shows it seems to fix the problems:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=f227fff8334685ffa5c73af25687286370dc242a

I believe its because of the race with the CC timer I described above, but I suppose it could also just be changing timing again.

Here's a full try build with the new patch set:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=2cc9a071d92b9eaf9dced6f197125bd56e64def8
Attachment #8870627 - Flags: review?(jryans)
I think just browser/base/content/test/general/browser_bug462289.js to sort out now.
Comment on attachment 8870627 [details] [diff] [review]
P11 Make debugger mochitests perform a full CC in addition to the forced GC during test cleanup. r=jryans

Review of attachment 8870627 [details] [diff] [review]:
-----------------------------------------------------------------

We do similar things in other DevTools test directories, and you appear to have a good reason for it, so looks good to me.
Attachment #8870627 - Flags: review?(jryans) → review+
Ryan convinced me in another bug we should not be firing early as often as we are.  So I'm removing the "flex time" to fire into the future:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=3435f35fee4504063c1b2dd2f22d37a8cf26f186

Using this tool to measure precision:

https://people-mozilla.org/~bkelly/timeout-precision?mode=delta&delay=10&count=100
This patch removes the "flex time" concept introduced in P5.  Ryan convinced me its actually a compat problem to fire setTimeout() early.  Our old implementation does it, but my testing shows that most other browsers don't.

Let's fix this by removing the flex time and preventing timers from firing early.

I still round down our nsITimer delay by 100 microseconds in order to enable the high precision mode.  My next patch will avoid this in background tabs.

Again, for reference I am using this tool to measure our timeout precision:

https://people-mozilla.org/~bkelly/timeout-precision/?mode=delta&delay=10&count=100

On windows I get about 25% of the timeouts scheduled by this page firing early.  Chrome doesn't fire any early.  Edge fires maybe 2% early.
Attachment #8870952 - Flags: review?(ehsan)
This patch introduces the concept of low and high precision.  In low precision we favor rounding up our nsITimer delay.  In high precision we favor rounding down our nsITimer delay.

The significance is that when we round down the nsITimer delay we may end up dispatching some additional 0-ms nsITimer or immediate runnables while we wait for the last few microseconds to count down to the deadline.  This is not that expensive, but lets avoid doing it on background tabs.  Who cares if they are a few milliseconds late.
Attachment #8870953 - Flags: review?(ehsan)
Note, I profiled the "high precision mode" where we may fire extra runnables until the deadline expires.  It doesn't even show up in the profile at all:

https://perfht.ml/2qW4ooP
https://perfht.ml/2qWvMTP
Attachment #8870244 - Flags: review?(ehsan) → review+
Comment on attachment 8870245 [details] [diff] [review]
P2 Removing the Timeout argument to TimeoutManager::RunTimeout(). r=ehsan

Review of attachment 8870245 [details] [diff] [review]:
-----------------------------------------------------------------

Fantastic!
Attachment #8870245 - Flags: review?(ehsan) → review+
Comment on attachment 8871014 [details] [diff] [review]
P14 Make pushPrefEnv() use setTimeout(1) since setTimeout(0) fires slightly faster now. r=ehsan

So it seems specialpowers in general and pushPrefEnv in particular use setTimeout(f, 0) to determine "completion".  Now that setTimeout(f, 0) runs slightly faster this causes some tests fail.

For example, this test becomes intermittent:

https://dxr.mozilla.org/mozilla-central/source/layout/base/tests/test_bug394057.html

I can't see anything in this test other than the pushPrefEnv that could be async.

This patch bumps up the setTimeout calls to a delay of 1 which will ensure an nsITimer is used.  It fixes the test linked here and at least one other.

I feel this patch is kind of lame, but I don't have a better solution since it seems these operations in specialpowers don't have good completion events to really base this stuff on.
Attachment #8871014 - Flags: review?(ehsan)
Comment on attachment 8870248 [details] [diff] [review]
P3 Improve Timeout ref counting to avoide bare AddRef/Release calls. r=ehsan

Review of attachment 8870248 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks, I think this is a great improvement over the current state of things!

::: dom/base/Timeout.h
@@ +100,5 @@
>  
>    // The language-specific information about the callback.
>    nsCOMPtr<nsITimeoutHandler> mScriptHandler;
>  
> +  RefPtr<Timeout> mClosureSelfRef;

Nit: This isn't really a closure, unless I'm missing something?  How about simply calling it mSelfRef?
Attachment #8870248 - Flags: review?(ehsan) → review+
Attachment #8870251 - Flags: review?(ehsan) → review+
(In reply to :Ehsan Akhgari (super long backlog, slow to respond) from comment #75)
> > +  RefPtr<Timeout> mClosureSelfRef;
> 
> Nit: This isn't really a closure, unless I'm missing something?  How about
> simply calling it mSelfRef?

This member is removed in P6 anyway, so I probably won't bother renaming this, if thats ok.
Comment on attachment 8870577 [details] [diff] [review]
P6 Use the TimeoutExecutor in TimeoutManager. r=ehsan

Review of attachment 8870577 [details] [diff] [review]:
-----------------------------------------------------------------

This was a joy to review, thanks so much for doing this!

::: dom/base/TimeoutManager.cpp
@@ +1033,5 @@
>      return true;
>    }
>  
>    // Reschedule the OS timer. Don't bother returning any error codes if
>    // this fails since the callers of this method don't care about them.

This comment isn't accurate any more.

::: dom/base/TimeoutManager.h
@@ +219,5 @@
>  
>    // Each nsGlobalWindow object has a TimeoutManager member.  This reference
>    // points to that holder object.
>    nsGlobalWindow&             mWindow;
> +  RefPtr<TimeoutExecutor>     mExecutor;

Hmm, why is this an indirect member and not just an inline |TimeoutExecutor mExecutor;|?  I haven't yet looked at all of the remaining parts but if something else doesn't need the indirection please remove it.
Attachment #8870577 - Flags: review?(ehsan) → review+
Comment on attachment 8870575 [details] [diff] [review]
P5 Add an executor class to represent the single timer executing timers for a window. r=ehsan

Review of attachment 8870575 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/base/TimeoutExecutor.cpp
@@ +220,5 @@
> +
> +NS_IMETHODIMP
> +TimeoutExecutor::SetName(const char* aName)
> +{
> +  return NS_ERROR_NOT_IMPLEMENTED;

Can you please implement this method similar to existing nsINamed implementations as a debug-only thing?

::: dom/base/TimeoutExecutor.h
@@ +22,5 @@
> +  nsCOMPtr<nsITimer> mTimer;
> +  TimeStamp mDeadline;
> +  const TimeDuration mFlexTime;
> +
> +  enum class Mode

Can you please add some comment about what this means?  I didn't really understand this well before reading the next part of the patch series, it would be helpful to have some description of the states the object can live in here.

@@ +56,5 @@
> +  void
> +  Shutdown();
> +
> +  nsresult
> +  MaybeSchedule(const TimeStamp& aDeadline);

I found the names of these methods a bit hard to follow, especially in the .cpp file.  I'd consider avoiding using "Maybe" in the names of the methods.  But that requires a lot of code changes, so it's probably not worth it!  I don't really feel that strongly.
Attachment #8870575 - Flags: review?(ehsan) → review+
(In reply to Ben Kelly [reviewing, but slowly][:bkelly] from comment #76)
> (In reply to :Ehsan Akhgari (super long backlog, slow to respond) from
> comment #75)
> > > +  RefPtr<Timeout> mClosureSelfRef;
> > 
> > Nit: This isn't really a closure, unless I'm missing something?  How about
> > simply calling it mSelfRef?
> 
> This member is removed in P6 anyway, so I probably won't bother renaming
> this, if thats ok.

Of course!  (I'm reviewing these as I go, sorry about missing stuff like this.)
(In reply to :Ehsan Akhgari (super long backlog, slow to respond) from comment #77)
> This was a joy to review, thanks so much for doing this!

Thanks!

> Hmm, why is this an indirect member and not just an inline |TimeoutExecutor
> mExecutor;|?  I haven't yet looked at all of the remaining parts but if
> something else doesn't need the indirection please remove it.

In theory the execution runnable/timer can be scheduled when the nsGlobalWindow/TimeoutManager is destroyed.  It seemed easiest to handle this by making TimeoutExecutor implement the nsIRunnable/nsITimerCallback interfaces and be its own ref-counted object.

I don't think its too bad because we are saving a ton of heap allocations per nsITimer with the patches in this bug.  We don't have to allocate a separate set of nsTimer/nsTimerImpl/runnable objects for each Timeout like we used to.  Instead we just have a single TimeoutExecutor that stays alive and gets re-used.
Comment on attachment 8870257 [details] [diff] [review]
P7 Remove the Timeout nsITimer code. r=ehsan

Review of attachment 8870257 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/base/Timeout.cpp
@@ -27,5 @@
>      mPopupState(openAllowed)
>  {
>  }
>  
>  Timeout::~Timeout()

And then this will go also.

::: dom/base/Timeout.h
@@ +30,4 @@
>    : public LinkedListElement<RefPtr<Timeout>>
>  {
>  public:
>    Timeout();

Nit: now you could also have |~Timeout() = default;|.
Attachment #8870257 - Flags: review?(ehsan) → review+
(In reply to :Ehsan Akhgari (super long backlog, slow to respond) from comment #78)
> > +NS_IMETHODIMP
> > +TimeoutExecutor::SetName(const char* aName)
> > +{
> > +  return NS_ERROR_NOT_IMPLEMENTED;
> 
> Can you please implement this method similar to existing nsINamed
> implementations as a debug-only thing?

I'm not sure what you mean, many (most?) are implemented with NS_ERROR_NOT_IMPLEMENTED:

https://dxr.mozilla.org/mozilla-central/source/dom/base/nsContentSink.cpp#1641
https://dxr.mozilla.org/mozilla-central/source/dom/base/nsGlobalWindow.cpp#665
https://dxr.mozilla.org/mozilla-central/source/dom/events/EventStateManager.cpp#247
etc...
Comment on attachment 8870260 [details] [diff] [review]
P8 Remove TimeoutManager back pressure mechanism. r=ehsan

Review of attachment 8870260 [details] [diff] [review]:
-----------------------------------------------------------------

This is fine with me, but I would like Olli to sign off on this since originally he reviewed your patches for it.  In general I agree that it's a complicated and hard to maintain feature and I trust your judgement on whether it serves us more than it hurts us.
Attachment #8870260 - Flags: review?(ehsan) → review?(bugs)
Comment on attachment 8870952 [details] [diff] [review]
P12 Prevent setTimeout() from firing early by removing TimeoutExecutor flex time. r=ehsan

Review of attachment 8870952 [details] [diff] [review]:
-----------------------------------------------------------------

If your version control setup makes it easy, please fold this into part 5, so that we don't add this and take it away needlessly!
Attachment #8870952 - Flags: review?(ehsan) → review+
Comment on attachment 8870953 [details] [diff] [review]
P13 Creat a low precision mode and use it for scheduling TimeoutExecutor in background tabs. r=ehsan

Review of attachment 8870953 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/base/TimeoutExecutor.h
@@ +16,5 @@
>  
> +enum class TimeoutPrecision : uint8_t {
> +  Low,
> +  High
> +};

The patch comment (which is great!) should probably go here.
Attachment #8870953 - Flags: review?(ehsan) → review+
Comment on attachment 8871014 [details] [diff] [review]
P14 Make pushPrefEnv() use setTimeout(1) since setTimeout(0) fires slightly faster now. r=ehsan

Review of attachment 8871014 [details] [diff] [review]:
-----------------------------------------------------------------

::: testing/specialpowers/content/specialpowersAPI.js
@@ +707,5 @@
>      if (typeof window != 'undefined')
>        setTimeout(callback, 0);
>      // for mochitest-plain
>      else
> +      content.window.setTimeout(callback, 1);

Please add a scary comment here warning people that it's important for this to not go below 1!

@@ +721,5 @@
>           if (typeof window != 'undefined')
>             setTimeout(aCallback, 0);
>           // For mochitest-plain
>           else
> +           content.window.setTimeout(aCallback, 1);

Ditto.
Attachment #8871014 - Flags: review?(ehsan) → review+
(In reply to Ben Kelly [reviewing, but slowly][:bkelly] from comment #82)
> (In reply to :Ehsan Akhgari (super long backlog, slow to respond) from
> comment #78)
> > > +NS_IMETHODIMP
> > > +TimeoutExecutor::SetName(const char* aName)
> > > +{
> > > +  return NS_ERROR_NOT_IMPLEMENTED;
> > 
> > Can you please implement this method similar to existing nsINamed
> > implementations as a debug-only thing?
> 
> I'm not sure what you mean, many (most?) are implemented with
> NS_ERROR_NOT_IMPLEMENTED:
> 
> https://dxr.mozilla.org/mozilla-central/source/dom/base/nsContentSink.
> cpp#1641
> https://dxr.mozilla.org/mozilla-central/source/dom/base/nsGlobalWindow.
> cpp#665
> https://dxr.mozilla.org/mozilla-central/source/dom/events/EventStateManager.
> cpp#247
> etc...

Oh... you're right, my memory is failing me.  Runnable::SetName() is *the exception* not the rule.  Sorry, my bad.
(In reply to Ben Kelly [reviewing, but slowly][:bkelly] from comment #80)
> (In reply to :Ehsan Akhgari (super long backlog, slow to respond) from
> comment #77)
> > This was a joy to review, thanks so much for doing this!
> 
> Thanks!
> 
> > Hmm, why is this an indirect member and not just an inline |TimeoutExecutor
> > mExecutor;|?  I haven't yet looked at all of the remaining parts but if
> > something else doesn't need the indirection please remove it.
> 
> In theory the execution runnable/timer can be scheduled when the
> nsGlobalWindow/TimeoutManager is destroyed.  It seemed easiest to handle
> this by making TimeoutExecutor implement the nsIRunnable/nsITimerCallback
> interfaces and be its own ref-counted object.

Oh, so it's lifetime can actually survive past the window/timeout manager... OK, that's true in theory.  Now that I think about it more, it may be better to actually not codify any lifetime assumptions in the code here and keep this a RefPtr.  But can you please add a comment explaining the lifetime of the member?

> I don't think its too bad because we are saving a ton of heap allocations
> per nsITimer with the patches in this bug.  We don't have to allocate a
> separate set of nsTimer/nsTimerImpl/runnable objects for each Timeout like
> we used to.  Instead we just have a single TimeoutExecutor that stays alive
> and gets re-used.

Yeah, the patches are improving things overall so I'm not very concerned about this.  :-)  And your argument about is actually a strong reason *not* to have this as an inline member.
Attachment #8870577 - Attachment is obsolete: true
Attachment #8871461 - Flags: review+
Attachment #8870257 - Attachment is obsolete: true
Attachment #8871463 - Flags: review+
I added a comment.  I also decided we should use 1ms throughout for the setTimeout() calls here.  I was confused and thought some of them were using nsITimer jsm thing, but they are all using the nsGlobalWindow version.
Attachment #8871465 - Flags: review+
Maybe P13 changes to specialpowers setTimeout is causing more trouble than it fixed.  Lets see what it looks like without that:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=ef58152d396e2d5c2d44a11f95ccc568cd4f3aa8
Comment on attachment 8870260 [details] [diff] [review]
P8 Remove TimeoutManager back pressure mechanism. r=ehsan

I'm ok removing this. TimeoutManager has gotten so complicated that some code removals are good.
Attachment #8870260 - Flags: review?(bugs) → review+
Comment on attachment 8871465 [details] [diff] [review]
P13 Make pushPrefEnv() use setTimeout(1) since setTimeout(0) fires slightly faster now. r=ehsan

In retrospect I don't think this patch is necessary.
Attachment #8871465 - Attachment is obsolete: true
Comment on attachment 8871468 [details] [diff] [review]
P12 Create a low precision mode and use it for scheduling TimeoutExecutor in background tabs. r=ehsan

I think I have a better solution than the "high precision mode" thing that does a short async runnable spin of the event loop.
Attachment #8871468 - Attachment is obsolete: true
The root cause of the precision problems in this bug are that we require an integer number of milliseconds be passed to nsITimer.  This is an artificial limitation, though.  The underlying implementation already supports a finer grained measurement.

This patch simply pipes TimeDuration into the nsITimer using a new PreciseInitWithCallback() method.  I only bothered with the nsITimerCallback variant since that is what I needed.  Others could be added if desired.

I also preserved TimeDuration throughout the TimerThread run loop, Monitor, and CondVar layers.  This allows the precise delay to be passed down to the OS level call.

You may be asking, why do we need this now and not before?

To answer, consider this situation:

1. setTimeout(f, 1);
2. do some work for 500us
3. setTimeout(g, 1);

This should fire f() and g() 500us apart.

In our existing code this works because each setTimeout() gets its on nsITimer.  The 1ms is preserved and pass through to nsITimer which converts it to a TimeStamp, etc.

With the patches in this bug there is only one nsITimer.  We fire f() and then try to schedule a new nsITimer for g().  Its only 500us in the future, though!  What do we do?  We either fire early and spin for a while or fire late.

The best solution I've found is to preserve the accuracy through the nsITimer implementation.  This avoids extra runnables and wakeups, while also improving our overall accuracy.

On my machine using this approach changed the output on this site:

https://people-mozilla.org/~bkelly/timeout-precision/?mode=delta&delay=10&count=100

From:

min:0.095 mean:0.640 median:0.525 max:6.860 stdev:0.727

To:

min:0.015 mean:0.842 median:0.052 max:5.055 stdev:1.589

As you can see, our median deviation from the target firing time is greatly improved.
Attachment #8871910 - Flags: review?(nfroyd)
Ehsan, this patch replaces the old patch to create a High/Low precision enumeration.  Instead of doing that we instead just use the new API to pass the precise delay to nsITimer.  This avoid the need to round at all.
Attachment #8871912 - Flags: review?(ehsan)
Attachment #8871912 - Flags: review?(ehsan)
Comment on attachment 8871910 [details] [diff] [review]
P12 Expose nsITimer::PreciseInitWithCallback() that takes a TimeDuration delay. r=froydnj

Review of attachment 8871910 [details] [diff] [review]:
-----------------------------------------------------------------

I think this is OK; it would be nice if your explanation was in the commit message and/or found a nice comment in the source code.  I can't think of a good place to drop the comment in the source code, though.

r-'ing just for the wait() thing.  The naming thing is just complaining, but if you have a better idea for the name...InitWithCallbackAndTimeDuration?

::: xpcom/threads/CondVar.h
@@ -64,5 @@
>  #ifdef MOZILLA_INTERNAL_API
>      GeckoProfilerThreadSleepRAII sleep;
>  #endif //MOZILLA_INTERNAL_API
> -    if (aInterval == PR_INTERVAL_NO_TIMEOUT) {
> -      mImpl.wait(*mLock);

Can we please try and retain this behavior of calling the no-timeout wait?  The no-argument/PR_INTERVAL_NO_TIMEOUT case is overwhemingly the common case and involves significantly less work on all of our platforms.

::: xpcom/threads/nsITimer.idl
@@ +170,5 @@
> +   * @param aFunc      nsITimerCallback interface to call when timer expires
> +   * @param aDelay     The high precision interval
> +   * @param aType      Timer type per TYPE* consts defined above
> +   */
> +  [noscript] void preciseInitWithCallback(in nsITimerCallback aCallback,

I wish there was a different adjective to use to describe this.  We have "precise" in the TYPE* constants which has (had) a completely different meaning than the one here.  Unfortunately, I can't think of a different adjective to use here, and I think providing a function overload doesn't work unless we [binaryName] the existing initWithCallback...which just moves the naming issue somewhere else and makes the patch for this bug much bigger.
Attachment #8871910 - Flags: review?(nfroyd) → review-
(In reply to Nathan Froyd [:froydnj] from comment #103)
> ::: xpcom/threads/CondVar.h
> @@ -64,5 @@
> >  #ifdef MOZILLA_INTERNAL_API
> >      GeckoProfilerThreadSleepRAII sleep;
> >  #endif //MOZILLA_INTERNAL_API
> > -    if (aInterval == PR_INTERVAL_NO_TIMEOUT) {
> > -      mImpl.wait(*mLock);
> 
> Can we please try and retain this behavior of calling the no-timeout wait? 
> The no-argument/PR_INTERVAL_NO_TIMEOUT case is overwhemingly the common case
> and involves significantly less work on all of our platforms.

I can put this back, but the underlying primitive explicitly handles TimeDuration::Forever():

https://dxr.mozilla.org/mozilla-central/source/mozglue/misc/ConditionVariable_posix.cpp#126
https://dxr.mozilla.org/mozilla-central/source/mozglue/misc/ConditionVariable_windows.cpp#69
(In reply to Nathan Froyd [:froydnj] from comment #103)
> I think this is OK; it would be nice if your explanation was in the commit
> message and/or found a nice comment in the source code.  I can't think of a
> good place to drop the comment in the source code, though.

I'll put the explanation in a comment in the next patch where TimeoutExecutor uses this API.
Comment on attachment 8871938 [details] [diff] [review]
P12 Expose nsITimer::InitWithCallbackAndTimeDuration() method. r=froydnj

Review of attachment 8871938 [details] [diff] [review]:
-----------------------------------------------------------------

Updated per our IRC conversation.
Attachment #8871938 - Flags: review?(nfroyd)
Updated with new method name and added explanatory comment.
Attachment #8871912 - Attachment is obsolete: true
Attachment #8871938 - Attachment description: P12 Expose nsITimer::PreciseInitWithCallback() that takes a TimeDuration delay. r=froydnj → P12 Expose nsITimer::InitWithCallbackAndTimeDuration() method. r=froydnj
Attachment #8871954 - Attachment description: Use the nsITimer::InitWithCallbackAndTimeDuration() instead of rounding. r=ehsan → P13 Use the nsITimer::InitWithCallbackAndTimeDuration() instead of rounding. r=ehsan
With instrumentation I see we still get 1 or 2 spurious wakeups when nsITimer fires early.  Unfortunately I don't think we can do any better right now without allowing early setTimeout() callbacks.  I think that is very undesirable from a web compat perspective.
Attachment #8871954 - Flags: review?(ehsan)
Attachment #8871014 - Attachment is obsolete: true
(In reply to Ben Kelly [reviewing, but slowly][:bkelly] from comment #109)
> With instrumentation I see we still get 1 or 2 spurious wakeups when
> nsITimer fires early.  Unfortunately I don't think we can do any better
> right now without allowing early setTimeout() callbacks.  I think that is
> very undesirable from a web compat perspective.

Any idea why? Is this just the OS being wonky? To wit, Sleep(1) doesn't sleep for 1ms but rather the shortest supported period, which is usually 0.5ms (at least after calling timeBeginPeriod(1), as I believe we do). Sleep(0) almost does the same thing as SwitchToThread() (that is, yields the rest of its slice to other threads), except that it also allows I/O to execute. Sleep(n) with n >= 2 work as you would expect.
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #111)
> Any idea why? Is this just the OS being wonky? To wit, Sleep(1) doesn't
> sleep for 1ms but rather the shortest supported period, which is usually
> 0.5ms (at least after calling timeBeginPeriod(1), as I believe we do).
> Sleep(0) almost does the same thing as SwitchToThread() (that is, yields the
> rest of its slice to other threads), except that it also allows I/O to
> execute. Sleep(n) with n >= 2 work as you would expect.

Uh, because I missed some bits in TImerThread that still use the old integer based interval logic.  If I fix those I stop getting early wake ups.

Also, FWIW, this code does not use sleep.  It waits on a condition variable.
Attachment #8871938 - Flags: review?(nfroyd)
(In reply to Ben Kelly [reviewing, but slowly][:bkelly] from comment #112)
> (In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #111)
> > Any idea why? Is this just the OS being wonky? To wit, Sleep(1) doesn't
> > sleep for 1ms but rather the shortest supported period, which is usually
> > 0.5ms (at least after calling timeBeginPeriod(1), as I believe we do).
> > Sleep(0) almost does the same thing as SwitchToThread() (that is, yields the
> > rest of its slice to other threads), except that it also allows I/O to
> > execute. Sleep(n) with n >= 2 work as you would expect.
> 
> Uh, because I missed some bits in TImerThread that still use the old integer
> based interval logic.  If I fix those I stop getting early wake ups.
> 
> Also, FWIW, this code does not use sleep.  It waits on a condition variable.

I'm going to have to investigate this a bit more.  I'm not sure how safe it is to try to change this logic given that underlying windows code is based on integer milliseconds:

https://dxr.mozilla.org/mozilla-central/source/mozglue/misc/ConditionVariable_windows.cpp#72

I don't want to accidentally make the TimerThread spin a bunch.

I think review of this will have to wait for next Wednesday after the holiday.
Dao, I found that browser_bug462289.js fails frequently with these patches.  I saw that you changed this file to use executeSoon() instead of setTimeout(f, 0) in bug  1102017.  Was there a reason for that?

I spoke with Blake and he suggested the significant change bug 1102017 was waiting for TabSwitchDone and that the executeSoon() changes may not be strictly necessary.

If I switch the executeSoon() calls back to setTimeout(f, 0) the test is no longer intermittent.

Given that the timing of setTimeout is slightly different its possible this test is racing with a setTimeout somewhere else in browser chrome code.  Using setTimeout here would provide some ordering guarantees that executeSoon does not.  That is my best theory about why it works here.
Attachment #8871998 - Flags: review?(dao+bmo)
(In reply to Ben Kelly [reviewing, but slowly][:bkelly] from comment #112)
> (In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #111)
> > Any idea why? Is this just the OS being wonky? To wit, Sleep(1) doesn't
> > sleep for 1ms but rather the shortest supported period, which is usually
> > 0.5ms (at least after calling timeBeginPeriod(1), as I believe we do).
> > Sleep(0) almost does the same thing as SwitchToThread() (that is, yields the
> > rest of its slice to other threads), except that it also allows I/O to
> > execute. Sleep(n) with n >= 2 work as you would expect.
> 
> Uh, because I missed some bits in TImerThread that still use the old integer
> based interval logic.  If I fix those I stop getting early wake ups.

This was not why we were getting spurious wakeups.  The issue is that TimerThread will execute a timer up to have the resolution of a PRInterval early.  On windows PRInterval is 1ms, so this means TimerThread will fire a timer up to 500us early.

I'm not sure I feel comfortable changing that here.
Or maybe it will fire 250us early.  In any case, its a fairly large number of microseconds that TimerThread explicitly decides to fire early on windows.
Comment on attachment 8872003 [details] [diff] [review]
P12 Expose nsITimer::InitWithCallbackAndTimeDuration() method. r=froydnj

Ok, this implements the changes we discussed in IRC:

1. Separate CondVar::Wait() in non-debug.
2. Change the name to InitWithCallbackAndTimeDuration()

I also changed TimerThread::Run() to use TimeDuration throughout.

Finally, I was using TimeDuration::Resolution() in TimerThread::Run().  I don't think this is equivalent to the halfMicroSecondsIntervalResolution, though.  I switched it back to use the TimerThread's specific half resolution calculation instead.

As I mentioned above, we do still get early wakeups from nsITimer with these changes on windows.  They are, however, greatly reduced.
Attachment #8872003 - Flags: review?(nfroyd)
Comment on attachment 8871954 [details] [diff] [review]
P13 Use the nsITimer::InitWithCallbackAndTimeDuration() instead of rounding. r=ehsan

Review of attachment 8871954 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/base/TimeoutExecutor.cpp
@@ +70,5 @@
> +  // 3. setTimeout(g, 1);
> +  //
> +  // This should fire f() and g() 500us apart.
> +  //
> +  // In the past worked because each setTimeout() got its own nsITimer.  The 1ms

Nit: the first sentence seems like it's missing a word somewhere in the middle.
Attachment #8871954 - Flags: review?(ehsan) → review+
Comment on attachment 8872003 [details] [diff] [review]
P12 Expose nsITimer::InitWithCallbackAndTimeDuration() method. r=froydnj

On further reflection I'm not sure this patch is a good idea.

Removing the PR_MicrosecondsToInterval() call here effectively removes some rounding:

https://dxr.mozilla.org/mozilla-central/source/xpcom/threads/TimerThread.cpp#514

This patch results in use doing a floor to the nearest interval instead.  So its likely spinning the TimerThread.

I'll post a new patch that just uses TimeDuration to set the nsTimerImpl::mTimeout.  The logic for scheduling the wakeup will not change, though.

The end result is that the patch set in this bug will probably make us fire late a bit more frequently than we do today.
Attachment #8872003 - Flags: review?(nfroyd)
Attachment #8871998 - Flags: review?(dao+bmo) → review+
In the interest of trying to maintain our old behavior, lets see if we can allow early firing again.  This will only be limited to the amount allowed by TimerThread.  If our TimerThread ever becomes more accurate then this would automatically adjust.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=69f9ff7e02187f35180197dca6fa9fd798905161
Ok, this latest try is looking good.  I don't see spurious wakeups locally.  The precision is about the same as we have had historically.

I'm going to clean these remaining patches up so maybe I can get this reviewed and landed on Wednesday.
There is a new persistent failure I need to investigate:

devtools/client/netmonitor/test/browser_net_simple-request-data.js

Looks like this test was just changed recently for bug 1242505.
Paolo, do you have any advice on how to approach this error?  It started showing up with my patchset here after you landed bug 1242505.

https://treeherder.mozilla.org/logviewer.html#?job_id=102884059&repo=try&lineNumber=5879

I don't actually do anything specific with this test, but my code does slightly change setTimeout() timings.  I don't see any specific setTimeout() calls in this test though.
Flags: needinfo?(paolo.mozmail)
It seems the failure occurs in devtools/client/netmonitor/test/head.js line 408. If this is intermittent, it may be one of the test shutdown issues with devtools, and in bug 1242505 we just just whitelisted them in these three places:

https://dxr.mozilla.org/mozilla-central/search?q=%22tabTarget+is+null%22&redirect=true

If this is deterministic after your patch lands, then you should either debug and address the issue if it's easy to fix, or use expectUncaughtRejection in the individual test file.
Flags: needinfo?(paolo.mozmail)
(In reply to :Paolo Amadini from comment #133)
> If this is deterministic after your patch lands, then you should either
> debug and address the issue if it's easy to fix, or use
> expectUncaughtRejection in the individual test file.

I'll do some more pushes, but it does not look deterministic to me.  For example, in my try build it triggered in win7 VM opt, but not in win7 opt.
Attachment #8872661 - Attachment description: Bug 1363829 P17 Delay browser_net_simple-request-data.js teardown to allow all events to fire. r=Honza → P17 Delay browser_net_simple-request-data.js teardown to allow all events to fire. r=Honza
Comment on attachment 8872422 [details] [diff] [review]
P12 Expose nsITimer::InitWithCallbackAndTimeDuration() method. r=froydnj

Review of attachment 8872422 [details] [diff] [review]:
-----------------------------------------------------------------

Nathan, this patch is similar to the last one I submitted except:

1. It only includes the bits letting us pass a TimeDuration into nsITimer.
2. It does not change the Monitor or CondVar APIs since underneath they do not support better than millisecond resolution on all platforms.
3. It does not modify the main TimerThread::Run() loop logic at all.

Basically it preserves the rounding TimerThread::Run() that we have always done.  The benefit of the patch is to avoid rounding at the nsITimer API layer as well.
Attachment #8872422 - Flags: review?(nfroyd)
Comment on attachment 8872423 [details] [diff] [review]
P15 Expose nsITimer::GetAllowedEarlyFiringMicroseconds(). r=froydnj

Review of attachment 8872423 [details] [diff] [review]:
-----------------------------------------------------------------

Nathan, this patch exposes an AllowedEarlyFiringTime() method on nsITimer.  The TimerThread internally decides to fire timers up to 250us early.  I'd like to make the setTimeout() logic aware of this value so it can avoid scheduling additional timers if its within this range of the deadline.

I did this as a method on nsITimer so that it can be adjusted if we ever support different timer implementations based on platform, etc.
Attachment #8872423 - Flags: review?(nfroyd)
Comment on attachment 8872424 [details] [diff] [review]
P16 Allow setTimeout() to fire early based on nsIThread::GetAllowedEarlyFiringMicroseconds(). r=ehsan

Review of attachment 8872424 [details] [diff] [review]:
-----------------------------------------------------------------

Ehsan, this patch restores the ability of setTimeout() to fire slightly early.  Previous patches called this "flex time".  In this patch, however, the value is based on the actual calculated value from our nsITimer implementation.  If the underlying implementation changes then setTimeout() will automatically adjust as well.

AFAICT, we currently use 250us for this value on our main platforms today.
Attachment #8872424 - Flags: review?(ehsan)
Jan, I was getting intermittent failures in browser_net_simple-request-data.js with my patches here.  They change the timing of things slightly which tends to reveal race conditions.

In this case browser_net_simple-request-data.js has a race condition in that waitForNetworkEvents() resolves its promise when it first receives the last expected event.  The STARTED_RECEIVING_RESPONSE event, however, has asynchronous steps that need to be completed before teardown.

This patch makes the test explicitly wait for all of the queued async test steps.

I also changed the cookies test assertions since they just don't seem accurate any more.  The .requestCookies and .responseCookies attributes don't have a .cookies attribute.  They are just arrays themselves.  I don't think they were getting picked up by the test previously because .once() doesn't catch exceptions.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=31ca9c4aaadb88c217b89818070be7040df50c80
Attachment #8872725 - Flags: review?(odvarko)
Attachment #8872661 - Attachment is obsolete: true
Ugh, the browser_net_simple-request-data.js test is picking up an extra cookie from browser_net_filter* tests.
Comment on attachment 8872755 [details] [diff] [review]
P17 Make browser_net_simple-request-data.js wait for all async tests to complete before finishing. r=honza

Jan, please seem comment 139 for the initial description.

This patch further adds a Service.cookies.removeAll() call to the test directories cleanup handler.  The test was failing in automation because a cookie was leaking from another test to this one.

Looks green on try now.
Attachment #8872755 - Flags: review?(odvarko)
Latest try is looking pretty good:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=f8eb8d48591c95155f932879b9fea8d1fd77a2dc

Assuming the last few reviews go ok I'm hoping to land this tomorrow.
Comment on attachment 8872755 [details] [diff] [review]
P17 Make browser_net_simple-request-data.js wait for all async tests to complete before finishing. r=honza

Unfortunately this test still gives periodic errors like this.

https://treeherder.mozilla.org/logviewer.html#?job_id=103177945&repo=try&lineNumber=4469
Attachment #8872755 - Flags: review?(odvarko)
Attachment #8872755 - Attachment is obsolete: true
Comment on attachment 8872844 [details] [diff] [review]
P17 Make browser_net_simple-request-data.js wait for all async tests to complete before finishing. r=honza

Jan, sorry for the flag spam here.  See comment 139 and comment 143 for the previous descriptions.

I had to further update the test here to check for null requestItem in the various waitUntil() polling functions.  It seems its possible for the poll to sometimes run fast enough to get an undefined item in the list.

Again, I don't think the test reported these kinds of errors before because once() does not catch exceptions.  This patch now converts them to promise rejections.
Attachment #8872844 - Flags: review?(odvarko)
Comment on attachment 8872844 [details] [diff] [review]
P17 Make browser_net_simple-request-data.js wait for all async tests to complete before finishing. r=honza

Review of attachment 8872844 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good to me.

R+ assuming Try is green

Thanks for the patch!

Honza
Attachment #8872844 - Flags: review?(odvarko) → review+
Comment on attachment 8872422 [details] [diff] [review]
P12 Expose nsITimer::InitWithCallbackAndTimeDuration() method. r=froydnj

Review of attachment 8872422 [details] [diff] [review]:
-----------------------------------------------------------------

r=me with the name change.

::: xpcom/threads/nsITimer.idl
@@ +172,5 @@
> +   * @param aType      Timer type per TYPE* consts defined above
> +   */
> +  [noscript] void InitWithCallbackAndTimeDuration(in nsITimerCallback aCallback,
> +                                                  [const] in TimeDuration aDelay,
> +                                                  in unsigned long aType);

Having had a chance to think about it a little more, WDYT about calling this InitHighResolutionWithCallback?  This name seems preferable to me, and is resilient in the face of TimeDuration renaming changes or similar.
Attachment #8872422 - Flags: review?(nfroyd) → review+
Comment on attachment 8872424 [details] [diff] [review]
P16 Allow setTimeout() to fire early based on nsIThread::GetAllowedEarlyFiringMicroseconds(). r=ehsan

Review of attachment 8872424 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/base/TimeoutExecutor.h
@@ +22,5 @@
>    nsCOMPtr<nsITimer> mTimer;
>    TimeStamp mDeadline;
>  
> +  // Limits how far we allow timers to fire into the future from their
> +  // deadline.  Starts of at zero, but is then adjusted when we start

Nit: "Starts *off* at zero..."
Comment on attachment 8872423 [details] [diff] [review]
P15 Expose nsITimer::GetAllowedEarlyFiringMicroseconds(). r=froydnj

Review of attachment 8872423 [details] [diff] [review]:
-----------------------------------------------------------------

These comments-as-a-previous-review are great, but they'd be better as proper commit messages...

::: xpcom/threads/nsITimer.idl
@@ +252,5 @@
> +  /**
> +   * The number of microseconds this nsITimer implementation can possibly
> +   * fire early.
> +   */
> +  [noscript] readonly attribute unsigned long allowedEarlyFiringMicroseconds;

Make this [noscript,infallible] so the C++ side can just call GetAllowedEarlyFiringMicroseconds()...we don't ever expect that this could fail, right?
Attachment #8872423 - Flags: review?(nfroyd) → review+
(In reply to Nathan Froyd [:froydnj] from comment #151)
> Make this [noscript,infallible] so the C++ side can just call
> GetAllowedEarlyFiringMicroseconds()...we don't ever expect that this could
> fail, right?

I tried doing [infallible], but that requires the interface to be [builtinclass].  That seemed to be problematic with the way nsTimer/nsTimerImpl are currently written.  I was trying not to change too much here, so I just skipped the infallible for now.
Attachment #8871463 - Attachment is obsolete: true
Attachment #8873204 - Flags: review+
Attachment #8872424 - Flags: review?(ehsan) → review+
This includes the suggested rename.
Attachment #8872422 - Attachment is obsolete: true
Attachment #8873241 - Flags: review+
Rename the method in the commit message as well.
Attachment #8873241 - Attachment is obsolete: true
Attachment #8873242 - Flags: review+
Pushed by bkelly@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/2e778ec02f89
P1 Remove some deadcode in TimeoutManager related to processing pending timers. r=ehsan
https://hg.mozilla.org/integration/mozilla-inbound/rev/7eec06a678cb
P2 Removing the Timeout argument to TimeoutManager::RunTimeout(). r=ehsan
https://hg.mozilla.org/integration/mozilla-inbound/rev/60ce370ec87b
P3 Improve Timeout ref counting to avoide bare AddRef/Release calls. r=ehsan
https://hg.mozilla.org/integration/mozilla-inbound/rev/aea41a0174ae
P4 Store the scheduled delay on Timeout instead of relying on the nsITimer to hold it. r=ehsan
https://hg.mozilla.org/integration/mozilla-inbound/rev/9ac282f67370
P5 Add an executor class to represent the single timer executing timers for a window. r=ehsan
https://hg.mozilla.org/integration/mozilla-inbound/rev/5214ce6bed25
P6 Use the TimeoutExecutor in TimeoutManager. r=ehsan
https://hg.mozilla.org/integration/mozilla-inbound/rev/32c98a6c8750
P7 Remove the Timeout nsITimer code. r=ehsan
https://hg.mozilla.org/integration/mozilla-inbound/rev/bd452eda2e83
P8 Remove TimeoutManager back pressure mechanism. r=smaug
https://hg.mozilla.org/integration/mozilla-inbound/rev/a70379a2937c
P9 Make docshell chrome tests wait for next paint to avoid races. r=smaug
https://hg.mozilla.org/integration/mozilla-inbound/rev/cd8ae8487d00
P10 Make image map reftest wait for first paint before triggering focus/blur. r=dbaron
https://hg.mozilla.org/integration/mozilla-inbound/rev/833f4fe14182
P11 Make debugger mochitests perform a full CC in addition to the forced GC during test cleanup. r=jryans
https://hg.mozilla.org/integration/mozilla-inbound/rev/79c1d7e0d539
P12 Expose nsITimer::InitHighResolutionWithCallback() method. r=froydnj
https://hg.mozilla.org/integration/mozilla-inbound/rev/1c99a1222200
P13 Use the nsITimer::InitHighResolutionWithCallback() instead of rounding. r=ehsan
https://hg.mozilla.org/integration/mozilla-inbound/rev/9b5fb2901eac
P14 Make browser_bug462289.js test use setTimeout(f, 0) like it did before bug 1102017. r=dao
https://hg.mozilla.org/integration/mozilla-inbound/rev/d27d106fc45d
P15 Expose nsITimer::GetAllowedEarlyFiringMicroseconds(). r=froydnj
https://hg.mozilla.org/integration/mozilla-inbound/rev/95421442eec4
P16 Allow setTimeout() to fire early based on nsIThread::GetAllowedEarlyFiringMicroseconds(). r=ehsan
https://hg.mozilla.org/integration/mozilla-inbound/rev/7c6116ac71e9
P17 Make browser_net_simple-request-data.js wait for all async tests to complete before finishing. r=honza
Note, I did not measure any improvement on speedometer with this patch.  Dropping the dependency.
No longer blocks: Speedometer_V2
For future reference, here is a profile of:

https://mozdevs.github.io/servo-experiments/experiments/tiles/

On my mac book pro using 55.0a1 (2017-05-28) (64-bit):

https://perfht.ml/2soSOQz

Once a nightly spins with the patches from this bug I will compare.
I took some profiles before and after this bug landed on:

Tiles:
  * https://mozdevs.github.io/servo-experiments/experiments/tiles/
    * before:
      * https://perfht.ml/2sx78Xn
      * https://perfht.ml/2sxlsPv
    * after:
      * https://perfht.ml/2swX0xI
      * https://perfht.ml/2rzrUbu

Here you can see in the "before" case that there are large 200ms to 500ms pauses.  During this period there is a lot of setTimeout() and nsTimer runnables executing.  About ~65ms to ~165ms are spent in setTimeout.

In the after case pauses are limited to ~80ms.  Of these pauses setTimeout takes ~30ms.

I don't completely understand this result, but there are a couple things that could be helping here:

1) Our setTimeout() doesn't always lock with TimerThread any more.  So we can execute more continuously on the main thread without forcing context switching.
2) We are dispatching many fewer timer events, so the possibility to intermix nsTimer runnables during these periods is reduced.
3) Processing timeouts faster allows us to drain the timeout list more quickly reducing the amount of timeouts we have to search in Insert().

Flood:
  * https://people-mozilla.org/~bkelly/timer-flood/index.html
    * before:
      * https://perfht.ml/2sxkPFT
      * https://perfht.ml/2sxdDtm
    * after:
      * https://perfht.ml/2rzqm1n
      * https://perfht.ml/2rz3cIk

The main take away from these profiles is that we end up contending with the TimerThread a lot less. If you search for "Acquire" you can see how often mutexes show up in the profile (this is windows).  There is a lot more contention in the "before" case.

Also, there is a huge amount of contention during back pressure reduction.  This is one reason I chose to remove the back pressure mechanism for now.
(In reply to Ben Kelly [reviewing, but slowly][:bkelly] from comment #175)
> I took some profiles before and after this bug landed on:
> 
> Tiles:
>   * https://mozdevs.github.io/servo-experiments/experiments/tiles/
>     * before:
>       * https://perfht.ml/2sx78Xn

This profile isn't a good example.  All of the cost here is bug 1369494 (around 5% is nsITimer related).

>       * https://perfht.ml/2sxlsPv
>     * after:
>       * https://perfht.ml/2swX0xI
>       * https://perfht.ml/2rzrUbu
> 
> Here you can see in the "before" case that there are large 200ms to 500ms
> pauses.  During this period there is a lot of setTimeout() and nsTimer
> runnables executing.  About ~65ms to ~165ms are spent in setTimeout.
> 
> In the after case pauses are limited to ~80ms.  Of these pauses setTimeout
> takes ~30ms.
> 
> I don't completely understand this result, but there are a couple things
> that could be helping here:
> 
> 1) Our setTimeout() doesn't always lock with TimerThread any more.  So we
> can execute more continuously on the main thread without forcing context
> switching.
> 2) We are dispatching many fewer timer events, so the possibility to
> intermix nsTimer runnables during these periods is reduced.
> 3) Processing timeouts faster allows us to drain the timeout list more
> quickly reducing the amount of timeouts we have to search in Insert().

Well the fact remains that setTimeout() is still freaking expensive!  In the second before profile, around 11-12% of the cost of setTimeout() is about creating the nsITimer, the rest is Timeouts::Insert()!  That function is *really* inefficient.

> Flood:
>   * https://people-mozilla.org/~bkelly/timer-flood/index.html
>     * before:
>       * https://perfht.ml/2sxkPFT
>       * https://perfht.ml/2sxdDtm
>     * after:
>       * https://perfht.ml/2rzqm1n
>       * https://perfht.ml/2rz3cIk
> 
> The main take away from these profiles is that we end up contending with the
> TimerThread a lot less. If you search for "Acquire" you can see how often
> mutexes show up in the profile (this is windows).  There is a lot more
> contention in the "before" case.

Indeed, that's exactly working as planned!  \o/

The other interesting difference between these two cases is that Timeouts::Insert() doesn't show up that much in this profile.  And the reason is that this test case won't cause you to end up having a long queue of timeouts.  This difference bothered me a bit, so I decided to do the math.  I broke in the JS debugger in the servo demo to see how many tiles I have.  I had a 91*50 grid.  That means we'll iterate over 4549 timeouts *in the last iteration of the loop*, and I bet the real number is much lower than that.  And this loop only runs one time in the beginning of each round of a new effect rolling over a new picture (so not frequently at all).  This really shouldn't be that slow, and we should barely see it in the profile...

> Also, there is a huge amount of contention during back pressure reduction. 
> This is one reason I chose to remove the back pressure mechanism for now.

Seems like it indeed was a good idea!
(In reply to :Ehsan Akhgari (needinfo please, extremely long backlog) from comment #176)
> >       * https://perfht.ml/2sxlsPv
> >     * after:
> >       * https://perfht.ml/2swX0xI
> >       * https://perfht.ml/2rzrUbu
> > 
> > Here you can see in the "before" case that there are large 200ms to 500ms
> > pauses.  During this period there is a lot of setTimeout() and nsTimer
> > runnables executing.  About ~65ms to ~165ms are spent in setTimeout.
> > 
> > In the after case pauses are limited to ~80ms.  Of these pauses setTimeout
> > takes ~30ms.
> > 
> > I don't completely understand this result, but there are a couple things
> > that could be helping here:
> > 
> > 1) Our setTimeout() doesn't always lock with TimerThread any more.  So we
> > can execute more continuously on the main thread without forcing context
> > switching.
> > 2) We are dispatching many fewer timer events, so the possibility to
> > intermix nsTimer runnables during these periods is reduced.
> > 3) Processing timeouts faster allows us to drain the timeout list more
> > quickly reducing the amount of timeouts we have to search in Insert().
> 
> Well the fact remains that setTimeout() is still freaking expensive!  In the
> second before profile, around 11-12% of the cost of setTimeout() is about
> creating the nsITimer, the rest is Timeouts::Insert()!  That function is
> *really* inefficient.

I think the other factor in the painting pauses on the tiles before was that it might have been triggering back pressure.  We started at 5000 timers and I think tiles does more than that.  So that could partially be why there are those pauses.

> The other interesting difference between these two cases is that
> Timeouts::Insert() doesn't show up that much in this profile.  And the
> reason is that this test case won't cause you to end up having a long queue
> of timeouts.  This difference bothered me a bit, so I decided to do the
> math.  I broke in the JS debugger in the servo demo to see how many tiles I
> have.  I had a 91*50 grid.  That means we'll iterate over 4549 timeouts *in
> the last iteration of the loop*, and I bet the real number is much lower
> than that.  And this loop only runs one time in the beginning of each round
> of a new effect rolling over a new picture (so not frequently at all).  This
> really shouldn't be that slow, and we should barely see it in the profile...

It seems like windows high resolution TimeStamp might be expensive to compare.  The CheckQPC thing seems pretty complex and we do it a lot.

Making Insert scale better in general, though, would be good.  We can discuss in bug 1369494.
Depends on: 1371020
Depends on: 1378149
Depends on: 1378394
Depends on: 1410930
Depends on: 1489463
Component: DOM → DOM: Core & HTML
No longer depends on: 1489463
Regressions: 1489463
Regressions: 1683560
Performance Impact: --- → P3
Whiteboard: [qf:p3]
Regressions: 1824472
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: