Closed Bug 1300659 Opened 8 years ago Closed 8 years ago

prevent setTimeout/setInterval from flooding the main thread event queue

Categories

(Core :: DOM: Core & HTML, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla52
Tracking Status
firefox52 --- fixed

People

(Reporter: bkelly, Assigned: bkelly)

References

(Depends on 1 open bug, Blocks 2 open bugs, Regressed 1 open bug)

Details

Attachments

(7 files, 35 obsolete files)

19.54 KB, image/png
Details
17.43 KB, image/png
Details
18.72 KB, patch
bkelly
: review+
Details | Diff | Splinter Review
8.66 KB, patch
bkelly
: review+
Details | Diff | Splinter Review
9.38 KB, patch
bkelly
: review+
Details | Diff | Splinter Review
7.74 KB, patch
bkelly
: review+
Details | Diff | Splinter Review
4.02 KB, patch
bkelly
: review+
Details | Diff | Splinter Review
Bug 1300118 adapts the xpcom TaskQueue class a bit to be used to prevent main thread runnable flooding.  This bug will use TaskQueue in nsGlobalWindow to throttle runnables from setTimeout() and setInterval().

For example, this bug will keep this script from killing main thread responsiveness:

  function boom() { setTimeout(boom, 0); setTimeout(boom, 0); }
  boom();
I believe there is still a fair bit of orange with these patches.

Also, I'd like to add code to freeze timers if we detect that the MainThreadTaskQueue is falling behind.  In the worker patches I block the worker thread if I see a backlog > 5000 runnables.  Here I plan to use the existing "frozen" timers mechanism.  I'll detect when to resume timers by inserting a runnable in the task queue so I know when the backlog has flushed.
Lets just limit this to linux x64 until its more green:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=df894db43221
Trying a different approach to managing the shutdown of the task queue.  Also, this shares a task queue between connected parent/child windows (but not window.opener yet).
My new shutdown approach is to make a TaskQueue wrapper that just DTRT.  It internally handles shutdown when the TaskQueue is no longer referenced or when the browser is shutting down.  For browser shutdown new Dispatch calls are issues to the original base target.
Attachment #8788302 - Attachment is obsolete: true
Attachment #8788303 - Attachment is obsolete: true
Attachment #8788304 - Attachment is obsolete: true
Attachment #8788305 - Attachment is obsolete: true
This exposes the auto-closing TaskQueue on the outer window.  A TaskQueue is shared with all child windows and between openers.
Updated patch to make setTimeout() and setInterval() use the task queue.
These patches pass all our content tests.  There are maybe 5 browser-chrome tests that fail currently.  Trying to run them locally on an unmodified tree tends to trigger the same failures, though.  I think they are just racy.  We would need to fix them before proceeding with this, though.
My inability to run these tests actually had more to do with my xvfb args.  Apparently the tests are designed to run on 1600x1200.  I was using 640x480.
It seems there is some timeout message at the end of some browser chrome tests that races with the window closing.  If the window closes before the timer runs then the test hangs.  My changes made this happen more frequently because I no longer let more than one timer run synchronously in the same runnable.

This updates the patch to allow up to 10 timers to execute in the same runnable.  It lets me pass at least one of these browser chrome tests locally:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=a668bbeee477

I'll see if I can dig out which timer specifically was causing the problem.
Attachment #8790265 - Attachment is obsolete: true
Blocks: QuantumDOM
It seems devtools wants to check to see that certain UX components have been selected and that these selections can happen via timeouts.  So throttling timeouts makes these tests fail due to missing their carefully tuned races.

As a way around this, lets try going back to my earlier idea of exempting chrome windows from throttling.  This would make both the browser chrome harness and devtools work the same as they do today while throttling content script.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=9e77137a8cba
Exempt chrome windows from the "single timeout per runnable" limitation like I intended:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=24441e5e1ee9
One of the difficulties with using a TaskQueue for main thread things like windows is being able to shutdown the TaskQueue cleanly.  In a worker thread we can AwaitShutdownAndIdle(), but we can't do that on the main thread.

This helper class tries to solve the problem.  It wraps the TaskQueue as an nsIEventTarget and provides a self-closing semantic.  If the target goes out of scope then we shutdown the TaskQueue asynchronously.  If xpcom-shutdown comes, then we shutdown the TaskQueue and spin the event loop until its drained.  This lets us largely pass the nsIEventTarget wrapper around trusting that it will DTRT when we need to stop using it.

I don't love the name or that this is separate from TaskQueue itself, but I think its a good step towards figuring out what we need here.  I didn't want to just put this in DOM since I already have two different places I want to use.  I'm just using it in the window case first.

Once things stabilize I figure we could revisit the name and possibly consider rolling it into TaskQueue itself.  That is much more work, though, since we need to negotiate with the other consumers of TaskQueue.  I'd like to prove this out first.

Also note, I would have liked to implement TaskQueue as an interface here, but there is no TaskQueue interface yet.  Another enhancement we could make would be adding an nsITaskQueue or similar abstract class.  We probably can't do anything like that until Bobby comes back from PTO, though.

Anyway, I'd love to proceed for now and then incrementally improve this class if possible.  Thanks!
Attachment #8790263 - Attachment is obsolete: true
Attachment #8790729 - Flags: review?(nfroyd)
Attachment #8790952 - Flags: review?(amarchesini)
Try is green.  I intend to flag for review once I can coordinate the review request with bz.
Comment on attachment 8790932 [details] [diff] [review]
P2 Expose a main thread TaskQueue on nsGlobalWindow. r=bz

Here is my review comment so I don't lose it:

This patch creates a single TaskQueue for each constellation of windows.  The TaskQueue is kept on the outer window.  The existence of a TaskQueue is not guaranteed, so we provide a GetMainThreadTaskQueue() getter that can return nullptr.

The task queue can be nullptr under these circumstances:

1) Its a chrome window.  We avoid throttling chrome windows partially in order to prioritize chrome work and partially to avoid the resulting browser-chrome/devtools orange it produces.
2) The window has not yet had a docshell set.
3) The window in a new constellation has a docshell set during xpcom shutdown.

Note that we do not explicitly clear the TaskQueue when the docshell is cleared.  Its unclear if all runnables have executed by that point so we maintain the TaskQueue to preserve the ordering of runnables trailing in after de-activation.

The shutdown of the queue is handled auto-magically by the SelfCloseTaskQueueTarget wrapper class.  See P1 for the gory details.  Basically it shuts down when the target receives its last Release() or when xpcom shutdown starts.  All runnables are flushed to the main thread before the task queue is destroyed in either case.
Comment on attachment 8790933 [details] [diff] [review]
P3 Make setTimeout() and setInterval() use the MainThreadTaskQueue. r=bz

Review of attachment 8790933 [details] [diff] [review]:
-----------------------------------------------------------------

Here is my review comment:

This patch does two things:

1) It makes the nsTimeout nsITimers fire their runnables at the window's main thread TaskQueue if its present.
2) It limits the number of timeouts we will run in a single runnable before we yield back the main thread.

Both of these are important.  Without (1) we can swamp the main thread event loop with runnables.  Without (2) we will end up blocking the main thread in a single runnable for extremely long periods when there are excessive timers in play.

We exempt chrome windows from (2) just like we exempt chrome windows from the main thread TaskQueue throttling mechanism.

Note that the limit on timeouts per runnable was a bit tricky because we can end up re-ordering nsITimer callbacks relative to our window nsTimeout list somehow.  I believe this can happen during suspend and resume.  We handle this case by always ensuring we fire at least the timer that the callback was created before.  This avoids the situation where we skip a timeout, but then the next timer runnable does not fire because the other timers were cancelled or cleared.
Comment on attachment 8790932 [details] [diff] [review]
P2 Expose a main thread TaskQueue on nsGlobalWindow. r=bz

Olli, Boris won't be able to review this for a while.  Do you mind looking at the patches?

The description for this patch is in comment 26.
Attachment #8790932 - Flags: review?(bugs)
Comment on attachment 8790933 [details] [diff] [review]
P3 Make setTimeout() and setInterval() use the MainThreadTaskQueue. r=bz

Description for this patch is in comment 27.
Attachment #8790933 - Flags: review?(bugs)
Suspend the window when we detect that the main thread TaskQueue is backing up.  I don't have an automated test that triggers this yet, but I verified locally that it works as expected.

I also took a couple screenshots of the process cpu and memory to show the benefits of applying this kind of backpressure.
Attachment #8791052 - Flags: review?(bugs)
This shows that memory just sky rockets with runnable throttling, but no back pressure to the page.
This shows the CPU and memory of the process when we add in the P5 patch applying back pressure.  Suspending the window when the TaskQueue is backed up has a big impact.
New try with P4 and P5.  Also rebased on latest m-c:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=60aa651bb72f
I'm starting to think maybe we should just rename:

  SelfClosingTaskQueueTarget

To something more like:

  TaskQueueTarget

And just use it everywhere; both workers and windows.  I don't really want to expose AbstractThread or TaskQueue APIs further in the code than I have to.
Comment on attachment 8790729 [details] [diff] [review]
P1 Add a TaskQueue wrapper that automatically shuts down when appropriate. r=froydnj

Review of attachment 8790729 [details] [diff] [review]:
-----------------------------------------------------------------

I think this is reasonable.  I have a few questions, though.

::: xpcom/threads/SelfClosingTaskQueueTarget.cpp
@@ +96,5 @@
> +      new TaskQueue(target.forget(), false /* tail dispatch */);
> +
> +    RefPtr<Inner> ref = new Inner(aBaseTarget, taskQueue);
> +
> +    nsresult rv = obs->AddObserver(ref, kShutdownTopic, false /* weak ref */);

According to nsIObserverService.idl, passing false here means that we're holding a strong ref, so the comment should be fixed.

@@ +114,5 @@
> +    MOZ_ASSERT(!strcmp(aTopic, kShutdownTopic));
> +
> +    MaybeStartShutdown();
> +
> +    while (!mTaskQueue->IsEmpty()) {

How are we sure that all threads are done dispatching to the task queue at this point?  (i.e. how are we guaranteed that we're not going to race deciding that this is empty, and then somebody is going to dispatch an event to it behind our back?)

@@ +124,5 @@
> +
> +  void
> +  MaybeStartShutdown()
> +  {
> +    // Any thread

Do you think it's worth explaining here why two threads can't race to this function?  (One thread would have to be destroying the outer object, and the main thread would have to be going through xpcom-shutdown.  But if the main thread is calling this function at xpcom-shutdown, then we can't be destroying the outer object, because the inner object is still holding a strong ref to it.  Right?)

@@ +169,5 @@
> +  nsresult
> +  DispatchFromScript(nsIRunnable* aEvent, uint32_t aFlags)
> +  {
> +    // Any thread
> +    nsIEventTarget* target = mShutdownStarted ? mBaseTarget : mTaskQueueTarget;

Factor this computation out into a helper method, because you use it in a number of places?

Oh, you can't use NS_FORWARD_NSIEVENTTARGET because it doesn't work correctly with move-only classes.  Boo.
Attachment #8790729 - Flags: review?(nfroyd) → feedback+
(In reply to Nathan Froyd [:froydnj] from comment #35)
> > +    nsresult rv = obs->AddObserver(ref, kShutdownTopic, false /* weak ref */);
> 
> According to nsIObserverService.idl, passing false here means that we're
> holding a strong ref, so the comment should be fixed.

Well, I was trying to indicate the argument was aWeakRef boolean.  Clearly it would be nice if this took an enum. :-\

> > +    MaybeStartShutdown();
> > +
> > +    while (!mTaskQueue->IsEmpty()) {
> 
> How are we sure that all threads are done dispatching to the task queue at
> this point?  (i.e. how are we guaranteed that we're not going to race
> deciding that this is empty, and then somebody is going to dispatch an event
> to it behind our back?)

Because MaybeStartShutdown() sets the Atomic<bool> mShutdownStarted.  After that we force all Dispatch() routines to the base target instead of the task queue.  I can add a comment.

> > +  void
> > +  MaybeStartShutdown()
> > +  {
> > +    // Any thread
> 
> Do you think it's worth explaining here why two threads can't race to this
> function?  (One thread would have to be destroying the outer object, and the
> main thread would have to be going through xpcom-shutdown.  But if the main
> thread is calling this function at xpcom-shutdown, then we can't be
> destroying the outer object, because the inner object is still holding a
> strong ref to it.  Right?)

Well, the first thing this method does is check-and-set the Atomic<bool> mShutdownStarted.  That was my anti-race mitigation.  Maybe I need a mutex around mShutdownStarted instead.  It does seem two threads could get passed the conditional before either one sets the atomic.

I'll add the mutex.

> > +    nsIEventTarget* target = mShutdownStarted ? mBaseTarget : mTaskQueueTarget;
> 
> Factor this computation out into a helper method, because you use it in a
> number of places?

Will do.  That will make it easier to use the mutex as well.
(In reply to Ben Kelly [:bkelly] from comment #36)
> (In reply to Nathan Froyd [:froydnj] from comment #35)
> > > +    nsresult rv = obs->AddObserver(ref, kShutdownTopic, false /* weak ref */);
> > 
> > According to nsIObserverService.idl, passing false here means that we're
> > holding a strong ref, so the comment should be fixed.
> 
> Well, I was trying to indicate the argument was aWeakRef boolean.  Clearly
> it would be nice if this took an enum. :-\

Story of my life.

> > > +    MaybeStartShutdown();
> > > +
> > > +    while (!mTaskQueue->IsEmpty()) {
> > 
> > How are we sure that all threads are done dispatching to the task queue at
> > this point?  (i.e. how are we guaranteed that we're not going to race
> > deciding that this is empty, and then somebody is going to dispatch an event
> > to it behind our back?)
> 
> Because MaybeStartShutdown() sets the Atomic<bool> mShutdownStarted.  After
> that we force all Dispatch() routines to the base target instead of the task
> queue.  I can add a comment.

Ah, that's a good explanation.  Please do add the comment.

> > > +  void
> > > +  MaybeStartShutdown()
> > > +  {
> > > +    // Any thread
> > 
> > Do you think it's worth explaining here why two threads can't race to this
> > function?  (One thread would have to be destroying the outer object, and the
> > main thread would have to be going through xpcom-shutdown.  But if the main
> > thread is calling this function at xpcom-shutdown, then we can't be
> > destroying the outer object, because the inner object is still holding a
> > strong ref to it.  Right?)
> 
> Well, the first thing this method does is check-and-set the Atomic<bool>
> mShutdownStarted.  That was my anti-race mitigation.  Maybe I need a mutex
> around mShutdownStarted instead.  It does seem two threads could get passed
> the conditional before either one sets the atomic.
> 
> I'll add the mutex.

If you were going to check-and-set, what you want is Atomic::compareExchange:

http://dxr.mozilla.org/mozilla-central/source/mfbt/Atomics.h#575

Using that should ensure only one thread gets into this method, and then you don't have to bother with the mutex.
(In reply to Nathan Froyd [:froydnj] from comment #37)
> If you were going to check-and-set, what you want is Atomic::compareExchange:
> 
> http://dxr.mozilla.org/mozilla-central/source/mfbt/Atomics.h#575

Perfect!  I forgot we had this.
(In reply to Ben Kelly [:bkelly] from comment #36)
> > > +    MaybeStartShutdown();
> > > +
> > > +    while (!mTaskQueue->IsEmpty()) {
> > 
> > How are we sure that all threads are done dispatching to the task queue at
> > this point?  (i.e. how are we guaranteed that we're not going to race
> > deciding that this is empty, and then somebody is going to dispatch an event
> > to it behind our back?)
> 
> Because MaybeStartShutdown() sets the Atomic<bool> mShutdownStarted.  After
> that we force all Dispatch() routines to the base target instead of the task
> queue.  I can add a comment.

Thinking about this some more, what if you have:

SomeThread: checks mShutdownStarted in Dispatch*, finds it false
SomeThread: preempted
MainThread: calls MaybeStartShutdown
MainThread: spins task queue event loop
MainThread: finishes loop
SomeThread: dispatches to the task queue

Just setting the boolean isn't enough; you have to ensure that all threads have somehow finished anything they might have been doing in Dispatch.
(In reply to Nathan Froyd [:froydnj] from comment #39)
> (In reply to Ben Kelly [:bkelly] from comment #36)
> > > > +    MaybeStartShutdown();
> > > > +
> > > > +    while (!mTaskQueue->IsEmpty()) {
> > > 
> > > How are we sure that all threads are done dispatching to the task queue at
> > > this point?  (i.e. how are we guaranteed that we're not going to race
> > > deciding that this is empty, and then somebody is going to dispatch an event
> > > to it behind our back?)
> > 
> > Because MaybeStartShutdown() sets the Atomic<bool> mShutdownStarted.  After
> > that we force all Dispatch() routines to the base target instead of the task
> > queue.  I can add a comment.
> 
> Thinking about this some more, what if you have:
> 
> SomeThread: checks mShutdownStarted in Dispatch*, finds it false
> SomeThread: preempted
> MainThread: calls MaybeStartShutdown
> MainThread: spins task queue event loop
> MainThread: finishes loop
> SomeThread: dispatches to the task queue
> 
> Just setting the boolean isn't enough; you have to ensure that all threads
> have somehow finished anything they might have been doing in Dispatch.

Well, the worst case here is that the TaskQueue dispatch returns an error.  This doesn't seem too unreasonable if a thread is trying to use this thing during xpcom shutdown from a non-main-thread.

If you'd like, I could check the Dispatch() call for failure and then retry with the base target if we have entered shutdown state.

  nsIEventTarget*
  InternalTarget() const
  {
    // This can be called by any thread.  Consult our Atomic<bool> shutdown
    // flag.  If we have not been shutdown then return the task queue.  Otherwise
    // return the base target.
    return mShutdownStarted ? mBaseTarget : mTaskQueueTarget;
  }

  nsresult
  Dispatch(already_AddRefed<nsIRunnable> aEvent, uint32_t aFlags)
  {
    // Any thread
    nsresult rv = InternalTarget()->Dispatch(Move(aEvent), aFlags);
    if (NS_FAILED(rv) && mShutdownStarted) {
      rv = InternalTarget()->Dispatch(Move(aEvent), aFlags)
    }
    return rv;
  }

Or I can just wrap all of Dispatch() in a mutex.
Flags: needinfo?(nfroyd)
Addressed review feedback.  I implemented the "retry if we fail dispatch and are shutting down" idea from the previous comment.
Attachment #8790729 - Attachment is obsolete: true
Flags: needinfo?(nfroyd)
Attachment #8791369 - Flags: review?(nfroyd)
Now with more weak ref bare boolean commenting.
Attachment #8791369 - Attachment is obsolete: true
Attachment #8791369 - Flags: review?(nfroyd)
Attachment #8791375 - Flags: review?(nfroyd)
Comment on attachment 8790952 [details] [diff] [review]
P4 Attach the worker TaskQueue to the window TaskQueue if present. r=baku

Review of attachment 8790952 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/workers/WorkerPrivate.cpp
@@ +3957,4 @@
>    }
>  
>    // TODO: If we have a window, try to use its MainThreadTaskQueue as the
>    //       target for our sub-queue.

remove this TODO.
Attachment #8790952 - Flags: review?(amarchesini) → review+
Comment on attachment 8791375 [details] [diff] [review]
P1 Add a TaskQueue wrapper that automatically shuts down when appropriate. r=froydnj

Review of attachment 8791375 [details] [diff] [review]:
-----------------------------------------------------------------

r=me with the below two changes.  Please double-check my reasoning on the Dispatch comment.

::: xpcom/threads/SelfClosingTaskQueueTarget.cpp
@@ +14,5 @@
> +using mozilla::services::GetObserverService;
> +
> +namespace {
> +
> +static const char* kShutdownTopic = "xpcom-shutdown";

Please make this static const char kShutdownTopic[].

@@ +80,5 @@
> +  InternalTarget() const
> +  {
> +    // This can be called by any thread.  Consult our Atomic<bool> shutdown
> +    // flag.  If we have not been shutdown then return the task queue.  Otherwise
> +    // return the base target.

Can you reverse the order of the last two sentences to correspond to the order of the condition?

@@ +195,5 @@
> +    nsresult rv = InternalTarget()->Dispatch(Move(aEvent), aFlags);
> +    // Between selecting the target in InternalTarget() and calling
> +    // Dispatch() the TaskQueue might get shutdown.  If that happens,
> +    // then try again.  This time we will get the base target which
> +    // should succeed.

Is, uh, that really guaranteed?  Looking at TaskQueue::DispatchLocked, I see a lot of different failure conditions.  Though I guess we won't hit the AbortIfFlushing condition, and dispatch to the underlying event target really ought to be infallible.  So maybe we'll only fail if we're shutting down.
Attachment #8791375 - Flags: review?(nfroyd) → review+
> > +    nsresult rv = InternalTarget()->Dispatch(Move(aEvent), aFlags);
> > +    // Between selecting the target in InternalTarget() and calling
> > +    // Dispatch() the TaskQueue might get shutdown.  If that happens,
> > +    // then try again.  This time we will get the base target which
> > +    // should succeed.
> 
> Is, uh, that really guaranteed?  Looking at TaskQueue::DispatchLocked, I see
> a lot of different failure conditions.  Though I guess we won't hit the
> AbortIfFlushing condition, and dispatch to the underlying event target
> really ought to be infallible.  So maybe we'll only fail if we're shutting
> down.

Well, AbortIfFlushing is dead code. Nothing I could find sets mIsFlushing.  (Have I mentioned I don't love some of the stuff in AbstractThread/TaskQueue land?)

That just leaves shutdown and real dispatch failure.  I think propagating the real dispatch failure back is reasonable.  Double-attempting it should not have any impact other than a perf penalty during this rare error condition.
Comment on attachment 8791052 [details] [diff] [review]
P5 Suspend a window if its main thread TaskQueue falls to far behind. r=smaug

I think nsGlobalWindow::SuspendTimeouts() is probably buggy at the moment.  I'm going to write a separate bug to fix that first.
Attachment #8791052 - Flags: review?(bugs)
Depends on: 1303167
Comment on attachment 8790932 [details] [diff] [review]
P2 Expose a main thread TaskQueue on nsGlobalWindow. r=bz


>+
>+  // Attempt to use the task queue already present in our constellation.
We're about to change the naming. You mean here 'tab group'.
(document group is then the similar-origin-group of related globals)

>+  // Maybe create a TaskQueue for dispatching runnables to the main thread
>+  // from this window.  This TaskQueue will throttle these runnables and
>+  // prevent the window content from significantly impacting main thread
>+  // responsiveness.  Certain situations will result in a nullptr task
>+  // queue even after calling this method.
>+  void
>+  MaybeCreateMainThreadTaskQueue();
Why the task queue is called MainThreadTaskQueue? That sounds like something bound to the main thread, but here we can have actually
several task queues, one per tab group.
Perhaps call it GetTabGroupTaskQueue? Though, I'm not quite happy with that either, since only some tasks are supposed to be queued here.
GetThrottlableTabGroupTaskQueue is quite horrible, but is closer to what the thing is about, right? I'd prefer something like that.

r+ if the name of the task queue is changed to something which hints what it is about.
Attachment #8790932 - Flags: review?(bugs) → review+
(In reply to Olli Pettay [:smaug] from comment #47)
> >+  void
> >+  MaybeCreateMainThreadTaskQueue();
> Why the task queue is called MainThreadTaskQueue? That sounds like something
> bound to the main thread, but here we can have actually
> several task queues, one per tab group.

Well, a TaskQueue can target any thread or nsIEventTarget.  So I was trying to indicate which thread the runnables for the TaskQueue would eventually run on.  While this may seem implicit here, its an important distinction for worker threads.  And I was trying to use uniform naming throughout.

I guess I was thinking the window would have a "task queue that goes to the main thread", and its somewhat abstracted which task queue that is.  The window picks the right one when its activated.  It may create its own or use the tab group's TaskQueue.  The caller of the code should not care, though.  It should just use the TaskQueue associated with the window regardless of the source.

Does any of that make sense?
Flags: needinfo?(bugs)
Well, not really :) Window always lives on the main thread, so why would you ever get non-main-thread event queue. So, I would rather indicate in the name how it is supposed to be called.
Flags: needinfo?(bugs)
I guess if the name is the most controversial thing here I'll call it a win.
Comment on attachment 8790933 [details] [diff] [review]
P3 Make setTimeout() and setInterval() use the MainThreadTaskQueue. r=bz

Dropping review here until I fix bug 1303167.  Simplifying our timeout code should make this review easier.
Attachment #8790933 - Flags: review?(bugs)
Rebase.  I still need to address action items.
Attachment #8790932 - Attachment is obsolete: true
Attachment #8804279 - Flags: review+
Update to use the new API from bug 1303167.
Attachment #8791052 - Attachment is obsolete: true
Looks like a new test failure to fix in W(1):

TEST-UNEXPECTED-TIMEOUT | /media-source/mediasource-seek-beyond-duration.html | Test seeking beyond media duration. - Test timed out
Attachment #8804279 - Attachment is obsolete: true
Rewrote the patches slightly to keep the TaskQueue on the TabGroup.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=6f42908cd89a7e4fe6b13aa33a44d4492a331284
Depends on: 1303196
Attachment #8805586 - Attachment is obsolete: true
https://treeherder.mozilla.org/#/jobs?repo=try&revision=819353f929d26854ab9cce606227c927af10f460&selectedJob=30117035

Still have to figure out why

  dom/tests/mochitest/bugs/test_bug479143.html

Is hanging in non-e10s.  At least it reproduces locally.
Attachment #8805589 - Attachment is obsolete: true
Attachment #8805697 - Flags: review+
The issue with the failure in comment 65 is:

1) A setTimeout() fires running js code from within the TaskQueue context
2) The js code calls showModalDialog() which spins the event loop
3) The showModalDialog() uses a setTimeout() to close itself
4) The nested setTimeout() never gets to run because the TaskQueue is still stuck on the runnable from (1).

I think the cleanest way to solve this is to make the nsITimer notify to the TaskQueue as my patches do now, but then to issue another runnable to the main thread to execute js code.  This means we have at most 2 runnables in the main thread event queue at once instead of just 1 runnable.  I don't love it, but it should be ok.

I have a hacked up version which passes the test with this locally.  I need to clean it up and do another full try run, though.
Actually, we could probably use a microtask or "tail dispatch" instead of a normal runnable dispatch here.  That would avoid any additional delay in running the timer.
Does that matter too much? But anyhow, please don't add new microtask usage right now. I have patches to make Promises to use microtasks but still working on to fix our devtools and browsers tests.
(only 1 failing devtool test left, browser_responsiveui_window_close.js, written by you! ;), and dozens of failing browser tests)

Hmm, now that I think. Running setTimeout code during microtask checkpoint would be quite surprising, and against the spec.
So, after working on this bug for a while and talking to :billm I finally decided we need something slightly different from TaskQueue.

TaskQueue was designed for feeding a thread pool.  As such it wants to guarantee that no two events are running on the thread pool at once.  To that end it dispatches the next event only after all processing for the current event is done.

That works for thread pools, but is not quite right for a mechanism that dispatches to a single thread which may be its current thread.  In that world we need to handle cases where the event might spin the event loop.  If we only dispatch the next runnable after processing the current event then spinning the event loop from an event stalls the TaskQueue.

This class, in contrast, dispatches its next runnable *before* processing its current event.  This solves the event loop problem, but makes it not suitable for thread pool targets.

Writing a new class also allowed us to more easily work around some other idiosyncrasies in TaskQueue.  For example, this class is a simple nsIEventTarget which ensures it will work with the majority of the gecko tree.

Most of this patch is the same as the previous SelfClosingTaskQueueTarget class I previously had reviewed here.  I simply swapped out the TaskQueue for an nsEventQueue, a mutex, and a runnable.  (I believe the nsEventQueue is likely better optimized than TaskQueue's std::queue as well.)
Attachment #8791375 - Attachment is obsolete: true
Attachment #8806780 - Flags: review?(nfroyd)
This patch exposes the ThrottledEventQueue on the TabGroup and nsPIDOMWindow objects.

A TabGroup always has a ThrottledEventQueue if we're not in shutdown, but we only expose it to windows that are both non-chrome and have an outer window assigned (because we need it to get the TabGroup).
Attachment #8805695 - Attachment is obsolete: true
Attachment #8806784 - Flags: review?(bugs)
I'm still running a few final tests before flagging this patch for review.
Attachment #8805696 - Attachment is obsolete: true
This is a rewrite of the previous P4 patch that made workers feed the window's TaskQueue.

This patch makes workers now use ThrottledEventQueue instead of TaskQueue.  It also feeds into the window TaskQueue as we did before.
Attachment #8805697 - Attachment is obsolete: true
Attachment #8806787 - Flags: review?(amarchesini)
Comment on attachment 8806787 [details] [diff] [review]
P4 Use ThrottledEventQueue in workers. r=baku

Review of attachment 8806787 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/workers/WorkerPrivate.cpp
@@ +4002,5 @@
> +  // Throttle events to the main thread using a ThrottledEventQueue specific to
> +  // this worker thread.  This may return nullptr during shutdown.
> +  mMainThreadThrottledEventQueue = ThrottledEventQueue::Create(target);
> +
> +  // If we were able to creat the throttled event queue, then use it for

to create

@@ +4624,5 @@
>      // If the worker thread is spamming the main thread faster than it can
>      // process the work, then pause the worker thread until the MT catches
>      // up.
> +    if (mMainThreadThrottledEventQueue &&
> +        mMainThreadThrottledEventQueue->Length() > 5000) {

Maybe this should be a #define MAX_THROTTLEDEVENTQUEUE_LENGTH
Attachment #8806787 - Flags: review?(amarchesini) → review+
Comment on attachment 8806780 [details] [diff] [review]
P1 Add the ThrottledEventQueue class. r=froydnj

Need to sort out some try failures.
Attachment #8806780 - Flags: review?(nfroyd)
Minor tweak to fix building on windows where the compiler is more strict about ternary types.

Also, I had to make child worker threads inherit their parent worker's ThrottledEventQueue.  Unlike TaskQueue we must currently create ThrottledEventQueue on the main thread.
Attachment #8806787 - Attachment is obsolete: true
Attachment #8806831 - Flags: review+
Comment on attachment 8806784 [details] [diff] [review]
P2 Expose a ThrottledEventQueue on TabGroup and nsPIDOMWindow. r=smaug

> TabGroup::TabGroup()
>-{}
>+{
>+  nsCOMPtr<nsIThread> mainThread;
>+  NS_GetMainThread(getter_AddRefs(mainThread));
>+  MOZ_DIAGNOSTIC_ASSERT(mainThread);
>+
>+  // This may return nullptr during xpcom shutdown.  This is ok as we
>+  // do not guarantee a task queue will be present.
>+  mThrottledEventQueue = ThrottledEventQueue::Create(mainThread);
>+}
I don't think we should create mThrottledEventQueue for chrome tab group, since then
accessing the queue from window returns null, but accessing queue from TabGroup is non-null
Attachment #8806784 - Flags: review?(bugs) → review-
This adds an argument to the TabGroup constructor so we can avoid creating the ThrottledEventQueue for the chrome TabGroup.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=cc8529e302395177acb25432946825492d999d33
Attachment #8806784 - Attachment is obsolete: true
Attachment #8806844 - Flags: review?(bugs)
Comment on attachment 8806844 [details] [diff] [review]
P2 Expose a ThrottledEventQueue on TabGroup and nsPIDOMWindow. r=smaug

Sorry, too quick to flag here.
Attachment #8806844 - Flags: review?(bugs)
>+  MOZ_ASSERT(!sChromeTabGroup);
Clearly this assertion may fire when non-chrome tab groups are created. So, remove the assertion.

With that, r+ for the patch fixing that issue.
Another worker tweak.  We don't want to shutdown the ThrottledEventQueue explicitly since it might be shared with child workers.  Instead just rely on the automatic shutdown mechanism.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=4a092dee584b24fd4654240857aac0d67ceb0762
Attachment #8806831 - Attachment is obsolete: true
Attachment #8806900 - Flags: review+
Comment on attachment 8806780 [details] [diff] [review]
P1 Add the ThrottledEventQueue class. r=froydnj

Ok, I didn't need to change anything here.  Please see patch description in comment 69.
Attachment #8806780 - Flags: review?(nfroyd)
Comment on attachment 8806785 [details] [diff] [review]
P3 Make setTimeout() and setInterval() use the TabGroup ThrottledEventQueue. r=smaug

Review of attachment 8806785 [details] [diff] [review]:
-----------------------------------------------------------------

This patch modifies our timer code to dispatch runnables using the window's ThrottledEventQueue.

In addition, we also try to avoid running multiple callbacks in a single runnable.  Chrome windows, however, will still batch similar to how chrome windows bypass the throttling mechanism.
Attachment #8806785 - Flags: review?(bugs)
Comment on attachment 8806788 [details] [diff] [review]
P5 Suspend a window if its main thread TaskQueue falls to far behind. r=smaug

Add a mechanism to apply back pressure to a window.  If we detect the ThrottledEventQueue is falling behind simply Suspend() the window.  We place a runnable at the end of the queue that will Resume() the window (when the queue is presumably clear).

Right now I trigger this from the timer code. As more subsystems start using the ThrottledEventQueue() they should also check for the back pressure condition.

This back pressure mechanism is important to avoid memory problems.  Throttling events by itself keeps the browser responsive, but moves the cpu problem to a memory problem by buffering events in the queue.  The back pressure is then a way to solve the memory problem by not filling the queue any more after it hits a threshold.
Attachment #8806788 - Flags: review?(bugs)
Try build from comment 83 is looking pretty green!  Hoping I can land this before it bit rots.
Comment on attachment 8806788 [details] [diff] [review]
P5 Suspend a window if its main thread TaskQueue falls to far behind. r=smaug

It feels wrong to suspend and yet expect the event queue to be processed.
I must be missing something here.
I would have expected us to throttle timeout handling in case the queue gets too long. Though, that wouldn't help with postMessage messages coming from workers. In that case worker should be throttled.
But looks like this doesn't anyhow deal with postMessages... so, I'm lost here.


Either I'm missing something here, or this adds some new odd state where we're suspended but yet process stuff.
Attachment #8806788 - Flags: review?(bugs) → review-
Comment on attachment 8806788 [details] [diff] [review]
P5 Suspend a window if its main thread TaskQueue falls to far behind. r=smaug

ok, after the explanation on irc, this make more sense.

Please add some comment that this back pressure should apply only in worst case scenarios.
Attachment #8806788 - Flags: review- → review+
Comment on attachment 8806785 [details] [diff] [review]
P3 Make setTimeout() and setInterval() use the TabGroup ThrottledEventQueue. r=smaug

>+  // Run a limited number of timers at once to avoid janking the main
>+  // thread.  Any timers beyond this will get picked up on the next
>+  // timer runnable which should fire immediately.  Chrome windows are
>+  // exempty from this limit.
>+  static const uint32_t kMaxSequentialTimeouts = 1;
I don't understand what this stuff has to do with this bug, but fine.

>   // The timeout list is kept in deadline order. Discover the latest timeout
>   // whose deadline has expired. On some platforms, native timeout events fire
>   // "early", but we handled that above by setting deadline to aTimeout->mWhen
>   // if the timer fired early.  So we can stop walking if we get to timeouts
>   // whose mWhen is greater than deadline, since once that happens we know
>   // nothing past that point is expired.
>   last_expired_timeout = nullptr;
>+  uint32_t count = 0;
>+  bool foundTarget = false;
>   for (Timeout* timeout = mTimeouts.getFirst();
>        timeout && timeout->mWhen <= deadline;
>        timeout = timeout->getNext()) {
>     if (timeout->mFiringDepth == 0) {
>       // Mark any timeouts that are on the list to be fired with the
>       // firing depth so that we can reentrantly run timeouts
>       timeout->mFiringDepth = firingDepth;
>       last_expired_timeout = timeout;
>+
>+      // Note that we have seen the timer this runnable was dispatched
>+      // to fire.
>+      if (timeout == aTimeout) {
>+        foundTarget = true;
>+      }
>+
>+      // Apply throttling to the number of timers we will fire in a
>+      // single runnable before yielding the main thread.
>+      //
>+      // While we would like to throttle immediately after finding our
>+      // limited number of timers, we MUST fire our target timer.  We
>+      // cannot rely on another timer's runnable firing because that
>+      // timer might get cleared as a result of this runnable.
>+      //
>+      // Note, the fact that our target timeout is not always the first
>+      // timer we encounter suggests we have other bugs.  We should
>+      // receive timer callbacks in the same order we store our mTimers
>+      // list.  It seems they can come slightly out of order after
>+      // suspend/resume, however.
>+      count += 1;
>+      if (foundTarget && count >= kMaxSequentialTimeouts && !IsChromeWindow()) {
>+        break;
>+      }
So this doesn't guarantee at most kMaxSequentialTimeouts timeouts, if timeout == aTimeout isn't true for the first timeout.
Per IRC, remove kMaxSequentialTimeouts and just rely on 'timeout == aTimeout' and document that one or more, but usually just one timer ends up running.
Attachment #8806785 - Flags: review?(bugs) → review+
Comment on attachment 8806780 [details] [diff] [review]
P1 Add the ThrottledEventQueue class. r=froydnj

Review of attachment 8806780 [details] [diff] [review]:
-----------------------------------------------------------------

This looks good, I just have a couple of comments on the documentation (which is great!).  r=me with comments fixed up.

::: xpcom/threads/ThrottledEventQueue.cpp
@@ +17,5 @@
> +using mozilla::services::GetObserverService;
> +
> +namespace {
> +
> +static const char* kShutdownTopic = "xpcom-shutdown";

Nit: please make this |static const char kShutdownTopic[]|

@@ +158,5 @@
> +    Unused << event->Run();
> +    --mExecutionDepth;
> +
> +    // If shutdown was started and the queue is now empty we can now
> +    // finalize the shutdown.  This is performed separate at the end

"...performed separately", perhaps?

@@ +174,5 @@
> +    MOZ_ASSERT(IsEmpty());
> +
> +    nsCOMPtr<nsIObserverService> obs = GetObserverService();
> +
> +    // This may cause immediate destruction of `this`.

That can't be the case, because the NewRunnableMethod executing ShutdownComplete is already holding a strong ref to `this`, right?

@@ +266,5 @@
> +  AwaitIdle() const
> +  {
> +    // Any thread, except the main thread or our base target.  Blocking the
> +    // main thread is forbidden and block the base target is guaranteed to
> +    // produce a deadlock.

Does this mean to say "...blocking the base target..."?  Not sure if it makes sense otherwise.

::: xpcom/threads/ThrottledEventQueue.h
@@ +3,5 @@
> +/* This Source Code Form is subject to the terms of the Mozilla Public
> + * License, v. 2.0. If a copy of the MPL was not distributed with this
> + * file, You can obtain one at http://mozilla.org/MPL/2.0/. */
> +
> +#ifndef mozilla_ThrottledEventQueue_h

I've been asking people to put a one-line comment here so DXR etc. will display a little information about files.  Maybe:

// nsIEventTarget wrapper for throttling event dispatch.

?

@@ +13,5 @@
> +
> +// A ThrottledEventQueue is an event target that can be used to throttle
> +// events being dispatched to another base target.  It maintains its
> +// own queue of events and only dispatches one at a time to the wrapped
> +// target.  This can be used to avoid flooding the base target.

Thank you for this comment.

It's not immediately obvious from the implementation, or from this comment, how this works: after all, a normal thread maintains its own queue of events and only runs ("dispatches") one at a time.  I think something needs to be added about how Length/AwaitIdle can be used to throttle events coming into the queue?

@@ +53,5 @@
> +  Create(nsIEventTarget* aBaseTarget);
> +
> +  // Determine if there are any events pending in the queue.
> +  bool
> +  IsEmpty() const;

Nit: I think we only use this style with the return type of a separate line for definitions, or if the complete declaration is too long to fit on one line.  Can you fix the declarations up to be on one line?

@@ +72,5 @@
> +  // In order to determine when shutdown completes you can either dispatch
> +  // a runnable before calling MaybeStartShutdown() or you can block
> +  // until the queue is empty using AwaitIdle().
> +  void
> +  MaybeStartShutdown();

Is there value in making this a public function, or could it be protected/private and made visible via `friend` declarations instead?
Attachment #8806780 - Flags: review?(nfroyd) → review+
Thanks for the review!

(In reply to Nathan Froyd [:froydnj] from comment #91)
> @@ +72,5 @@
> > +  // In order to determine when shutdown completes you can either dispatch
> > +  // a runnable before calling MaybeStartShutdown() or you can block
> > +  // until the queue is empty using AwaitIdle().
> > +  void
> > +  MaybeStartShutdown();
> 
> Is there value in making this a public function, or could it be
> protected/private and made visible via `friend` declarations instead?

I was using it in my workers patch, but stopped doing so.  I can make it private for now and change the comments to indicate that people should depend on the self-closing behavior.
I mean, I think explicitly shutting down is a valid use case, but maybe we don't need it right now.
(In reply to Ben Kelly [:bkelly] from comment #93)
> I mean, I think explicitly shutting down is a valid use case, but maybe we
> don't need it right now.

I think that could be a valid use case, too, but I would prefer that we make the API non-visible right now and let the person who needs it make a case for why it should be public later on.
Should we MOZ_COUNT_CTOR/DTOR the Inner class, to ensure the observer service isn't holding a bunch of these things alive?
Flags: needinfo?(bkelly)
(In reply to Nathan Froyd [:froydnj] from comment #95)
> Should we MOZ_COUNT_CTOR/DTOR the Inner class, to ensure the observer
> service isn't holding a bunch of these things alive?

NS_IMPL_ISUPPORTS uses ref-count logging which is supposedly preferred over ctor/dtor logging.
Flags: needinfo?(bkelly)
One last try to make sure I didn't break something while addressing review feedback:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=7ba29c73421cc2e05439bb8edd4badbf0fc28aa9
Attachment #8806788 - Attachment is obsolete: true
Attachment #8808284 - Flags: review+
I don't want to hold this up, but is there a reason the queue is on the TabGroup and not the DocGroup?
Flags: needinfo?(bkelly)
(In reply to Bill McCloskey (:billm) from comment #102)
> I don't want to hold this up, but is there a reason the queue is on the
> TabGroup and not the DocGroup?

Its the safest starting point given all documents in the TabGroup have sync access to each other in one form or another.  In the future we can create DocGroup specific queues that feed into the TabGroup queue.  Timers could then use those DocGroup queues.  Or we could even have window specific timer queues.

I guess I view this as a starting point with the expectation that we will continue to refine it.
Flags: needinfo?(bkelly)
Pushed by bkelly@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/537dea427441
P1 Add the ThrottledEventQueue class. r=froydnj
https://hg.mozilla.org/integration/mozilla-inbound/rev/a7b4c0d350ef
P2 Expose a ThrottledEventQueue on TabGroup and nsPIDOMWindow. r=smaug
https://hg.mozilla.org/integration/mozilla-inbound/rev/00efae73e570
P3 Make setTimeout() and setInterval() use the TabGroup ThrottledEventQueue. r=smaug
https://hg.mozilla.org/integration/mozilla-inbound/rev/2ff33f3c59d5
P4 Use ThrottledEventQueue in workers. r=baku
https://hg.mozilla.org/integration/mozilla-inbound/rev/24d97dc514d4
P5 Suspend a window if its main thread TaskQueue falls to far behind. r=smaug
Blocks: 1315803
Depends on: 1315867
Depends on: 1315895
Depends on: 1318084
Blocks: 1320235
Depends on: 1334537
No longer depends on: 1334537
Depends on: 1336598
No longer depends on: 1336598
Depends on: 1342854
Depends on: 1345645
Depends on: 1355809
Depends on: 1448926
No longer depends on: 1448926
See Also: → 1448926
Depends on: 1504692
Component: DOM → DOM: Core & HTML
Regressions: 1609296
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: