Closed Bug 1312960 Opened 3 years ago Closed 3 years ago

Each message in IPC queue should be associated with the runnable that will run it

Categories

(Core :: IPC, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla52
Tracking Status
firefox52 --- fixed

People

(Reporter: billm, Assigned: billm)

References

(Blocks 1 open bug)

Details

Attachments

(4 files)

Right now, we enqueue IPC messages in an mPending queue and post runnables to dequeue them. Each runnable just picks off the first message, whatever it happens to be. We only need to ensure that there are at least as many runnables as there are messages in the queue.

This design causes problems if we want to put different runnables in different event queues, which is something we want to do for bug 1305926 and bug 1302270. As an example, we want vsync and input messages to go in a higher priority event queue. To do that, each runnable needs to know what message it's going to run before we post it to the event loop.
Since the refactoring work that Kyle did, we only post Gecko runnables to the chromium MessageLoop. Consequently, all the runnables are already refcounted. Previous to that, we needed to use RefCountedTask if we wanted to post the same task to the event loop more than once. Now we no longer need RefCountedTask and it can be removed. This patch stops using it for mOnChannelConnectedTask.
Attachment #8804551 - Flags: review?(dvander)
This code makes no sense at all. We only call FlushPendingInterruptQueue on a PBackground channel. But PBackground doesn't use any intr messages, and this code is a no-op if the queue has no intr messages in it. Let's get rid of it.
Attachment #8804555 - Flags: review?(dvander)
Attached patch main patchSplinter Review
This patch gets rid of the OnMaybeDequeueOne thing. Instead, each element in mPending is actually a runnable (called MessageTask) that owns an IPC Message. The mPending list is stored as a linked list of these runnables.

When we go to run a MessageTask, we check if the task is still in mPending. If it's not, we assume it was removed by ProcessPendingMessages and we don't run it. Additionally, we may choose not to run the message due to a timeout. In that case, the MessageTask remains in mPending but it's not in the event queue. When the timeout ends, we may post the MessageTask back to the event queue so it can be run.

I tried to preserve the way that intr messages work as much as I could. There's one place in EnqueuePendingMessages where I removed a loop that posts tasks for all mDeferred messages. I don't think this loop is needed. MaybeUndeferIncall will move a message from mDeferred to mPending if mDeferred is non-empty. When it does this, it posts the new message to the event loop. When that task runs, it will call MaybeUndeferIncall again, and that task can remove another mDeferred element and post it. So I think the loop is not needed.
Attachment #8804559 - Flags: review?(dvander)
Attachment #8804551 - Flags: review?(dvander) → review+
Attachment #8804555 - Flags: review?(dvander) → review+
Bill, I was on the verge of changing this code to implement bug 1284369.  I was planning to make us only queue one DequeuTask in the loop at a time.  Once it executed one message, it would then requeue itself to post the next message.

The effect would be that a remote process could only place at most one runnable in the target process's main event queue at once.  This would avoid flooding and permit a remote process for overwhelming the target process.  Its the same anti-flooding approach I've used in bug 1300658 and soon in bug 1300659.

Since you are rewriting this code, do you think you could build in this logic now?  Or do you think your changes are hostile to it?  This code is a bit weird in that it seems to go to the message loop directly instead of using an nsIEventTarget.

Thanks.
Flags: needinfo?(wmccloskey)
Sorry, this:

  This would avoid flooding and permit a remote process for overwhelming the target process.

Should say:

  This would avoid flooding and prevent a remote process for overwhelming the target process.
(In reply to Ben Kelly [:bkelly] from comment #4)
> Bill, I was on the verge of changing this code to implement bug 1284369.  I
> was planning to make us only queue one DequeuTask in the loop at a time. 
> Once it executed one message, it would then requeue itself to post the next
> message.

I'm worried that would make our responsiveness worse in cases where there's no flooding. It would basically give runnables that are generated in the content process priority over runnables that come in over IPC. Since some of the most important runnables, like the vsync notification and input events, come in over IPC, that seems a little backwards.

Maybe I'm misunderstanding though. Do you want to do this for the chrome process, the content process, or both? I could see an argument for doing it for the chrome process.

> Since you are rewriting this code, do you think you could build in this
> logic now?  Or do you think your changes are hostile to it?

Well, the next step in my plan is to allow you to specify a custom nsIEventTarget for each actor. When we post a given message to the event loop, we'll first look up the actor for it and see if it or any of its ancestors has a custom nsIEventTarget defined. If it does then we'll post to that.

Once that happens it would probably make sense to use TaskQueues as the event targets for some of the lower priority actors. Note that we'll have to think pretty hard about ordering issues when we do that, but that's fine. (Also, we'll need to figure out what to do about the annoying fact that a TaskQueue is not an nsIEventTarget.)

> This code is a
> bit weird in that it seems to go to the message loop directly instead of
> using an nsIEventTarget.

Some of our MessageLoops don't run on XPCOM threads, so we have to fix that before we can use nsIEventTargets generically.
Flags: needinfo?(wmccloskey)
(In reply to Bill McCloskey (:billm) from comment #6)
> I'm worried that would make our responsiveness worse in cases where there's
> no flooding. It would basically give runnables that are generated in the
> content process priority over runnables that come in over IPC. Since some of
> the most important runnables, like the vsync notification and input events,
> come in over IPC, that seems a little backwards.

Things like vsync are being changed to be high priority ipc which would bypass the mechanism.

Also, slamming N runnables into the queue on top of the runnables the child was already going to produce is pretty much guaranteed to produce more jank.  Right now the parent process does *zero* throttling to the child because there is no concept of IPC backpressue.  It will flood the target process.  Just look at what the IDB subsystem does in this benchmark:

http://nolanlawson.github.io/database-comparison/

If you select 100,000 docs with IDB in the main thread enivironment we jank hard.  This is because we flood the child process event queue with 100,000 IDBRequest complete events.  If we allowed animation and paint runnables to be interleaved, hwoever, we would have much less jank and possibly no jank in this demo.

In addition, keeping runnables in the MessageChannel allows us to detect when we are backing up.  We can look at the depth of the MessageChannel.  If it exceeds a depth of N, then we can signal the parent process to throttle itself.  It would be nice if the actor base classes had "BackPressureEnabled()" and "BackPressureDisabled()" virtual method hooks.  Then actors could catch the back pressure and stop producing messages.  For example, IDB could halt its disk I/O until the back pressure is alleviated.

> Maybe I'm misunderstanding though. Do you want to do this for the chrome
> process, the content process, or both? I could see an argument for doing it
> for the chrome process.

It should be everywhere in my opinion.  A remote process should not be able to flood another process.

For example, I've also seen our plugin flash process create problems by flooding the parent process.  See bug 1302731.

> > Since you are rewriting this code, do you think you could build in this
> > logic now?  Or do you think your changes are hostile to it?
> 
> Well, the next step in my plan is to allow you to specify a custom
> nsIEventTarget for each actor. When we post a given message to the event
> loop, we'll first look up the actor for it and see if it or any of its
> ancestors has a custom nsIEventTarget defined. If it does then we'll post to
> that.
> 
> Once that happens it would probably make sense to use TaskQueues as the
> event targets for some of the lower priority actors. Note that we'll have to
> think pretty hard about ordering issues when we do that, but that's fine.
> (Also, we'll need to figure out what to do about the annoying fact that a
> TaskQueue is not an nsIEventTarget.)

Ok, that was my original plan.  I bailed on it when I saw IPC did not use event targets internally.

If this is where you are going, then that works for me.  I'll just step back and work on some other tasks.  Just let me know if you need my help with the task queue stuff.

As far as TaskQueue not being an nsIEventTarget, I'm going to be landed a hybrid thing in bug 1284369.  It implement nsIEventTarget and also will automatically handle closing gracefully when de-refed or browser shutdown.

Thanks!
Sorry, the hybrid task queue thing is in bug 1300659.  See:

https://bugzilla.mozilla.org/page.cgi?id=splinter.html&bug=1300659&attachment=8791375

The name is annoying, but the behavior makes things easier.  We can evolve the name over time to something better.
OK, sounds good. I definitely agree that backpressure is important.
Attached patch test fixSplinter Review
This test needs to wait for about:blank to load in the remote iframe. Otherwise, the appendChild call can get run before about:blank has even begun loading, in which case it gets overwritten. For some reason this happens a lot more with the patches in this bug.
Attachment #8805273 - Flags: review?(poirot.alex)
Comment on attachment 8805273 [details] [diff] [review]
test fix

I'm wondering if it would be safer to wait for the load event from the parent process before doing *anything*?
i.e. wait for load event on the <iframe remote> created in the beginning of this test. Then, only after that, add message listeners and execute the frame script.

But your patch looks totally fine and is somewhat equivalent.
Attachment #8805273 - Flags: review?(poirot.alex) → review+
Comment on attachment 8804559 [details] [diff] [review]
main patch

Review of attachment 8804559 [details] [diff] [review]:
-----------------------------------------------------------------

This is great. It was very sketchy how OnMaybeDequeOne and mPending were only vaguely tied together.

::: ipc/glue/MessageChannel.cpp
@@ -945,5 @@
> -        if (it != mPending.rend()) {
> -            // This message type has compression enabled, and the queue holds
> -            // a message with the same message type and routed to the same destination.
> -            // Erase it.  Note that, since we always compress these redundancies, There Can
> -            // Be Only One.

It would be good to keep the comment "since we always compress these redundancies, there can be only one."

@@ +990,5 @@
>          NotifyWorkerThread();
>      }
>  
>      if (shouldPostTask) {
> +        task->Post();

It'd read a little nicer if the Post function was on the MessageChannel and took in a MessageTask.

@@ +997,5 @@
>  
>  void
>  MessageChannel::PeekMessages(mozilla::function<bool(const Message& aMsg)> aInvoke)
>  {
> +    // FIXME: We shouldn't be holding the lock for aInvoke!

FIXME?

@@ +1539,3 @@
>  
> +    if (!Connected()) {
> +        ReportConnectionError("OnMaybeDequeueOne");

nit: Error string is out of date (unless that's intentional)

@@ +1611,5 @@
> +
> +    if (!isInList()) {
> +        return NS_OK;
> +    }
> +    remove();

This looks a little dangerous if called while iterating mPending, so maybe the function deserves a comment.
Attachment #8804559 - Flags: review?(dvander) → review+
Pushed by wmccloskey@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/bfee38f564ed
Fix test_child_docshell.html (r=ochameau)
Pushed by wmccloskey@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/8e766e68c91e
Use CancelableRunnable for mOnChannelConnectedTask (r=dvander)
https://hg.mozilla.org/integration/mozilla-inbound/rev/8470f22948e8
Remove FlushPendingInterruptQueue (r=dvander)
https://hg.mozilla.org/integration/mozilla-inbound/rev/d6f081b175ca
Associate each message in IPC queue with the runnable that will run it (r=dvander)
Pushed by wmccloskey@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/a171aae3ae49
MessageChannel compiler fixes on a CLOSED TREE
Backed bug 1312960 and bug 1310547 out for failing splitText-normalize.html in reftests on Windows 7 VM debug:

Bug 1312960:
https://hg.mozilla.org/integration/mozilla-inbound/rev/be5f087a2c4e3fcfd64a56921429f2a5b76dc02f
https://hg.mozilla.org/integration/mozilla-inbound/rev/07f30746ec45034db4a2d6da54796e6742ab7dfb
https://hg.mozilla.org/integration/mozilla-inbound/rev/4b390287fd434a9256a424a88f59d37cf1073362
https://hg.mozilla.org/integration/mozilla-inbound/rev/4c5b55684c21a261a1f4187b290d6ac7b4d5891c
https://hg.mozilla.org/integration/mozilla-inbound/rev/c4f2be33044f293bba6518622bbf4043b48e57bb

Bug 1310547:
https://hg.mozilla.org/integration/mozilla-inbound/rev/b38d9a1cd7995c1cf79d7fc9e5351839a1a2d95c
https://hg.mozilla.org/integration/mozilla-inbound/rev/0e4c28cbee9c44b1e1a97846212f03f95f4b11fb
https://hg.mozilla.org/integration/mozilla-inbound/rev/ac349d1b3c10546da0193046d6ce4b7f065eb070
https://hg.mozilla.org/integration/mozilla-inbound/rev/d9c679129b669805d737c09c97a4768a8e6f7e6f
https://hg.mozilla.org/integration/mozilla-inbound/rev/7e9f99645168e37e0f7298faebeba3e263491bbc
https://hg.mozilla.org/integration/mozilla-inbound/rev/c5a67edeb6239cc576be71fc8ec4955e544a4ca4
https://hg.mozilla.org/integration/mozilla-inbound/rev/e419e663860780657e7825bd7a0108e414c1a134
https://hg.mozilla.org/integration/mozilla-inbound/rev/e0730d3f61026ed2d45b4f339b7a1082199bcedb
https://hg.mozilla.org/integration/mozilla-inbound/rev/ebd1c1823d617cabfc57d3d9325f2be4a28330ec

First not busted push which exposes the failure: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=a171aae3ae491dd52e184735c51924c7a467a0bf
Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=38420599&repo=mozilla-inbound

01:36:25     INFO - REFTEST TEST-UNEXPECTED-FAIL | file:///C:/slave/test/build/tests/reftest/tests/layout/reftests/selection/splitText-normalize.html | load failed: timed out waiting for pending paint count to reach zero (waiting for MozAfterPaint)
Flags: needinfo?(wmccloskey)
Pushed by wmccloskey@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/f1d9568cc12f
Use CancelableRunnable for mOnChannelConnectedTask (r=dvander)
https://hg.mozilla.org/integration/mozilla-inbound/rev/13826b3d90c7
Remove FlushPendingInterruptQueue (r=dvander)
https://hg.mozilla.org/integration/mozilla-inbound/rev/774891f4556d
Associate each message in IPC queue with the runnable that will run it (r=dvander)
Flags: needinfo?(wmccloskey)
Depends on: 1334097
You need to log in before you can comment on or make changes to this bug.