Closed Bug 901789 Opened 7 years ago Closed 6 years ago

Combine AsyncChannel, SyncChannel, RPCChannel into one class

Categories

(Core :: IPC, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla27

People

(Reporter: dvander, Assigned: dvander)

References

Details

Attachments

(2 files, 5 obsolete files)

The Async/Sync/RPC protocol implementations are implemented as separate classes that inherit from each other. This would make sense if they each strictly built on each other's functionality, and top-level actors opted into Async or Sync-only channels. However, with the introduction of urgent messages, the former isn't true, and the latter hasn't really occurred and doesn't seem that useful. We're left with RPCChannel being the only implementation active in practice, and it's difficult to understand since all the code is spread out and some of it is dead.

When adding urgent message support, we were debating whether to bite the bullet and do a large refactoring. We decided not to. But now we're looking to add a new protocol that allows us to make certain sync messages re-entrant (still, from child to parent only), and doing it on the current design would be quite painful. So it feels like the right time to refactor.

The first step is just to remove the Async/Sync/RPCChannel distinction and have a combined "MessageChannel" class. It'll preserve most of the existing logic, just eliminating redundancy. I'd also like to split out the Link/ProcessLink/ThreadLink classes into MessageLink.h/cpp just to make the implementation and headers easier to read.

Should have a prototype patch very soon.
Attached patch rough draft, v0 (obsolete) — Splinter Review
Prototype of the new channel implementation, nowhere near ready for checkin yet but should be good enough for feedback. Aside from combining the three existing protocol files, the "urgent" protocol has undergone major simplification. It has been restricted such that:
 * Only the parent process may initiate an urgent call.
 * An urgent call may not be initiated within an RPC outcall.

With these two restrictions, the "mUrgent" queue can only have one message at most, so it has become a single-entry field. In addition, the new urgent semantics basically match the "sync" protocol, so the guts of Send(msg, reply) have been factored into a new "SendAndWait" function, which both Call(urgent) and Send(msg) use. This also means the mysterious queue mNonUrgentDeferred goes away, since queuing of deferred messages just works as normal.

With these changes it should be very trivial to implement the next protocol features we need.

Note, the RPC implementation have mostly been copied wholesale.

TODOs:
 * Make sure comments are up to date and have not been cut from the original files.
 * Revive RPC_ASSERT as IPC_ASSERT and use where appropriate.
 * Get rid of all the diff noise around having moved Side to global scope.
 * Move the urgent bit from the priority field to a new protocol bit.
Attachment #787090 - Flags: feedback?(cjones.bugs)
Attached patch rough draft v1 (obsolete) — Splinter Review
* Renamed Child/Parent/Unknown to ChildSide/ParentSide/UnknownSide because of dom::ipc::Blob
 * hg remove [Async|RPC|Sync]Channel.cpp
 * Moved "urgent" protocol to a special bit instead of the priority field

Working through test failures now.
Attachment #787090 - Attachment is obsolete: true
Attachment #787090 - Flags: feedback?(cjones.bugs)
Attachment #787873 - Flags: feedback?(cjones.bugs)
Comment on attachment 787873 [details] [diff] [review]
rough draft v1

This looking about how I expected.  Some things I'd like to see if possible

 - hg mv RPCChannel.{cpp,h} MessageChannel.{cpp,h}.  There's some crazy stuff in the RPCChannel history I'd like to preserve.  The others less so.
 - I was expecting the sync-send and rpc-call methods to share more code or be the same helper method, but I didn't look particularly closely to see why they're not.
 - minor style thing, but I like to keep the worker-thread methods clearly separated in the source file from the IO-thread methods.  (That may have "regressed" in the current sources, not sure.)
Attachment #787873 - Flags: feedback?(cjones.bugs) → feedback+
Attached patch rough draft v2 (obsolete) — Splinter Review
Addressed feedback comments, passes tests, appears okay on try. Going to do some manual testing on Windows before setting r?
Attachment #787873 - Attachment is obsolete: true
Attached patch v3 (obsolete) — Splinter Review
fixes some Windows stuff, re-introduces RPC_ASSERT as IPC_ASSERT and revives the old assert messages.
Attachment #793271 - Attachment is obsolete: true
Attached patch v4 (obsolete) — Splinter Review
This is starting to look green everywhere.
Attachment #794976 - Attachment is obsolete: true
Attachment #796229 - Flags: review?(cjones.bugs)
re: SendAndWait and RPCCall not sharing code. Since RPC has a lot of extra details, I wanted them to be separate so it was clearer how they behaved differently. I can combine them back again though if it's better to combine them. I didn't think too hard about it at the time.
Attached patch v5Splinter Review
This version just factors SendAndWait callers a little better to avoid code duplication. I'll post an interdiff, it's small.
Attachment #796229 - Attachment is obsolete: true
Attachment #796229 - Flags: review?(cjones.bugs)
Attachment #797540 - Flags: review?(cjones.bugs)
Heh, I had just starting reviewing v4 ;).
Comment on attachment 797540 [details] [diff] [review]
v5

There's a fair amount of orthogonal change in here, and for future
reference it would have been a little faster to review if split up
more.

>diff --git a/ipc/glue/RPCChannel.cpp b/ipc/glue/MessageChannel.cpp

>+using mozilla::MonitorAutoLock;
>+using mozilla::MonitorAutoUnlock;

These aren't necessary, you can nuke them.

>+void
>+MessageChannel::OnMessageReceivedFromLink(const Message& aMsg)
>+{

>+    // Urgent messages cannot be compressed.
>+    MOZ_ASSERT(!aMsg.compress() || !aMsg.is_urgent());

It would be good to add an ipc/ipdl/test/ipdl/error/ test for this
case.

>diff --git a/ipc/glue/RPCChannel.h b/ipc/glue/MessageChannel.h

>+    // Any protocol that requires blocking until a reply arrives, will send its
>+    // outgoing message through this function. Currently, two protocols do this:

(Terminology nit: a "protocol" is a set of messages and a state
machine.  This code uses "messaging semantics" for what you're
referring to here.)

>+    // Worker-thread only; type we're expecting for the reply to a sync
>+    // out-message. This will never be greater than 1.
>+    size_t mPendingSyncReplies;
>+

Use a bool?

>+    // Worker-thread only; type we're expecting for the reply to an
>+    // urgent out-message. This will never be greater than 1.
>+    size_t mPendingUrgentReplies;

"never be greater than 1" --- for the time being, right?

>diff --git a/ipc/glue/moz.build b/ipc/glue/moz.build

(I don't know anything about these files, so this is a rubber-stamp.)

>diff --git a/ipc/ipdl/ipdl/type.py b/ipc/ipdl/ipdl/type.py

>+        if mtype.isUrgent() and (mtype.isIn() or mtype.isInout()):
>+            self.error(
>+                loc,
>+                "urgent child-to-parent messages are verboten (here, message `%s' in protocol `%s')",
>+                mname, pname)

Please add an ipc/ipdl/test/ipdl/error/ (compiler-only) test for this.

>diff --git a/ipc/ipdl/test/cxx/PTestDataStructures.ipdl b/ipc/ipdl/test/cxx/PTestDataStructures.ipdl

> include "mozilla/GfxMessageUtils.h";
> 
>+include "mozilla/GfxMessageUtils.h";
>+

Accident?

This is basically a review of everything but MessageChannel.cpp.  I'm
going to need to come back and make another pass over that.

There's some code here that doesn't look familiar to me, so it's
probably a good idea for bent or bsmedberg to peek at this too.
Attachment #797540 - Flags: review?
Comment on attachment 797540 [details] [diff] [review]
v5

Feel free to tag in bsmedberg if you're overloaded or not comfortable reviewing this.
Attachment #797540 - Flags: review? → review?(bent.mozilla)
I should add that this is looking really good :).
(In reply to Chris Jones [:cjones] [:warhammer] from comment #11)
>
> There's a fair amount of orthogonal change in here, and for future
> reference it would have been a little faster to review if split up
> more.

Yeah, I apologize for the patch bomb, I didn't really notice until the end.

> >+    // Worker-thread only; type we're expecting for the reply to an
> >+    // urgent out-message. This will never be greater than 1.
> >+    size_t mPendingUrgentReplies;
> 
> "never be greater than 1" --- for the time being, right?

Yeah, indeed.
Comment on attachment 797540 [details] [diff] [review]
v5

>diff --git a/ipc/glue/RPCChannel.cpp b/ipc/glue/MessageChannel.cpp

>+    NS_ABORT_IF_FALSE(aReply->is_sync(), "reply is not sync");

Let's replace all the ABORT_IF_FALSE()s here with IPC_ASSERT()s while
we're at it.

I don't see anything dissuading me from r+, so here it is :).  Please
address the comments above.
Attachment #797540 - Flags: review?(cjones.bugs) → review+
(In reply to Chris Jones [:cjones] [:warhammer] from comment #15)
> Comment on attachment 797540 [details] [diff] [review]
> v5
> 
> >diff --git a/ipc/glue/RPCChannel.cpp b/ipc/glue/MessageChannel.cpp
> 
> >+    NS_ABORT_IF_FALSE(aReply->is_sync(), "reply is not sync");
> 
> Let's replace all the ABORT_IF_FALSE()s here with IPC_ASSERT()s while
> we're at it.
> 
> I don't see anything dissuading me from r+, so here it is :).  Please
> address the comments above.

I just noticed that RPC_ASSERT was a release-mode assert, which is why it wasn't used everywhere in the old RPCChannel. I'll assume it's okay to change this to a debug-only thing? :)
Change which to debug-only? RPC_ASSERT was very intentionally a release-mode assert, with the following general guidelines:

* It should be used for programming errors using the RPC library
* It should not be used for conditions where corrupted data from a child could trigger the abort in a parent (parents should kill the child but should not crash themselves)
* It may be used if bad data is passed to a child (child will crash, parent should recover normally)
Yes, RPC_ASSERT probably wasn't the greatest name because of the potential confusion.  RPC_RUNTIMEABORT would have been better.  You're both right that both sets of checks/aborts should be re-audited, but at this low level of code I generally prefer to fail-safe if a core invariant is violated than fail-soft.  I'm happy to defer that discussion / followup work though.
Sorry, I'm late to this party. Chris, which code is unfamiliar to you (when you flipped r? to me)? I'd rather not re-review all the stuff you have already reviewed.
Flags: needinfo?(cjones.bugs)
The changes that seem to have come in from bug 893242, as I recall.  The patch here is basically moving stuff around so and to that end looked fine to me, but I can't review any deeper than that.  If there's anything there you're concerned about, I'd encourage you to take a look.

Also if you'd like to take a look at the overall change, that'd be good too, but up to you.
Flags: needinfo?(cjones.bugs)
Comment on attachment 797540 [details] [diff] [review]
v5

Review of attachment 797540 [details] [diff] [review]:
-----------------------------------------------------------------

I think this looks great!

You've got some excess/trailing whitespace scattered throughout that needs to be cleaned up, but other than that I just have two questions:

::: ipc/glue/WindowsMessageLoop.cpp
@@ +693,5 @@
>        if (msg.message == WM_QUIT) {
>            NS_ERROR("WM_QUIT received in SpinInternalEventLoop!");
>        } else {
>            TranslateMessage(&msg);
> +          ::DispatchMessageW(&msg);

Nit: Is this needed somehow?

::: ipc/ipdl/ipdl/lower.py
@@ +2954,5 @@
>                                             params=params, ret=_Result.Type()))
> +
> +            if not switch:
> +              crash = StmtExpr(ExprCall(ExprVar('MOZ_ASSUME_UNREACHABLE'),
> +                               args=[ExprLiteral.String('message protocol not supported')]))

This won't actually force a crash in release builds... If you want a crash you need MOZ_CRASH.
Attachment #797540 - Flags: review?(bent.mozilla) → review+
(In reply to ben turner [:bent] (needinfo? encouraged) from comment #21)
> You've got some excess/trailing whitespace scattered throughout that needs
> to be cleaned up, but other than that I just have two questions:

Thanks, fixed

> >        if (msg.message == WM_QUIT) {
> >            NS_ERROR("WM_QUIT received in SpinInternalEventLoop!");
> >        } else {
> >            TranslateMessage(&msg);
> > +          ::DispatchMessageW(&msg);
> 
> Nit: Is this needed somehow?

Yeah, DispatchMessage is a class member now.
https://hg.mozilla.org/mozilla-central/rev/9cc90a4b6475
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla27
I am looking at bug 798219 and it appears on many talos runs we fail to quit the browser, I see a line from the patches on this bug in the output only when we fail to close the browser:
08:24:43     INFO -  NOTE: child process received `Goodbye', closing down

I don't know why this is a problem about 33% of the time while running talos?  Is there a way I can shut things down differently?  Here is how we quit:
http://hg.mozilla.org/build/talos/file/e11c30b59cef/talos/pageloader/chrome/quit.js
Flags: needinfo?(dvander)
this happens mostly on osx.
That message just means that the content process saw a message from the parent telling it to shut down, and it does so.
ok, so some reason we have a content process sitting around and not allowing Firefox to terminate.  This only happens when I see the above mentioned message in the log file.  For the other 66% of the test runs on talos, we don't see this message and terminate correctly.  I am open to suggestions.
Did anyone file a bug about the clobber?
(In reply to Joel Maher (:jmaher) from comment #25)
> I am looking at bug 798219 and it appears on many talos runs we fail to quit
> the browser, I see a line from the patches on this bug in the output only
> when we fail to close the browser:
> 08:24:43     INFO -  NOTE: child process received `Goodbye', closing down
> 
> I don't know why this is a problem about 33% of the time while running
> talos?  Is there a way I can shut things down differently?  Here is how we
> quit:
> http://hg.mozilla.org/build/talos/file/e11c30b59cef/talos/pageloader/chrome/
> quit.js

Answered in IRC earlier, but for posterity: comment #27 is right, this message always prints when the child process exits. To debug the TBPL problem it would help to have stacks of the child and parent processes.
Flags: needinfo?(dvander)
Depends on: 924121
Depends on: 976838
You need to log in before you can comment on or make changes to this bug.