Last Comment Bug 660774 - e10s necko: refactor channelEventQueue to allow async resume/flush
: e10s necko: refactor channelEventQueue to allow async resume/flush
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: Networking (show other bugs)
: Trunk
: All All
: -- normal (vote)
: mozilla7
Assigned To: Jason Duell [:jduell] (needinfo? me)
:
Mentors:
Depends on:
Blocks: 637339
  Show dependency treegraph
 
Reported: 2011-05-31 02:52 PDT by Jason Duell [:jduell] (needinfo? me)
Modified: 2011-06-13 14:07 PDT (History)
5 users (show)
mounir: in‑testsuite+
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
Simple fix. (9.11 KB, patch)
2011-05-31 02:52 PDT, Jason Duell [:jduell] (needinfo? me)
josh: review+
Details | Diff | Review
More extensive refactoring (applies on top of simple fix) (39.83 KB, patch)
2011-05-31 02:54 PDT, Jason Duell [:jduell] (needinfo? me)
josh: review+
Details | Diff | Review
fixes issue with test_rentrancy_wrap.js (1.29 KB, patch)
2011-06-11 13:44 PDT, Jason Duell [:jduell] (needinfo? me)
no flags Details | Diff | Review

Description Jason Duell [:jduell] (needinfo? me) 2011-05-31 02:52:32 PDT
Created attachment 536259 [details] [diff] [review]
Simple fix.

In bug 637339 we've moved mCallonResume and FlushEventQueue() to an asyc callback (instead of right within Resume()) so that we don't flush the queue (and call client callbacks) before the client's Resume call completes.  Alas, this breaks the existing, rather hack-y ChannelEventQueue logic.

In particular, if the suspend happens within OnStartReq, etc., we wind up hitting FlushEventQueue (when OnStartReq's AutoEventEnqueuer goes out of scope), and that barfs on the mQueuePhase==PHASE_FINISHED_QUEUEING we set in Resume().  If we remove the error check and allow PHASE_FINISHED_QUEUEING the queue will get flushed before we get a chance to run mCallOnResume, which would be out of order.  I suppose we might be able to fix that by adding another state to mQueuePhase, or possibly even (now that I think about it) by setting mQueuePhase=PHASE_QUEUEING in HttpChannelChild::Resume(), plus removing the NS_ABORT_IF_FALSE(mQueuePhase != PHASE_UNQUEUED) in FlushEventQueue().

But in general I'm finding the ChannelEventQueue logic to be too fragile and unintuitive:  for example, when suspended we wind up queueing messages while state=PHASE_UNQUEUED.  For this bug I offer two approaches--one quick fix, and one more extensive refactoring.  I think the second is better.

Note: both of these patches are built on top of the patches in bug 637339, but you don't really need to understand them at all, other than than HttpChannelChild::Resume now launches an async "CompleteResume()" method, which is where we want to actually resume the EventQueue.

Short term fix: I've built in formal Suspend/Resume methods into ChannelEventQueue.  This allows us to delay resuming the ChannelEventQueue until we get to CompleteResume().  It also cleans up the caller code a bit (I could also remove all the template logic from ChannelEventQueue, but I haven't bothered to here).
Comment 1 Jason Duell [:jduell] (needinfo? me) 2011-05-31 02:54:07 PDT
Created attachment 536260 [details] [diff] [review]
More extensive refactoring (applies on top of simple fix)

2nd, more ambituous patch.  Follow on the first patch, but change ChannelEventQueue from using a single state variable and instead check for the various conditions (suspended, in a 'critical section', flushing) that we've been fudging with an enum.  I think this makes the logic a lot easier to follow.

Also convert ChannelEventQueue to be a non-template class, and make it a member of channels, rather than a base class.
Comment 2 Josh Matthews [:jdm] 2011-05-31 05:39:01 PDT
Comment on attachment 536260 [details] [diff] [review]
More extensive refactoring (applies on top of simple fix)

I really like the way the logic is presented here; it's significantly easier to follow. My comments are all superficial things.

>+#include <nsIChannel.h>

I think this can be forward-declared.

>+  bool mForced;
>   bool mSuspended;
>+  bool mFlushing;

These could be bitfields. I don't feel strongly one way or the other, though.

>+  NS_ASSERTION(!( (answer == false) && !mEventQueue.IsEmpty()),
>+               "Should always enqueue if ChannelEventQueue not empty");

This hurts every time I think about it. Can we apply DeMorgan's laws and get |answer == true || mEventQueue.IsEmpty()| instead?

>+  NS_ASSERTION(!mSuspended,

NS_ABORT_IF_FALSE

>+  NS_ASSERTION(mSuspended,

NS_ABORT_IF_FALSE

>+  NS_ASSERTION(!mForced, "MaybeFlushQueue called inside critical section");

NS_ABORT_IF_FALSE

>+    mEventQ.Resume();    // TODO: make this async: see HttpChannelChild::Resume

Get a bug filed on that?
Comment 3 Josh Matthews [:jdm] 2011-05-31 05:40:44 PDT
However, please add the test changes from the first patch into the second one.
Comment 4 Josh Matthews [:jdm] 2011-05-31 05:41:21 PDT
Actually ignore comment 3, I can't read.
Comment 5 Jason Duell [:jduell] (needinfo? me) 2011-06-01 16:02:32 PDT
> >+#include <nsIChannel.h>
> 
> I think this can be forward-declared.

Done. We'll slow global warming down a little from that, except that I suspect all the code that includes this file will eventually include nsIChannel.h too.

 
> These could be bitfields. I don't feel strongly one way or the other, though.

Not going to bother with bitfields. If we had 12+ bools, then sure.


> >+  NS_ASSERTION(!( (answer == false) && !mEventQueue.IsEmpty()),
> >+               "Should always enqueue if ChannelEventQueue not empty");
> 
> This hurts every time I think about it. Can we apply DeMorgan's laws and get
> |answer == true || mEventQueue.IsEmpty()| instead?

I like it.  Sold!

> >+  NS_ASSERTION(!mSuspended,
> 
> NS_ABORT_IF_FALSE

What do you like about NS_ABORT so much?  You want to crash in the field, not just get orange on tbpl?

> >+    mEventQ.Resume();    // TODO: make this async: see HttpChannelChild::Resume
> 
> Get a bug filed on that?

Yes--bug 657076.

Pushed to try...
Comment 6 Josh Matthews [:jdm] 2011-06-01 19:56:14 PDT
Well, I conceptually like NS_ABORT for things we're asserting that are meant to be actual invariants. The cases I highlighted fill that role in my mind; if you disagree then I'll defer to you.
Comment 7 Jason Duell [:jduell] (needinfo? me) 2011-06-02 15:57:25 PDT
Meh--causing test_reentrancy_wrap.js to fail.  Will look into it...
Comment 8 Josh Matthews [:jdm] 2011-06-08 07:15:36 PDT
With regards to asserting, want to compromise on using MOZ_ASSERT from mfbt/Util.h to get hard assertions in debug and nothing in release?
Comment 9 Jason Duell [:jduell] (needinfo? me) 2011-06-09 12:28:55 PDT
MOZ_ASSERT is so much shorter to spell than NS_ABORT_IF_FALSE--that's great!  Is there any other semantic difference?  (both seem to drop dead if debug, else nothing: MOZ_ASSERT doesn't let you set a msg). 

Alas, MOZ_ASSERT currently ties into JS, and while we no longer "really" care about keeping necko independent from other parts of the browser, this seems like a silly place to start breaking that interface.  So I'll stick with NS_ABORT_IF_FALSE.
Comment 10 Josh Matthews [:jdm] 2011-06-09 12:34:22 PDT
MOZ_ASSERT is a nop in release builds. And while it technically ties into the JS engine right now, that's going to change in the nearish future, once mfbt/ turns into a real library of its own right.
Comment 11 Jason Duell [:jduell] (needinfo? me) 2011-06-11 13:44:16 PDT
Created attachment 538715 [details] [diff] [review]
fixes issue with test_rentrancy_wrap.js

test_reentrancy_wrap.js shows that it is possible to call MaybeFlushQueue from within a critical section (for instance, if you're in OnStartRequest for a while and something outside suspends/resumes you).   Fixed by checking for !mForced.

Pushed to try.
Comment 12 Jason Duell [:jduell] (needinfo? me) 2011-06-11 18:43:04 PDT
http://hg.mozilla.org/integration/mozilla-inbound/rev/4a33bc1f772d
Comment 13 Mounir Lamouri (:mounir) 2011-06-12 08:30:50 PDT
Pushed:
http://hg.mozilla.org/mozilla-central/rev/4a33bc1f772d

Note You need to log in before you can comment on or make changes to this bug.