Last Comment Bug 637339 - Improve error handling in HttpChannelChild
: Improve error handling in HttpChannelChild
Status: RESOLVED FIXED
: crash
Product: Core
Classification: Components
Component: Networking (show other bugs)
: Trunk
: All All
: -- critical (vote)
: mozilla7
Assigned To: Jason Duell [:jduell] (needinfo me)
:
Mentors:
: 637336 (view as bug list)
Depends on: 660774
Blocks: 1060788
  Show dependency treegraph
 
Reported: 2011-02-28 08:31 PST by Josh Matthews [:jdm]
Modified: 2014-08-30 06:09 PDT (History)
9 users (show)
mounir: in‑testsuite?
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
patch v1 (3.88 KB, patch)
2011-03-24 13:53 PDT, Nicholas Hurley [:nwgh][:hurley]
jduell.mcbugs: review-
Details | Diff | Splinter Review
patch v2 (8.27 KB, patch)
2011-03-25 14:17 PDT, Nicholas Hurley [:nwgh][:hurley]
jduell.mcbugs: review+
Details | Diff | Splinter Review
Patch for asynchronous FailedAsyncOpen (4.93 KB, patch)
2011-03-28 13:11 PDT, Nicholas Hurley [:nwgh][:hurley]
jduell.mcbugs: review-
Details | Diff | Splinter Review
Patch for asynchronous FailedAsyncOpen (28.19 KB, patch)
2011-04-01 14:27 PDT, Nicholas Hurley [:nwgh][:hurley]
no flags Details | Diff | Splinter Review
Patch for asynchronous FailedAsyncOpen (32.80 KB, patch)
2011-04-28 09:11 PDT, Nicholas Hurley [:nwgh][:hurley]
no flags Details | Diff | Splinter Review
part1 v3 (8.06 KB, patch)
2011-05-04 23:37 PDT, Jason Duell [:jduell] (needinfo me)
jduell.mcbugs: review+
Details | Diff | Splinter Review
part 2 (asynchronous FailedAsyncOpen) v4 (31.56 KB, patch)
2011-05-04 23:41 PDT, Jason Duell [:jduell] (needinfo me)
jduell.mcbugs: review+
Details | Diff | Splinter Review
part1 v3.1 (corrected patch hg metadata) (8.04 KB, patch)
2011-05-04 23:53 PDT, Jason Duell [:jduell] (needinfo me)
jduell.mcbugs: review+
Details | Diff | Splinter Review
part 2 (asynchronous FailedAsyncOpen) v4.1 (just hg meta-data fix) (31.55 KB, patch)
2011-05-04 23:54 PDT, Jason Duell [:jduell] (needinfo me)
no flags Details | Diff | Splinter Review
part 2 (asynchronous FailedAsyncOpen) v4.2 (async fix) (32.56 KB, patch)
2011-05-06 01:26 PDT, Jason Duell [:jduell] (needinfo me)
no flags Details | Diff | Splinter Review
part 2 (asynchronous FailedAsyncOpen) v4.3 (removed dead code) (31.43 KB, patch)
2011-05-09 21:50 PDT, Jason Duell [:jduell] (needinfo me)
jduell.mcbugs: review+
Details | Diff | Splinter Review
part3: Avoid casting base/derived member function ptrs by using template class. (27.61 KB, patch)
2011-05-09 22:27 PDT, Jason Duell [:jduell] (needinfo me)
honzab.moz: review+
Details | Diff | Splinter Review
part3 - update: move mCallOnResume to the base class (5.37 KB, patch)
2011-05-13 10:46 PDT, Honza Bambas (:mayhemer)
no flags Details | Diff | Splinter Review
One last fix... (1.79 KB, patch)
2011-05-13 17:27 PDT, Jason Duell [:jduell] (needinfo me)
no flags Details | Diff | Splinter Review
bitrot fix: still needs queueing logic repaired (35.69 KB, patch)
2011-05-24 03:59 PDT, Jason Duell [:jduell] (needinfo me)
jduell.mcbugs: review+
Details | Diff | Splinter Review

Description Josh Matthews [:jdm] 2011-02-28 08:31:36 PST
It looks to me like there's a possibility that while the redirect callback is pending, the parent could cancel the channel, which would destroy the IPC actor.  This would explain crashes like bp-eb3ccbf1-f475-40d2-9cf4-d644a2110131.
Comment 1 Honza Bambas (:mayhemer) 2011-02-28 15:58:05 PST
A good catch, along with bug 637336.  I believe there should be just a single patch for both of them, no need to stack patches separately.
Comment 2 Jason Duell [:jduell] (needinfo me) 2011-03-23 10:44:04 PDT
What, we're throwing the new guy at e10s redirect bugs?  What is this, a hazing ritual? :)

The stack trace jdm points to definitely indicates that the IPDL protocol of the existing (as opposed to the new, redirected-to) channel has been deleted.  So I suppose we're going to want to guard SendRedirect2Verify with a check for mIPCOpen.   But I'd like to understand how we can actually get to this point, because at the moment I don't.

>  the parent could cancel the channel, which would destroy the IPC actor.

Actually, AFAICT cancelling the original channel does nothing more than cause the nsHttpChannel to get cancelled, which means it will send OnStart/OnStopRequest with error codes, but won't close the IPDL channel.  In fact, it doesn't even cancel the redirect if it's already starting to happen (which I think may be a bug: I just filed bug 644185 for it).

We only delete the IPDL protocol from the child side, and we only do so in four places, none of which seem possible to me (which means I'm probably missing something):

1) OnCancel. (which we should rename CancelEarly):  This only gets called if AsyncOpen doesn't complete successfully, in which case we never get to redirects

2) We get to OnStopRequest on the child, and this is not a document channel.  We should never get to OnStop in the middle of a redirect.

3) DeleteSelf():   this only gets sent to the old channel (i.e the one we're seeing closed here) by the parent if we've successfully completed a redirect, so it shouldn't happen before the child has called SendRedirect2Verify.

4) The HttpChannelChild's refcount hits 1 and mKeptAlive is true.  This only happens if we've already hit OnStopRequest, which shouldn't be the case during a redirect.

So I really don't understand what's happening here.  Perhaps an IPDL bug?

I suppose we should probably add the check for mIPCOpen to stop the crashing.  But it makes me queasy to not understand what's going on here.
Comment 3 Josh Matthews [:jdm] 2011-03-23 11:51:19 PDT
You're right, I misread OnCancel as being the general cancellation instead of the specific pre-AsyncOpen succeeding cancel.  My mistake.  It definitely could use a rename.
Comment 4 Josh Matthews [:jdm] 2011-03-23 11:57:57 PDT
I admit that I'm having difficulty following all of the redirect code.  Assuming I'm not crazy and it's possible for a redirect to be initiated and then have the old channel canceled, would it be possible for the OnStopRequest to run before the redirect callback?  Would it also be possible for the cancelled channel to be a non-document one?  That seems like it would set the stage for the observed crash, but I don't have a sense of whether all these things can actually occur together.
Comment 5 Honza Bambas (:mayhemer) 2011-03-23 12:01:00 PDT
If some redirect sink on the parent process cancels the redirect sooner that any sink returns some result on the child process, then there is a chance we get OnStart/OnStop also before child sinks answers.

The channel can be an image/js/csss load.  Part of the load group but not a document load.

I also agree on renaming OnCancel to CancelEarly.
Comment 6 Honza Bambas (:mayhemer) 2011-03-23 12:02:11 PDT
(In reply to comment #5)
> I also agree on renaming OnCancel to CancelEarly.

Or even OnFailedAsyncOpen ?
Comment 7 Nicholas Hurley [:nwgh][:hurley] 2011-03-24 13:53:20 PDT
Created attachment 521625 [details] [diff] [review]
patch v1

Fixes this AND bug 637336 (plus does the rename of OnCancel->CancelEarly)
Comment 8 Jason Duell [:jduell] (needinfo me) 2011-03-25 07:05:36 PDT
Comment on attachment 521625 [details] [diff] [review]
patch v1

Let's go with Honza's suggestion and call it "FailedAsyncOpen": it's the most descriptive (no need to call in "OnFailedAsyncOpen': the "On" is only there for OnStartRequest, etc., because that's the name of the IDL function).  This will mean changing "CancelEarly" to "FailedAsyncOpen" in PHttpChannel.ipdl and also in the code that sends the msg in HttpChannelParent.  Also change the name of CancelEvent to FailedAsyncOpenEvent (make sure to also change the friend class declaration in the .h file).

I also just noticed that the "if (mCanceled) return" check in OnCancel is incorrect.  If the client calls AsyncOpen, and we send the msg to do AsyncOpen on the parent (and return NS_OK back to the client on the child), but AsyncOpen fails on the parent and we thus send CancelEarly back to the child, we need to make sure that we get to the OnStart/OnStop calls in OnCancel, even if the client has called cancel after AsyncOpen returned (otherwise we break the channel's contract that OnStart/Stop are always called if AsyncOpen returned NS_OK).   So remove the "if (mCanceled)" check, and also get rid of the hack in HttpChannelChild::AsyncOpen where we set mCanceled to false before calling OnCancel.  

Otherwise looks good!  One more rev and I think we're done.
Comment 9 Honza Bambas (:mayhemer) 2011-03-25 08:03:13 PDT
Agree on the bad OnCancel condition.  if (mCanceled) return; can just be removed from that method.

One more update (doesn't necessarily have to part of this bug as we live with it a long time, but wouldn't be bad to have):

call to OnCancel/FailedAsyncOpenEvent from HttpChannelChild::AsyncOpen should be made asynchronously, the way nsHttpChannel::AsyncAbort is used.
Comment 10 Nicholas Hurley [:nwgh][:hurley] 2011-03-25 14:17:24 PDT
Created attachment 521948 [details] [diff] [review]
patch v2

Just like the old one, but renames OnCancel/CancelEarly (and associated methods & classes) to FailedAsyncOpen, etc. If we want, I can work the asynchronicity of it into another rev of this patch, or we can get this in and I'll file another bug for the bit in Honza's last comment.
Comment 11 Jason Duell [:jduell] (needinfo me) 2011-03-26 06:02:20 PDT
Comment on attachment 521948 [details] [diff] [review]
patch v2

Looks good.

Go ahead and fix the async FailedAsyncOpenEvent issue in another patch (i.e. on top of this patch in your mq patch queue, assuming you're using mq) and attach it to this bug.  That'll make it easier to review, and then we can get approval for both patches in one fell swoop.

By the way, the reason we want to have the FailedAsyncOpenEvent be called async is that otherwise the client gets OnStart and OnStop called while they're still in AsyncOpen, which is not what their code may expect.
Comment 12 Nicholas Hurley [:nwgh][:hurley] 2011-03-28 13:11:45 PDT
Created attachment 522443 [details] [diff] [review]
Patch for asynchronous FailedAsyncOpen

Asynchronously calls FailedAsyncOpen from HttpChannelChild::AsyncOpen as Honza requested.
Comment 13 Jason Duell [:jduell] (needinfo me) 2011-03-29 06:02:37 PDT
Comment on attachment 522443 [details] [diff] [review]
Patch for asynchronous FailedAsyncOpen

Honza/bz: I could use your advice on #4 and #5 below.

>+nsresult
>+HttpChannelChild::AsyncCall(nsAsyncCallback funcPtr,
>+                         nsRunnableMethod<HttpChannelChild> **retval)
>+{
>+    nsresult rv;
>+
>+    nsRefPtr<nsRunnableMethod<HttpChannelChild> > event =
>+        NS_NewRunnableMethod(this, funcPtr);
>+    rv = NS_DispatchToCurrentThread(event);
>+    if (NS_SUCCEEDED(rv) && retval) {
>+        *retval = event;
>+    }
>+
>+    return rv;
>+}

Aha!  Why this looks like cut-and-paste from nsHttpChannel::AsyncCall.  We should move it into the base class, HttpBaseChannel.  And actually, as I look over the nsHttpChannel::AsyncAbort implemention, I see it's handling some things that this patch does not, such as making sure the OnStart/OnStop aren't delivered while the channel is suspended.  Which is important! :)

Time to move this entire call chain into HttpBaseChannel, I think.   

1) move the following functions from nsHttpChannel into HttpBaseChannel:

    AsyncAbort
    AsyncCall
    HandleAsyncNotifyListener()
    DoNotifyListener();

There's a minor bug in AsyncAbort right now:  it dispatches HandleAsyncNotifyListener, then "finally" removes the channel from the load group.  Except that the load group will actually probably get removed before the async callback runs.  Let's change the name of HandleAsyncNotifyListener to "HandleAsyncAbort" and move the code that removes the channel from the loadgroup from AsyncAbort to it (call it after DoNotifyListener is called);

2) We'll also need to add a new function (lets call it DoNotifyListenerCleanup) to handle slight differences between nsHttpChannel/HttpChannelChild at the end of the logic.  Make it virtual in the base class.  nsHttpChannel will use it to call CleanRedirectCacheChainIfNecessary() (replace the call to it that's in DoNotifyListener now with a call to DoNotifyListenerCleanup).  HttpChannelChild's implementation will be 

  if (mIPCOpen)
    PHttpChannelChild::Send__delete__(this);

3) Change HttpChannelChild::FailedAsyncOpen to take an "bool async" parameter.  When called from IPDL, pass 'false'.  When called from AsyncOpen, pass 'true'.  It can then look like

    { 
      if (async) { 
        AsyncAbort(status); 
      } else { 
        mStatus = status; 
        mIsPending = PR_FALSE;
        // We're already being called from an incoming IPDL msg, and are thus
        // already "async" from client's perspective.
        HandleAsyncAbort(); 
      }   
    }
   
4) Copy the logic in nsHttpChannel::Resume that checks mPendingAsyncCallOnResume and calls AsyncCall to HttpChannelChild::Resume.  The only thing funky here is the existing NS_ENSURE_SUCCESS call that we do after AsyncCall.  I'm not really sure how useful that logic is (is it really useful to return an error from Resume if we can't restart mPendingAsyncCallOnResume?  Perhaps just so the client can abort somehow?  Unless they write their logic to no longer rely on onStart/Stop getting called if Resume fails, they're still screwed, and the IDL contract for nsIChannel says nothing about failure for resume.  Honza?  bz?  thoughts? I suspect we'd be better off calling mPendingAsyncCallOnResume synchronously if async launching fails.  Or perhaps we can just asssume that this never actually happens, and/or things are Very Bad anyway if it does?).   Anyway, for now you can just save rv and return it at the end of the function:
    
  if (!mSuspendCount) {

    if (mPendingAsyncCallOnResume) {
        nsresult rv = AsyncCall(mPendingAsyncCallOnResume);
        mPendingAsyncCallOnResume = nsnull;
    }

    // If we were suspended outside of an event handler (bug 595972) we'll
    // consider ourselves unqueued. This is a legal state of affairs but
    // FlushEventQueue() can't easily ensure this fact, so we'll do some
    // fudging to set the invariants correctly.    
    if (mQueuePhase == PHASE_UNQUEUED)
      mQueuePhase = PHASE_FINISHED_QUEUEING;
    FlushEventQueue();
    }
    return rv;
  }

5) add "if (mChannel)" checks to HttpChannelParent::RecvResume, RecvSuspend, and RecvSetPriority (Honza: do we need it for RecvUpdateAssociatedContentSecurity or RecvMarkOfflineCacheEntryAsForeign?  I assume not) to avoid segfaulting if the client calls any of those functions on a "channel" that never got created on the parent (ie RecvAsyncOpen called SendAsyncOpenFailed).  In all cases do nothing and just return true if !mChannel.

6) Understand all this as best you can, and make sure I haven't missed any subtle differences between the nsHttpChannel and HttpChannelChild cases.  The only things I see are that we currently set mCanceled = 1 in AsyncOpenFailed, while we wouldn't in the new logic, which is fine (we're not actually really canceling the channel per se).  Also the current FailedAsyncOpen logic removes the channel from the loadgroup before calling onStart/stop, which is a minor bug, so moving to the nsHttpChannel logic is fine there too.
Comment 14 Nicholas Hurley [:nwgh][:hurley] 2011-04-01 14:27:45 PDT
Created attachment 523681 [details] [diff] [review]
Patch for asynchronous FailedAsyncOpen

A less cut-and-pasty version of the original, with the other fixes, too.
Comment 15 Honza Bambas (:mayhemer) 2011-04-25 06:02:47 PDT
(In reply to comment #13)
> Comment on attachment 522443 [details] [diff] [review]
> Patch for asynchronous FailedAsyncOpen
> 
> Honza/bz: I could use your advice on #4 and #5 below.

> 4) I'm not really sure how useful that logic is

AFAICT, dispatch might fail only when the thread we are called on doesn't have an event queue or have already shut the queue down.  It might happen during shutdown, so the channel will be closed anyway somehow (I believe).

> mPendingAsyncCallOnResume synchronously if async launching fails. 

Not a bad idea, but not for this bug.  Let's just add an assertion or NS_ERROR that this has failed to watch for incidents.  If occurs, fix it.

> Anyway, for now you can just save rv and return it
> at the end of the function:

Agree, the behavior should be preserved.

> 5) add "if (mChannel)" checks to HttpChannelParent::RecvResume, RecvSuspend,
> and RecvSetPriority

Agree.

> (Honza: do we need it for
> RecvUpdateAssociatedContentSecurity or RecvMarkOfflineCacheEntryAsForeign?  I
> assume not) 

We don't.
Comment 16 Honza Bambas (:mayhemer) 2011-04-25 06:31:31 PDT
Just an idea: couldn't HttpBaseChannel::AsyncCall be template method with the channel type as a template argument?  Something like template<class T> nsresult HttpBaseChannel::AsyncCall(T channel, nsAsyncCallback funcPtr, ..); ?  I'm not a big fun of the new do_ functions.

Also I'm not sure where did all the code from nsHttpChannel::AsyncAbort go.  But maybe just the new code is unreadable for me..
Comment 17 Nicholas Hurley [:nwgh][:hurley] 2011-04-27 10:48:45 PDT
(In reply to comment #16)
> Just an idea: couldn't HttpBaseChannel::AsyncCall be template method with the
> channel type as a template argument?  Something like template<class T> nsresult
> HttpBaseChannel::AsyncCall(T channel, nsAsyncCallback funcPtr, ..); ?  I'm not
> a big fun of the new do_ functions.

It looks like this should work. I've done a cursory test on mac and linux (building on win7 right now), and the somewhat odd casting I've had to do seems ok so far (no access to try, or I'd push there to give it a shot). Once I've run it on win7, I'll put up a new version of the patch like this. The code is generally cleaner (I wasn't a fan of the do_ functions, either, tbqh), so that's a win.
 
> Also I'm not sure where did all the code from nsHttpChannel::AsyncAbort go. 
> But maybe just the new code is unreadable for me..

This was moved around in accordance with item 1 from comment 13 from Jason
Comment 18 Nicholas Hurley [:nwgh][:hurley] 2011-04-28 09:11:53 PDT
Created attachment 528865 [details] [diff] [review]
Patch for asynchronous FailedAsyncOpen

Here's the patch updated to make AsyncCall a templated function. I've done cursory tests on win/linux/mac, but if someone would like to push this to try (strange c++ still scares me a bit), I wouldn't mind.
Comment 19 Jason Duell [:jduell] (needinfo me) 2011-05-04 23:37:25 PDT
Created attachment 530244 [details] [diff] [review]
part1 v3

Fix bitrot
Comment 20 Jason Duell [:jduell] (needinfo me) 2011-05-04 23:41:59 PDT
Created attachment 530245 [details] [diff] [review]
part 2 (asynchronous FailedAsyncOpen) v4

+r of Nick's patch, plus some fixes I've made:

I removed the 'PRBool sync' parameter from FailedAsyncOpen (by the way, just use 'bool' now, unless the parameter is part of an XPCOM interface), and just have Child::AsyncOpen call 'AsyncAbort' directly.  Simpler.

I moved the body of 'AsyncCall' outside the class declaration (to avoid clutter, we generally only want one- or two-line functions to be inlined directly in the class decl).  I also made AsyncCall and HandleAsyncAbort protected, not public.

The following functions didn't need to be made 'public': I've kept them private:  HandleAsyncRedirect, HandleAsyncNotModified, HandleAsyncFallback, HandleAsyncReplaceWithProxy, HandleAsyncRedirectChannelToHttps 

no newline for 'else' in BaseChannel.cpp DoNotifyListener

I changed HttpBaseChannel::DoNotifyListenerCleanup to be a pure virtual function, rather than have a do-nothing implementation in HttpBaseChannel.

Some line breaks were needed: generally we avoid 80+ lines, even in files that already have some long lines (except function decls in .h files with long lines: then I'm fine adding more such long line function decls).  

As part of making the lines shorter, I changed AsyncCall so that it doesn't take a generic nsAsyncCallback type, but instead just takes a member function to the template class T.   This avoids the need to cast each & every function passed to AsyncCall into an nsAsyncCallback and back, which avoids at least some of the reinterpret_cast<> verbiage.  God C++ is a mess :)

Changed name of mPendingAsyncCallOnResume to mCallOnResume, and nsAsyncCallback to onResumeFunc.

pushing to try...
Comment 21 Jason Duell [:jduell] (needinfo me) 2011-05-04 23:45:46 PDT
*** Bug 637336 has been marked as a duplicate of this bug. ***
Comment 22 Jason Duell [:jduell] (needinfo me) 2011-05-04 23:47:39 PDT
This now covers a lot of stuff besides the original description.
Comment 23 Jason Duell [:jduell] (needinfo me) 2011-05-04 23:53:08 PDT
Created attachment 530246 [details] [diff] [review]
part1 v3.1 (corrected patch hg metadata)
Comment 24 Jason Duell [:jduell] (needinfo me) 2011-05-04 23:54:03 PDT
Created attachment 530247 [details] [diff] [review]
part 2 (asynchronous FailedAsyncOpen) v4.1 (just hg meta-data fix)
Comment 25 Honza Bambas (:mayhemer) 2011-05-05 09:30:11 PDT
(In reply to comment #20)
> As part of making the lines shorter, I changed AsyncCall so that it doesn't
> take a generic nsAsyncCallback type, but instead just takes a member function
> to the template class T.   This avoids the need to cast each & every function
> passed to AsyncCall into an nsAsyncCallback and back, which avoids at least
> some of the reinterpret_cast<> verbiage.  God C++ is a mess :)

Excellent!  I new it was possible somehow.  However, are we sure |this| pointer is passeded correctly to the target method?  Because you call HttpBaseChannel::*mCallOnResume that is assigned a pointer to an nsHttpChannel::method or HttpChannelChild::method, method of a derived class.  It is not quit clear what happens on what compiler.  As I read some articles, it seems that the derived class must first derive from the base class (the method pointer is created for), in this case HttpBaseChannel.  See [1] - "Member Function Pointers - why are they so complex?".


However, for me the two patches are not compilable:
d:/mozilla/mozilla-central/netwerk/protocol/http/nsHttpHandler.cpp(1467) : error C2259: 'mozilla::net::HttpChannelChild' : cannot instantiate abstract class due to following members:
        'bool mozilla::net::PHttpChannelChild::RecvCancelEarly(const nsresult &)' : is abstract
D:\mozilla\mozilla-central\_obj-browser-debug\ipc\ipdl\_ipdlheaders\mozilla/net/PHttpChannelChild.h(75) : see declaration of 'mozilla::net::PHttpChannelChild::RecvCancelEarly'


[1] http://www.codeproject.com/KB/cpp/FastDelegate.aspx
Comment 26 Honza Bambas (:mayhemer) 2011-05-05 10:49:08 PDT
What we exactly do:
- have mCallOnResume member of type pointer to a method of HttpBaseChannel
- in nsHttpChannel we assign it a method of nsHttpChannel through reinterpret_cast ; on windows, this cannot be compiled, it needs static_cast and warns with "warning C4407: cast between different pointer to member representations, compiler may generate incorrect code", BTW
- in nsHttpChannel::Resume we "upcast" mCallOnResume to pointer to a method of nsHttpChannel when we passing it to AsyncCall template method as an argument ; this is legal, no warning
- then nsRunnableMethod calls (mObject->*mMethod)() where mObject is of nsHttpChannel* type (correctly assigned) and mMethod is of (nsHttpChannel::*)() type ; in my test program, |this| pointer is broken for such scenario thanks the static_cast
Comment 27 Honza Bambas (:mayhemer) 2011-05-05 11:54:40 PDT
(In reply to comment #25)
> However, for me the two patches are not compilable:
Sorry for spam: ipdl change.
Comment 28 Jason Duell [:jduell] (needinfo me) 2011-05-06 01:26:14 PDT
Created attachment 530569 [details] [diff] [review]
part 2 (asynchronous FailedAsyncOpen) v4.2 (async fix)

Another fix: we should actually use AsyncCall when we flush the event queue in HttpChannelChild::Resume.  Otherwise we could win up calling OnDataAvailable (or whatever else is triggered by the msgs in the queue) before the client's Resume calls returns.  Fixed.

Regarding the safety of our C++ member function casting:  

> in nsHttpChannel we assign it a method of nsHttpChannel through
> reinterpret_cast ; on windows, this cannot be compiled,  it needs static_cast
> and warns..

Honza, are you still seeing this?  We seem to pass the Windows try server just fine.

The article you linked to is very interesting.  And worrisome.  It sounds like what we're doing is "undefined" in C++:

"According to the Standard (section 5.2.10/9), you can use reinterpret_cast to store a member function for one class inside a member function pointer for an unrelated class. The result of invoking the casted member function is undefined.  The only thing you can do with it is cast it back to the class that it came from. I'll discuss this at length later in the article, because it's an area where the Standard bears little resemblance to real compilers."

This patch is currently casting nsHttpChannel functions into HttpBaseChannel ones, and then casting mCallOnResume "back" to an nsHttpChannel function in Resume.  When mCallOnResume is indeed an nsHttpChannel function, that's fine.  But in the case where mCallOnResume is &HttpBaseChannel::HandleAsyncAbort, we're in undefined territory (I'm unclear on whether all our compilers of interest will handle it correctly or not), since we're taking a base class member function, casting it to derived, then using it.  (The HttpChannelChild code simply calls mCallOnResume without casting it, i.e. as a HttpBaseChannel function: this works for now because its only use case is &HttpBaseChannel::HandleAsyncAbort.)

Elsewhere the article mentions that to be safe in practice across various compilers, "you can safely cast a member function pointer from a derived class to its first base class only!".  Right now HttpBaseClass is the first class in nsHttpChannel's inheritance list, but second for HttpChannelChild.  But that can be easily fixed. 

I'm not sure how to verify that we're "safe" here or not.  Perhaps we need to call in one of our Mozilla C++ gurus (I'm going to email a few of them).   The "delegates" code in the article could probably also work, but it sounds like a lot of complex and hacky code (18k worth).   

Perhaps we should just switch from using member function pointers to using a enum state variable and a function that does a switch() on it to call the correct function.  That's a "gross" C way of doing it, but it's actually quite simple.  To be precise:  We could have 'int mCallOnResume', plus a couple of #defined states in HttpBaseClass.h  (ONRESUME_NOTHING, ONRESUME_HANDLEASYNCABORT), and perhaps a virtual function "bool CallOnResume()" that does a switch(mCallOnResume) and returns true if the value is one of those two #defines (and calls HandleAsyncAbort if its ONRESUME_HANDLEASYNCABORT).   Then the derived classes could have their own #defines for the member functions they use, and an implementaton of CallOnResume() that 1) first calls HttpBaseClass::CallOnResume(), and 2) if it returns false, does a switch statement to call one of its own functions.
Comment 29 Jason Duell [:jduell] (needinfo me) 2011-05-06 01:31:56 PDT
cc-ing some C++ gurus for advice on this member function ptr issue.  Reading starting from comment 25 (or maybe just 28) should be enough to get the gist of the issue.  Advice appreciated!
Comment 30 Jason Duell [:jduell] (needinfo me) 2011-05-06 03:06:36 PDT
Try is green for latest patches, including windows.  (But I doubt we test the HandleAsyncAbort code path).
Comment 31 Benjamin Smedberg [:bsmedberg] 2011-05-06 06:40:20 PDT
Casting pointer-to-member is almost always a bad idea. If it can be done with enums and virtual functions, that is much preferable.
Comment 32 Zack Weinberg (:zwol) 2011-05-06 07:39:24 PDT
Do I understand correctly that the situation is: an HttpBaseChannel method wishes to call one of several methods depending on circumstances; all of those methods have the same signature; however, not all of them are implemented in the same class: some are HttpBaseChannel, some are only visible on one of its derived classes (HttpChannelChild, nsHttpChannel) ?

If so, you could avoid all the casting of pointer-to-member-functions by providing stub definitions in HttpBaseChannel of all the methods that might possibly need to be called this way.  If all the stubs have exactly the same signature -- including virtualness -- I think you can even just make the ones that should be on the subclasses be abstract virtuals, and it'll Just Work.  (Unless HttpBaseChannel ever needs instantiating by itself.)
Comment 33 Jason Duell [:jduell] (needinfo me) 2011-05-06 09:14:45 PDT
Zack,

Close:  it's the derived classes that are calling the member function ptr (which is a base class member function ptr, and needs to be, since in one case it's the base class that assigns it one of its functions).   I suppose maybe if we had the base class implement all the derived classes' function we could just pass "&Base::Foo" in all cases, and that might work (taking the addr of a pure virtual seems unadvisable :).   

But at this point the enum solution honestly seems cleaner than jamming all the derived class function info into the base class.
Comment 34 Honza Bambas (:mayhemer) 2011-05-06 09:37:32 PDT
I can build both patches, no warnings nor errors.

I tried to run test test_redirect-caching_passing.js, that invokes call of nsHttpChannel::HandleAsyncRedirect.  The members (in a debug build) seems to be perfectly correct.

I added a testing method to HttpChannelChild that is called like this:

  mCallOnResume = reinterpret_cast<onResumeFunc>(&HttpChannelChild::HandleAsyncVoid);
  AsyncCall<HttpChannelChild>(this, mCallOnResume);
  mCallOnResume = nsnull;

|this| at HttpChannelChild::HandleAsyncVoid is clearly shifted.

[ I wrote a little test program, trying to have a simple stub of what the Jason's patch does.  But I have very different results even during build (I don't understand why!), so let's don't take it into account as a valid source of information for this case.  Just note that I experience |this| shift there as well. ]

I turned HandleAsyncVoid to be pure virtual in HttpBaseChannel, defined in both nsHttpChannel and HttpChannelChild.  Using the code:

  mCallOnResume = static_cast<onResumeFunc>(&HttpChannelChild::HandleAsyncVoid);
  AsyncCall<HttpChannelChild>(this, mCallOnResume);
  mCallOnResume = nsnull;

...I get the correct |this| pointer.

This is only tested on windows.
Comment 35 Honza Bambas (:mayhemer) 2011-05-06 09:51:56 PDT
If we don't have any common methods (like HttpBaseChannel::HandleAsyncAbort) that we are trying to AsyncCall in both derived classes, we could have a template base class (template<class T> class HttpBaseAsyncCall) that implements the common code (<T>AsyncCall, <T>mCallOnResume) for both derived classes individually.  Then nsHttpChannel::HandleAsyncAbort can simply call HttpBaseChannel::HandleAsyncAbort, same for HttpChannelChild.

Otherwise, let's leave the code split as it was before rather then do anything even more complicated.
Comment 36 Jason Duell [:jduell] (needinfo me) 2011-05-09 21:50:42 PDT
Created attachment 531244 [details] [diff] [review]
part 2 (asynchronous FailedAsyncOpen) v4.3 (removed dead code)

This is just v2, minus some functions I added the the EventQueue API that we don't wind up using.
Comment 37 Jason Duell [:jduell] (needinfo me) 2011-05-09 22:27:44 PDT
Created attachment 531248 [details] [diff] [review]
part3: Avoid casting base/derived member function ptrs by using template class.

This implements Honza's idea, as near as I understand it.

Uses template class to avoid casting member function ptrs to/from base class, while still sharing code.   For bonus points I threw AsyncCall() into the class, which cleans up the call sites considerably.

The template implementation is a little odd: the class needs to be a base class for nsHttpChannel/HttpChannelChild (so it can get at 'this' in the resume function, which must take void).   But it also needs to touch private fields in the subclasses, so it's also a friend of each.   It's not likely to win any design awards.  But then, neither is C++ ;)

Note: we could have rolled all the template logic into HttpBaseClass, instead of having another base class.  But then we'd need to change every HttpBaseChannel function to "template <class T>" and move all code from the .cpp file into the .h file, and I didn't want to deal with that.

DON'T CHECK THIS IN AS A SEPARATE PATCH: we should qfold it into the part 2 patch. It reverts all the AsyncCall calls back to the original (i.e. untouched by this bug), and we might as well preserve the hg blame history.  I don't think that'll happen if we check in part2 (which modifies the call sites) then this (which reverts them back).  I'm keeping it a separate patch for now so it's easier to review.

Pushed to try...
Comment 38 Jason Duell [:jduell] (needinfo me) 2011-05-10 15:28:20 PDT
try is green.
Comment 39 Nicholas Hurley [:nwgh][:hurley] 2011-05-10 15:37:18 PDT
FWIW, the latest part 3 (and revisions to part 2) look good to me (as good as C++ can, anyway). (C++)-- for all this.
Comment 40 Honza Bambas (:mayhemer) 2011-05-13 10:45:05 PDT
Comment on attachment 531248 [details] [diff] [review]
part3: Avoid casting base/derived member function ptrs by using template class.

Review of attachment 531248 [details] [diff] [review]:
-----------------------------------------------------------------

Excelent!  You can move mCallOnResume to HttpAsyncAborter as well.  I have a working and compilable patch for it.

r=honzab with that.
Comment 41 Honza Bambas (:mayhemer) 2011-05-13 10:46:29 PDT
Created attachment 532275 [details] [diff] [review]
part3 - update: move mCallOnResume to the base class
Comment 42 Jason Duell [:jduell] (needinfo me) 2011-05-13 17:27:59 PDT
Created attachment 532393 [details] [diff] [review]
One last fix...

Nice catch Honza.

I found one more issue:  we need to keep the logic that sets PHASE_FINISHED_QUEUEING in Resume (not CompleteResume).  Otherwise there could be messages in the queue (yes, right now it's possible to have PHASE_UNQUEUED and have messages in the queue, if the channel was suspended outside of OnStartRequest/OnDataAvailable/etc.), and if the IPDL thread has an incoming msg posted to the main thread that gets processed before CompleteResume, it will see both PHASE_UNQUEUED and !suspended, which will cause it to get processed immediately before the msgs in the queue, resulting in out-of-order delivery (OnStopRequest before OnDataAvail, etc.).  

I'm 99% certain this is correct, so I don't think it needs review. (are those famous last words?).

I'm going to open a separate bug to hopefully do some cleanup on the EventQueue state logic, which I consider unintuitive right now.
Comment 43 Jason Duell [:jduell] (needinfo me) 2011-05-13 18:28:09 PDT
Was going to land this on cedar, but a bad cut-and-paste job into my hgrc made it land on m-c instead :)

http://hg.mozilla.org/mozilla-central/rev/93604789da99
Comment 44 Phil Ringnalda (:philor) 2011-05-13 21:50:28 PDT
Backed out in http://hg.mozilla.org/mozilla-central/rev/0ded19f796d8 - something caused timeouts in both flavors of Linux and Windows debug xpcshells,

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1305341349.1305343032.12226.gz

TEST-INFO | /home/cltbld/talos-slave/test/build/xpcshell/tests/netwerk/test/unit_ipc/test_httpsuspend_wrap.js | running test ...

command timed out: 1200 seconds without output, attempting to kill
process killed by signal 9
program finished with exit code -1
elapsedTime=1533.543692

and since this was conveniently at the tip, I took it out first.
Comment 45 Phil Ringnalda (:philor) 2011-05-14 00:07:54 PDT
'twas this causing them.
Comment 46 Jason Duell [:jduell] (needinfo me) 2011-05-24 03:59:45 PDT
Created attachment 534716 [details] [diff] [review]
bitrot fix: still needs queueing logic repaired

fixes bitrot.   The problem is with the ChannelEventQueue.  I'll look at it when I'm fully conscious again.
Comment 47 Jason Duell [:jduell] (needinfo me) 2011-05-31 03:00:58 PDT
Filed bug 660774 for the ChannelEventQueue work we need to get this working.
Comment 48 Jason Duell [:jduell] (needinfo me) 2011-06-11 18:47:49 PDT
At last--even looks like it'll stick this time:

http://hg.mozilla.org/integration/mozilla-inbound/rev/6558faa0b314
Comment 49 Mounir Lamouri (:mounir) 2011-06-12 08:29:48 PDT
Pushed:
http://hg.mozilla.org/mozilla-central/rev/6558faa0b314

Note You need to log in before you can comment on or make changes to this bug.