559200 - e10s HTTP: Buffer incoming IPDL necko messages

Reporter

Description

•

14 years ago

Dougt is seeing the following error when loading

    http://mobile.support.mozilla.com/en-US/kb/

Possibly something wrong with our IPDL serialization logic causing RecvOnStartRequest to return before calling the listener?

Jae-Seong, you wrote the serialization code, so I'm giving to you.  This is blocking browsing with fennec, though, so if you can't get to it very soon, please let me know and we'll find someone else.  

---

WARNING: Positioned frame that does not handle positioned kids; looking further
up the parent chain: file
/home/dougt/builds/e10s/electrolysis/layout/base/nsCSSFrameConstructor.cpp, line
5511

WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x804B000A: file
/home/dougt/builds/e10s/electrolysis/netwerk/base/src/nsIOService.cpp, line 583

++DOMWINDOW == 3 (0xb1e4dcd0) [serial = 3] [outer = 0xb1e4ce20]

###!!! ASSERTION: Error: OnStartRequest() must be called before
OnDataAvailable(): '(eOnStart == mParserContext->mStreamListenerState ||
eOnDataAvail == mParserContext->mStreamListenerState)', file
/home/dougt/builds/e10s/electrolysis/parser/htmlparser/src/nsParser.cpp, line
2905

###!!! [Child][AsyncChannel] Error: Value error: message was deserialized, but
contained an illegal value

###!!! ASSERTION: Error: OnStartRequest() must be called before
OnDataAvailable(): '(eOnStart == mParserContext->mStreamListenerState ||
eOnDataAvail == mParserContext->mStreamListenerState)', file
/home/dougt/builds/e10s/electrolysis/parser/htmlparser/src/nsParser.cpp, line
2905

###!!! [Child][AsyncChannel] Error: Value error: message was deserialized, but
contained an illegal value

###!!! ASSERTION: Error: OnStartRequest() must be called before
OnDataAvailable(): '(eOnStart == mParserContext->mStreamListenerState ||
eOnDataAvail == mParserContext->mStreamListenerState)', file
/home/dougt/builds/e10s/electrolysis/parser/htmlparser/src/nsParser.cpp, line
2905

###!!! [Child][AsyncChannel] Error: Value error: message was deserialized, but
contained an illegal value

###!!! ASSERTION: nsExpatDriver didn't get an nsIExpatSink: 'Error', file
/home/dougt/builds/e10s/electrolysis/parser/htmlparser/src/nsExpatDriver.cpp,
line 1206

###!!! ASSERTION: DoContent returned no listener?: 'abort ||
m_targetStreamListener', file
/home/dougt/builds/e10s/electrolysis/uriloader/base/nsURILoader.cpp, line 755

Jae-Seong Lee-Russo

Comment 1

•

14 years ago

I would like to say that it is not necessarily the serialization code that causes the "Error: Value error: " message.

https://wiki.mozilla.org/Content_Processes/FAQ suggests MOZ_IPC_MESSAGE_LOG=1.

Jae-Seong Lee-Russo

Comment 2

•

14 years ago

PHttpChannelParent sent Msg_OnStartRequest(), but the child did not receive it, HttpChannelChild::RecvOnStartRequest not executed, thus the assertion error - OnStartRequest() must be called before OnDataAvailable().

This happens on fennec, but not on firefox.

Jae-Seong Lee-Russo

Comment 3

•

14 years ago

(In reply to comment #2)
> but the child did not receive it,
> HttpChannelChild::RecvOnStartRequest not executed,

I was wrong.  HttpChannelChild::OnStartRequest is called, but I don't know what is happening.

----------------------------------------------------------------------------
I put printfs in RecvOnStartRequest, RecvOnDataAvailable & RecvOnStopRequest.

bool 
HttpChannelChild::RecvOnStartRequest(const nsHttpResponseHead& responseHead)
{
  LOG(("HttpChannelChild::RecvOnStartRequest [this=%x]\n", this));
+ fprintf(stderr, "\nHttpChannelChild::OnStartRequest b\n");
  mState = HCC_ONSTART;

  mResponseHead = new nsHttpResponseHead(responseHead);
+ fprintf(stderr, "\nStatus: %d\n", mResponseHead->Status());
+ fprintf(stderr, "\nVersion: %d\n", mResponseHead->Version());
  nsresult rv = mListener->OnStartRequest(this, mListenerContext);
+ fprintf(stderr, "\nHttpChannelChild::OnStartRequest 1\n");
  if (!NS_SUCCEEDED(rv)) {
    // TODO: Cancel request:
    //  - Send Cancel msg to parent 
    //  - drop any in flight OnDataAvail msgs we receive
    //  - make sure we do call OnStopRequest eventually
    //  - return true here, not false
    return false;  
  }
+ fprintf(stderr, "\nHttpChannelChild::OnStartRequest e\n");
  return true;
}

http://mxr.mozilla.org/projects-central/source/electrolysis/netwerk/protocol/http/src/HttpChannelChild.cpp#94

----------------------------------------------------------------------------
HttpChannelChild::OnStartRequest b

Status: 200

Version: 11
++DOMWINDOW == 3 (045B7860) [serial = 3] [outer = 03EB8628]

HttpChannelChild::OnDataAvailable b

###!!! ASSERTION: Error: OnStartRequest() must be called before OnDataAvailable(): '(eOnStart == mParserContext->mStreamListenerState || eOnDataAvail == mParserContext->mStreamListenerState)', file c:/mozilla-build/e10s-debug/parser/htmlparser/src/nsParser.cpp, line 2905
----------------------------------------------------------------------------

Jae-Seong Lee-Russo

Comment 4

•

14 years ago

Attached file stack trace until the first ASSERTION — Details

Jae-Seong Lee-Russo

Comment 5

•

14 years ago

Jason, I don't think it is the serialization code.

###!!! [Child][AsyncChannel] Error: Value error: message was deserialized, but
contained an illegal value

That message was generated because RecvOnDataAvailable returned false (NS_PREDONDITION failed).  I can confirm that using debugger.

Assignee: lusian → nobody

Jae-Seong Lee-Russo

Comment 6

•

14 years ago

Is there a way to force OnDataAvailable wait until OnStartRequest completes?

This may be wrong, but here is my idea:

 1 NS_InvokeByIndex_P
   ...
17 mozilla::net::HttpChannelChild::RecvOnStartRequest
18 mozilla::net::PHttpChannelChild::OnMessageReceived
19 mozilla::dom::PContentProcessChild::OnMessageReceived
20 mozilla::ipc::AsyncChannel::OnDispatchMessage

HttpChannelChild received RecvOnStartRequest (17) and made the following calls (16 to 1).  NS_InvokeByIndex_P calls functions (such as nsStandardURL::GetSpec), and periodically calls TabChild::SendSyncMessage, which subsequently calls

nsFrameMessageManager::SendSyncMessage -> PIFrameEmbeddingChild::CallsendSyncMessageToParent -> RPCChannel::Call -> AsyncChannel::OnDispatchMessage

and AsyncChannel::OnDispatchMessage calls HttpChannelChild::RecvOnDataAvailable although HttpChannelChild::RecvOnStartRequest has not completed yet.  The following code sees "OnStartRequest() must be called before
OnDataAvailable()" is false.

So, we need to do something to make sure OnStartRequest finishes before OnDataAvailable starts.

Jason Duell

Reporter

Comment 7

•

14 years ago

Comment on attachment 439077 [details]
stack trace until the first ASSERTION

cjones looked at this stack trace, and apparently the problem is the following:

RecvOnStartRequest is triggering a DOM event that's making an RPC call
to the parent (the confusingly named rpc method "sendSyncMessageToParent").

Since rpc msgs need to be re-entrant, this causes pending IPDL msgs in the queue (in this case, RecvOnDataAvailable) to get delivered, when it would otherwise not be dispatched until RecvOnStartRequest has finished.

I don't think this issue can be fixed in necko, or should be.  The rpc msg probably shouldn't be getting sent, and/or shouldn't be an rpc msg (I thought we were planning to use rpc only when needed--for plugins--precisely to avoid this sort of mess).

Jason Duell

Reporter

Comment 8

•

14 years ago

Renaming and (per hg blame), assigning to smaug.

This is blocking fennec using necko/e10s, so would be nice to fix ASAP.

Assignee: nobody → Olli.Pettay

Component: Networking: HTTP → DOM: Events

QA Contact: networking.http → events

Summary: HttpChannelChild calling OnDataAvail before OnStartRequest due to serialization error? → DOM "SendSyncMessageToParent" rpc call breaks necko/e10s

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Comment 9

•

14 years ago

To unblock, I would suggest finding a possibly quick 'n hacky way of preventing the OnStart event from being forwarded from child->parent, I don't see how that would ever be useful.

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Comment 10

•

14 years ago

Fwiw, this frame in the backtrace

xul.dll!nsDocLoader::FireOnLocationChange

suggests that we're forwarding onlocationchange to chrome.  If that's right, comment 9 is incorrect.  I do however think that we shouldn't forward cross-process a notification that chrome can (can?|might?) listen to.

Olli Pettay [:smaug][bugs@pettay.fi]

Updated

•

14 years ago

Component: DOM: Events → IPC

QA Contact: events → ipc

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 11

•

14 years ago

I don't know what is confusing with the name "sendSyncMessageToParent"
It is sending a synchronous messageManager message from child to parent.

And yes, it needs to be 'rpc' because otherwise CPOW handling doesn't
work. It used to be 'sync' , but after adding support for CPOW sending, it was
changed to rpc. But in any case it must be something
which can return some value to child process synchronously.

Could we somehow postpone sendOnDataAvailable? Could RecvOnStartRequest
send some message to chrome when it is "done", and only after that
chrome sends OnDataAvailable?

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 12

•

14 years ago

I don't really know anything in messageManager message handling which
could help here.

Assignee: Olli.Pettay → nobody

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 13

•

14 years ago

And note, if sendSyncMessageToParent can be called, then also createWindow
which also must be rpc. It just doesn't happen in this case, but writing
a testcase for that shouldn't be hard.

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 14

•

14 years ago

A possible case for OnStartRequest to send createWindow to chrome is
using XMLHttpRequest and having readystatechange event listener to
calls window.open() in JS.

So to me it sounds like necko/e10s really must not send on OnDataAvailable
before it knows that OnStartRequest is fully handled.
Or the client side part of necko/e10s should cache/postpone/batch the handling of
OnDataAvailable messages.

Josh Matthews [:jdm]

Assignee

Comment 15

•

14 years ago

Attached patch Quick fix (obsolete) — Details — Splinter Review

This is untested, but I suspect it should fix the problem in question.  We just buffer up all data received until the child sends a notification that OnStartRequest has finished, then the parent can send all buffered notifications.  It gets a bit more complicated because of thread-safety (I suspect there's the potential for more OnDataAvailable notifications to be delivered while sending the buffered IPDL messages), but I think it's a workable short-term solution.

Attachment #439883 - Flags: review?(jduell.mcbugs)

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 16

•

14 years ago

Are those geolocation changes relevant?

Josh Matthews [:jdm]

Assignee

Comment 17

•

14 years ago

I'm sorry, there were some mq problems.  I'll put up a correct one in a few minutes.

Josh Matthews [:jdm]

Assignee

Comment 18

•

14 years ago

Attached patch Quick fix (obsolete) — Details — Splinter Review

Now with 100% fewer mq screwups.

Attachment #439883 - Attachment is obsolete: true

Attachment #439884 - Flags: review?(jduell.mcbugs)

Attachment #439883 - Flags: review?(jduell.mcbugs)

Josh Matthews [:jdm]

Assignee

Comment 19

•

14 years ago

Bleah, it occurs to me that there's still a race present in my code:

>  else {
>    MutexAutoLock lock(mBufferLock);
>    shouldBuffer = !!mBufferedData.Length();
>  }
>  if (shouldBuffer) {
>    DataBuffer buffer = {
>      data, aOffset, bytesRead
>    };
>    MutexAutoLock lock(mBufferLock);
>    mBufferedData.AppendElement(buffer);
>    return NS_OK;
>  }

If the parent receives StartRequestCompleted in between shouldBuffer being set and the AppendElement call, we could end up buffering data that will never be sent, which will then cause all subsequent data to be buffered as well.  I'm not really sure of the best way to fix this without duplicating the appending to share the first lock.

Josh Matthews [:jdm]

Assignee

Comment 20

•

14 years ago

I guess the parent could just grab the lock unconditionally.

Josh Matthews [:jdm]

Assignee

Comment 21

•

14 years ago

Attached patch Quick fix v2 (obsolete) — Details — Splinter Review

Race fixed, explanatory comments.  Sorry for the bugspam.

Attachment #439884 - Attachment is obsolete: true

Attachment #439886 - Flags: review?(jduell.mcbugs)

Attachment #439884 - Flags: review?(jduell.mcbugs)

Jason Duell

Reporter

Comment 22

•

14 years ago

Comment on attachment 439886 [details] [diff] [review]
Quick fix v2

After chatting with bsmedberg, the preferred way to handle this is to cache the data on the child side, not the parent.  That way we can fire off OnDataAvail as soon as OnStart completes, rather than waiting for an IPC round trip.

See BrowserStreamChild::RecvWrite() for an example.

Thanks for picking this bug up jdm!

Attachment #439886 - Flags: review?(jduell.mcbugs) → review-

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Comment 23

•

14 years ago

(In reply to comment #10)
> Fwiw, this frame in the backtrace
> 
> xul.dll!nsDocLoader::FireOnLocationChange
> 
> suggests that we're forwarding onlocationchange to chrome.  If that's right,
> comment 9 is incorrect.  I do however think that we shouldn't forward
> cross-process a notification that chrome can (can?|might?) listen to.

Is forwarding onlocationchange to chrome indeed what was causing this bug?  If so, why are we doing that?  Can't chrome determine for itself when the location has changed?

(In reply to comment #11)
> I don't know what is confusing with the name "sendSyncMessageToParent"
> It is sending a synchronous messageManager message from child to parent.
> 

'send': for a 'sync' message Foo() in a protocol spec, the IPDL compiler generate a C++ method called SendFoo(); for an 'rpc' message Bar(), it generates a method CallBar().  Also, prevailing style uses upper-case MessageName().

'Sync': IPDL uses the 'sync' keyword to mean "synchronous, non-reentrant"; 'rpc' is "synchronous, reentrant."

'ToParent': is implied

This is all documented here https://developer.mozilla.org/en/IPDL.

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Comment 24

•

14 years ago

(In reply to comment #23)
> Is forwarding onlocationchange to chrome indeed what was causing this bug?  If
> so, why are we doing that?  Can't chrome determine for itself when the location
> has changed?
> 

(FWIW I don't disagree with the proposed fix, I just think it would be propitious to smoke out unnecessary event forwarding.)

Josh Matthews [:jdm]

Assignee

Comment 25

•

14 years ago

Attached patch Patch (obsolete) — Details — Splinter Review

Still untested.  If there's a testcase for this, it would be nice to prove to myself that this works.

Assignee: nobody → josh

Attachment #439886 - Attachment is obsolete: true

Attachment #440068 - Flags: review?(jduell.mcbugs)

Josh Matthews [:jdm]

Assignee

Comment 26

•

14 years ago

Attached patch Patch (obsolete) — Details — Splinter Review

One day mq and I will get along without incident.

Attachment #440068 - Attachment is obsolete: true

Attachment #440069 - Flags: review?(jduell.mcbugs)

Attachment #440068 - Flags: review?(jduell.mcbugs)

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 27

•

14 years ago

(In reply to comment #23) 
> 'ToParent': is implied
Because IPDL doesn't allow to use the same message name for
child->parent and parent->child, I couldn't figure out anything better naming for this.

(And yes, this is off topic :) )

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Comment 28

•

14 years ago

(In reply to comment #27)
> (In reply to comment #23) 
> > 'ToParent': is implied
> Because IPDL doesn't allow to use the same message name for
> child->parent and parent->child, I couldn't figure out anything better naming
> for this.
> 

Nope, https://developer.mozilla.org/en/IPDL/Tutorial#Direction .

Jae-Seong Lee-Russo

Comment 29

•

14 years ago

Attached file stack trace after applying jdm's patch (obsolete) — Details

###!!! ASSERTION: nsExpatDriver didn't get an nsIExpatSink: 'Error', file c:/mozilla-build/e10s-debug/parser/htmlparser/src/nsExpatDriver.cpp, line 1206

Now, HttpChannelChild::RecvOnStopRequest runs before HttpChannelChild::RecvOnStartRequest finishes.

Josh Matthews [:jdm]

Assignee

Comment 30

•

14 years ago

I suppose I could just make it a queue of nsIRunnables and add buffering code to every message that could be received.  Dunno if there's a better way to address this.

Josh Matthews [:jdm]

Assignee

Comment 31

•

14 years ago

Attached patch Patch v2 (obsolete) — Details — Splinter Review

Now buffering all callbacks.

Attachment #440069 - Attachment is obsolete: true

Attachment #440174 - Flags: review?(jduell.mcbugs)

Attachment #440069 - Flags: review?(jduell.mcbugs)

Josh Matthews [:jdm]

Assignee

Comment 32

•

14 years ago

Comment on attachment 440174 [details] [diff] [review]
Patch v2

Bleah, patch is busted.  New version that doesn't segfault coming soon.

Attachment #440174 - Attachment is obsolete: true

Attachment #440174 - Flags: review?(jduell.mcbugs)

Josh Matthews [:jdm]

Assignee

Comment 33

•

14 years ago

Attached patch Patch v2 (obsolete) — Details — Splinter Review

It's new and improved!

Attachment #440183 - Flags: review?(jduell.mcbugs)

Jae-Seong Lee-Russo

Comment 34

•

14 years ago

Comment on attachment 440092 [details]
stack trace after applying jdm's patch

No more errors!

Attachment #440092 - Attachment is obsolete: true

Jason Duell

Reporter

Updated

•

14 years ago

Blocks: 558801

Josh Matthews [:jdm]

Assignee

Comment 35

•

14 years ago

Attached patch Patch v2.5 (obsolete) — Details — Splinter Review

The patches, they never end!  This just factors the buffering logic out into one place, and makes it generally nicer to add more buffered callbacks (such as the forthcoming redirect, I assume).

Attachment #440183 - Attachment is obsolete: true

Attachment #440474 - Flags: review?(jduell.mcbugs)

Attachment #440183 - Flags: review?(jduell.mcbugs)

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Comment 36

•

14 years ago

Smaug, can we now agree that the sendSyncMessageToParent() message should have IPDL/'sync' semantics, not 'rpc'?  This would allow jduell to dodge this bullet for a while longer.

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Comment 37

•

14 years ago

Attached patch Make the sendSyncMessage use 'sync' semantics (obsolete) — Details — Splinter Review

This kind of a band-aid, dodge-the-bullet kind of patch, but I don't repro the
original bug with it applied.  If anything tried to re-enter sendSyncMessage (a
CPOW, say), the content process would blow up.

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 38

•

14 years ago

(In reply to comment #36)
> Smaug, can we now agree that the sendSyncMessageToParent() message should have
> IPDL/'sync' semantics, not 'rpc'? 
As I said no. If we want to keep support for CPOW sending, then sendSyncMessageToParent needs to be rpc. But if we want to remove support for
CPOWs there, then we can make it back to sync which it was before Bug 554941.

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 39

•

14 years ago

And actually even removing CPOW support from sync message listeners doesn't
help at all, since in chrome message listener could still access
the CPOW for the content process DOMWindow.

Josh Matthews [:jdm]

Assignee

Updated

•

14 years ago

Depends on: 564382

Josh Matthews [:jdm]

Assignee

Updated

•

14 years ago

Depends on: 567058

Josh Matthews [:jdm]

Assignee

Updated

•

14 years ago

Attachment #440474 - Flags: review?(jduell.mcbugs)

Josh Matthews [:jdm]

Assignee

Comment 41

•

14 years ago

The blocking bugs are tracking the steps being taken to fix this problem; jduell suggested that we should add some state flags to the http channel child that let us check for re-entrancy so we can avoid problems like this in the future.

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Comment 42

•

14 years ago

There are two issues here --- first, once this bug's blockers are fixed, we can summarily remove 'rpc' from the PContent tree, and statically prevent re-entrancy at the IPDL level.

Second is re-entrancy caused by nested event loops, which our IPC code doesn't pay much attention to.  It's not clear to me what problems might arise there.

Boris Zbarsky [:bzbarsky]

Updated

•

14 years ago

Blocks: 581279

Alon Zakai (:azakai)

Comment 43

•

14 years ago

It looks like various event loops currently exist in the code, for example in window.alert()/notify/confirm, as noted in bug 573635. And there might be additional sequences of messages aside from HttpChannel::onStartRequest/onDataAvailable/etc. that are sensitive to this problem (for example, FTP channels). So perhaps it makes sense to think about a more general solution?

One possible direction: Separate messages into multiple logical streams (e.g., 'channels' in the sense of the networking library ENet if anyone's heard of that, http://enet.bespin.org/Features.html ). So we might have networking messages in one stream, GUI messages in another, etc., and the order would be guaranteed to be preserved within a stream. So when responding to a network event, if you run an event loop, it will let you process additional messages but *not* messages from the same stream, which it will buffer. So we would basically write the buffering code once in a general way as opposed to specifically for HTTP channels.

Jason Duell

Reporter

Comment 44

•

14 years ago

OK, it looks like sync XHR is going to require that we actually take this buffering approach.  If an XHR gets sent out during OnStartRequest, for instance, we'll spin the loop and get OnDataAvailable from IPDL while we're still in OnStart.  Buffering is the most obvious way to fix this. 

> One possible direction: Separate messages into multiple logical streams (e.g.,
'channels' in the sense of the networking library ENet if anyone's heard of
that, http://enet.bespin.org/Features.html ). So we might have networking
messages in one stream, GUI messages in another, etc.

Won't work--this is re-entrancy within necko itself.  Time to bite the bullet and buffer.  (hey, that's kinda catchy).

jdm is putting this high on his stack--we need a working solution for fennec.  Should just be a matter of fixing up his previous patch.

tracking-fennec: --- → ?

Summary: DOM "SendSyncMessageToParent" rpc call breaks necko/e10s → e10s HTTP: Buffer incoming IPDL necko messages

Josh Matthews [:jdm]

Assignee

Comment 46

•

14 years ago

Attached patch Patch (obsolete) — Details — Splinter Review

Rebased patch that fixes the crash in bug 581279.

Attachment #440474 - Attachment is obsolete: true

Attachment #441125 - Attachment is obsolete: true

Attachment #461703 - Flags: review?(jduell.mcbugs)

Josh Matthews [:jdm]

Assignee

Comment 47

•

14 years ago

Attached patch Patch — Details — Splinter Review

Updated to not leak callbacks.

Attachment #461703 - Attachment is obsolete: true

Attachment #461993 - Flags: review?(jduell.mcbugs)

Attachment #461703 - Flags: review?(jduell.mcbugs)

Stuart Parmenter

Updated

•

14 years ago

tracking-fennec: ? → 2.0a1+

Doug Turner (:dougt)

Comment 48

•

14 years ago

Comment on attachment 461993 [details] [diff] [review]
Patch

double check lock.  neat.

Attachment #461993 - Flags: review?(jduell.mcbugs) → review+

Doug Turner (:dougt)

Comment 49

•

14 years ago

also -- this approach is probably fine for now.  I much rather see a solution that doesn't require every protocol to do this buffering.  Could we get a follow up bug to do this work, please?

Josh Matthews [:jdm]

Assignee

Comment 50

•

14 years ago

Filed bug 584441 for further investigations into a more widespread fix.

OS: Linux → Windows CE

Josh Matthews [:jdm]

Assignee

Comment 51

•

14 years ago

http://hg.mozilla.org/mozilla-central/rev/40aea46474a9

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Comment 52

•

14 years ago

I happened to catch the use of the lock here while unrotting a patch.  It looks totally unnecessary to me (this is all main-thread only), and additionally,

+  MutexAutoLock lock(mBufferLock);
+  bool ret = true;
+  nsCOMPtr<nsIHttpChannel> kungFuDeathGrip(this);
+  for (PRUint32 i = 0; i < mBufferedCallbacks.Length(); i++) {
+    ret = mBufferedCallbacks[i]->Run();

is dispatching callbacks while holding the lock, which is going to deadlock if we need to buffer from a nested event loop.  Plz to be fixing?

Josh Matthews [:jdm]

Assignee

Comment 53

•

14 years ago

Attached patch Followup — Details — Splinter Review

You make excellent points.

Attachment #462968 - Flags: review?(jones.chris.g)

Doug Turner (:dougt)

Updated

•

14 years ago

Attachment #462968 - Flags: review?(jones.chris.g) → review+

Doug Turner (:dougt)

Comment 54

•

14 years ago

http://hg.mozilla.org/mozilla-central/rev/b22036de463a

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Comment 55

•

14 years ago

Comment on attachment 462968 [details] [diff] [review]
Followup

There are some other things I think we should fix in the original
patch.  Let me list those first.

First, the invariant here is somewhat nonobvious, took me a bit to
figure out: "Don't dispatch necko events until after the OnStart
notification finishes, and then only after queued events have been
dispatched in the order in which they were received."  We need a
comment describing this.  It's also tricky that we might enqueue new
Callbacks while flushing the buffer; that is, mutate
mBufferedCallbacks while in the dispatch loop.  This is OK because of
the way the loop iteration is written and because we don't *remove*
items until done dispatching.

We also need assertions to check this invariant.  This code implements
a three-phase process: first, before the OnStart notification
finishes, we queue events (say, PHASE_ONSTART).  Second, after OnStart
finishes, we flush the queue, but still enqueue any events received
from a nested context so as to preserve in-order delivery (say,
PHASE_FLUSHING_BUFFER).  And third, no queuing (say, PHASE_UNBUFFERED).  I
would recommend a state variable for these instead of a bool
mShouldBuffer.  Then, in the functions where we deliver the
notifications, we can |NS_ABORT_IF_FALSE(mPhase != PHASE_ONSTART)|.
To decide when to buffer, we can check |mPhase != PHASE_UNBUFFER|.

I like the |Callback| and |BufferOrDispatch| implementation, it's nice
and clean.  However, it incurs a cost --- here we always alloc the
Callback, and always copy the delivered params.  Considering that the
param to OnData is a buffer of possibly very large size, I think we
should only pay this price when we need to, since it should be
relatively uncommon.  So I think this style

>+  DataAvailableEvent* event = new DataAvailableEvent(this, data, offset, count);
>+  return BufferOrDispatch(event);

ought instead to be

  if (ShouldBuffer()) {
    return AppendToBuffer(new DataAvailableEvent(this, data, offset, count));
  }
  return OnDataAvailable(data, offset, count);

This is more verbose, but saves what could potentially be a fair
amount of unnecessary memory traffic.

>diff --git a/netwerk/protocol/http/HttpChannelChild.cpp b/netwerk/protocol/http/HttpChannelChild.cpp
>--- a/netwerk/protocol/http/HttpChannelChild.cpp
>+++ b/netwerk/protocol/http/HttpChannelChild.cpp
>@@ -44,18 +44,16 @@
> #include "mozilla/net/NeckoChild.h"
> #include "mozilla/net/HttpChannelChild.h"
> 
> #include "nsStringStream.h"
> #include "nsHttpHandler.h"
> #include "nsMimeTypes.h"
> #include "nsNetUtil.h"
> 
>-using mozilla::MutexAutoLock;
>-
> class Callback
> {
>  public:
>   virtual bool Run() = 0;
> };
> 
> namespace mozilla {
> namespace net {
>@@ -63,17 +61,16 @@ namespace net {
> // C++ file contents
> HttpChannelChild::HttpChannelChild()
>   : mIsFromCache(PR_FALSE)
>   , mCacheEntryAvailable(PR_FALSE)
>   , mCacheExpirationTime(nsICache::NO_EXPIRATION_TIME)
>   , mState(HCC_NEW)
>   , mIPCOpen(false)
>   , mShouldBuffer(true)
>-  , mBufferLock("mozilla.net.HttpChannelChild.mBufferLock")
> {
>   LOG(("Creating HttpChannelChild @%x\n", this));
> }
> 
> HttpChannelChild::~HttpChannelChild()
> {
>   LOG(("Destroying HttpChannelChild @%x\n", this));
> }
>@@ -151,17 +148,16 @@ HttpChannelChild::RecvOnStartRequest(con
>     //  - make sure we do call OnStopRequest eventually
>     //  - return true here, not false
>     return false;  
>   }
> 
>   if (mResponseHead)
>     SetCookie(mResponseHead->PeekHeader(nsHttp::Set_Cookie));
> 
>-  MutexAutoLock lock(mBufferLock);
>   bool ret = true;
>   nsCOMPtr<nsIHttpChannel> kungFuDeathGrip(this);
>   for (PRUint32 i = 0; i < mBufferedCallbacks.Length(); i++) {
>     ret = mBufferedCallbacks[i]->Run();
>     if (!ret)
>       break;
>   }
>   mBufferedCallbacks.Clear();
>@@ -384,24 +380,18 @@ HttpChannelChild::OnStatus(const nsresul
> 
>   return true;
> }
> 
> bool
> HttpChannelChild::BufferOrDispatch(Callback* callback)
> {
>   if (mShouldBuffer) {
>-    MutexAutoLock lock(mBufferLock);
>-    // If we can't grab the lock immediately, that means we're currently
>-    // emptying the buffer. Therefore, the following condition should now
>-    // be false, and we can resume immediate message processing.
>-    if (mShouldBuffer) {
>       mBufferedCallbacks.AppendElement(callback);
>       return true;
>-    }
>   }
> 
>   bool result = callback->Run();
>   delete callback;
>   return result;
> }
> 

FWIW, this patch looks fine to me for fixing the unnecessary locking and potential deadlock ;).

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Comment 56

•

14 years ago

Also, it's possible to do sync XHR in xpcshell, right?  Would be nice to have a test for this.

Jason Duell

Reporter

Comment 57

•

14 years ago

Please don't check in something this significant and subtle to necko w/o getting a review or signoff from a necko peer (me, bz, biesi, or darin if you're really lucky).

I'm confused here, and 95% sure that this code is at least a little bit broken.

1) Some OnStatus (and I believe OnProgress) messages arrive and should be dispatched *before* OnStartRequest.   This patch looks like it breaks that.  To know for sure (and to add test coverage), please modify test_progress.js so that it also handles OnStart/Data/Stop.  Have all of them (inc onStatus/Progress) print when they're called.   Run in unit, and then make a wrapper in unit_ipc and run.  See if the order of calls differ.  If it does, change the test so that it keeps a state variable to enforce that e10s calls all the notifications/callbacks in the same order as non-e10s (I'm guessing we'll want to guarantee that OnStatus and/or OnProgress have been called before OnStartReq.  If we want to get fancy, OnStatus and OnProgress should also be called again afterward, before each OnDataAvail.)

2) Why are we only buffering until OnStartRequest completes?  I'm guessing it's quite possible for a sync XHR request to happen within OnDataAvailable (if HTML parser hits JS that does one before the page has finished loading?  Not my department--anyone know?).  Or an OnProgress/Status, or even OnStop.  There's nothing in the API that forbids that. 

I was figuring we'd have a mechanism that buffered incoming msgs if *any* other message is still being dispatched (or if the channel's suspended).  I.e. just guarantee that all necko IPDL messages are dispatched serially.  

nit: rename "Callback" something more descriptive and less likely to cause namespace and/or mxr clash?  How about "ChannelChildEvent"?  

> Considering that the param to OnData is a buffer of possibly very large size,
> I think we should only pay this price when we need to, since it should be
> relatively uncommon.

+1

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Jason Duell

Reporter

Comment 58

•

14 years ago

str = "guarantee that all necko IPDL messages are dispatched serially";

str += "on a per-channel basis";  

We obviously aren't trying to queue msgs for other channels (like the sync XHR one).

Jason Duell

Reporter

Comment 59

•

14 years ago

BTW in case it's not clear, I don't think we need to back this out--dougt doesn't think anything in fennec would break from the onStatus/progress msgs coming later than OnStart.

(Does this mean we should file a new bug?)

Doug Turner (:dougt)

Comment 60

•

14 years ago

cjones - great suggestions.  I created a follow up bug to address both parts.  See bug 584604.

jduell - it is odd that status messages happen before start happens, but right not I do not think that anything depends on this behavior in fennec or dependent code.

I still would like to treat this solution as a work around.  I really hate to see http do so much work to deal with ipdl necko messages.  Other protocols are going to have to do the same thing which means making a new protocol in mozilla a real pain in the ass.

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Comment 61

•

14 years ago

(In reply to comment #60)
> I still would like to treat this solution as a work around.  I really hate to
> see http do so much work to deal with ipdl necko messages.  Other protocols are
> going to have to do the same thing which means making a new protocol in mozilla
> a real pain in the ass.

The work done here is to hack around necko notifying observers that call out into other things and so on that end up spinning a nested XPCOM loop, and then the observer asserting when it's re-entered in the nested loop.  The same problem exists for single-process necko, does anyone know how it's avoided?  I don't.

The reason I suspect there's not a general IPDL-level solution is that some consumers need messages delivered in the nested context and some don't, and some might sometimes and not others.  Regardless, *not* delivering in a nested loop breaks in-order delivery, which is a fundamental guarantee.  I'm all for C++ helpers to make this problem easier, where it arises and a general helper makes sense, but let's not overgeneralize based on this instance.  We had a similar problem with plugin streams, and the solution we needed there was different.

Johnny Stenback (:jst)

Comment 62

•

14 years ago

(In reply to comment #56)
> Also, it's possible to do sync XHR in xpcshell, right?  Would be nice to have a
> test for this.

Seems so, once bug 564351 is fixed.

Josh Matthews [:jdm]

Assignee

Comment 63

•

14 years ago

Calling this fixed for now.  I'll reproduce all important comments here in bug 584604.

Josh Matthews [:jdm]

Assignee

Updated

•

14 years ago

Status: REOPENED → RESOLVED

Closed: 14 years ago → 14 years ago

Resolution: --- → FIXED

stack trace until the first ASSERTION 14 years ago Jae-Seong Lee-Russo 6.33 KB, text/plain		Details
Quick fix 14 years ago Josh Matthews [:jdm] 7.66 KB, patch		Details \| Diff \| Splinter Review
Quick fix 14 years ago Josh Matthews [:jdm] 6.26 KB, patch		Details \| Diff \| Splinter Review
Quick fix v2 14 years ago Josh Matthews [:jdm] 6.34 KB, patch	jduell.mcbugs : review-	Details \| Diff \| Splinter Review
Patch 14 years ago Josh Matthews [:jdm] 4.58 KB, patch		Details \| Diff \| Splinter Review
Patch 14 years ago Josh Matthews [:jdm] 4.58 KB, patch		Details \| Diff \| Splinter Review
stack trace after applying jdm's patch 14 years ago Jae-Seong Lee-Russo 6.52 KB, text/plain		Details
Patch v2 14 years ago Josh Matthews [:jdm] 6.81 KB, patch		Details \| Diff \| Splinter Review
Patch v2 14 years ago Josh Matthews [:jdm] 6.79 KB, patch		Details \| Diff \| Splinter Review
Patch v2.5 14 years ago Josh Matthews [:jdm] 7.16 KB, patch		Details \| Diff \| Splinter Review
Make the sendSyncMessage use 'sync' semantics 14 years ago Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 5.68 KB, patch		Details \| Diff \| Splinter Review
Patch 14 years ago Josh Matthews [:jdm] 9.87 KB, patch		Details \| Diff \| Splinter Review
Patch 14 years ago Josh Matthews [:jdm] 9.92 KB, patch	dougt : review+	Details \| Diff \| Splinter Review
Followup 14 years ago Josh Matthews [:jdm] 3.32 KB, patch	dougt : review+	Details \| Diff \| Splinter Review