e10s HTTP: nytimes.com causes use-after-delete of IPDL actor?

RESOLVED FIXED

Status

()

defect
RESOLVED FIXED
9 years ago
9 years ago

People

(Reporter: jduell.mcbugs, Assigned: jdm)

Tracking

unspecified
x86_64
Linux
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(fennec2.0+)

Details

Attachments

(1 attachment, 3 obsolete attachments)

cjones reports he saw

    - WARNING: NS_ENSURE_TRUE(mSuspendCount > 0) failed: file
      /netwerk/base/src/nsInputStreamPump.cpp, line 241
      WARNING: NS_ENSURE_TRUE(aOriginalURI) failed: file
      /netwerk/protocol/http/HttpBaseChannel.cpp, line 246
      ###!!! [Parent][AsyncChannel] Error: Route error: message sent to unknown
      actor ID

Haven't had time to verify
I saw it too.  I had to load nytimes.com, go back to the homepage and then load it again.
(gdb) p msg
$2 = (const TupleTraits<mozilla::ipc::AsyncChannel::Message*>::Message &) @0xbfe84410: {
  <Pickle> = {
    static kPayloadUnit = 64, 
    header_ = 0x86dda88, 
    header_size_ = 28, 
    capacity_ = 64, 
    variable_buffer_offset_ = 0
  }, 
  members of IPC::Message: 
  _vptr.Message = 0x313c5c8, 
  file_descriptor_set_ = {
    ptr_ = 0x0
  }, 
  name_ = 0x2fa0fc1 "???", 
  received_time_ = 0, 
  output_params_ = Traceback (most recent call last):
  File "/usr/lib/../share/gdb/python/libstdcxx/v6/printers.py", line 557, in to_string
    return self.val['_M_dataplus']['_M_p'].string (encoding, length = len)
LookupError: unknown encoding: UCS-4
, 
  log_data_ = 0x0, 
  dont_log_ = false
}
(gdb) p *msg.header()
$4 = {
  <Pickle::Header> = {
    payload_size = 4
  }, 
  members of IPC::Message::Header: 
  routing = -491, 
  type = 12289, 
  flags = 2, 
  num_fds = 0, 
  rpc_remote_stack_depth_guess = 4294967295, 
  rpc_local_stack_depth = 4294967295, 
  seqno = 0
}
According to the script at http://pastebin.mozilla.org/781741, this is a SetPriority call gone wrong.
The plot thickens. I tracked this down to an HttpChannelChild actor with a positive id, which should be impossible.  It's not overflow, as other channels are only hanging around -200.  The channel's not cached, it's loading a 1x1 gif and calling SetPriority(-1).  What's also strange is that a subsequent call of SetPriority(0) doesn't trigger the same error.
We've got a known race with SetPriority and redirects (bug 587171).  We don't remove the old channel from the loadgroup until the child receives Redirect3Complete(), and we call Send__delete in the parent right after we sent that.  That might be the window where the SetPriority arrives?   Perhaps we need to move the _delete to the child's RecvRedirect3Complete.

(Actually I'm puzzled as to why we delete the protocol here in the first place: the old channel gets canceled, and I would have thought we could just deleted it the usual way, after OnStopRequest.)

None of this explains the child actor with the positive ID, though, unless it's released memory that's been scribbled over.
tracking-fennec: --- → ?
Assignee: nobody → josh
I suspect it is the race mentioned in comment 6.  The NSPR log for the relevant 1x1 image that causes the error shows this:


-1231029392[9a02c08]: http response [
-1231029392[9a02c08]:   HTTP/1.1 302 Moved Temporarily
-1231029392[9a02c08]:   Content-Length: 0
-1231029392[9a02c08]:   Date: Tue, 14 Sep 2010 21:33:32 GMT
-1231029392[9a02c08]:   Location: http://s0.2mdn.net/viewad/666472/130-1x1.gif
-1231029392[9a02c08]:   Cache-Control: no-cache
-1231029392[9a02c08]:   Pragma: no-cache
-1231029392[9a02c08]:   Server: GFE/2.0
-1231029392[9a02c08]:   Content-Type: text/html
-1231029392[9a02c08]: ]
Ok, the positive ID is a red herring.  I mistook it to mean that actors each side share positive/negative versions of the same ID, but really it means that actors share the same ID, period.  If an actor is created in the parent, both will have a positive ID, while if it's created in the child they'll both have a negative one.
FWIW, the actor complaining here is the newly-constructed channel after the redirect, not the redirected one.  Still not sure what the underlying cause is, but I guess there's mysteriously no longer a corresponding parent channel actor.
Note to self - this is the stack that leads to the error-causing SetPriority call:


#0  RescheduleRequest (aRequest=<value optimized out>, delta=-1) at /home/t_mattjo/src/firefox/mobilebase/netwerk/base/src/nsLoadGroup.cpp:122
#1  0x00fa5e7c in nsLoadGroup::AddRequest (this=0x9aadd18, request=0x9e35e18, ctxt=0x0) at /home/t_mattjo/src/firefox/mobilebase/netwerk/base/src/nsLoadGroup.cpp:578
#2  0x01092fb9 in mozilla::net::HttpChannelChild::CompleteRedirectSetup (this=0x9e35dd0, listener=0x9aae228, aContext=0x0) at /home/t_mattjo/src/firefox/mobilebase/netwerk/protocol/http/HttpChannelChild.cpp:732
#3  0x010930bc in mozilla::net::HttpChannelChild::Redirect3Complete (this=0x9aae968) at /home/t_mattjo/src/firefox/mobilebase/netwerk/protocol/http/HttpChannelChild.cpp:708
#4  0x01093149 in mozilla::net::HttpChannelChild::RecvRedirect3Complete (this=0x9aae968) at /home/t_mattjo/src/firefox/mobilebase/netwerk/protocol/http/HttpChannelChild.cpp:692
#5  0x0259cb23 in mozilla::net::PHttpChannelChild::OnMessageReceived (this=0x9aae968, __msg=...) at PHttpChannelChild.cpp:579
#6  0x02527c30 in mozilla::dom::PContentChild::OnMessageReceived (this=0x9577d60, __msg=...) at PContentChild.cpp:840
#7  0x02427635 in mozilla::ipc::AsyncChannel::OnDispatchMessage (this=0x9577d68, msg=...) at /home/t_mattjo/src/firefox/mobilebase/ipc/glue/AsyncChannel.cpp:262
#8  0x024300f8 in mozilla::ipc::RPCChannel::OnMaybeDequeueOne (this=0x9577d68) at /home/t_mattjo/src/firefox/mobilebase/ipc/glue/RPCChannel.cpp:438
#9  0x02431686 in DispatchToMethod<mozilla::ipc::RPCChannel, bool (mozilla::ipc::RPCChannel::*)()> (this=0x9578c50) at /home/t_mattjo/src/firefox/mobilebase/ipc/chromium/src/base/tuple.h:383
#10 RunnableMethod<mozilla::ipc::RPCChannel, bool (mozilla::ipc::RPCChannel::*)(), Tuple0>::Run (this=0x9578c50) at /home/t_mattjo/src/firefox/mobilebase/ipc/chromium/src/base/task.h:307
#11 0x024316d9 in Run (this=0xb6c0d208) at ../../dist/include/mozilla/ipc/RPCChannel.h:448
#12 mozilla::ipc::RPCChannel::DequeueTask::Run (this=0xb6c0d208) at ../../dist/include/mozilla/ipc/RPCChannel.h:473
#13 0x0267d76c in MessageLoop::RunTask (this=0xbfea2c20, task=0xb6c0d208) at /home/t_mattjo/src/firefox/mobilebase/ipc/chromium/src/base/message_loop.cc:343
#14 0x0267de58 in MessageLoop::DeferOrRunPendingTask (this=0xbfea2c20, pending_task=...) at /home/t_mattjo/src/firefox/mobilebase/ipc/chromium/src/base/message_loop.cc:351
#15 0x0267ef07 in MessageLoop::DoWork (this=0xbfea2c20) at /home/t_mattjo/src/firefox/mobilebase/ipc/chromium/src/base/message_loop.cc:451
#16 0x0242ba5c in mozilla::ipc::MessagePump::Run (this=0x9576ee0, aDelegate=0xbfea2c20) at /home/t_mattjo/src/firefox/mobilebase/ipc/glue/MessagePump.cpp:114
#17 0x0242bd6d in mozilla::ipc::MessagePumpForChildProcess::Run (this=0x9576ee0, aDelegate=0xbfea2c20) at /home/t_mattjo/src/firefox/mobilebase/ipc/glue/MessagePump.cpp:229
#18 0x0267dc85 in MessageLoop::RunInternal (this=0xbfea2c20) at /home/t_mattjo/src/firefox/mobilebase/ipc/chromium/src/base/message_loop.cc:219
#19 0x0267dca1 in MessageLoop::RunHandler (this=0xbfea2c20) at /home/t_mattjo/src/firefox/mobilebase/ipc/chromium/src/base/message_loop.cc:202
#20 0x0267dd0c in MessageLoop::Run (this=0xbfea2c20) at /home/t_mattjo/src/firefox/mobilebase/ipc/chromium/src/base/message_loop.cc:176
#21 0x022bf2d7 in nsBaseAppShell::Run (this=0x97ad088) at /home/t_mattjo/src/firefox/mobilebase/widget/src/xpwidgets/nsBaseAppShell.cpp:180
#22 0x00f79c9b in XRE_RunAppShell () at /home/t_mattjo/src/firefox/mobilebase/toolkit/xre/nsEmbedFunctions.cpp:629
#23 0x0242bcc6 in mozilla::ipc::MessagePumpForChildProcess::Run (this=0x9576ee0, aDelegate=0xbfea2c20) at /home/t_mattjo/src/firefox/mobilebase/ipc/glue/MessagePump.cpp:215
#24 0x0267dc85 in MessageLoop::RunInternal (this=0xbfea2c20) at /home/t_mattjo/src/firefox/mobilebase/ipc/chromium/src/base/message_loop.cc:219
#25 0x0267dca1 in MessageLoop::RunHandler (this=0xbfea2c20) at /home/t_mattjo/src/firefox/mobilebase/ipc/chromium/src/base/message_loop.cc:202
#26 0x0267dd0c in MessageLoop::Run (this=0xbfea2c20) at /home/t_mattjo/src/firefox/mobilebase/ipc/chromium/src/base/message_loop.cc:176
#27 0x00f7990a in XRE_InitChildProcess (aArgc=1, aArgv=0xbfea2e44, aProcess=GeckoProcessType_Content) at /home/t_mattjo/src/firefox/mobilebase/toolkit/xre/nsEmbedFunctions.cpp:506
#28 0x08048651 in main (argc=2, argv=0xbfea2e44) at /home/t_mattjo/src/firefox/mobilebase/ipc/app/MozillaRuntimeMain.cpp:67
I do believe I was wrong, and we're seeing the redirect/SetPriority race on the old channel, not the new one.
Note to self - I believe this is the stack causing the error for realz this time:

#0  mozilla::net::HttpChannelChild::SetPriority (this=0xa4d51b8, aPriority=10) at /home/t_mattjo/src/firefox/mobilebase/netwerk/protocol/http/HttpChannelChild.cpp:1044
#1  0x01069671 in mozilla::net::HttpBaseChannel::AdjustPriority (this=0xa4d51d0, delta=1) at /home/t_mattjo/src/firefox/mobilebase/netwerk/protocol/http/HttpBaseChannel.cpp:967
#2  0x00fa4e4c in RescheduleRequest (aRequest=<value optimized out>, delta=1) at /home/t_mattjo/src/firefox/mobilebase/netwerk/base/src/nsLoadGroup.cpp:122
#3  0x00fa733b in nsLoadGroup::RemoveRequest (this=0xa80a620, request=0xa4d5200, ctxt=0x0, aStatus=2152398850) at /home/t_mattjo/src/firefox/mobilebase/netwerk/base/src/nsLoadGroup.cpp:664
#4  0x01093091 in mozilla::net::HttpChannelChild::Redirect3Complete (this=0xa4d51b8) at /home/t_mattjo/src/firefox/mobilebase/netwerk/protocol/http/HttpChannelChild.cpp:704
#5  0x01093149 in mozilla::net::HttpChannelChild::RecvRedirect3Complete (this=0xa4d51b8) at /home/t_mattjo/src/firefox/mobilebase/netwerk/protocol/http/HttpChannelChild.cpp:692
#6  0x0259cb23 in mozilla::net::PHttpChannelChild::OnMessageReceived (this=0xa4d51b8, __msg=...) at PHttpChannelChild.cpp:579
#7  0x02527c30 in mozilla::dom::PContentChild::OnMessageReceived (this=0x9577d60, __msg=...) at PContentChild.cpp:840
#8  0x02427635 in mozilla::ipc::AsyncChannel::OnDispatchMessage (this=0x9577d68, msg=...) at /home/t_mattjo/src/firefox/mobilebase/ipc/glue/AsyncChannel.cpp:262
#9  0x024300f8 in mozilla::ipc::RPCChannel::OnMaybeDequeueOne (this=0x9577d68) at /home/t_mattjo/src/firefox/mobilebase/ipc/glue/RPCChannel.cpp:438
#10 0x02431686 in DispatchToMethod<mozilla::ipc::RPCChannel, bool (mozilla::ipc::RPCChannel::*)()> (this=0x9578c50) at /home/t_mattjo/src/firefox/mobilebase/ipc/chromium/src/base/tuple.h:383
#11 RunnableMethod<mozilla::ipc::RPCChannel, bool (mozilla::ipc::RPCChannel::*)(), Tuple0>::Run (this=0x9578c50) at /home/t_mattjo/src/firefox/mobilebase/ipc/chromium/src/base/task.h:307
#12 0x024316d9 in Run (this=0xb6c352d8) at ../../dist/include/mozilla/ipc/RPCChannel.h:448
#13 mozilla::ipc::RPCChannel::DequeueTask::Run (this=0xb6c352d8) at ../../dist/include/mozilla/ipc/RPCChannel.h:473
#14 0x0267d76c in MessageLoop::RunTask (this=0xbfea2c20, task=0xb6c352d8) at /home/t_mattjo/src/firefox/mobilebase/ipc/chromium/src/base/message_loop.cc:343
#15 0x0267de58 in MessageLoop::DeferOrRunPendingTask (this=0xbfea2c20, pending_task=...) at /home/t_mattjo/src/firefox/mobilebase/ipc/chromium/src/base/message_loop.cc:351
#16 0x0267ef07 in MessageLoop::DoWork (this=0xbfea2c20) at /home/t_mattjo/src/firefox/mobilebase/ipc/chromium/src/base/message_loop.cc:451
#17 0x0242ba5c in mozilla::ipc::MessagePump::Run (this=0x9576ee0, aDelegate=0xbfea2c20) at /home/t_mattjo/src/firefox/mobilebase/ipc/glue/MessagePump.cpp:114
#18 0x0242bd6d in mozilla::ipc::MessagePumpForChildProcess::Run (this=0x9576ee0, aDelegate=0xbfea2c20) at /home/t_mattjo/src/firefox/mobilebase/ipc/glue/MessagePump.cpp:229
#19 0x0267dc85 in MessageLoop::RunInternal (this=0xbfea2c20) at /home/t_mattjo/src/firefox/mobilebase/ipc/chromium/src/base/message_loop.cc:219
#20 0x0267dca1 in MessageLoop::RunHandler (this=0xbfea2c20) at /home/t_mattjo/src/firefox/mobilebase/ipc/chromium/src/base/message_loop.cc:202
#21 0x0267dd0c in MessageLoop::Run (this=0xbfea2c20) at /home/t_mattjo/src/firefox/mobilebase/ipc/chromium/src/base/message_loop.cc:176
#22 0x022bf2d7 in nsBaseAppShell::Run (this=0x97ad088) at /home/t_mattjo/src/firefox/mobilebase/widget/src/xpwidgets/nsBaseAppShell.cpp:180
#23 0x00f79c9b in XRE_RunAppShell () at /home/t_mattjo/src/firefox/mobilebase/toolkit/xre/nsEmbedFunctions.cpp:629
#24 0x0242bcc6 in mozilla::ipc::MessagePumpForChildProcess::Run (this=0x9576ee0, aDelegate=0xbfea2c20) at /home/t_mattjo/src/firefox/mobilebase/ipc/glue/MessagePump.cpp:215
#25 0x0267dc85 in MessageLoop::RunInternal (this=0xbfea2c20) at /home/t_mattjo/src/firefox/mobilebase/ipc/chromium/src/base/message_loop.cc:219
#26 0x0267dca1 in MessageLoop::RunHandler (this=0xbfea2c20) at /home/t_mattjo/src/firefox/mobilebase/ipc/chromium/src/base/message_loop.cc:202
#27 0x0267dd0c in MessageLoop::Run (this=0xbfea2c20) at /home/t_mattjo/src/firefox/mobilebase/ipc/chromium/src/base/message_loop.cc:176
#28 0x00f7990a in XRE_InitChildProcess (aArgc=1, aArgv=0xbfea2e44, aProcess=GeckoProcessType_Content) at /home/t_mattjo/src/firefox/mobilebase/toolkit/xre/nsEmbedFunctions.cpp:506
#29 0x08048651 in main (argc=2, argv=0xbfea2e44) at /home/t_mattjo/src/firefox/mobilebase/ipc/app/MozillaRuntimeMain.cpp:67
I suspect we may have to move the __Delete_ of the old channel to some place where we're guaranteed that all child actors (including the load group) have been notified about the redirect.  That may mean moving the _delete_ to the child (though TBH I'm not sure why we have to delete in the redirect code, as opposed to just letting the old channel go through OnStop, etc. normally).

Naturally, it'd also be really nice if we could forward any late-arriving SetPriority to the new channel.
The fundamental issue here is that the __delete__ message comes out of the blue.  When the child deletes, we are in a known state in which the parent will not be sending any further messages.  That is not the case in the redirection code in the parent, so we could either avoid any __delete__ from parent->child, move the __delete__ to another point when the child is guaranteed not to send any further messages, or we could switch to an acknowledged __delete__ strategy when deleting from the parent (ie. parent sends a DeleteSelf message, child responds with __delete__).

I tried removing the channel deletion in the case of successful redirection, and I can no longer reproduce this bug.  I'll attach a patch, but I still haven't reasoned through whether the deletion in the case of unsuccessful redirection is race-free.
Patch to remove channel deletion on redirection success which makes the problem unreproducible.
Attachment #475595 - Attachment description: Initiate all PHttpChannel protocol deletion from the child to avoid races. → Remove channel deletion on redirection success.
(In reply to comment #0)
> cjones reports he saw
> 
>     - WARNING: NS_ENSURE_TRUE(mSuspendCount > 0) failed: file
>       /netwerk/base/src/nsInputStreamPump.cpp, line 241

This one particular warning is related to bug 513008.  See bug 594882 for explanation.
(In reply to comment #6)
> (Actually I'm puzzled as to why we delete the protocol here in the first place:
> the old channel gets canceled, and I would have thought we could just deleted
> it the usual way, after OnStopRequest.)

You mean delete from HttpChannelChild::OnStopRequest?  We cannot do that, because we won't get OnStopRequest call for the old child channel.  That's how redirects work: you do AsyncOpen on a channel, it succeeds, so you expect OnStartRequest/OnStopRequest.  You will get it, but only from a new channel after the redirect is made (if it is not another redirect).  This switch is transparent to the channel consumer.
If that's the case, we should probably switch to using a request/delete system from the parent.  I'll throw together a patch.
Comment on attachment 475595 [details] [diff] [review]
Remove channel deletion on redirection success.

This is a good start.  But we will leak probably one of the channel (the old or the new one depending on the redirect result).

The correct discovery here is that the channel must not be deleted from the parent side.

We have either to move the deletion to the child: in case of a successful redirect, do it in HttpChannelChild::Redirect3Complete (on *this) after we get success result from mRedirectChannelChild->CompleteRedirectSetup ; in case of a failed redirect, we can delete the channel from the parent or we have to introduce Redirect3Failed event that will delete the new channel (the redirect target channel) on the child side.

Or as a general solution also for future, do what you suggests in comment 14 - DeleteSelf message.
Attachment #475595 - Attachment is obsolete: true
Attachment #476266 - Flags: review?(jduell.mcbugs)
Attachment #476266 - Attachment is obsolete: true
Attachment #476266 - Flags: review?(jduell.mcbugs)
Depends on: 598076
Attachment #478854 - Flags: review?(jduell.mcbugs)
Attachment #478854 - Flags: review?(jduell.mcbugs)
can we get a review on this?
tracking-fennec: ? → 2.0+
I can steal it.
Attachment #478854 - Attachment is obsolete: true
Attachment #478854 - Flags: review?(jduell.mcbugs)
Attachment #481289 - Flags: review?(jduell.mcbugs)
Attachment #481289 - Flags: review?(jduell.mcbugs) → review?(honzab.moz)
Comment on attachment 481289 [details] [diff] [review]
Initiate all PHttpChannel protocol deletion from the child to avoid races.

r=honzab if the netwerk xpc-shell tests are passing with this patch, w/o leaks.
Attachment #481289 - Flags: review?(honzab.moz) → review+
http://hg.mozilla.org/mozilla-central/rev/8e118657e810
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.