Closed Bug 1261070 Opened 8 years ago Closed 8 years ago

Use after free crash in HttpChannelChild::FailedAsyncOpen during shutdown

Categories

(Core :: Networking, defect)

45 Branch
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla48
Tracking Status
firefox45 --- disabled
firefox46 --- disabled
firefox47 + fixed
firefox48 + fixed
firefox-esr38 --- unaffected
firefox-esr45 --- disabled

People

(Reporter: nical, Assigned: dragana)

References

Details

(4 keywords, Whiteboard: [necko-active][post-critsmash-triage])

Attachments

(1 file, 1 obsolete file)

I am running into this a lot as I am investigating shutdown gfx issues.


(#5 and below: the segfault handler stuff...) 
#6  0x00007f3e19d045e5 in mozilla::net::HttpBaseChannel::DoNotifyListener() (this=0x7f3de3422030) at /netwerk/protocol/http/HttpBaseChannel.cpp:2634
#7  0x00007f3e19d337e2 in mozilla::net::HttpAsyncAborter<mozilla::net::HttpChannelChild>::HandleAsyncAbort() (this=0x7f3de34224a8) at /netwerk/protocol/http/HttpBaseChannel.h:575
#8  0x00007f3e19d0b8ea in mozilla::net::HttpChannelChild::HandleAsyncAbort() (this=0x7f3de3422000) at /netwerk/protocol/http/HttpChannelChild.cpp:1080
#9  0x00007f3e19d0b96d in mozilla::net::HttpChannelChild::FailedAsyncOpen(nsresult const&) (this=0x7f3de3422000, status=@0x7f3de08fe750: -2147221231) at /netwerk/protocol/http/HttpChannelChild.cpp:1091
#10 0x00007f3e19d2f795 in mozilla::net::FailedAsyncOpenEvent::Run() (this=0x7f3de08fe740) at /netwerk/protocol/http/HttpChannelChild.cpp:1060
#11 0x00007f3e19cb7976 in mozilla::net::ChannelEventQueue::RunOrEnqueue(mozilla::net::ChannelEvent*, bool) (this=0x7f3ded7fb1f0, aCallback=0x7f3de08fe740, aAssertionWhenNotQueued=false) at /home/nical/dev/mozilla/objdir/dbg2/dist/include/mozilla/net/ChannelEventQueue.h:133
#12 0x00007f3e19d0b8c0 in mozilla::net::HttpChannelChild::RecvFailedAsyncOpen(nsresult const&) (this=0x7f3de3422000, status=@0x7ffefb3b3634: -2147221231) at /netwerk/protocol/http/HttpChannelChild.cpp:1070
#13 0x00007f3e1a0f29f5 in mozilla::net::PHttpChannelChild::OnMessageReceived(IPC::Message const&) (this=0x7f3de3422000, msg__=...) at /home/nical/dev/mozilla/objdir/dbg2/ipc/ipdl/PHttpChannelChild.cpp:718
#14 0x00007f3e1a5a096d in mozilla::dom::PContentChild::OnMessageReceived(IPC::Message const&) (this=0x7f3e0b0c5030, msg__=...) at /home/nical/dev/mozilla/objdir/dbg2/ipc/ipdl/PContentChild.cpp:6243
#15 0x00007f3e19f8f22e in mozilla::ipc::MessageChannel::DispatchAsyncMessage(IPC::Message const&) (this=0x7f3e0b0c5098, aMsg=...) at /ipc/glue/MessageChannel.cpp:1630
#16 0x00007f3e19f8ed26 in mozilla::ipc::MessageChannel::DispatchMessage(IPC::Message const&) (this=0x7f3e0b0c5098, aMsg=...) at /ipc/glue/MessageChannel.cpp:1568
#17 0x00007f3e19f8ea7f in mozilla::ipc::MessageChannel::OnMaybeDequeueOne() (this=0x7f3e0b0c5098) at /ipc/glue/MessageChannel.cpp:1535
#18 0x00007f3e19fa7831 in details::CallMethod<, mozilla::ipc::MessageChannel, bool (mozilla::ipc::MessageChannel::*)()>(mozilla::IndexSequence<>, mozilla::ipc::MessageChannel*, bool (mozilla::ipc::MessageChannel::*)(), mozilla::Tuple<>&) (obj=0x7f3e0b0c5098, method=(bool (mozilla::ipc::MessageChannel::*)(mozilla::ipc::MessageChannel * const)) 0x7f3e19f8e95c <mozilla::ipc::MessageChannel::OnMaybeDequeueOne()>, arg=...) at /ipc/chromium/src/base/task.h:28
#19 0x00007f3e19fa74b7 in DispatchTupleToMethod<mozilla::ipc::MessageChannel, bool (mozilla::ipc::MessageChannel::*)()>(mozilla::ipc::MessageChannel*, bool (mozilla::ipc::MessageChannel::*)(), mozilla::Tuple<>&) (obj=0x7f3e0b0c5098, method=(bool (mozilla::ipc::MessageChannel::*)(mozilla::ipc::MessageChannel * const)) 0x7f3e19f8e95c <mozilla::ipc::MessageChannel::OnMaybeDequeueOne()>, arg=...) at /ipc/chromium/src/base/task.h:46
#20 0x00007f3e19fa6f26 in RunnableMethod<mozilla::ipc::MessageChannel, bool (mozilla::ipc::MessageChannel::*)(), mozilla::Tuple<> >::Run() (this=0x7f3e0b042a80) at /ipc/chromium/src/base/task.h:307
#21 0x00007f3e19f9878b in mozilla::ipc::MessageChannel::RefCountedTask::Run() (this=0x7f3e0b002710) at /home/nical/dev/mozilla/objdir/dbg2/dist/include/mozilla/ipc/MessageChannel.h:475
#22 0x00007f3e19f98978 in mozilla::ipc::MessageChannel::DequeueTask::Run() (this=0x7f3de39ca0c0) at /home/nical/dev/mozilla/objdir/dbg2/dist/include/mozilla/ipc/MessageChannel.h:492
#23 0x00007f3e19ee6b83 in MessageLoop::RunTask(Task*) (this=0x7ffefb3b66b0, task=0x7f3de39ca0c0) at /ipc/chromium/src/base/message_loop.cc:364
#24 0x00007f3e19ee6bfd in MessageLoop::DeferOrRunPendingTask(MessageLoop::PendingTask const&) (this=0x7ffefb3b66b0, pending_task=...) at /ipc/chromium/src/base/message_loop.cc:372
#25 0x00007f3e19ee704f in MessageLoop::DoWork() (this=0x7ffefb3b66b0) at /ipc/chromium/src/base/message_loop.cc:459
#26 0x00007f3e19f93172 in mozilla::ipc::DoWorkRunnable::Run() (this=0x7f3e0b0b3850) at /ipc/glue/MessagePump.cpp:222
#27 0x00007f3e199cee5a in nsThread::ProcessNextEvent(bool, bool*) (this=0x7f3e0ac2d100, aMayWait=false, aResult=0x7ffefb3b641f) at /xpcom/threads/nsThread.cpp:994
#28 0x00007f3e19a35c22 in NS_ProcessNextEvent(nsIThread*, bool) (aThread=0x7f3e0ac2d100, aMayWait=false) at /xpcom/glue/nsThreadUtils.cpp:297
#29 0x00007f3e19f92b15 in mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) (this=0x7f3e0b04cfb0, aDelegate=0x7ffefb3b66b0) at /ipc/glue/MessagePump.cpp:97
#30 0x00007f3e19f93478 in mozilla::ipc::MessagePumpForChildProcess::Run(base::MessagePump::Delegate*) (this=0x7f3e0b04cfb0, aDelegate=0x7ffefb3b66b0) at /ipc/glue/MessagePump.cpp:295
#31 0x00007f3e19ee661b in MessageLoop::RunInternal() (this=0x7ffefb3b66b0) at /ipc/chromium/src/base/message_loop.cc:234
#32 0x00007f3e19ee65ae in MessageLoop::RunHandler() (this=0x7ffefb3b66b0) at /ipc/chromium/src/base/message_loop.cc:227
#33 0x00007f3e19ee653d in MessageLoop::Run() (this=0x7ffefb3b66b0) at /ipc/chromium/src/base/message_loop.cc:201
#34 0x00007f3e1cf38088 in nsBaseAppShell::Run() (this=0x7f3e01dd7d60) at /widget/nsBaseAppShell.cpp:156
#35 0x00007f3e1df111af in XRE_RunAppShell() () at /toolkit/xre/nsEmbedFunctions.cpp:801
#36 0x00007f3e19f9330d in mozilla::ipc::MessagePumpForChildProcess::Run(base::MessagePump::Delegate*) (this=0x7f3e0b04cfb0, aDelegate=0x7ffefb3b66b0) at /ipc/glue/MessagePump.cpp:263
#37 0x00007f3e19ee661b in MessageLoop::RunInternal() (this=0x7ffefb3b66b0) at /ipc/chromium/src/base/message_loop.cc:234
#38 0x00007f3e19ee65ae in MessageLoop::RunHandler() (this=0x7ffefb3b66b0) at /ipc/chromium/src/base/message_loop.cc:227
#39 0x00007f3e19ee653d in MessageLoop::Run() (this=0x7ffefb3b66b0) at /ipc/chromium/src/base/message_loop.cc:201
#40 0x00007f3e1df10bf6 in XRE_InitChildProcess(int, char**, mozilla::gmp::GMPLoader*) (aArgc=3, aArgv=0x7ffefb3b7d08, aGMPLoader=0x0) at /toolkit/xre/nsEmbedFunctions.cpp:637
#41 0x0000000000427351 in content_process_main(int, char**) (argc=5, argv=0x7ffefb3b7d08) at /ipc/app/../contentproc/plugin-container.cpp:237
#42 0x000000000042741a in main(int, char**) (argc=6, argv=0x7ffefb3b7d08) at /ipc/app/MozillaRuntimeMain.cpp:11

FailedAsyncOpenEvent holds a raw pointer to an HttpChannelChild which is deleted by the time Run() is called.

I assume this belongs to the network component because the cashing code is in the netwerk directory (not sure though).
Could you attach more of the report? Usually a use-after-free ASAN report includes where the object was allocated and where it was freed in addition to where the re-use was. Doesn't need to be a comment, you can stuff it in an attachment to avoid bug clutter.
Group: core-security → layout-core-security
Flags: needinfo?(nical.bugzilla)
The relevant necko code seems to use raw pointer member variables in few places, and that is always suspicious, but I'm not familiar with the code at all, so perhaps this isn't about those cases.
Dragana, haven't you recently seen/fixed something like this?
Flags: needinfo?(dd.mozilla)
Group: layout-core-security → network-core-security
This should be fixed with bug 1256336. It fails couple of the first line of the stack to be 100% sure. I am 99% sure the stack is the same.
What version are you using?
Bug 1256336 landed on 24.March so probably it is in Nightly since 25-26. March.
Flags: needinfo?(dd.mozilla)
what is the hg cset you're working off of for the reported stack?
I am working off a patch queue on top of 290738:d5d53a3b4e50 (from March 29th). My patch queue enables parts of the gfx shutdown that are currently disabled for child processes with e10s, although I doubt anything in gfx interacts with this area of the code at all.
The steps I use to test my patches (and with which I tend to run into this) is to start firefox, go to a youtube video, watch a second or two of the video and close the browser while the video plays. I don't know if these steps have an impact on the reproducibility of this particular crash, though.

I don't have an asan build handy, I'll look into that when I have some spare cycles.
This is another bug, it does not have anything to do with bug 1256336.

in function mozilla::net::HttpBaseChannel::DoNotifyListener() we call DoNotifyLisenerCleanup which does a really good cleanup and cleans up/destroys HttpChannelChild(HttpBaseChannel) and then DoNotifyListener() tries to access its members.

This is not a gfx problem. Thanks for the report.
Assignee: nobody → dd.mozilla
Status: NEW → ASSIGNED
[Tracking Requested - why for this release]:

[Tracking Requested - why for this release]:

[Tracking Requested - why for this release]:

This code is for along time here. 

It needs e10s so 45 and 45esr are not affected. I am not sure if the e10s experiment on 46 is still running.
The problem was introduced by bug 1215140 in ff45 (and uplifted to 44)
Attached patch bug_1261070_v1.patch (obsolete) — Splinter Review
I have moved Send__delete to FailedAsyncOpen, it is less likely(because there is no similar function in the non-e10s version) that someone is going to put stuff into this function and produce similar bugs.
Attachment #8737159 - Flags: review?(jduell.mcbugs)
Whiteboard: [necko-active]
Comment on attachment 8737159 [details] [diff] [review]
bug_1261070_v1.patch

Review of attachment 8737159 [details] [diff] [review]:
-----------------------------------------------------------------

There are days when shoving common code into templates really doesn't seem worth it...  (and I was the one who did it :)

So while this patch seems safe, since it moves the Send__delete "later" in the codepath, it's actually not--if the channel is suspended, we'll wind up calling Send_Delete() (which very possibly deletes the channel) while the channel is still suspended.  Very Bad Things will result.

Also, comment 7 identifies one use-after-free case:

1) HttpBaseChannel::DoNotifyListener() is referring to 'this' (via mLoadInfo) after DoNotifyListenerCleanup (the docshell flush code)

But I'm also seeing another one here:

2) HttpAsyncAborter<T>::HandleAsyncAbort() is referring to this->mLoadGroup (in order to remove channel from loadGroup) after DoNotifyListener{Cleanup}

If the relative ordering of the docshell flush, the loadGroup removal, and the mInterceptListener cleanup is not important (I don't know the answer to that, but I'd suspect it's OK to reorder them), I think we could fix both of these issues by moving the call to DoNotifyListenerCleanup to the end of HttpAsyncAborter<T>::HandleAsyncAbort().  Please take a look and see if that seems right to you.

One other thing I really recommend we do is to copy-and-paste the scary warning we have in one place for Send__delete to all other call sites in the code: 

 // This calls NeckoChild::DeallocPHttpChannelChild(), which deletes |this| if IPDL
 // holds the last reference.  Don't rely on |this| existing after here!

Maybe we should change it to be bolder and actually put it *after* the call
sites:

 // WARNING:  DO NOT RELY ON |THIS| EXISTING ANY MORE! 
 // 
 // NeckoChild::DeallocPHttpChannelChild() may have been called, which deletes 
 // |this| if IPDL holds the last reference.  

I'd also suggest we put a similar comment at the DoNotifyListenerCleanup() callsite.  We really need to make sure no one adds code after these calls.
Attachment #8737159 - Flags: review?(jduell.mcbugs) → review-
(In reply to Jason Duell [:jduell] (needinfo? me) from comment #13)

Please also see bug 1260527 if you are going to move with IPC lifetime here.
> Please also see bug 1260527 if you are going to move with IPC lifetime here.

We're not really moving IPC lifetime here.  And we should land the fix for this bug on beta.  So I think that bug will be unrelated to this.
(In reply to Jason Duell [:jduell] (needinfo? me) from comment #13)
> Comment on attachment 8737159 [details] [diff] [review]
> bug_1261070_v1.patch
> 
> Review of attachment 8737159 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> There are days when shoving common code into templates really doesn't seem
> worth it...  (and I was the one who did it :)
> 
> So while this patch seems safe, since it moves the Send__delete "later" in
> the codepath, it's actually not--if the channel is suspended, we'll wind up
> calling Send_Delete() (which very possibly deletes the channel) while the
> channel is still suspended.  Very Bad Things will result.
> 

No we are not. 
HttpChannelChild have a different suspend logic than nsHttpChannel.
HttpChannelChild will not call FailedAsyncOpen if it is suspended, it will queue it. 
http://mxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/HttpChannelChild.cpp#1070

so the channel is not suspended when  HandleAsyncAbort() is called.

> Also, comment 7 identifies one use-after-free case:
> 
> 1) HttpBaseChannel::DoNotifyListener() is referring to 'this' (via
> mLoadInfo) after DoNotifyListenerCleanup (the docshell flush code)
> 
> But I'm also seeing another one here:
> 
> 2) HttpAsyncAborter<T>::HandleAsyncAbort() is referring to this->mLoadGroup
> (in order to remove channel from loadGroup) after DoNotifyListener{Cleanup}
> 

Yes I notice that.

> If the relative ordering of the docshell flush, the loadGroup removal, and
> the mInterceptListener cleanup is not important (I don't know the answer to
> that, but I'd suspect it's OK to reorder them), I think we could fix both of
> these issues by moving the call to DoNotifyListenerCleanup to the end of
> HttpAsyncAborter<T>::HandleAsyncAbort().  Please take a look and see if that
> seems right to you.
> 
> One other thing I really recommend we do is to copy-and-paste the scary
> warning we have in one place for Send__delete to all other call sites in the
> code: 
> 
>  // This calls NeckoChild::DeallocPHttpChannelChild(), which deletes |this|
> if IPDL
>  // holds the last reference.  Don't rely on |this| existing after here!
> 
> Maybe we should change it to be bolder and actually put it *after* the call
> sites:
> 
>  // WARNING:  DO NOT RELY ON |THIS| EXISTING ANY MORE! 
>  // 
>  // NeckoChild::DeallocPHttpChannelChild() may have been called, which
> deletes 
>  // |this| if IPDL holds the last reference.  
> 
> I'd also suggest we put a similar comment at the DoNotifyListenerCleanup()
> callsite.  We really need to make sure no one adds code after these calls.

Yes you are right. I will add comments.
For current code it is not need, but I could put Send__delete on mEventQ so it will not be executed if channel is suspended.

In current code we do not suspend channel between FailAsyncOpne call and Send__delete call, so channel is for sure not suspended when Send__delete is call. 

For the future, a possible change, I could put it to the event queue (this adds unnecessary function call)
Attachment #8737159 - Attachment is obsolete: true
Attachment #8738155 - Flags: review?(jduell.mcbugs)
I'm going to mark this sec-critical. It may be hard to trigger because it is related to shutting down the networking channel, but on the other hand at least two separate people have stumble across it in their regular builds, so it is probably easy to find, and maybe a clever exploit could do something nasty with this.
Keywords: crash
Comment on attachment 8738155 [details] [diff] [review]
bug_1261070_v1.patch

Review of attachment 8738155 [details] [diff] [review]:
-----------------------------------------------------------------

> HttpChannelChild will not call FailedAsyncOpen if it is suspended, it will queue 
> it... so the channel is not suspended when  HandleAsyncAbort() is called.

Ah, that's a good point.  I think my approach requires less subtle thinking, and might be less likely to silently become a bug again if we ever try to switch to PBackground, or allow off-main Suspend, etc. (putting put Send__delete on mEventQ might be even simpler and more failproof--I dont' care about a single queueing/function overhead per channel lifetime). But I'm not going to quibble when we need a patch here up to at least aurora ASAP, so let's take this.
Attachment #8738155 - Flags: review?(jduell.mcbugs) → review+
Comment on attachment 8738155 [details] [diff] [review]
bug_1261070_v1.patch

Approval Request Comment
[Feature/regressing bug #]: long time
[User impact if declined]: security/crashes: see comment 19
[Describe test coverage new/current, TreeHerder]:  should be plenty of coverage in existing tests (though it's a race condition, so repro is probabilistic)
[Risks and why]: low: fairly small change, looked over carefully.
[String/UUID change made/needed]: none

We definitely want this for any e10s desktop builds we put out there.
Attachment #8738155 - Flags: approval-mozilla-aurora?
Flags: needinfo?(nical.bugzilla)
Comment on attachment 8738155 [details] [diff] [review]
bug_1261070_v1.patch

[Security approval request comment]
How easily could an exploit be constructed based on the patch? From the patch it can be seen that we were using after freeing. I am not sure how easy is to trigger this, all pointers to a channel must be freed, i expect that it is not that hard.

Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem? yes

Which older supported branches are affected by this flaw? all

If not all supported branches, which bug introduced the flaw?

Do you have backports for the affected branches? If not, how different, hard to create, and risky will they be? Yes, it is easy and safe to backport.

How likely is this patch to cause regressions; how much testing does it need? it is unlikely to cause regression.
Attachment #8738155 - Flags: sec-approval?
sec-approval+ and approval for Aurora.

Beta (46) is having e10s disabled so I am marking it as "disabled" now for status.
Attachment #8738155 - Flags: sec-approval?
Attachment #8738155 - Flags: sec-approval+
Attachment #8738155 - Flags: approval-mozilla-aurora?
Attachment #8738155 - Flags: approval-mozilla-aurora+
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/2984386471ba
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla48
Group: network-core-security → core-security-release
Whiteboard: [necko-active] → [necko-active][post-critsmash-triage]
Version: unspecified → 45 Branch
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: