Closed Bug 1405818 Opened 7 years ago Closed 6 years ago

Crash in mozilla::net::HttpChannelChild::Release

Categories

(Core :: Networking, defect, P1)

58 Branch
x86
Windows
defect

Tracking

()

RESOLVED DUPLICATE of bug 1401459
Tracking Status
firefox-esr52 --- unaffected
firefox56 --- wontfix
firefox57 --- wontfix
firefox58 --- affected

People

(Reporter: philipp, Assigned: schien)

Details

(4 keywords, Whiteboard: [necko-triaged])

Crash Data

Attachments

(3 obsolete files)

This bug was filed from the Socorro interface and is 
report bp-745012d3-4275-4f2c-8a81-282540170929.
=============================================================
Crashing Thread (9), Name: Socket Thread
Frame 	Module 	Signature 	Source
0 	xul.dll 	mozilla::net::HttpChannelChild::Release() 	netwerk/protocol/http/HttpChannelChild.cpp:239
1 	xul.dll 	RefPtr<mozilla::net::HttpChannelChild>::assign_assuming_AddRef(mozilla::net::HttpChannelChild*) 	obj-firefox/dist/include/mozilla/RefPtr.h:65
2 	xul.dll 	mozilla::net::HttpBackgroundChannelChild::OnChannelClosed() 	netwerk/protocol/http/HttpBackgroundChannelChild.cpp:121
3 	xul.dll 	mozilla::detail::RunnableMethodImpl<nsGlobalWindow* const, void ( nsGlobalWindow::*)(void), 1, 0>::Run() 	obj-firefox/dist/include/nsThreadUtils.h:1172
4 	xul.dll 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp:1446
5 	xul.dll 	NS_ProcessNextEvent(nsIThread*, bool) 	xpcom/threads/nsThreadUtils.cpp:480
6 	xul.dll 	mozilla::net::nsSocketTransportService::Run() 	netwerk/base/nsSocketTransportService2.cpp:976
7 	xul.dll 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp:1446
8 	xul.dll 	NS_ProcessNextEvent(nsIThread*, bool) 	xpcom/threads/nsThreadUtils.cpp:480
9 	xul.dll 	mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*) 	ipc/glue/MessagePump.cpp:339
10 	xul.dll 	MessageLoop::RunHandler() 	ipc/chromium/src/base/message_loop.cc:319
11 	xul.dll 	MessageLoop::Run() 	ipc/chromium/src/base/message_loop.cc:299
12 	xul.dll 	nsThread::ThreadFunc(void*) 	xpcom/threads/nsThread.cpp:506
13 	nss3.dll 	_PR_NativeRunThread 	nsprpub/pr/src/threads/combined/pruthr.c:397
14 	nss3.dll 	pr_root 	nsprpub/pr/src/md/windows/w95thred.c:95
15 	ucrtbase.dll 	_o__CIpow 	
16 	kernel32.dll 	BaseThreadInitThunk 	
17 	mozglue.dll 	patched_BaseThreadInitThunk 	mozglue/build/WindowsDllBlocklist.cpp:824
18 	ntdll.dll 	__RtlUserThreadStart 	
19 	ntdll.dll 	_RtlUserThreadStart

crash reports in the content process with this signature are somewhat rising since firefox 56 on windows. a small portion of them seems to indicate a UAF siatuation, so marking the bug as security sensitive as a precaution.
Group: core-security → network-core-security
The UAF crashes with this signature seem to be _WRITE access violations. The bulk of the crashes with this signature are _READ violations. Might be different bugs.
Jason, can you take a look or help find someone to investigate? Thanks.
Flags: needinfo?(jduell.mcbugs)
SC: this is a refcount bug for HttpChannelChild on the socket transport thread.  I'm wondering if this is somehow related to PBackgroundHttp?  It's hard to tell since we've only got a crashing Release() event.
Flags: needinfo?(schien)
Not sure about the exact scenario that triggers this refcount bug. By looking at the backtrace and source code, I do find some member variables not well protected against multi-thread access.

https://searchfox.org/mozilla-central/rev/1a4a26905f923458679a59a4be1e455ebc53c333/netwerk/ipc/ChannelEventQueue.h#157
https://searchfox.org/mozilla-central/rev/1a4a26905f923458679a59a4be1e455ebc53c333/netwerk/protocol/http/HttpChannelChild.h#306
Assignee: nobody → schien
Flags: needinfo?(schien)
Tracing down the path of HttpChannelChild::Release, I also found HttpChannelChild::TrySendDeletingChannel has fundamental logic error and it is definitely introduced by my PBg-Http patch.

https://searchfox.org/mozilla-central/rev/1a4a26905f923458679a59a4be1e455ebc53c333/netwerk/protocol/http/HttpChannelChild.cpp#3399
Flags: needinfo?(jduell.mcbugs)
Priority: -- → P1
Whiteboard: [necko-triaged]
Attachment #8917790 - Flags: feedback?(honzab.moz)
Attachment #8917791 - Flags: feedback?(honzab.moz)
These three patches addressed my find in comment #4 and comment #5. Still trying to figure out if any of these fixes have relation to the crash report but worth to get feedback first.
Comment on attachment 8917791 [details] [diff] [review]
make HttpChannelChild::mKeptAlive atomic

Review of attachment 8917791 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/protocol/http/HttpChannelChild.cpp
@@ +253,5 @@
>    // Normally we Send_delete in OnStopRequest, but when we need to retain the
>    // remote channel for security info IPDL itself holds 1 reference, so we
>    // Send_delete when refCnt==1.  But if !mIPCOpen, then there's nobody to send
>    // to, so we fall through.
> +  if (count == 1 && mIPCOpen && mKeptAlive.compareExchange(true, false)) {

so, this fix suggests that:

object state: mRefCnt == 2, mIPCOpen == true, mKeptAlive == true

T1: Release() { 
  --mRefCnt -> 1
  mKeptAlive ?== true
  count ?== 1

context switch
T2: AddRef() {
  ++mRefCnt -> 2
}
T2: Release() {
  --mRefCnt -> 1
  mKeptAlive ?== true
  count ?== 1
  mIPCOpen ?== true
  yes, all true!

context switch:
T1: continues Release() ->
  mIPCOpen ?== true
  yes, all true!
  call TrySendDeletingChannel();
}
  
context switch:
T2: continues Release() ->
  call TrySendDeletingChannel();
}


that's possible, yes, but...  if you fix this with making keptalive atomic and use cmpxchg, then there is something else wrong.  since if would mean that access to the object and call to release() can be done on multiple threads w/o locking from above.  That's bad on its own.  Release() and AddRef() should always be considered as implicitly fully atomic because they must always be called under some lock provided by the caller, see [1].

hence, if this particular patch fixes the crashes, then we need to find the place where the locking is not used properly over access to this object.

[1] https://www.janbambas.cz/illusion-atomic-reference-counting/
Attachment #8917791 - Flags: feedback?(honzab.moz) → feedback-
Comment on attachment 8917790 [details] [diff] [review]
make ChannelEventQueue::mOwner atomic

Review of attachment 8917790 [details] [diff] [review]:
-----------------------------------------------------------------

This doesn't sound like a proper solution, please see the feedback for the " make HttpChannelChild::mKeptAlive atomic " patch, I think the same applies for this patch...
Attachment #8917790 - Flags: feedback?(honzab.moz) → feedback-
Comment on attachment 8917792 [details] [diff] [review]
fix logic of reentrant HttpChannelChild::TrySendDeletingChannel

Review of attachment 8917792 [details] [diff] [review]:
-----------------------------------------------------------------

so, this is just a clean up after the " make HttpChannelChild::mKeptAlive atomic " patch?
(In reply to Honza Bambas (:mayhemer) from comment #10)
> [1] https://www.janbambas.cz/illusion-atomic-reference-counting/

To elaborate a bit more on this, if there is a member or global that we can call both Release() and AddRef() on on two threads, we have the problem.  

I kinda think that having non-threadsafe refcounter and rather use a lock in case of such complicated Release() (and of course AddRef()) implementation may be a better solution here, both improving performance and safety (multiple atomics are expensive, and can be more expensive than one lock!)  Also, from experience, when there are multiple state variables we change and make decisions with, it's better to make the changes and decisions totally atomic (with a lock).

If we avoid the concurrent unlocked member access - if there is such, we are OK then.
(In reply to Honza Bambas (:mayhemer) from comment #13)
> (In reply to Honza Bambas (:mayhemer) from comment #10)
> > [1] https://www.janbambas.cz/illusion-atomic-reference-counting/
> 
> To elaborate a bit more on this, if there is a member or global that we can
> call both Release() and AddRef() on on two threads, we have the problem.  
> 
> I kinda think that having non-threadsafe refcounter and rather use a lock in
> case of such complicated Release() (and of course AddRef()) implementation
> may be a better solution here, both improving performance and safety
> (multiple atomics are expensive, and can be more expensive than one lock!) 
> Also, from experience, when there are multiple state variables we change and
> make decisions with, it's better to make the changes and decisions totally
> atomic (with a lock).
> 
> If we avoid the concurrent unlocked member access - if there is such, we are
> OK then.
Thanks for sharing this. HttpChannelChild can be hold by many objects and each of them works on different threads:
1) caller/creator of HttpChannelChild, typically main thread
2) listner of HttpChannelChild, main thread, and target thread if ODA is enabled
3) IPDL, extra refcount hold/release on main thread
4) ChannelEventQueue, main thread, and target thread if ODA is enabled
5) HttpBackgroundChannelChild, main thread and STS thread

Not all the references to the HttpChannelChild is controlled by Necko, therefore it would be nearly impossible to use lock on every AddRef/Release to a HttpChannelChild object. This leads to two choices: 1) make HttpChannelChild::AddRef/Release atomic, or 2) make them reliable for multi-thread access at the same time.
(In reply to Shih-Chiang Chien [:schien] (UTC+8) (use ni? plz) from comment #14)
> (In reply to Honza Bambas (:mayhemer) from comment #13)
> > (In reply to Honza Bambas (:mayhemer) from comment #10)
> > > [1] https://www.janbambas.cz/illusion-atomic-reference-counting/
> > 
> > To elaborate a bit more on this, if there is a member or global that we can
> > call both Release() and AddRef() on on two threads, we have the problem.  
> > 
> > I kinda think that having non-threadsafe refcounter and rather use a lock in
> > case of such complicated Release() (and of course AddRef()) implementation
> > may be a better solution here, both improving performance and safety
> > (multiple atomics are expensive, and can be more expensive than one lock!) 
> > Also, from experience, when there are multiple state variables we change and
> > make decisions with, it's better to make the changes and decisions totally
> > atomic (with a lock).
> > 
> > If we avoid the concurrent unlocked member access - if there is such, we are
> > OK then.
> Thanks for sharing this. HttpChannelChild can be hold by many objects and
> each of them works on different threads:
> 1) caller/creator of HttpChannelChild, typically main thread
> 2) listner of HttpChannelChild, main thread, and target thread if ODA is
> enabled
> 3) IPDL, extra refcount hold/release on main thread
> 4) ChannelEventQueue, main thread, and target thread if ODA is enabled
> 5) HttpBackgroundChannelChild, main thread and STS thread
> 
> Not all the references to the HttpChannelChild is controlled by Necko,
> therefore it would be nearly impossible to use lock on every AddRef/Release
> to a HttpChannelChild object. 

That is not what I mean.  Let's explain on an example.  A class nsClass1 has a RefPtr<HttpChannelChild> mPtr member.  It's assigned an instance of a valid object.  A class nsClass2 has it's own RefPtr member too, pointing to the same http child.  So, there are two individual objects keeping ref to the single child channel.  The channel's refcnt is then at least 2.

If we access and addref() (or release()) nsClass1->mRawPtr on T1 w/o a global lock and also access and addref() (or release()) nsClass2->mRawPtr on T2, again w/o locking, there is no problem.  If both release()s, only one of the releases will delete the object, correctly.  If one release()s and the other addref()s, the object will still be live - the refcnt will never drop to 0.

*The problem* exists when nsClass1->mRawPtr - a single instance of the *RefPtr* aka a single memory cell keeping the address to the http child object - is accessed on T1 and T2 w/o locking!  Then there is the race as I describe it in the article.

Hence, you don't need one giant lock to access HttpChannelChild references all over the code.  You need to care if access to HttpChannelChild *within* a single class (object) that refers it is atomic (done under a lock specific to the class with the mPtr member).

> This leads to two choices: 1) make
> HttpChannelChild::AddRef/Release atomic, 

Won't fix *the problem.*

> or 2) make them reliable for
> multi-thread access at the same time.

They mostly are, I believe.  But simplification (use of a lock rather than few atomics) would be good, just to eliminate the problem happen to be in release() imperfect atomicity.
Comment on attachment 8917792 [details] [diff] [review]
fix logic of reentrant HttpChannelChild::TrySendDeletingChannel

comment 15 and above, dropping f to move forward here.  also, feel free to argue with me that I'm wrong :)  these things are complicated.
Attachment #8917792 - Flags: feedback?(honzab.moz) → feedback-
Hi Shih-Chiang, are you still planning to continue investigating this sec bug?
Flags: needinfo?(schien)
This bug might be a dup to bug 1401459 with different symptom.
Flags: needinfo?(schien)
Double check the code and crash stack, it's highly likely that the patch for bug 1401459 will fix this issue as well.

The root cause is in HttpChannelChild::Release that we implicitly use |this| outside of critical section. Two Release runs in different threads might interleaving like this:

>                T1                                                         T2
>nsrefcnt count = --mRefCnt; //count = 1
>                                                               nsrefcnt count = --mRefCnt; //count = 0
>                                                               if (mKeptAlive && count == 1 && mIPCOpen) { ... }
>                                                               if (count == 0) {
>                                                                 mRefCnt = 1; /* stabilize */
>                                                                 delete this;
>                                                                 return 0;
>                                                               }
>if (mKeptAlive && count == 1 && mIPCOpen) {...}
>// access `this` after `delete this` on T2

Thus we'll see UAF in HttpChannelChild::Release. After bug 1401459 landed, HttpChannelChild::Release will always runs on main thread. Therefore, Release will not interleaving with each other anymore.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → DUPLICATE
Attachment #8917790 - Attachment is obsolete: true
Attachment #8917791 - Attachment is obsolete: true
Attachment #8917792 - Attachment is obsolete: true
Group: network-core-security
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: