Closed Bug 1059081 Opened 10 years ago Closed 9 years ago

TCPSocket crash in nsMultiplexInputStream::ReadSegments on "Socket Thread" due to apparent racey use of nsMultiplexInputStream

Categories

(Core :: Networking, defect)

x86_64
Linux
defect
Not set
normal

Tracking

()

VERIFIED FIXED
mozilla41
Tracking Status
firefox38 --- wontfix
firefox38.0.5 --- wontfix
firefox39 --- fixed
firefox40 --- fixed
firefox41 --- fixed
firefox-esr31 --- wontfix
firefox-esr38 --- wontfix
b2g-v2.0 --- wontfix
b2g-v2.0M --- wontfix
b2g-v2.1 --- fixed
b2g-v2.1S --- fixed
b2g-v2.2 --- fixed
b2g-master --- fixed

People

(Reporter: asuth, Assigned: jdm)

References

Details

(Keywords: sec-low, Whiteboard: [adv-main39+][b2g-adv-main2.2+])

Attachments

(1 file)

The Gaia email app uses TCPSocket in our back-end tests using b2g-desktop in a non-OOP mode.  I have observed intermittent crashes in nsMultiplexInputStream::ReadSegments.  I'm not exactly clear what's going on, but the fact that TCPSocket.js calls removeStream(0) and appendStream(...) on the main thread while nsMultiplexInputStream::ReadSegments is doing something on the "Socket Thread" without any apparent synchronization constructs seems like it could explain things.

The call stack on an -O1 self-built trunk build from earlier today looks like this:
#0  nsMultiplexInputStream::ReadSegments (this=0x2aaaca7bb740, aWriter=<optimized out>, aClosure=<optimized out>, aCount=65536, aResult=0x2aaabeaffbcc)

318	    rv = mStreams[mCurrentStream]->ReadSegments(ReadSegCb, &state, aCount, &read);

#1  0x00002aaaac9dae65 in nsStreamCopierIB::DoCopy (this=<optimized out>, aSourceCondition=0x2aaabeaffc08, aSinkCondition=0x2aaabeaffc0c) at xpcom/io/nsMultiplexInputStream.cpp:318
#2  0x00002aaaac9db093 in Process (this=0x2aaad01f0500) at xpcom/io/nsStreamUtils.cpp:291
#3  nsAStreamCopier::Run (this=0x2aaad01f0500) at xpcom/io/nsStreamUtils.cpp:414
#4  0x00002aaaac9e7588 in nsThread::ProcessNextEvent (this=0x2aaabe8263c0, aMayWait=<optimized out>, aResult=0x2aaabeaffcef)
#5  0x00002aaaaca053ee in NS_ProcessNextEvent (aThread=<optimized out>, aMayWait=<optimized out>)
#6  0x00002aaaaca37c3e in nsSocketTransportService::Run (this=0x2aaaabd8f600) at netwerk/base/src/nsSocketTransportService2.cpp:744
#7  0x00002aaaac9e7588 in nsThread::ProcessNextEvent (this=0x2aaabe8263c0, aMayWait=<optimized out>, aResult=0x2aaabeaffdcf)
#8  0x00002aaaaca053ee in NS_ProcessNextEvent (aThread=<optimized out>, aMayWait=<optimized out>)
#9  0x00002aaaacc56927 in mozilla::ipc::MessagePumpForNonMainThreads::Run (this=0x2aaabe82d040, aDelegate=0x2aaaabd67060)
#10 0x00002aaaacc3532a in MessageLoop::RunInternal (this=this@entry=0x2aaaabd67060)
#11 0x00002aaaacc35598 in RunHandler (this=0x2aaaabd67060)
#12 MessageLoop::Run (this=this@entry=0x2aaaabd67060)
#13 0x00002aaaac9ebee3 in nsThread::ThreadFunc (aArg=0x2aaabe8263c0)
#14 0x00002aaaabfd2fb8 in _pt_root (arg=0x2aaaabd37920)
#15 0x00002aaaaacd8182 in start_thread (arg=0x2aaabeb00700)
#16 0x00002aaaab7f738d in clone ()

(gdb) p *mStreams.mHdr
$20 = {
  static sEmptyHdr = {
    static sEmptyHdr = <same as static member of an already seen type>, 
    mLength = 0, 
    mCapacity = 0, 
    mIsAutoArray = 0
  }, 
  mLength = 2, 
  mCapacity = 3, 
  mIsAutoArray = 0
}

(gdb) x/4g mStreams.mHdr
0x2aaad0f0ee20:	0x0000000300000002	0x00002aaac153a400
0x2aaad0f0ee30:	0x00002aaac153a4c0	0x5a5a5a5a5a5a5a5a

(gdb) p mCurrentStream
$21 = 1

Unfortunately, Elements() seems to have been optimized out 

The disassembly around the crash point start from the while loop comparator is:
   0x00002aaaac9ccc58 <+102>:	cmp    %r12d,%edx
   0x00002aaaac9ccc5b <+105>:	jae    0x2aaaac9ccce6 <nsMultiplexInputStream::ReadSegments(tag_nsresult (*)(nsIInputStream*, void*, char const*, unsigned int, unsigned int, unsigned int*), void*, unsigned int, unsigned int*)+244>
   0x00002aaaac9ccc61 <+111>:	jmp    0x2aaaac9cccd5 <nsMultiplexInputStream::ReadSegments(tag_nsresult (*)(nsIInputStream*, void*, char const*, unsigned int, unsigned int, unsigned int*), void*, unsigned int, unsigned int*)+227>
   0x00002aaaac9ccc63 <+113>:	lea    0x1c(%rsp),%r15
   0x00002aaaac9ccc68 <+118>:	lea    0x20(%rsp),%r14
   0x00002aaaac9ccc6d <+123>:	lea    -0x3e4(%rip),%r13        # 0x2aaaac9cc890 <nsMultiplexInputStream::ReadSegCb(nsIInputStream*, void*, char const*, unsigned int, unsigned int, unsigned int*)>
   0x00002aaaac9ccc74 <+130>:	mov    %edx,%edx
// get mStreams
   0x00002aaaac9ccc76 <+132>:	mov    0x28(%rbx),%rax
// get mStreams.mHdr.element[1] (not literally, but)
   0x00002aaaac9ccc7a <+136>:	mov    0x8(%rax,%rdx,8),%rdi 
=> 0x00002aaaac9ccc7f <+141>:	mov    (%rdi),%rax
   0x00002aaaac9ccc82 <+144>:	mov    %r15,%r8
   0x00002aaaac9ccc85 <+147>:	mov    %ebp,%ecx
   0x00002aaaac9ccc87 <+149>:	mov    %r14,%rdx
   0x00002aaaac9ccc8a <+152>:	mov    %r13,%rsi
   0x00002aaaac9ccc8d <+155>:	callq  *0x30(%rax)

(gdb) info registers rax rbx rdx rdi
rax            0x2aaaca642500	46913028367616
rbx            0x2aaaca7bb740	46913029912384
rdx            0x1	1
rdi            0x100000001	4294967297

// just confirming offsets here
(gdb) p this
$27 = (nsMultiplexInputStream * const) 0x2aaaca7bb740
(gdb) p &mStreams
$26 = (nsTArray<nsCOMPtr<nsIInputStream> > *) 0x2aaaca7bb768

(gdb) x 0x100000001
0x100000001:	Cannot access memory at address 0x100000001

I'm guessing that subscripting the nsTArray raced a main-thread call to appendStream and this resulted in a corrupt/crazy value.  Although 0x100000001 is a pretty bad-ass pointer.

I think this is probably concerning for B2G in both non-OOP and OOP since based on the remoting currently in use, all the same risky business would happen in the parent process.  I'm not marking this a security bug, though, since TCPSocket is a privileged-only API within the circle of trust, if you will.

cc'ing other affected email people, leaving this up to Core::Network people for assessment/etc.
looks like jdm and donovan are the likely people to understand this?
Flags: needinfo?(josh)
I have the same problem as you.
I encountered the problem in xulRunner 31.
At this point, I compiled Xulrunner 32 and TCPSocket doesn't seem to work.
For now, I have to disable optimize flag to make things work.
I'm working on MSWINDOWS.
Yeah, this class says it's threadsafe and does absolutely nothing to back that up. Whee.
Flags: needinfo?(josh)
I suspect the original intention of nsMultiplexInputStream was that you would set it up just so, then you would process it, and then you would drop your references to it whenever you stop caring about it. We're using it as an ongoing queue of streams, however, so that's a bit of a problem. My original thought was just to throw a lock into the mix and move on, but there's a lot of shared, mutating state that looks hairy to lock efficiently. I'm going to ponder a mechanism more suitable to our requirements and whether that would be less work and/or more correct.
Assignee: nobody → josh
If the core of TCPSocket weren't implemented in JS we could always do the append on the necko network thread via runnable dispatch.  (Or if the multiplex stream did that itself after it had been activated and if it wasn't on the network thread.)

Another possible partial mitigation for the worst of this would be to call SetCapacity on the multiplex stream to a large enough number initially that it should never actually expand need to dynamically resize the array.  It seems like there's likely to still be races on the concurrent array access, but at least it wouldn't involve new memory allocations / freed memory allocations in the nsTArray itself.
Blocks: 1086906
It just occurred to me that because the actual data sending happens in the parent process in e10s mode, a segfault will take the root b2g process down and that could constitute a DoS security bug.  (Originally I wasn't thinking things through and believed that because the tcp-socket API was privileged and subprocesses were the vulnerable ones, so we were safe.)  So marking to get triaged.

The specific the specific attack would be something like this.

Prereqs:

- The user is using a legitimate app that uses TCPSocket.  For example the built-in email app or the loqui IM app for Firefox OS (https://github.com/loqui/im).  These apps tend to have some combination of auto-connection (ex: periodic email sync) and auto-reconnection (ex: email has some exponential backoff stuff with one immediate retry on connection reuse primarily to deal with stale network connections to compensate for the network stack's crappy TCP keep-alive settings as they relate to TCPSocket.)

- Attacker has the ability to passively notice the connection (to see TCP connections) and some ability to reset the connection (via TCP reset or knocking the device off wi-fi) to cause the device to reconnect.


Attack:

- The attacker keeps resetting the TCP connections to try and cause the nsMultiplexStream vulnerability window to trigger and crash the root b2g process.

- In the case mozAlarms are used to launch the vulnerable app, this could result in the phone ending up in the same situation again soon, depending on the alarm interval.  mozAlarms fires all alarms that were in the past when it starts up.  The email app supports a maximum interval of every 5 minutes; other apps like IM apps may be more aggressive.
Group: core-security
This problem is really annoying for my xulrunner application (on microsoft Windows).
At this point, I can't use the optimized version.
Cc["@mozilla.org/tcp-socket;1"].createInstance(Ci.nsIDOMTCPSocket)
is no longer reliable for my work.
I agree we need to resolve this.  As a temporary workaround, you might be able to issue multiple .send() calls in rapid succession when you first open the socket (possibly even zero-length?), forcing the array to perform all of its reallocations up-front.  Of course, this may also just immediately crash the application instead.  In a non-e10s situation this may turn out okay, but the latency of the messages in the e10s case may increase the probability of collision/explosion too much.
I'm also hitting this in a very reliable way while working on an addon that relies on ADB Helper to pull lots of files from a device.
Any hope to investigate this ? It's blocking me badly to create a new tool that makes intensive use of ADB pull and I do have 100% repro on the crash in my case (need two addons and specific devices though for now).
Flags: needinfo?(josh)
Flags: needinfo?(bugmail)
My best idea so far is to create a new C++ class like this:

class nsThreadsafeMultiplexInputStream : public nsIMultiplexStream
                                       , public nsISeekableStream
                                       , public nsIIPCSerializableStream
{
  Mutex mLock;
  nsCOMPtr<nsIMultiplexStream> mStream;
public:
  NS_DECL_THREADSAFE_ISUPPORTS
  NS_DECL_NSIMULTIPLEXSTREAM
  NS_DECL_NSISEEKABLESTREAM
  NS_DECL_NSIIPCSERIALIZABLESTREAM
};

Every method would first lock mLock before calling through to mStream. This wouldn't guarantee perfect threadsafety (due to the single use of the `this` pointer in nsMultiplexStream::ReadSegments) but it should be enough to avoid the crash exhibited here.
Flags: needinfo?(josh)
Comment on attachment 8604631 [details] [diff] [review]
Add a threadsafe wrapper for persistent nsMultiplexStream queues

Does this help?
Attachment #8604631 - Flags: feedback?(lissyx+mozillians)
Comment on attachment 8604631 [details] [diff] [review]
Add a threadsafe wrapper for persistent nsMultiplexStream queues

Review of attachment 8604631 [details] [diff] [review]:
-----------------------------------------------------------------

And trust me, if I could put more '+' on this 'f', I would. This has fixed the issue for me so far: while the code was crashing nearly instantaneously when doing my set of adb pull (850+ files to get from the device), I have now been able to complete more than 5 consecutive full run of pulling, without hitting a single crash. Thank you so much for the fast workaround!
Attachment #8604631 - Flags: feedback?(lissyx+mozillians) → feedback+
Yay!
Flags: needinfo?(bugmail)
Comment on attachment 8604631 [details] [diff] [review]
Add a threadsafe wrapper for persistent nsMultiplexStream queues

Nathan, what do you think about this solution? It looks like TCPSocket is using nsMultiplexInputStream in a different way than it way originally intended (a dedicated queue of streams vs. a fire-and-forget collection of multiple streams), hence the threading issues that we ran into. This is the most naive locking behaviour possible because the ReadSegments code is pretty hairy to pull apart.
Attachment #8604631 - Flags: review?(nfroyd)
To reword comment 16, the theory is that nsMultiplexInputStream was meant to be used like:

1. make nsMultiplexInputStream
2. appendStream a bunch of things
3. drain the stream entirely

But this different usage is constantly calling appendStream and removeStream as new streams show up or are emptied?  Is that correct?
Flags: needinfo?(josh)
Oh, also, do we have any tests for tcpsocket in the tree that might show issues under something like TSan?
(In reply to Nathan Froyd [:froydnj] [:nfroyd] from comment #18)
> Oh, also, do we have any tests for tcpsocket in the tree that might show
> issues under something like TSan?

Just calling send a bunch of times in a for loop should trigger the problem blatantly enough to crash stuff.  Smart tooling that understands locks might already detect stuff with the mochitest:

https://dxr.mozilla.org/mozilla-central/source/dom/network/tests/test_tcpsocket_client_and_server_basics.html
Yes.
Flags: needinfo?(josh)
Comment on attachment 8604631 [details] [diff] [review]
Add a threadsafe wrapper for persistent nsMultiplexStream queues

Review of attachment 8604631 [details] [diff] [review]:
-----------------------------------------------------------------

Yeah, this is probably as good a solution as any.

::: xpcom/io/nsMultiplexInputStream.cpp
@@ +58,5 @@
>    static NS_METHOD ReadSegCb(nsIInputStream* aIn, void* aClosure,
>                               const char* aFromRawSegment, uint32_t aToOffset,
>                               uint32_t aCount, uint32_t* aWriteCount);
>  
> +  Mutex mLock;

Can you add a comment here that specifies the mutex protects all of the members below?
Attachment #8604631 - Flags: review?(nfroyd) → review+
https://hg.mozilla.org/mozilla-central/rev/48a159459034
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla41
Blocks: 1168562, 1165830
Seems like we want this on Aurora/Beta/b2g37 at a minimum? Not sure about b2g34 or not.
Comment on attachment 8604631 [details] [diff] [review]
Add a threadsafe wrapper for persistent nsMultiplexStream queues

NOTE: Please see https://wiki.mozilla.org/Release_Management/B2G_Landing to better understand the B2G approval process and landings.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): Original TCPSocket implementation.
User impact if declined: Intermittent crashes in certain b2g apps such as the email client.
Testing completed: Manual testing; easily-reproduced crashes no longer occur.
Risk to taking this patch (and alternatives if risky): Should be none. This merely serializes some code that could run simultaneously on multiple threads before.
String or UUID changes made by this patch: None
Flags: needinfo?(josh)
Attachment #8604631 - Flags: approval-mozilla-beta?
Attachment #8604631 - Flags: approval-mozilla-b2g37?
Attachment #8604631 - Flags: approval-mozilla-aurora?
(In reply to Josh Matthews [:jdm] from comment #27)
> User impact if declined: Intermittent crashes in certain b2g apps such as
> the email client.

For clarity, the crash will occur in the root/parent b2g process, since that's where the actual crashy network code is running.  This is notable because it's much worse than just the app process crashing.
Comment on attachment 8604631 [details] [diff] [review]
Add a threadsafe wrapper for persistent nsMultiplexStream queues

This fix has been on mozilla-central for over a week and has not had a negative impact that we know of. Approving for uplift to Beta and Aurora.
Attachment #8604631 - Flags: approval-mozilla-beta?
Attachment #8604631 - Flags: approval-mozilla-beta+
Attachment #8604631 - Flags: approval-mozilla-aurora?
Attachment #8604631 - Flags: approval-mozilla-aurora+
Andrew, can you please verify that this is fixed? Thanks!
Flags: needinfo?(bugmail)
It would definitely be good to verify that this is fixed once it lands in 39 and 40.
Flags: qe-verify+
(In reply to Ritu Kothari (:ritu) from comment #30)
> Andrew, can you please verify that this is fixed? Thanks!

This fixed the crashes we were encountering, yes.  I am marking as verified for trunk.
Status: RESOLVED → VERIFIED
Flags: needinfo?(bugmail)
Comment on attachment 8604631 [details] [diff] [review]
Add a threadsafe wrapper for persistent nsMultiplexStream queues

Approving for b2g37. Please NI if causing any side effect.
Attachment #8604631 - Flags: approval-mozilla-b2g37? → approval-mozilla-b2g37+
https://hg.mozilla.org/releases/mozilla-b2g37_v2_2/rev/65aab9ab5a05

Josh, what do you think about this for v2.1?
Flags: needinfo?(jocheng)
This is hitting non-unified bustage on b2g37:
https://treeherder.mozilla.org/logviewer.html#?job_id=133699&repo=mozilla-b2g37_v2_2

Josh, easy fix?
Flags: needinfo?(josh)
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #35)
> https://hg.mozilla.org/releases/mozilla-b2g37_v2_2/rev/65aab9ab5a05
> 
> Josh, what do you think about this for v2.1?

Hmm, let's wait until Josh fix b2g37 issues. Thanks
Flags: needinfo?(jocheng)
Hi Josh,
Please also raise approval for b2g34 for better stability. Thanks!
Follow-up fix from jdm for the non-unified bustage:
https://hg.mozilla.org/releases/mozilla-b2g37_v2_2/rev/57fa202d3421
Comment on attachment 8604631 [details] [diff] [review]
Add a threadsafe wrapper for persistent nsMultiplexStream queues

See earlier approval request.
Flags: needinfo?(josh)
Attachment #8604631 - Flags: approval-mozilla-b2g34?
Attachment #8604631 - Flags: approval-mozilla-b2g34? → approval-mozilla-b2g34+
Whiteboard: [adv-main39+]
Whiteboard: [adv-main39+] → [adv-main39+][b2g-adv-main2.2+]
Group: core-security → core-security-release
Group: core-security-release
No longer blocks: 1347743
You need to log in before you can comment on or make changes to this bug.