Closed Bug 938857 Opened 6 years ago Closed 6 years ago

Crash Report [@ nr_socket_sendto ]

Categories

(Core :: WebRTC: Networking, defect, critical)

x86_64
Windows 7
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla28
Tracking Status
firefox26 --- unaffected
firefox27 + fixed
firefox28 + fixed
firefox-esr24 --- unaffected
b2g18 --- unaffected
b2g-v1.1hd --- unaffected
b2g-v1.2 --- ?

People

(Reporter: mreavy, Assigned: bwc)

Details

(Keywords: crash, csectype-uaf, sec-critical, Whiteboard: [qa-])

Attachments

(5 files, 3 obsolete files)

https://crash-stats.mozilla.com/report/index/57bc0978-d383-4c36-b97f-0eae82131114

I was setting up a WebRTC call with Byron using apprtc.  I started the call and when Byron clicked on the URL to join, my Firefox crashed.  I was running Firefox 27.0a2 (Aurora).

Address: read 0x62616c26 ("bal*" doesn't seem like a good address; almost certainly a sec bug). Aurora 27, Win, BuildID 20131111004004. In nr_socket_sendto() called from (indirectly) TransportLayerDTLS::Handshake() (stack in the crash report).

Since this is likely a sec bug in the DTLS or NSS networking code, I'm assigning it to Ekr for better analysis and triage.
I agree this doesn't look awesome.
Here appear to be some others:

https://crash-stats.mozilla.com/report/list?product=Firefox&query_search=signature&query_type=contains&reason_type=contains&date=2013-11-14&range_value=28&range_unit=days&hang_type=any&process_type=any&signature=nr_socket_sendto

I note that this is only in Aurora and Nightly, so that gives us some
time limit on when the problem was introduced.


nr_socket_sendto is trivial:
http://dxr.mozilla.org/mozilla-central/source/media/mtransport/third_party/nICEr/src/net/nr_socket.c?from=nr_socket_sendto#79

We really need to catch this in a debugger or add some instrumentation. Do you have
any idea how to repro.
Given the low number of reports I imagine it will be hard or impossible to repro on demand.  Barring extreme luck, we'll have to figure out some other way to resolve this.

More info may be available (line numbers, who submitted it) if you have a Socorro login and the right permissions.

It seems possible the problem starting in early October in 27, but it may just be that other changes caused timing differences or otherwise exposed a latent bug or one waiting for the right timings (and causable in theory in older builds).  Won't know until we have a cause.
Severity: normal → critical
Jesup, can you please see if you can get a stack trace from Maire's report. I understand on Windows we can see local variables? That would be helpful.
The line numbers are hidden in the URLS for each entry (hover them or click).  I'm working to try to get crash-stats access.  No idea if I'll be able to get local vars.

You should also consider if there are any deadfall traps/asserts you want to put in both to block security issues and to provide useful crash data pointing to the problem going forward.
I have the line numbers.

I understand from JST that one can get local variables.

Yes, I had already thought about asserts; that's the "instrumentation"
I listed above, but it's going to take a while to reproduce this, so
it would be nice to see if we could gather more from the traces we have
now.
I have crash dump access. In theory you can get local variables but they're often optimized out. Sometimes you can recover them with some stack digging. 

Are there any particular locals you're interested in?
The things I am most interested in seeing are (from nr_ice_media_stream_send) comp->active->local->type and comp->active->local->state. If we can get those, I'll know where the next place to look is. If it is easy to get the entirety of comp->active->local, or comp->active, that would be even better.
All I can see is that comp == 0x093af82c. I can't see any further inside that object; minidumps only contain a very small amount of the heap, unfortunately.

I disassembled nr_ice_media_stream_send to see if there were any remnants of that dereference chain sitting around in registers, but it seems they've all been reused.
Crap, was afraid of that. I can try and find something on the stack that might give us a clue, but I'm not terribly optimistic.
Are you able to look at the nr_socket_send frame and verify my suspicion that were crashing because |sock| is null?
The debugger claims processed_indication is 0, and I think we can trust that value because other nearby locals look reasonable (e.g. |len| is a small number, |string| is a real string, etc.)

The actual crash is because |sock| is 0x62616c26, which is in an inaccessible page. I confirmed that that's the real value of sock by dumping the stack as well as tracing through the disassembly.

Those hex values look vaguely like text. I also see another crash report from a different build and different machine with the same bad address. That can't be a coincidence. Maybe some corruption somewhere caused the code to read a pointer out of a string buffer? Then again there's also another crash report with a different bad address that doesn't look like text.
Ok, if that same string shows up twice, that's a great big hint. Will give it a look.
(or rather, |sock| is 0x62616c22, but trying to read its vtbl member at an offset of +4 leads to a bad read at 0x62616c26)
So, the only place I can find that has bal" as a string literal is here:

http://dxr.mozilla.org/mozilla-central/source/js/src/shell/js.cpp?from=js.cpp#947

Might be a total red herring.
Interesting.

Can we compare the values of the nr_ice_media_stream argument at:

nr_ice_media_stream_send (frame #2)

mozilla::TransportLayerIce::IcePacketReceived (frame #24)

They should be the same, since we should be sending data out on the same stream it
went in on.
01 047bbcc0 59f7cc5b xul!nr_ice_media_stream_send(
			struct nr_ice_peer_ctx_ * pctx = 0x055c794c, 
			struct nr_ice_media_stream_ * str = 0x055c7a6c, 
			int component = 0n1, 
			unsigned char * data = 0x1f22d000
			int len = 0n834)+0x4e 

The parameter at IcePacketReceived is a slightly different type (mozilla::NrIceMediaStream *) but I can show you msg_recvd:

1a 047bd3bc 5a0d2524 xul!mozilla::NrIceCtx::msg_recvd(
			void * obj = 0x0bffbdc0, 
			struct nr_ice_peer_ctx_ * pctx = 0x055c794c, 
			struct nr_ice_media_stream_ * stream = 0x055c7a6c, 
			int component_id = 0n1, 
			unsigned char * msg = 0x047bd730
			int len = 0n894)+0x28
(In reply to Byron Campen [:bwc] from comment #15)
> So, the only place I can find that has bal" as a string literal is here:
> 
> http://dxr.mozilla.org/mozilla-central/source/js/src/shell/js.cpp?from=js.
> cpp#947
> 
> Might be a total red herring.

I think that string only goes into the js.exe shell. I can't find it in mozjs.dll.
Actually, this cannot be the case. Disregard.
(In reply to David Major [:dmajor] from comment #17)
> 01 047bbcc0 59f7cc5b xul!nr_ice_media_stream_send(
> 			struct nr_ice_peer_ctx_ * pctx = 0x055c794c, 
> 			struct nr_ice_media_stream_ * str = 0x055c7a6c, 
> 			int component = 0n1, 
> 			unsigned char * data = 0x1f22d000
> 			int len = 0n834)+0x4e 
> 
> The parameter at IcePacketReceived is a slightly different type
> (mozilla::NrIceMediaStream *) but I can show you msg_recvd:
> 
> 1a 047bd3bc 5a0d2524 xul!mozilla::NrIceCtx::msg_recvd(
> 			void * obj = 0x0bffbdc0, 
> 			struct nr_ice_peer_ctx_ * pctx = 0x055c794c, 
> 			struct nr_ice_media_stream_ * stream = 0x055c7a6c, 
> 			int component_id = 0n1, 
> 			unsigned char * msg = 0x047bd730
> 			int len = 0n894)+0x28

So this looks like not total insanity.

But here's the thing:

When nr_socket_prsock() self-destructs, it is supposed to deregister
from the STS:

http://dxr.mozilla.org/mozilla-central/source/media/mtransport/nr_socket_prsock.cpp#1012
http://dxr.mozilla.org/mozilla-central/source/media/mtransport/nr_socket_prsock.cpp#567

Now maybe I'm confused about the impact of setting mCondition, and I need to
study it more, but:

http://dxr.mozilla.org/mozilla-central/source/netwerk/base/src/nsSocketTransportService2.cpp#778

So, I'm trying to figure out how this can be a dangling reference, modulo a thread error.
We could try making the ASSERT_ON_THREAD here crash in opt....
I'm pretty sure the socket isn't really the thing that's dangling here; the candidate is dangling. Its osock pointer is garbage, because the candidate has been stomped on. I've looked pretty hard, and haven't found a case where a local peer candidate could be left dangling, absent a thread error.
Byron: can we change all the nrtransport/ice/etc ASSERT_ON_THREAD or equivalents to opt-time asserts/MOZ_CRASHes, and then measure the impact on perf?  If the impact is low (<1%), we should look at taking this at least for now.  If >1%, let's look at how much and where the hit is coming.
A patch that I think addresses the crash; needs testing.
Assignee: ekr → docfaraday
Status: NEW → ASSIGNED
Attached file crash.txt
JS logging from my browser on the call that crashed Maire's browser. The patch fixes an extremely bizarre case where we could prune an already-paired relayed candidate if a second identical relayed candidate with equal or greater priority is initted afterwards. The JS logs show that we have the same turn server twice, and that Maire's browser trickled exactly the same relayed candidate twice.
Attachment #8336214 - Attachment is obsolete: true
The test case is about as close as we can get to the real-world flow without getting network IO involved. However, this requires quite a lot of access to internal API in nICEr, and therefore makes a pretty lousy test for keeping around long term.

We could write a test case that simply inserts stuff into the candidate and check lists directly, and sees what the pruning logic does, but such a test case doesn't do such a great job of showing the bug can actually happen, nor does it do a great job of showing that the fix works properly.
Attachment #8337167 - Flags: review?(ekr)
I should mention; I have run the test case against the mozilla codebase prior to trickle landing, and it is fine.
Comment on attachment 8337167 [details] [diff] [review]
Fix bug where an already-paired relayed candidate could be destroyed by the pruning logic, leaving the pointer dangling.

Review of attachment 8337167 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm but this should get a secondary review from Adam/

::: media/mtransport/third_party/nICEr/src/ice/ice_component.c
@@ +368,5 @@
>  
>               Since this algorithmis run whenever a new candidate
>               is initialized, there should at most one duplicate.
>             */
> +          if (c1->priority <= c2->priority || nr_ice_any_peer_paired(c2)) {

Please add parenthes around the first <= expr.
Attachment #8337167 - Flags: review?(adam)
I agree we shouldn't land the test, if only because it really explains
the problem and how to make it happen. I suggest we sit on the test
for a while and then once Nils is spun up and we have more test
infrastructure, maybe see if he can do a non-unit test.

Either that, or maybe I will think of something cleverer how to test
it.
Comment on attachment 8337167 [details] [diff] [review]
Fix bug where an already-paired relayed candidate could be destroyed by the pruning logic, leaving the pointer dangling.

Review of attachment 8337167 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm but this should get a secondary review from Adam/

::: media/mtransport/third_party/nICEr/src/ice/ice_component.c
@@ +368,5 @@
>  
>               Since this algorithmis run whenever a new candidate
>               is initialized, there should at most one duplicate.
>             */
> +          if (c1->priority <= c2->priority || nr_ice_any_peer_paired(c2)) {

Please add parenthes around the first <= expr.
Attachment #8337167 - Flags: review?(ekr) → review+
Comment on attachment 8337167 [details] [diff] [review]
Fix bug where an already-paired relayed candidate could be destroyed by the pruning logic, leaving the pointer dangling.

Review of attachment 8337167 [details] [diff] [review]:
-----------------------------------------------------------------

Makes sense to me.
Attachment #8337167 - Flags: review?(adam) → review+
Attachment #8337167 - Attachment is obsolete: true
Comment on attachment 8343407 [details] [diff] [review]
Fix bug where an already-paired relayed candidate could be destroyed by the pruning logic, leaving the pointer dangling.

Review of attachment 8343407 [details] [diff] [review]:
-----------------------------------------------------------------

Carry forward r+ from ekr.
Attachment #8343407 - Flags: review+
Attachment #8343407 - Flags: review?(adam)
Comment on attachment 8343407 [details] [diff] [review]
Fix bug where an already-paired relayed candidate could be destroyed by the pruning logic, leaving the pointer dangling.

Review of attachment 8343407 [details] [diff] [review]:
-----------------------------------------------------------------

Looks like I missed the r+ from abr somehow.
Attachment #8343407 - Flags: review?(adam)
byron: can we ask for s-a so we can get this in?
Flags: needinfo?(docfaraday)
Comment on attachment 8343407 [details] [diff] [review]
Fix bug where an already-paired relayed candidate could be destroyed by the pruning logic, leaving the pointer dangling.

[Security approval request comment]
How easily could an exploit be constructed based on the patch?

   It is not super obvious that we're guarding against a sec bug here, but someone clever might be able to figure out how to trigger the bug. Once they realize how, it is very easy to trigger the bug (just stand up a fake TURN server that always returns the same allocation, write a web app that tells the browser about this server twice, and drop the first allocate request so the second candidate inits first). From here, there is a long-lifetime dangling pointer that an attacker could try to exploit (and, the traces suggest that an attacker could overwrite a function pointer that we end up calling).

Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?

   The commit message does for sure. Can fix.

Which older supported branches are affected by this flaw?

   Affected back to 26.

If not all supported branches, which bug introduced the flaw?

   Bug 842549   

Do you have backports for the affected branches? If not, how different, hard to create, and risky will they be?

   The backports should be reasonably easy; we may need to rebase to land on 28 now, since bug 906968 has landed.

How likely is this patch to cause regressions; how much testing does it need?

   Very unlikely.
Attachment #8343407 - Flags: sec-approval?
Flags: needinfo?(docfaraday)
Attached patch Patch for 28.Splinter Review
Unbitrot, and make commit message less sensitive.
I(In reply to Byron Campen [:bwc] from comment #38)

> Do comments in the patch, the check-in comment, or tests included in the
> patch paint a bulls-eye on the security problem?
> 
>    The commit message does for sure. Can fix.
> 
> Which older supported branches are affected by this flaw?
> 
>    Affected back to 26.
>

I believe this is incorrect: as indicated in c29, this is a trickle
issue and trickle landed 9/19, with m-c at the time being
FF 27 (https://wiki.mozilla.org/RapidRelease/Calendar). So, 
I believe this only applies to FF 27.

Double checking, look at beta:
http://hg.mozilla.org/releases/mozilla-beta/file/697b60cb4d70/media/mtransport/nricectx.cpp

And notice that there is no trickle_cb whereas there is one in
aurora.. (http://hg.mozilla.org/releases/mozilla-aurora/file/37f2d83153a5/media/mtransport/nricectx.cpp#l309o)
A correction; 26 is not affected.
Patch for 27.
Attached patch Patch for 27.Splinter Review
Patch for 27.
Attachment #8344137 - Attachment is obsolete: true
Comment on attachment 8343407 [details] [diff] [review]
Fix bug where an already-paired relayed candidate could be destroyed by the pruning logic, leaving the pointer dangling.

[Approval Request Comment]
See s-a request

String or IDL/UUID changes made by this patch: none
Attachment #8343407 - Flags: approval-mozilla-aurora?
Comment on attachment 8343407 [details] [diff] [review]
Fix bug where an already-paired relayed candidate could be destroyed by the pruning logic, leaving the pointer dangling.

sec-approval+ for trunk and I'm giving Aurora approval here as discussed.
Attachment #8343407 - Flags: sec-approval?
Attachment #8343407 - Flags: sec-approval+
Attachment #8343407 - Flags: approval-mozilla-aurora?
Attachment #8343407 - Flags: approval-mozilla-aurora+
landed on https://hg.mozilla.org/mozilla-central/rev/7850b778acfa
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Flags: in-testsuite?
Resolution: --- → FIXED
Maire, can you please advise if/how QA should verify this is fixed? Do you have reproducible steps?
Flags: needinfo?(mreavy)
This would be tricky to reproduce at will. For starters you would need to deploy a TURN server that was subtly broken in a specific and reliable way (Google's TURN servers seem to be _intermittently_ broken in this way). Once you had that, you would then need some additional setup to force the use of TURN for media. After that, you would be able to write something like a mochitest.
Thanks Byron. I'm marking this [qa-] based on that. If there's value in this type of testing, long-term, I can add it to QA's WebRTC testing laundry list.
Flags: needinfo?(mreavy)
Whiteboard: [qa-]
Group: core-security
You need to log in before you can comment on or make changes to this bug.