Closed Bug 1315248 Opened 8 years ago Closed 7 years ago

Crash in mozilla::dom::UDPSocketParent::ConnectInternal

Categories

(Core :: DOM: Core & HTML, defect, P3)

52 Branch
defect

Tracking

()

RESOLVED FIXED
mozilla54
Tracking Status
firefox-esr45 --- unaffected
firefox51 --- wontfix
firefox52 + fixed
firefox-esr52 52+ fixed
firefox53 + fixed
firefox54 + fixed

People

(Reporter: calixte, Assigned: drno)

References

Details

(Keywords: crash, csectype-uaf, sec-high, Whiteboard: [post-critsmash-triage][adv-main52+])

Crash Data

Attachments

(3 files)

This bug was filed from the Socorro interface and is 
report bp-782e6b03-4a79-4938-8f94-9417d2161031.
=============================================================

This crash (7 crashes) appeared on nightly the 2016-10-31 just after the patch [1] to fix bug 1198381 which modified the file xpcom/threads/nsThread.cpp which is in the backtrace.

[1] https://hg.mozilla.org/mozilla-central/rev?node=17b19eb241db07356004d211c4d23a7646fd8b09
Flags: needinfo?(afarre)
It really doesn't feel like it's related to bug 1198381. The changes to nsThread all happen before a runnable is executed, while this call stack is in the middle of the execution of a runnable. I'll look a bit closer on Monday though and ask around for ideas.
Flags: needinfo?(afarre)
Priority: -- → P3
this signature is increasing in frequency after firefox 51 was released - it's also quite common on os x.
OS: Windows 10 → All
Hardware: Unspecified → All
See Also: → 1194259
To add some background: WebRTC attempts a fake connect of a UDP socket to find out the IP address of the default interface on a machine.

I failed to find any code change in WebRTC which could explain this. Now up to version 52 that feature only gets triggered if user has set a userpref to not disclose all of his addresses. From version 52 onward that same feature is also being used as a fallback method if other means of gathering the local IP addresses have failed.
So the most logic explanation for the spike is that more users appear to set this pref (maybe via an add-on)?!

From looking at the code it looks like |mSocket| is null. Which either means the socket got closed, before we call connect() on it, or the socket wasn't even successfully created before we call connect() on it. Neither of which really makes sense to me.
I spotted bug 1336586 while looking at our code. Potentially that could cause this, but I'm not 100% convinced.
I apparently hit this crash just now on Firefox Nightly on Windows 10:
https://crash-stats.mozilla.com/report/index/08ac33be-79af-46ae-9a26-c93502170209

I had reopened Firefox with a full set of tabs from my previous session, and I was clicking through and closing a bunch of tabs in series. Some of the tabs were closed as they were starting to load, so it's possible there's a race with something.
Does anything guarantee that DoConnect() is processed before RecvClose (in case child explicitly closes the connect very early) or ActorDestroy (in case child process crashes or something)?
DoConnect() is called asynchronously. 

The relevant code might be coming from bug 1194259
Flags: needinfo?(rjesup)
Oh, relevant! I have dom.ipc.processCount set to 16.
experienced this issue on Linux/amd64: f64d4244-a80f-4fa1-af4b-aab1c2170209
drno: can you answer Olli's question?

I suspect the answer is no
Flags: needinfo?(rjesup) → needinfo?(drno)
Comment on attachment 8836279 [details]
Bug 1315248: ensure socket is available for connect and send.

https://reviewboard.mozilla.org/r/111754/#review113056

::: dom/network/UDPSocketParent.cpp:342
(Diff revision 1)
>  UDPSocketParent::RecvOutgoingData(const UDPData& aData,
>                                    const UDPSocketAddr& aAddr)
>  {
> -  MOZ_ASSERT(mSocket);
> +  if (!mSocket) {
> +    MOZ_ASSERT(false, "socket is closed");
> +    return IPC_FAIL(this, "Attempted to send data on closed socket");

curious, is this really a reason to kill the child process?
WebRTC's networking API's pretty much guarantee that close() can not be called before we have received the result for a connect() call, as nICEr (our ICE stack) assumes to operate on a sync socket API. Further more I assume that IPC calls in the parent get executed in the order which they were received. That's why I failed to see how we actually end up hitting this crash.

But ActorDestroy() is probably the answer how this can happen.
Flags: needinfo?(drno)
Comment on attachment 8836279 [details]
Bug 1315248: ensure socket is available for connect and send.

https://reviewboard.mozilla.org/r/111754/#review113056

> curious, is this really a reason to kill the child process?

No. You are right. I forgot what IPC_FAIL() means.
Comment on attachment 8836279 [details]
Bug 1315248: ensure socket is available for connect and send.

https://reviewboard.mozilla.org/r/111754/#review113180

`RecvConnect`, `RecvJoinMulticast` and `RecvLeaveMulticast` should also do a nullptr check on `mSocket` as well.

::: dom/network/UDPSocketParent.cpp:341
(Diff revision 2)
>  mozilla::ipc::IPCResult
>  UDPSocketParent::RecvOutgoingData(const UDPData& aData,
>                                    const UDPSocketAddr& aAddr)
>  {
> -  MOZ_ASSERT(mSocket);
> +  if (!mSocket) {
> +    MOZ_ASSERT(false, "socket is closed");

We should use NS_WARN_IF instead of MOZ_ASSERT, since there might be close or IPC actor error happened simultaneously. Aserting on chrome process seems overkill to me.
Attachment #8836279 - Flags: review?(schien) → review+
Check back on try result
Flags: needinfo?(drno)
Flags: needinfo?(drno)
Pushed by drno@ohlmeier.org:
https://hg.mozilla.org/integration/autoland/rev/6a3ba8bfb34c
ensure socket is available for connect and send. r=schien
https://hg.mozilla.org/mozilla-central/rev/6a3ba8bfb34c
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
Please request Aurora/Beta approval on this when you get a chance.
Assignee: nobody → drno
Flags: needinfo?(drno)
Comment on attachment 8836279 [details]
Bug 1315248: ensure socket is available for connect and send.

Approval Request Comment
[Feature/Bug causing the regression]: Bug 1194259 plus probably some timing related changes in later releases.
[User impact if declined]: Crashes on pages using webrtc if the page gets closed (?)
[Is this code covered by automated tests?]: No, as we don't know exactly how to reproduce it and it is most likely highly timing sensitive.
[Has the fix been verified in Nightly?]: No, as we don't know how to reproduce it.
[Needs manual test from QE? If yes, steps to reproduce]: Not clear how to reproduce.
[List of other uplifts needed for the feature/fix]: N/A
[Is the change risky?]: No
[Why is the change risky/not risky?]: The patch adds a safeguard against a null pointer reference.
[String changes made/needed]: N/A
Flags: needinfo?(drno)
Attachment #8836279 - Flags: approval-mozilla-beta?
Attachment #8836279 - Flags: approval-mozilla-aurora?
Comment on attachment 8836279 [details]
Bug 1315248: ensure socket is available for connect and send.

avoid null deref in udp code, beta52+, aurora53+
Attachment #8836279 - Flags: approval-mozilla-beta?
Attachment #8836279 - Flags: approval-mozilla-beta+
Attachment #8836279 - Flags: approval-mozilla-aurora?
Attachment #8836279 - Flags: approval-mozilla-aurora+
had to push https://hg.mozilla.org/releases/mozilla-beta/rev/d0d5f1187d755b0d1fc65baaecd0ca2e5d16d842 as follow up to fix a bustage (thanks to ryanvm to point me to the right direction)
(In reply to Carsten Book [:Tomcat] from comment #26)
> had to push
> https://hg.mozilla.org/releases/mozilla-beta/rev/
> d0d5f1187d755b0d1fc65baaecd0ca2e5d16d842 as follow up to fix a bustage
> (thanks to ryanvm to point me to the right direction)

Ups. Sorry for that. And thanks for fixing it!
hi, there are still crashes on 52.0b7: https://crash-stats.mozilla.com/report/index/ea0d347b-0be1-4cf9-bc43-57eba2170217
do we need to reopen?
Flags: needinfo?(drno)
After verifying that my patch actually has made it into beta 7 I have to admit that I'm puzzled on what is going on here.

Actually the list of crash reports shows a few crash addresses like this 0xffffffffe5e5e5e5, which if I'm not mistaken indicates user after free.

While looking around in the code I noticed that UDPSocketParent::RecvConnect() dispatches to the STS thread. But none of the other functions in UDPSocketParent do the same dispatch. So my best guess right now is that we are seeing some threading issue here.
In fact nsUDPSocket::InitWithAddress() and nsUDPSocket::Connect() show the same pattern: InitWithAddress() appears to be running on any thread, but Connect() checks for STS.

schien: do you have any thoughts on this?
Status: RESOLVED → REOPENED
Flags: needinfo?(drno) → needinfo?(schien)
Resolution: FIXED → ---
marking this as security-sensitive as precaution as well...
Group: core-security
Group: core-security → dom-core-security
Thanks philipp - was was about to do the same :-)

This smacks of something involving threading or messages; perhaps passing a 'this' pointer to a WrapRunnable without ensuring a lifetime long enough that the runnable is certain to have been destroyed.  It also could be thread coherence - accessing 'this' members (and changing them) from the STS thread and the PBackground thread.

Something like RecvConnect() or sendConnectResponse() -- if 'this' has gone away before the runnable runs, boom.  You can do WrapRunnable(RefPtr<USPSocketParent>(this), ...
Agree with @jesup, this looks fishy when I skim through UDPSocketParent.cpp. We should try hold additional refcount for that WrapRunnable.
Flags: needinfo?(schien)
Flags: needinfo?(drno)
(In reply to Nils Ohlmeier [:drno] from comment #30)
> Actually the list of crash reports shows a few crash addresses like this
> 0xffffffffe5e5e5e5, which if I'm not mistaken indicates user after free.

FYI crash addresses in Breakpad are always treated as 64-bit even on dumps from 32-bit CPUs, so this is just 0xe5e5e5e5 sign-extended to 64-bit, which is almost certainly some sort of poison pattern.
This is a sec-high in 52, which is also an ESR release.  Any chance of a fix while we're still in beta?
Flags: needinfo?(drno)
Flags: needinfo?(drno)
Attachment #8840079 - Flags: review?(rjesup)
Attachment #8840079 - Flags: review?(rjesup) → review+
Comment on attachment 8840079 [details] [diff] [review]
bug1315248_part2.patch

[Security approval request comment]
How easily could an exploit be constructed based on the patch? It's probably pretty hard as it appears to be some race condition between getting a PeerConnection to gather IP addresses and shutting it down at the right moment. What makes it easier is that you can try that in a loop without user interaction until your code hits it.

Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem? No.

Which older supported branches are affected by this flaw? All releases up to Firefox 46.

If not all supported branches, which bug introduced the flaw? Bug 1194259 added support for UDP connect.

Do you have backports for the affected branches? If not, how different, hard to create, and risky will they be? The patch applies to aurora. I have a patch for beta.

How likely is this patch to cause regressions; how much testing does it need? I think the likelihood of regressions is pretty low, since it only holds a handle to an object to ensure the object still exists when the runable gets executed. As we don't know how to reproduce the problem we can really test the problem itself.
Attachment #8840079 - Flags: sec-approval?
Comment on attachment 8840079 [details] [diff] [review]
bug1315248_part2.patch

sec-approval+ for trunk. We'll want this on branches ASAP as well. Can you please nominate branch patches?
Attachment #8840079 - Flags: sec-approval? → sec-approval+
Comment on attachment 8840079 [details] [diff] [review]
bug1315248_part2.patch

Approval Request Comment
[Feature/Bug causing the regression]: Bug 1194259 in Fx 46
[User impact if declined]: Crashes the browser if WebRTC is used and tab gets closed.
[Is this code covered by automated tests?]: No, we don't know how to reproduce it.
[Has the fix been verified in Nightly?]: No.
[Needs manual test from QE? If yes, steps to reproduce]: No STR known.
[List of other uplifts needed for the feature/fix]: N/A
[Is the change risky?]: Low risk
[Why is the change risky/not risky?]: I think it is not risky as it only adds a refpointer to keep the object alive until the runnables are executed.
[String changes made/needed]: N/A
Attachment #8840079 - Flags: approval-mozilla-aurora?
This is just a separate patch for beta (because attachment 8840079 [details] [diff] [review] does not apply cleanly).
Attachment #8840187 - Flags: review+
Comment on attachment 8840187 [details] [diff] [review]
bug1315248_part2_beta.patch

Approval Request Comment
[Feature/Bug causing the regression]: Bug 1194259 in Fx 46
[User impact if declined]: Crashes the browser if WebRTC is used and tab gets closed.
[Is this code covered by automated tests?]: No, we don't know how to reproduce it.
[Has the fix been verified in Nightly?]: No.
[Needs manual test from QE? If yes, steps to reproduce]: No STR known.
[List of other uplifts needed for the feature/fix]: N/A
[Is the change risky?]: Low risk
[Why is the change risky/not risky?]: I think it is not risky as it only adds a refpointer to keep the object alive until the runnables are executed.
[String changes made/needed]: N/A
Attachment #8840187 - Flags: approval-mozilla-beta?
Comment on attachment 8840079 [details] [diff] [review]
bug1315248_part2.patch

Fix a crash. Aurora53+.
Attachment #8840079 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Keywords: checkin-needed
Comment on attachment 8840187 [details] [diff] [review]
bug1315248_part2_beta.patch

uaf fix, should be in 52.0b9
Attachment #8840187 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
https://hg.mozilla.org/mozilla-central/rev/4fff02e2a6a2
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
Check back on crash stats once this has been in the public for some time.
Flags: needinfo?(drno)
Flags: qe-verify-
Whiteboard: [post-critsmash-triage]
Whiteboard: [post-critsmash-triage] → [post-critsmash-triage][adv-main52+]
Since the landing of the first patched lowered the frequency of the crashes in beta to a really low volume I guess we need to wait for 52 to become release to judge if it is really fixed now with both patches.
Updating tracking flags for 52/53/54 and esr52.
Group: dom-core-security → core-security-release
Since 52 becoming release I only see crashes for 51.0.1 or for older beta's of 52. So it looks good.
I'll check back in two weeks to see if this really disappeared from crash stats for good this time.
No crashes in 52 since release. Now I'm convinced we fixed it properly.
Flags: needinfo?(drno)
Group: core-security-release
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.