Closed Bug 1624405 Opened 4 years ago Closed 4 years ago

Crash in [@ nr_ice_component_check_if_failed]

Categories

(Core :: WebRTC: Networking, defect, P1)

Unspecified
Windows 10
defect

Tracking

()

RESOLVED FIXED
mozilla79
Tracking Status
firefox-esr68 --- fixed
firefox-esr78 --- fixed
firefox75 --- wontfix
firefox76 --- wontfix
firefox77 --- wontfix
firefox78 --- fixed
firefox79 --- fixed

People

(Reporter: bwc, Assigned: bwc)

References

Details

(Keywords: crash, csectype-uaf, sec-high, Whiteboard: fix in bug 1634145[sec-survey])

Crash Data

Attachments

(2 files)

This bug is for crash report bp-803e5bf2-55ff-40ea-85b2-aee430200319.

Top 10 frames of crashing thread:

0 xul.dll nr_ice_component_check_if_failed media/mtransport/third_party/nICEr/src/ice/ice_component.c:1602
1 xul.dll nr_ice_candidate_pair_stun_cb media/mtransport/third_party/nICEr/src/ice/ice_candidate_pair.c
2 xul.dll nr_stun_client_process_response media/mtransport/third_party/nICEr/src/stun/stun_client_ctx.c:822
3 xul.dll nr_ice_socket_readable_cb media/mtransport/third_party/nICEr/src/ice/ice_socket.c:114
4 xul.dll mozilla::NrUdpSocketIpc::recv_callback_s media/mtransport/nr_socket_prsock.cpp:1593
5 xul.dll mozilla::runnable_args_memfn<RefPtr<mozilla::NrUdpSocketIpc>, void  media/mtransport/runnable_utils.h:122
6 xul.dll mozilla::detail::runnable_args_base<mozilla::detail::NoResult>::Run media/mtransport/runnable_utils.h:40
7 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1220
8 xul.dll NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:481
9 xul.dll mozilla::net::nsSocketTransportService::Run netwerk/base/nsSocketTransportService2.cpp:1135

Probably our most frequent crash right now.

Looks like a nullptr crash. The stack is incomplete, unfortunately.

nr_stun_client_process_response seems to be here:

https://searchfox.org/mozilla-central/rev/202a285024f174c2d2bf2152d9cba90a03723eab/media/mtransport/third_party/nICEr/src/stun/stun_client_ctx.c#822

nr_stun_client_fire_finished_cb is defined here:

https://searchfox.org/mozilla-central/rev/202a285024f174c2d2bf2152d9cba90a03723eab/media/mtransport/third_party/nICEr/src/stun/stun_client_ctx.c#115

That is how we're ending up in nr_ice_candidate_pair_stun_cb.

nr_ice_component_check_if_failed is called from two places, but one of those is called from a timer, which does not match this stack. The other is here:

https://searchfox.org/mozilla-central/rev/202a285024f174c2d2bf2152d9cba90a03723eab/media/mtransport/third_party/nICEr/src/ice/ice_component.c#1597

This is called from one place:

https://searchfox.org/mozilla-central/rev/202a285024f174c2d2bf2152d9cba90a03723eab/media/mtransport/third_party/nICEr/src/ice/ice_candidate_pair.c#585

This, in turn, is called from several places in nr_ice_candidate_pair_stun_cb with parameters that are consistent with the stack. So, the stack should basically look like this:

0 xul.dll nr_ice_component_check_if_failed media/mtransport/third_party/nICEr/src/ice/ice_component.c:1602
0.1 xul.dll nr_ice_component_failed_pair media/mtransport/third_party/nICEr/src/ice/ice_component.c:1597
0.2 xul.dll nr_ice_candidate_pair_set_state media/mtransport/third_party/nICEr/src/ice/ice_candidate_pair.c:585
1 xul.dll nr_ice_candidate_pair_stun_cb media/mtransport/third_party/nICEr/src/ice/ice_candidate_pair.c (possible line numbers: 221, 229, 236, 249, 304)
1.1 xul.dll nr_stun_client_fire_finished_cb media/mtransport/third_party/nICEr/src/stun/stun_client_ctx.c:123
2 xul.dll nr_stun_client_process_response media/mtransport/third_party/nICEr/src/stun/stun_client_ctx.c:822
3 xul.dll nr_ice_socket_readable_cb media/mtransport/third_party/nICEr/src/ice/ice_socket.c:114
4 xul.dll mozilla::NrUdpSocketIpc::recv_callback_s media/mtransport/nr_socket_prsock.cpp:1593
5 xul.dll mozilla::runnable_args_memfn<RefPtr<mozilla::NrUdpSocketIpc>, void media/mtransport/runnable_utils.h:122
6 xul.dll mozilla::detail::runnable_args_base<mozilla::detail::NoResult>::Run media/mtransport/runnable_utils.h:40
7 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1220
8 xul.dll NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:481
9 xul.dll mozilla::net::nsSocketTransportService::Run netwerk/base/nsSocketTransportService2.cpp:1135

Group: core-security

I see some reports on crash-stats with poisoning. So, this looks like it might be a UAF bug.

I wonder if this is some flaw in nICEr where we aren't canceling callbacks for sockets when we tear down. Maybe we should cancel these inside NrSocket::close()?

Group: core-security → media-core-security

Byron, feel free to re-assing, but since this is sec-high we need an owner for now. If we don't know how to proceed lets mark as stalled please.

Assignee: nobody → docfaraday
Priority: P2 → P1

Bug 1624458 might help with this some, but I see a 76 beta crash still, so this is likely not fixed.

Nope, bug 1624458 does not appear to have helped. Need to dig into this more.

Ok, so it seems that here, pair->remote->component is a dangling (or otherwise invalid) pointer:

https://searchfox.org/mozilla-central/rev/158bac3df3a1890da55bdb6ffdaf9a7ffc0bfb0a/media/mtransport/third_party/nICEr/src/ice/ice_candidate_pair.c#585

At the start, I will note that the only code that destroys components is here:

https://searchfox.org/mozilla-central/rev/158bac3df3a1890da55bdb6ffdaf9a7ffc0bfb0a/media/mtransport/third_party/nICEr/src/ice/ice_media_stream.c#108

So, nr_ice_media_stream.components is the sole owner of these, meaning that if the nr_ice_candidate is holding onto a destroyed component pointer, the stream must also have been destroyed.

As for the pair itself, from the stack trace we can see that it is coming from here:

https://searchfox.org/mozilla-central/rev/158bac3df3a1890da55bdb6ffdaf9a7ffc0bfb0a/media/mtransport/third_party/nICEr/src/ice/ice_candidate_pair.c#185

That callback is used in a couple of places, but the stack trace tells us this is the one we're interested in:

https://searchfox.org/mozilla-central/rev/158bac3df3a1890da55bdb6ffdaf9a7ffc0bfb0a/media/mtransport/third_party/nICEr/src/stun/stun_client_ctx.c#123

This means this pair is coming from nr_stun_client_ctx.cb_arg. This implies we have an nr_stun_client_ctx outliving the component (and therefore the stream). We also have an nr_ice_socket outliving the component/stream, which is where our reference to the nr_stun_client_ctx lives, but it does not own the stun ctx.

Leaving these notes here, will get back to them later today.

nr_ice_sockets are owned solely by nr_ice_component:

https://searchfox.org/mozilla-central/rev/158bac3df3a1890da55bdb6ffdaf9a7ffc0bfb0a/media/mtransport/third_party/nICEr/src/ice/ice_component.c#169
https://searchfox.org/mozilla-central/rev/158bac3df3a1890da55bdb6ffdaf9a7ffc0bfb0a/media/mtransport/third_party/nICEr/src/ice/ice_component.c#425
https://searchfox.org/mozilla-central/rev/158bac3df3a1890da55bdb6ffdaf9a7ffc0bfb0a/media/mtransport/third_party/nICEr/src/ice/ice_component.c#1670

However, I see some places where we could create an nr_ice_socket, but never hand ownership of it to an nr_ice_component.

It appears that the nr_ice_socket created here will be leaked if there is any subsequent failure in the function (and there are multiple places that could occur):
https://searchfox.org/mozilla-central/rev/158bac3df3a1890da55bdb6ffdaf9a7ffc0bfb0a/media/mtransport/third_party/nICEr/src/ice/ice_component.c#235

There is a similar flaw here:
https://searchfox.org/mozilla-central/rev/158bac3df3a1890da55bdb6ffdaf9a7ffc0bfb0a/media/mtransport/third_party/nICEr/src/ice/ice_component.c#600

This code looks ok though:
https://searchfox.org/mozilla-central/rev/158bac3df3a1890da55bdb6ffdaf9a7ffc0bfb0a/media/mtransport/third_party/nICEr/src/ice/ice_component.c#400

These are definitely flaws that need fixing, and might be the source of this bug, because nr_ice_socket_readable_cb (which is in our crash stack) is registered immediately in nr_ice_socket_create:

https://searchfox.org/mozilla-central/rev/158bac3df3a1890da55bdb6ffdaf9a7ffc0bfb0a/media/mtransport/third_party/nICEr/src/ice/ice_socket.c#221-236

The question is whether the rest of what we observe is consistent with this happening.

Calling nr_ice_socket_create, and then throwing away the pointer, does in fact register the callback we see in the crash stack, so that's consistent. And, we will certainly see callbacks if data arrives on the socket. The nr_ice_socket has references to the nr_socket, the nr_ice_ctx, and the nr_ice_component. It does not have any nr_ice_stun_ctx, however.

https://searchfox.org/mozilla-central/rev/158bac3df3a1890da55bdb6ffdaf9a7ffc0bfb0a/media/mtransport/third_party/nICEr/src/ice/ice_socket.c#219

Our crash stack depends on there being at least one entry in |stun_ctxs| that is an nr_stun_client_ctx, which is registered here:

https://searchfox.org/mozilla-central/rev/158bac3df3a1890da55bdb6ffdaf9a7ffc0bfb0a/media/mtransport/third_party/nICEr/src/ice/ice_socket.c#312

CASE 1: nr_ice_component_initialize_udp fails at nr_ice_component_create_stun_server_ctx:

In nr_ice_component_initialize_udp, we have this:

https://searchfox.org/mozilla-central/rev/158bac3df3a1890da55bdb6ffdaf9a7ffc0bfb0a/media/mtransport/third_party/nICEr/src/ice/ice_component.c#339

Which calls this:

https://searchfox.org/mozilla-central/rev/158bac3df3a1890da55bdb6ffdaf9a7ffc0bfb0a/media/mtransport/third_party/nICEr/src/ice/ice_component.c#190

Which registers a stun server on the nr_ice_socket here:

https://searchfox.org/mozilla-central/rev/158bac3df3a1890da55bdb6ffdaf9a7ffc0bfb0a/media/mtransport/third_party/nICEr/src/ice/ice_socket.c#332

If we then see a failure here (a big if; only looks possible due to OOM, and the crash reports don't look like OOM is likely):

https://searchfox.org/mozilla-central/rev/158bac3df3a1890da55bdb6ffdaf9a7ffc0bfb0a/media/mtransport/third_party/nICEr/src/ice/ice_component.c#195

That will result in the failure here, and the leak of the nr_ice_socket which has stuff in |stun_ctxs|, but it won't be an nr_stun_client_ctx:

https://searchfox.org/mozilla-central/rev/158bac3df3a1890da55bdb6ffdaf9a7ffc0bfb0a/media/mtransport/third_party/nICEr/src/ice/ice_component.c#340

This is not the scenario we're seeing, and the error that leads to it probably isn't happening in practice anyway, but we should check crash-stats and see.

CASE 2: nr_ice_component_initialize_udp fails in nr_ice_candidate_create

There are four places this can be called. If the first one fails, I do not think we would see the crash we are seeing. However, if one does succeed and a later one fails, the candidate that was successfully created will be added to the nr_ice_component, which will then allow that candidate to be paired, and we're off to the races! Are we likely to see an error in nr_ice_candidate_create, however?

OOM (not likely): https://searchfox.org/mozilla-central/rev/158bac3df3a1890da55bdb6ffdaf9a7ffc0bfb0a/media/mtransport/third_party/nICEr/src/ice/ice_candidate.c#140

Looks possible if there's some failure in NrSocket::getaddr or one of its subclasses: https://searchfox.org/mozilla-central/rev/158bac3df3a1890da55bdb6ffdaf9a7ffc0bfb0a/media/mtransport/third_party/nICEr/src/ice/ice_candidate.c#154

Doesn't look possible, requires nr_ice_stun_server.type to be garbage: https://searchfox.org/mozilla-central/rev/158bac3df3a1890da55bdb6ffdaf9a7ffc0bfb0a/media/mtransport/third_party/nICEr/src/ice/ice_candidate.c#163,168

Doesn't look possible, requires calling nr_ice_candidate_create with a garbage candidate type: https://searchfox.org/mozilla-central/rev/158bac3df3a1890da55bdb6ffdaf9a7ffc0bfb0a/media/mtransport/third_party/nICEr/src/ice/ice_candidate.c#177

OOM (not likely): https://searchfox.org/mozilla-central/rev/158bac3df3a1890da55bdb6ffdaf9a7ffc0bfb0a/media/mtransport/third_party/nICEr/src/ice/ice_candidate.c#191,194

Very complicated, but looks like it could happen: https://searchfox.org/mozilla-central/rev/158bac3df3a1890da55bdb6ffdaf9a7ffc0bfb0a/media/mtransport/third_party/nICEr/src/ice/ice_candidate.c#196

Looks plausible.

CASE 3: nr_ice_component_initialize_udp fails in nr_socket_turn_create

Only looks possible if OOM.

Of these possibilities, an error in nr_socket_getaddr looks the most likely:

https://searchfox.org/mozilla-central/rev/158bac3df3a1890da55bdb6ffdaf9a7ffc0bfb0a/media/mtransport/nr_socket_prsock.cpp#1407

I'm still not convinced this whole scenario is our culprit, however. If this were a case of a simple leak of nr_ice_socket, while the rest of the ICE stack has been torn down, why do we never see failures here on crash-stats?

https://searchfox.org/mozilla-central/rev/158bac3df3a1890da55bdb6ffdaf9a7ffc0bfb0a/media/mtransport/third_party/nICEr/src/ice/ice_socket.c#93

Is it possible this is only happening in cases where a single stream has been torn down (but not the whole ICE ctx), due to ICE restart or something?

I think maybe the thing to do is fix the bug, and see if it stops the crashes.

Try pushes here:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=dd060774810e91ec4fa1de726283847dba222734

https://treeherder.mozilla.org/#/jobs?repo=try&revision=933e87852c18444668ea827c4c5cc7363beb05b5

Since this looks like a leak-fix, I will open a cover bug once it looks like this is working, and put the patches there.

Have filed bug 1634145.

See Also: → 1634145
See Also: 1634145

Ok, while we may be hiding sec bug dependency links from users that aren't cleared for sec-bug access, we do not hide see also. That's something we need to fix.

Can someone go in and scrub that see also history from bug 1634145?

^

Flags: needinfo?(dveditz)

You'll have to file a BMO::Administration bug to get someone with low-level DB access to change bug history. I don't think it's terrible in this case to leave it. setting and immediately clearing a see-also can look like you made a typo in the bug you linked.

comment 16 is bug 1628759 -- definitely would be nice to fix.

Flags: needinfo?(dveditz)

Bug 1634145 is ready to land, do I need to request approval for a cover bug like this?

Flags: needinfo?(dveditz)

Technically yes, but it can be logistically confusing. Could make a dummy attachment here just for the request. I don't have concerns about the patch itself, but would still be good to get some of the approval request questions answered. In particular do we know why this started spiking in the affected releases, and importantly do we need this patch on ESR68? (that is -- is ESR68 actually safe or has it just gotten lucky in maybe some timing issue?)

At the very least please set the "status firefox" boxes for the affected and unaffected (if any) firefox releases.

Flags: needinfo?(dveditz) → needinfo?(docfaraday)
Whiteboard: fix in bug 1634145
Flags: needinfo?(docfaraday)

Comment on attachment 9145503 [details]
Bug 1624405: (Copy of patch on cover bug) Avoid leaking nr_ice_sockets due to various kinds of failure.

Security Approval Request

  • How easily could an exploit be constructed based on the patch?: I imagine it would be somewhat difficult, since it would require noticing that the patch fixes UAF problems, and then finding a way to trigger the specific rare failures that can cause the dangling pointers.
  • Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?: No
  • Which older supported branches are affected by this flaw?: all
  • If not all supported branches, which bug introduced the flaw?: None
  • Do you have backports for the affected branches?: Yes
  • If not, how different, hard to create, and risky will they be?: I think the patch should apply cleanly across the board. That file has not changed much.
  • How likely is this patch to cause regressions; how much testing does it need?: Very unlikely.
Attachment #9145503 - Flags: sec-approval?

(In reply to Daniel Veditz [:dveditz] from comment #21)

Technically yes, but it can be logistically confusing. Could make a dummy attachment here just for the request. I don't have concerns about the patch itself, but would still be good to get some of the approval request questions answered. In particular do we know why this started spiking in the affected releases, and importantly do we need this patch on ESR68? (that is -- is ESR68 actually safe or has it just gotten lucky in maybe some timing issue?)

At the very least please set the "status firefox" boxes for the affected and unaffected (if any) firefox releases.

This crash seems to be happening on ESR68, and the flaw fixed in bug 1634145 is present in ESR68 also. An ESR uplift probably makes sense.

I am not sure why we're seeing a spike on 74. There have been a lot of timing changes in the webrtc code recently, including bug 1591199, which landed on 74.

Comment on attachment 9145503 [details]
Bug 1624405: (Copy of patch on cover bug) Avoid leaking nr_ice_sockets due to various kinds of failure.

sec-approval+ for landing on nightly. If we're going to use the cover bug effectively you should request beta and esr uplift in that bug .

Attachment #9145503 - Flags: sec-approval? → sec-approval+

"depends on" hiding works so we can tie the bugs that way.

Depends on: 1634145

Fixing bug 1634145 does not seem to have stopped this crash. Need to look some more.

Another place we might be running into trouble is here:

https://searchfox.org/mozilla-central/rev/df4c90d4b8c92c99f76334acfe4813c573c12661/media/mtransport/third_party/nICEr/src/ice/ice_peer_ctx.c#105-139

If there is a failure anywhere in that code block, we will not put the nr_ice_media_stream into pctx->peer_streams, so it will leak. However, the question is whether anything is left holding a reference to it.

From here: https://searchfox.org/mozilla-central/rev/df4c90d4b8c92c99f76334acfe4813c573c12661/media/mtransport/third_party/nICEr/src/ice/ice_peer_ctx.c#118

It looks like nr_ice_peer_ctx_parse_stream_attributes_int will eventually end up here, so we have a candidate with a reference:

https://searchfox.org/mozilla-central/rev/df4c90d4b8c92c99f76334acfe4813c573c12661/media/mtransport/third_party/nICEr/src/ice/ice_parser.c#133

If this code then succeeds, this candidate is registered with the component:

https://searchfox.org/mozilla-central/rev/df4c90d4b8c92c99f76334acfe4813c573c12661/media/mtransport/third_party/nICEr/src/ice/ice_peer_ctx.c#182-226

So it seems that leaking the nr_ice_media_stream could be dangerous here. Probably needs a fix. It only looks like this will happen on OOM, so maybe this isn't our culprit.

Depends on: 1639916

Comment on attachment 9151177 [details]
Bug 1624405: (Copy of patch on cover bug) Clean up pstream if we don't end up using it.

Security Approval Request

  • How easily could an exploit be constructed based on the patch?: Not very easily.
  • Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?: No
  • Which older supported branches are affected by this flaw?: all
  • If not all supported branches, which bug introduced the flaw?: None
  • Do you have backports for the affected branches?: No
  • If not, how different, hard to create, and risky will they be?: Trivial, probably.
  • How likely is this patch to cause regressions; how much testing does it need?: Cleaning up this leak could expose other issues, but it looks like it should be fine. We should probably wait a bit before uplifting it.
Attachment #9151177 - Flags: sec-approval?

Comment on attachment 9151177 [details]
Bug 1624405: (Copy of patch on cover bug) Clean up pstream if we don't end up using it.

sec-approval+

Attachment #9151177 - Flags: sec-approval? → sec-approval+

Rate seems to be way down on 77; I'm guessing bug 1634145 took care of most of this. No failures on 78 so far, but given the low base rate on 77 we wouldn't necessarily expect to see failures yet.

Seeing a crash on 78 from this weekend. We still have not completely fixed this yet it seems.

Maybe bug 1644477 helps with this; let's see if the uplift helps 78.

78 is looking really good since bug 1644477 was uplifted. Base rate was really low, but I think we may have squashed this one. Let's see how 78 does on release.

Took multiple fixes on other bugs, but this seems fine now.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Depends on: 1644477
Target Milestone: --- → mozilla79
Group: media-core-security → core-security-release

As part of a security bug pattern analysis, we are requesting your help with a high level analysis of this bug. It is our hope to develop static analysis (or potentially runtime/dynamic analysis) in the future to identify classes of bugs.

Please visit this google form to reply.

Flags: needinfo?(docfaraday)
Whiteboard: fix in bug 1634145 → fix in bug 1634145[sec-survey]
Flags: needinfo?(docfaraday)

Ryan: you marked this bug "status-esr68: fixed" but it was fixed by bug 1634145 which is marked "wontfix" on esr-68. Which bug is correct? I don't see an ESR landing link in either bug, but comment 20 says fixing this on esr-68 was a good idea so maybe it was somewhere else?

Flags: needinfo?(ryanvm)

Bug 1644477 is what I was going off per comment 35.

Flags: needinfo?(ryanvm)
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: