Crash in [@ nr_ice_component_check_if_failed]
Categories
(Core :: WebRTC: Networking, defect, P1)
Tracking
()
People
(Reporter: bwc, Assigned: bwc)
References
Details
(Keywords: crash, csectype-uaf, sec-high, Whiteboard: fix in bug 1634145[sec-survey])
Crash Data
Attachments
(2 files)
47 bytes,
text/x-phabricator-request
|
dveditz
:
sec-approval+
|
Details | Review |
47 bytes,
text/x-phabricator-request
|
dveditz
:
sec-approval+
|
Details | Review |
This bug is for crash report bp-803e5bf2-55ff-40ea-85b2-aee430200319.
Top 10 frames of crashing thread:
0 xul.dll nr_ice_component_check_if_failed media/mtransport/third_party/nICEr/src/ice/ice_component.c:1602
1 xul.dll nr_ice_candidate_pair_stun_cb media/mtransport/third_party/nICEr/src/ice/ice_candidate_pair.c
2 xul.dll nr_stun_client_process_response media/mtransport/third_party/nICEr/src/stun/stun_client_ctx.c:822
3 xul.dll nr_ice_socket_readable_cb media/mtransport/third_party/nICEr/src/ice/ice_socket.c:114
4 xul.dll mozilla::NrUdpSocketIpc::recv_callback_s media/mtransport/nr_socket_prsock.cpp:1593
5 xul.dll mozilla::runnable_args_memfn<RefPtr<mozilla::NrUdpSocketIpc>, void media/mtransport/runnable_utils.h:122
6 xul.dll mozilla::detail::runnable_args_base<mozilla::detail::NoResult>::Run media/mtransport/runnable_utils.h:40
7 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1220
8 xul.dll NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:481
9 xul.dll mozilla::net::nsSocketTransportService::Run netwerk/base/nsSocketTransportService2.cpp:1135
Probably our most frequent crash right now.
Assignee | ||
Comment 1•4 years ago
•
|
||
Looks like a nullptr crash. The stack is incomplete, unfortunately.
nr_stun_client_process_response seems to be here:
nr_stun_client_fire_finished_cb is defined here:
That is how we're ending up in nr_ice_candidate_pair_stun_cb.
Assignee | ||
Comment 2•4 years ago
•
|
||
nr_ice_component_check_if_failed is called from two places, but one of those is called from a timer, which does not match this stack. The other is here:
This is called from one place:
This, in turn, is called from several places in nr_ice_candidate_pair_stun_cb with parameters that are consistent with the stack. So, the stack should basically look like this:
0 xul.dll nr_ice_component_check_if_failed media/mtransport/third_party/nICEr/src/ice/ice_component.c:1602
0.1 xul.dll nr_ice_component_failed_pair media/mtransport/third_party/nICEr/src/ice/ice_component.c:1597
0.2 xul.dll nr_ice_candidate_pair_set_state media/mtransport/third_party/nICEr/src/ice/ice_candidate_pair.c:585
1 xul.dll nr_ice_candidate_pair_stun_cb media/mtransport/third_party/nICEr/src/ice/ice_candidate_pair.c (possible line numbers: 221, 229, 236, 249, 304)
1.1 xul.dll nr_stun_client_fire_finished_cb media/mtransport/third_party/nICEr/src/stun/stun_client_ctx.c:123
2 xul.dll nr_stun_client_process_response media/mtransport/third_party/nICEr/src/stun/stun_client_ctx.c:822
3 xul.dll nr_ice_socket_readable_cb media/mtransport/third_party/nICEr/src/ice/ice_socket.c:114
4 xul.dll mozilla::NrUdpSocketIpc::recv_callback_s media/mtransport/nr_socket_prsock.cpp:1593
5 xul.dll mozilla::runnable_args_memfn<RefPtr<mozilla::NrUdpSocketIpc>, void media/mtransport/runnable_utils.h:122
6 xul.dll mozilla::detail::runnable_args_base<mozilla::detail::NoResult>::Run media/mtransport/runnable_utils.h:40
7 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1220
8 xul.dll NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:481
9 xul.dll mozilla::net::nsSocketTransportService::Run netwerk/base/nsSocketTransportService2.cpp:1135
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 3•4 years ago
|
||
I see some reports on crash-stats with poisoning. So, this looks like it might be a UAF bug.
Assignee | ||
Comment 4•4 years ago
|
||
I wonder if this is some flaw in nICEr where we aren't canceling callbacks for sockets when we tear down. Maybe we should cancel these inside NrSocket::close()?
Comment hidden (obsolete) |
Comment hidden (obsolete) |
Updated•4 years ago
|
Comment hidden (obsolete) |
Updated•4 years ago
|
Comment 8•4 years ago
|
||
Byron, feel free to re-assing, but since this is sec-high we need an owner for now. If we don't know how to proceed lets mark as stalled please.
Assignee | ||
Comment 9•4 years ago
|
||
Bug 1624458 might help with this some, but I see a 76 beta crash still, so this is likely not fixed.
Assignee | ||
Comment 10•4 years ago
|
||
Nope, bug 1624458 does not appear to have helped. Need to dig into this more.
Assignee | ||
Comment 11•4 years ago
|
||
Ok, so it seems that here, pair->remote->component is a dangling (or otherwise invalid) pointer:
At the start, I will note that the only code that destroys components is here:
So, nr_ice_media_stream.components is the sole owner of these, meaning that if the nr_ice_candidate is holding onto a destroyed component pointer, the stream must also have been destroyed.
As for the pair itself, from the stack trace we can see that it is coming from here:
That callback is used in a couple of places, but the stack trace tells us this is the one we're interested in:
This means this pair is coming from nr_stun_client_ctx.cb_arg. This implies we have an nr_stun_client_ctx outliving the component (and therefore the stream). We also have an nr_ice_socket outliving the component/stream, which is where our reference to the nr_stun_client_ctx lives, but it does not own the stun ctx.
Leaving these notes here, will get back to them later today.
Assignee | ||
Comment 12•4 years ago
|
||
nr_ice_sockets are owned solely by nr_ice_component:
https://searchfox.org/mozilla-central/rev/158bac3df3a1890da55bdb6ffdaf9a7ffc0bfb0a/media/mtransport/third_party/nICEr/src/ice/ice_component.c#169
https://searchfox.org/mozilla-central/rev/158bac3df3a1890da55bdb6ffdaf9a7ffc0bfb0a/media/mtransport/third_party/nICEr/src/ice/ice_component.c#425
https://searchfox.org/mozilla-central/rev/158bac3df3a1890da55bdb6ffdaf9a7ffc0bfb0a/media/mtransport/third_party/nICEr/src/ice/ice_component.c#1670
However, I see some places where we could create an nr_ice_socket, but never hand ownership of it to an nr_ice_component.
It appears that the nr_ice_socket created here will be leaked if there is any subsequent failure in the function (and there are multiple places that could occur):
https://searchfox.org/mozilla-central/rev/158bac3df3a1890da55bdb6ffdaf9a7ffc0bfb0a/media/mtransport/third_party/nICEr/src/ice/ice_component.c#235
There is a similar flaw here:
https://searchfox.org/mozilla-central/rev/158bac3df3a1890da55bdb6ffdaf9a7ffc0bfb0a/media/mtransport/third_party/nICEr/src/ice/ice_component.c#600
This code looks ok though:
https://searchfox.org/mozilla-central/rev/158bac3df3a1890da55bdb6ffdaf9a7ffc0bfb0a/media/mtransport/third_party/nICEr/src/ice/ice_component.c#400
These are definitely flaws that need fixing, and might be the source of this bug, because nr_ice_socket_readable_cb (which is in our crash stack) is registered immediately in nr_ice_socket_create:
The question is whether the rest of what we observe is consistent with this happening.
Assignee | ||
Comment 13•4 years ago
|
||
Calling nr_ice_socket_create, and then throwing away the pointer, does in fact register the callback we see in the crash stack, so that's consistent. And, we will certainly see callbacks if data arrives on the socket. The nr_ice_socket has references to the nr_socket, the nr_ice_ctx, and the nr_ice_component. It does not have any nr_ice_stun_ctx, however.
Our crash stack depends on there being at least one entry in |stun_ctxs| that is an nr_stun_client_ctx, which is registered here:
CASE 1: nr_ice_component_initialize_udp fails at nr_ice_component_create_stun_server_ctx:
In nr_ice_component_initialize_udp, we have this:
Which calls this:
Which registers a stun server on the nr_ice_socket here:
If we then see a failure here (a big if; only looks possible due to OOM, and the crash reports don't look like OOM is likely):
That will result in the failure here, and the leak of the nr_ice_socket which has stuff in |stun_ctxs|, but it won't be an nr_stun_client_ctx:
This is not the scenario we're seeing, and the error that leads to it probably isn't happening in practice anyway, but we should check crash-stats and see.
CASE 2: nr_ice_component_initialize_udp fails in nr_ice_candidate_create
There are four places this can be called. If the first one fails, I do not think we would see the crash we are seeing. However, if one does succeed and a later one fails, the candidate that was successfully created will be added to the nr_ice_component, which will then allow that candidate to be paired, and we're off to the races! Are we likely to see an error in nr_ice_candidate_create, however?
OOM (not likely): https://searchfox.org/mozilla-central/rev/158bac3df3a1890da55bdb6ffdaf9a7ffc0bfb0a/media/mtransport/third_party/nICEr/src/ice/ice_candidate.c#140
Looks possible if there's some failure in NrSocket::getaddr or one of its subclasses: https://searchfox.org/mozilla-central/rev/158bac3df3a1890da55bdb6ffdaf9a7ffc0bfb0a/media/mtransport/third_party/nICEr/src/ice/ice_candidate.c#154
Doesn't look possible, requires nr_ice_stun_server.type to be garbage: https://searchfox.org/mozilla-central/rev/158bac3df3a1890da55bdb6ffdaf9a7ffc0bfb0a/media/mtransport/third_party/nICEr/src/ice/ice_candidate.c#163,168
Doesn't look possible, requires calling nr_ice_candidate_create with a garbage candidate type: https://searchfox.org/mozilla-central/rev/158bac3df3a1890da55bdb6ffdaf9a7ffc0bfb0a/media/mtransport/third_party/nICEr/src/ice/ice_candidate.c#177
OOM (not likely): https://searchfox.org/mozilla-central/rev/158bac3df3a1890da55bdb6ffdaf9a7ffc0bfb0a/media/mtransport/third_party/nICEr/src/ice/ice_candidate.c#191,194
Very complicated, but looks like it could happen: https://searchfox.org/mozilla-central/rev/158bac3df3a1890da55bdb6ffdaf9a7ffc0bfb0a/media/mtransport/third_party/nICEr/src/ice/ice_candidate.c#196
Looks plausible.
CASE 3: nr_ice_component_initialize_udp fails in nr_socket_turn_create
Only looks possible if OOM.
Of these possibilities, an error in nr_socket_getaddr looks the most likely:
I'm still not convinced this whole scenario is our culprit, however. If this were a case of a simple leak of nr_ice_socket, while the rest of the ICE stack has been torn down, why do we never see failures here on crash-stats?
Is it possible this is only happening in cases where a single stream has been torn down (but not the whole ICE ctx), due to ICE restart or something?
I think maybe the thing to do is fix the bug, and see if it stops the crashes.
Assignee | ||
Comment 14•4 years ago
|
||
Try pushes here:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=dd060774810e91ec4fa1de726283847dba222734
https://treeherder.mozilla.org/#/jobs?repo=try&revision=933e87852c18444668ea827c4c5cc7363beb05b5
Since this looks like a leak-fix, I will open a cover bug once it looks like this is working, and put the patches there.
Assignee | ||
Comment 16•4 years ago
|
||
Ok, while we may be hiding sec bug dependency links from users that aren't cleared for sec-bug access, we do not hide see also. That's something we need to fix.
Assignee | ||
Comment 17•4 years ago
|
||
Can someone go in and scrub that see also history from bug 1634145?
Comment 19•4 years ago
|
||
You'll have to file a BMO::Administration bug to get someone with low-level DB access to change bug history. I don't think it's terrible in this case to leave it. setting and immediately clearing a see-also can look like you made a typo in the bug you linked.
comment 16 is bug 1628759 -- definitely would be nice to fix.
Assignee | ||
Comment 20•4 years ago
|
||
Bug 1634145 is ready to land, do I need to request approval for a cover bug like this?
Comment 21•4 years ago
|
||
Technically yes, but it can be logistically confusing. Could make a dummy attachment here just for the request. I don't have concerns about the patch itself, but would still be good to get some of the approval request questions answered. In particular do we know why this started spiking in the affected releases, and importantly do we need this patch on ESR68? (that is -- is ESR68 actually safe or has it just gotten lucky in maybe some timing issue?)
At the very least please set the "status firefox" boxes for the affected and unaffected (if any) firefox releases.
Assignee | ||
Comment 22•4 years ago
|
||
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 23•4 years ago
|
||
Comment on attachment 9145503 [details]
Bug 1624405: (Copy of patch on cover bug) Avoid leaking nr_ice_sockets due to various kinds of failure.
Security Approval Request
- How easily could an exploit be constructed based on the patch?: I imagine it would be somewhat difficult, since it would require noticing that the patch fixes UAF problems, and then finding a way to trigger the specific rare failures that can cause the dangling pointers.
- Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?: No
- Which older supported branches are affected by this flaw?: all
- If not all supported branches, which bug introduced the flaw?: None
- Do you have backports for the affected branches?: Yes
- If not, how different, hard to create, and risky will they be?: I think the patch should apply cleanly across the board. That file has not changed much.
- How likely is this patch to cause regressions; how much testing does it need?: Very unlikely.
Assignee | ||
Comment 24•4 years ago
|
||
(In reply to Daniel Veditz [:dveditz] from comment #21)
Technically yes, but it can be logistically confusing. Could make a dummy attachment here just for the request. I don't have concerns about the patch itself, but would still be good to get some of the approval request questions answered. In particular do we know why this started spiking in the affected releases, and importantly do we need this patch on ESR68? (that is -- is ESR68 actually safe or has it just gotten lucky in maybe some timing issue?)
At the very least please set the "status firefox" boxes for the affected and unaffected (if any) firefox releases.
This crash seems to be happening on ESR68, and the flaw fixed in bug 1634145 is present in ESR68 also. An ESR uplift probably makes sense.
Assignee | ||
Comment 25•4 years ago
|
||
I am not sure why we're seeing a spike on 74. There have been a lot of timing changes in the webrtc code recently, including bug 1591199, which landed on 74.
Comment 26•4 years ago
|
||
Comment on attachment 9145503 [details]
Bug 1624405: (Copy of patch on cover bug) Avoid leaking nr_ice_sockets due to various kinds of failure.
sec-approval+ for landing on nightly. If we're going to use the cover bug effectively you should request beta and esr uplift in that bug .
Comment 27•4 years ago
|
||
"depends on" hiding works so we can tie the bugs that way.
Assignee | ||
Comment 28•4 years ago
|
||
Fixing bug 1634145 does not seem to have stopped this crash. Need to look some more.
Assignee | ||
Comment 29•4 years ago
•
|
||
Another place we might be running into trouble is here:
If there is a failure anywhere in that code block, we will not put the nr_ice_media_stream into pctx->peer_streams, so it will leak. However, the question is whether anything is left holding a reference to it.
It looks like nr_ice_peer_ctx_parse_stream_attributes_int will eventually end up here, so we have a candidate with a reference:
If this code then succeeds, this candidate is registered with the component:
So it seems that leaking the nr_ice_media_stream could be dangerous here. Probably needs a fix. It only looks like this will happen on OOM, so maybe this isn't our culprit.
Assignee | ||
Comment 30•4 years ago
|
||
Assignee | ||
Comment 31•4 years ago
|
||
Comment on attachment 9151177 [details]
Bug 1624405: (Copy of patch on cover bug) Clean up pstream if we don't end up using it.
Security Approval Request
- How easily could an exploit be constructed based on the patch?: Not very easily.
- Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?: No
- Which older supported branches are affected by this flaw?: all
- If not all supported branches, which bug introduced the flaw?: None
- Do you have backports for the affected branches?: No
- If not, how different, hard to create, and risky will they be?: Trivial, probably.
- How likely is this patch to cause regressions; how much testing does it need?: Cleaning up this leak could expose other issues, but it looks like it should be fine. We should probably wait a bit before uplifting it.
Updated•4 years ago
|
Comment 32•4 years ago
|
||
Comment on attachment 9151177 [details]
Bug 1624405: (Copy of patch on cover bug) Clean up pstream if we don't end up using it.
sec-approval+
Assignee | ||
Comment 33•4 years ago
|
||
Rate seems to be way down on 77; I'm guessing bug 1634145 took care of most of this. No failures on 78 so far, but given the low base rate on 77 we wouldn't necessarily expect to see failures yet.
Assignee | ||
Comment 34•4 years ago
|
||
Seeing a crash on 78 from this weekend. We still have not completely fixed this yet it seems.
Assignee | ||
Comment 35•4 years ago
|
||
Maybe bug 1644477 helps with this; let's see if the uplift helps 78.
Assignee | ||
Comment 36•4 years ago
|
||
78 is looking really good since bug 1644477 was uplifted. Base rate was really low, but I think we may have squashed this one. Let's see how 78 does on release.
Assignee | ||
Comment 37•4 years ago
|
||
Took multiple fixes on other bugs, but this seems fine now.
Updated•4 years ago
|
Updated•4 years ago
|
Comment 38•4 years ago
|
||
As part of a security bug pattern analysis, we are requesting your help with a high level analysis of this bug. It is our hope to develop static analysis (or potentially runtime/dynamic analysis) in the future to identify classes of bugs.
Please visit this google form to reply.
Assignee | ||
Updated•4 years ago
|
Comment 39•4 years ago
|
||
Ryan: you marked this bug "status-esr68: fixed" but it was fixed by bug 1634145 which is marked "wontfix" on esr-68. Which bug is correct? I don't see an ESR landing link in either bug, but comment 20 says fixing this on esr-68 was a good idea so maybe it was somewhere else?
Comment 40•4 years ago
|
||
Bug 1644477 is what I was going off per comment 35.
Updated•3 years ago
|
Description
•