UAF in sftk_FreeSession due to improper refcounting
Categories
(NSS :: Libraries, defect, P1)
Tracking
(firefox-esr60 wontfix, firefox-esr68 wontfix, firefox64 wontfix, firefox65 wontfix, firefox66 wontfix, firefox67 wontfix, firefox69 wontfix, firefox70+ wontfix, firefox71+ fixed)
People
(Reporter: marcia, Assigned: jcj)
References
Details
(4 keywords, Whiteboard: [post-critsmash-triage][adv-main71+])
Crash Data
Attachments
(2 files, 1 obsolete file)
47 bytes,
text/x-phabricator-request
|
abillings
:
sec-approval+
|
Details | Review |
182 bytes,
text/plain
|
Details |
Reporter | ||
Comment 1•6 years ago
|
||
Comment 2•6 years ago
|
||
Updated•6 years ago
|
Comment 3•6 years ago
|
||
Comment 4•6 years ago
|
||
Low crash rate and unclear steps forward. Marking as wontfix for 65.
Comment 5•6 years ago
|
||
sec-high, can we get an owner and priority assigned for investigation?
Updated•6 years ago
|
Comment 6•6 years ago
|
||
Can we get an update and ETA here? This is currently a sec-high bug.
Comment 7•6 years ago
|
||
I'll update this before the end of this week
Comment 8•6 years ago
|
||
By checking those crashes with SafeBrowsing call stacks, I didn't find any suspicious behavior and pointers passed to NSS are also valid.
The way SafeBrowsing uses NSS is very straightforward:
When there is a SafeBrowsing update, the worker thread Init the lib, update the buffer with SHA256 hash and then call Finish().
Since there are also crashes with a similar signature from different components, I guess this probably is not a SafeBrowsing bug.
I checked minidumps from different components, as mentioned in Comment 2, it looks like the pointer used in |sftk_FreeSession| is corrupt.
I am not familiar with NSS stuff, so I'll need some help.
Hi J.C.
Do you know who might be able to help on this bug? Thanks!
Assignee | ||
Comment 9•6 years ago
|
||
That's pretty much me. I'll get started on this as soon as I can, no later than Monday the 18th, and update the bug. (Leaving the NI intact)
Assignee | ||
Comment 10•6 years ago
|
||
I was a week late here due to coming down very ill. I've begun looking into this today. So far it appears to be a race condition, but I haven't yet nailed it down.
Assignee | ||
Comment 11•6 years ago
|
||
So I still don't have a solid lead on this, but it's clearly an NSS race condition, so let's start by moving components.
I'm pretty confident this is not going to be remotely exploitable. We can probably lower this to a sec-moderate, as I don't see how to trigger this, nor in a way to prompt useful corruption.
The stacks indicate this is a crash from a double-free on a given SFTKSession
, which would be a ref-miscount in sftk_FreeSession
, probably caused by a race on the refCount during destruction itself.
SFTKSession
intends to be a smart-pointer with a handy thread-safety lock, but since it cannot track its own state, upon deletion,
further uses of itself are bogus and will be uncaught. So it's a smart-pointer that can't be relied upon to fail-safe once it is
deallocated. Hurray.
Because of how this works, methods like sftk_SessionFromHandle find the session for a given handle, but as far as I can tell right now, the handle can long outlive the session. Destroying the session doesn't remove it from the handle. Perhaps that's the simplest remediation.
[Marking wontfix for 66, affecting 67 (though we haven't seen a crash yet, it's rare), and I don't think this is a regression.]
Assignee | ||
Comment 12•6 years ago
|
||
My summary update was improper; much earlier I thought this was a mutex collision, but it's actually a UAF from improper refcounting. Specifically, that once the refcount indicates we can destroy the object, we destroy the refcount, causing future accesses to be invalid.
The most obvious options here are:
- as mentioned in comment 11, to return result from sftk_FreeSession that obligates callers to zero their session pointers.
- add sentinel values
- build an indirection type that can be more like a real smart pointer; this is likely very hard for this type
Comment 13•5 years ago
|
||
Is this still something you may work on? Or, that we might find more help to investigate further?
I'm going through some untouched-for-a-while sec-high issues to make sure we've investigated all we can before marking them stalled.
Assignee | ||
Comment 14•5 years ago
|
||
kjacobs and I spent another 8 hours or so (each) on this in August. This bug is a perfect example of why even clever C is dangerous.
Where we left off is that we're going to need to build something to help us find this bug. And this is not the only UAF caused by this improper refcounting in NSS, so it'd be nice to get the whole class of bugs fixed.
I don't know if we should mark this stalled or not. When I have time to work on NSS again, I intend to return to working on this bug.
Assignee | ||
Comment 15•5 years ago
|
||
sftk_SessionFromHandle
reads SFTKSession objects from the slot, and it's
possible that during slot session closure/shutdown that a given session may have
already had its refcount subtracted. In that case, just don't return the
session.
Assignee | ||
Comment 16•5 years ago
|
||
Every instance of sftk_FreeSession
is correctly guarded to ensure it is never passed a null pointer, which further implicates that the whole structure has been freed or in some other way clobbered before the crash. So the actual crashing code is probably one of these two lines:
PZLock *lock = SFTK_SESSION_LOCK(slot, session->handle);
PZ_Lock(lock);
And it's probably SFTK_SESSION_LOCK
, since as a macro would put sfk_FreeSession
be the top of the crashing stack frames, which it is. Which would mean that (session->)slot
or slot->sessionLock
are null. Since session
isn't null, that implies its contents are no longer consistent.
This is all to say, my initial analysis seems accurate - sftk_FreeSession
is being passed a non-null session that has already been deallocated, but it can't tell because upon deallocation the memory can be overwritten. (There could be wider heap corruption, too, but this seems more likely.)
In practically all cases, sftk_FreeSession
is being passed a session from sftk_SessionFromHandle
which is getting it from the slot via sftk_SlotFromSessionHandle
.
The only way to remove a session from the slot is NSC_CloseSession, which is called indirectly via pk11_CloseSession
. If the refcounting works correctly, it's not possible to have a deallocated session until pk11_CloseSession
is called, as the slot holds a reference effectively itself, complete with an assertion in NSC_CloseSession
.
There are locks on the refcounts, and PR_Lock is used to enforce the critical sections. The locks are weird, in that they use a hash function to assign multiple sessions to a single lock, but they are consistent in the practice, so that's probably not the issue.
It is notable however that the lock belongs to the slot, not the session, yet we exit the critical section before destroying the session, which could be a race condition, but one which is still depending on some kind of refcount error elsewhere.
My only reasonable-level suggestion here is to check the session refcount in sftk_SessionFromHandle and return NULL
if it's zero, which is patch D47010. This will need some testing.
Comment 17•5 years ago
•
|
||
Eek. I don't like seeing refcounted objects that are accessible when their refcount is zero. That's a violation of an important invariant.
This is really hard to think through. I'm not sure that my sequence of events here is right either, but I don't think that this change addresses the core problem.
Looking at NSC_CloseSession
, say that two threads attempt to call that NSC_CloseSession
at the same time. Both could pull a valid pointer from sftk_SessionFromHandle if the two calls happen at around same time. The lock is released several times in this process. I think that the lock is consistent in covering both the refCount
and the reference to the session from the slot, but the interaction of the two could be a problem.
Assuming that refCount == 1 to start, this looks like a problematic sequence of events. The order of the first two steps doesn't matter.
A: sftk_SessionFromHandle => refCount++
B: sftk_SessionFromHandle => refCount++
A: sftkqueue_delete and refCount--
B: skip the delete block because the session is no longer attached to the slot
B: sftk_FreeSession => refCount--
B: refCount was reduced to 1, call free()
A: sftk_FreeSession => UAF on sftk_SlotFromSession as it attempts to read session->slot
All of those steps, aside from the last two, are independently covered by the lock. The scary part is that covering the free() with the lock wouldn't help. The point is that A is holding a reference to an object that has been freed.
The problem is that there is an interaction between reference counting and the ownership here. Given that this object is pinned to the slot and only freed on shutdown, I have a more dramatic simplification to suggest: remove the reference count entirely. Then use the lock to govern its insertion and removal to and from the slot. Only delete the object if it was removed from the slot.
That doesn't address the problem of sessions that are in active use during shutdown. Notice that NSC_CloseSession
doesn't assert that the reference count is 1, it asserts that it is greater than zero, which suggests that this code might not be careful enough about tracking outstanding uses before hitting the shutdown code. Changing that assertion to be a ==1 and running tests would be illuminating. Other than that check, I would advocate for dumping the reference count entirely.
Assignee | ||
Comment 18•5 years ago
|
||
Thank you for the further analysis, Martin - running a concrete example is, to use your words, illuminating.
Your hypothesis that asserting refcount==1
after the sftkqueue_delete
in NSC_CloseSession
seems to work fine in NSS ASAN builds on MacOS. I'm currently testing webcrypto tests in m-c and working on a patch to try removing the refcount entirely.
That said, these kind of fixes we'd probably want to bury in a big try-push against mozilla-central to fully evaluate on all platforms.
Updated•5 years ago
|
Assignee | ||
Comment 19•5 years ago
|
||
Updated•5 years ago
|
Assignee | ||
Comment 20•5 years ago
|
||
D47184 is a mozilla-central uplift with a collection of NSS changes and the patch from D47010. I do not intend to land D47184, but am planning to run a full all/all try run against mozilla-central using it, once we have some initial thoughts on the refcount removal patch.
Assignee | ||
Comment 21•5 years ago
|
||
I have an all-all try push running against mozilla-central ( https://treeherder.mozilla.org/#/jobs?repo=try&revision=8fc5474b750a6225b5b10b7608d67c1ddbe85ea8&selectedJob=268655312 ) and it has only one semi-suspicious crash, with retriggers in-progress. Overall, this change looks like it works for Firefox.
The nature of the change - particularly to PKCS11_CloseAllSessions - is that consumers of NSS that do a lot of module unloading are going to be most likely to run into any issues from this change. That would be users unplugging smart cards, most likely.
I'll see if I can dig some up for testing.
Assignee | ||
Comment 22•5 years ago
|
||
Comment on attachment 9095024 [details]
Bug 1508776 - Remove unneeded refcounting from SFTKSession r?mt,kjacobs
Security Approval Request
- How easily could an exploit be constructed based on the patch?: Not easily. This might be impossible to trigger.
- Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?: Unknown
- Which older supported branches are affected by this flaw?: All
- If not all supported branches, which bug introduced the flaw?: None
- Do you have backports for the affected branches?: Yes
- If not, how different, hard to create, and risky will they be?: This would be easy to add to the NSS uplifts, but three days of testing in nightly might not be enough. Still, I don't know how this could be triggered (except at shutdown or by removing a smart card repeatedly), so I think maybe it could land in nightly and uplift later...maybe?
- How likely is this patch to cause regressions; how much testing does it need?: It's unlikely to cause problems for the majority of Firefox users, but those using smart cards could see random issues. We're going to need some testing to feel confident in it before going to a wide audience.
Comment 23•5 years ago
|
||
Comment on attachment 9095024 [details]
Bug 1508776 - Remove unneeded refcounting from SFTKSession r?mt,kjacobs
sec-approval+ for mozilla-central.
Updated•5 years ago
|
Updated•5 years ago
|
Assignee | ||
Comment 24•5 years ago
|
||
Assignee | ||
Comment 25•5 years ago
|
||
I feel this bug is too risky to uplift into ESR or even probably Beta, and while it's a wildptr, there's no known exploit path. I'd like to not uplift this, and let it instead ride the trains.
Tom, would that be you now to give a blessing thus?
Assignee | ||
Updated•5 years ago
|
Comment 28•5 years ago
|
||
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Comment 30•5 years ago
|
||
Comment 31•5 years ago
|
||
Since this flaw is already mentioned in the release notes at:
https://developer.mozilla.org/en-US/docs/Mozilla/Projects/NSS/NSS_3.47_release_notes
Can we please make this bug public?
Comment 32•5 years ago
|
||
Please do not rush making these bugs open to the public. We are still in the process of adjusting our code.
Assignee | ||
Comment 33•5 years ago
|
||
Marking security bugs as public is the responsibility of the security management team. Overall we are generally more conservative for NSS, and more conservative still for sec-high bugs, but the marking-public operation is done in bulk periodically based on bugqueries.
It there a specific reason you need this sec-high disclosed early, Huzaifa? The description of removing the locks is much less useful than the descriptions in this bug and within the patch for generating a PoC. The early opening would need to be justified to Tom Ritter, not me.
Comment 34•5 years ago
|
||
(In reply to J.C. Jones [:jcj] (he/him) from comment #33)
Marking security bugs as public is the responsibility of the security management team. Overall we are generally more conservative for NSS, and more conservative still for sec-high bugs, but the marking-public operation is done in bulk periodically based on bugqueries.
It there a specific reason you need this sec-high disclosed early, Huzaifa? The description of removing the locks is much less useful than the descriptions in this bug and within the patch for generating a PoC. The early opening would need to be justified to Tom Ritter, not me.
Making the summary and the patch known while keeping the bug private just seems odd to me and hence the request. fwiw this bug is already disclosed via the patch and the entry in the release notes
Comment 35•5 years ago
|
||
(In reply to Huzaifa Sidhpurwala from comment #34)
Making the summary and the patch known while keeping the bug private just seems odd to me and hence the request. fwiw this bug is already disclosed via the patch and the entry in the release notes
Ah, but there is a distinct difference between having a summary (describing the generics of the bug) and code patch (for which efforts are made to not paint a bulls-eye on the vulnerability) public, and disclosing a bug that generally has a ready proof-of-concept and detailed discussion of the vulnerability and an analysis of exactly why it occurs.
From the former, it's extremely difficult to come to a way to exploit a vulnerability. From the latter, it'll be relatively (very) easy. This is also why tests, that expose a way to trigger the problem (similar to a PoC) tend to be landed (much) later than the patches for the problem.
While you are right that having the patch public might make it obvious where the problem lies, it won't hand anything on a silver platter; but it's required for Open Source development -- so this is the best compromise.
Comment 36•4 years ago
|
||
Removing employee no longer with company from CC list of private bugs.
Updated•4 years ago
|
Description
•