Closed Bug 1508776 (CVE-2019-11756) Opened 6 years ago Closed 5 years ago

UAF in sftk_FreeSession due to improper refcounting

Categories

(NSS :: Libraries, defect, P1)

Unspecified
Windows
defect

Tracking

(firefox-esr60 wontfix, firefox-esr68 wontfix, firefox64 wontfix, firefox65 wontfix, firefox66 wontfix, firefox67 wontfix, firefox69 wontfix, firefox70+ wontfix, firefox71+ fixed)

RESOLVED FIXED
Tracking Status
firefox-esr60 --- wontfix
firefox-esr68 --- wontfix
firefox64 --- wontfix
firefox65 --- wontfix
firefox66 --- wontfix
firefox67 --- wontfix
firefox69 --- wontfix
firefox70 + wontfix
firefox71 + fixed

People

(Reporter: marcia, Assigned: jcj)

References

Details

(4 keywords, Whiteboard: [post-critsmash-triage][adv-main71+])

Crash Data

Attachments

(2 files, 1 obsolete file)

This bug was filed from the Socorro interface and is report bp-ec6e8cfb-9c0b-4e68-9917-977e80181120. ============================================================= Seen while looking at nightly crash stats. 1 crash so far, but there are several other similar signatures in the same build ID: 20181120100045. Some of the crash reports have safebrowing stuff in the stack. *https://crash-stats.mozilla.com/report/index/5e48d3a0-ac12-4463-84ee-fb62f0181120 - [@ sftk_ObjectFromHandle] * https://crash-stats.mozilla.com/report/index/aa4a675d-1dae-4b22-9698-355130181120 - [@ sftk_SessionFromHandle ] * https://crash-stats.mozilla.com/report/index/18de7f1b-8ec5-4fc1-8784-42ac90181120 * https://crash-stats.mozilla.com/report/index/7b929956-332a-4de6-8f53-6c5100181120 Possible regression range based on Build ID: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=e44bb5b4bc79be613d29b3f95d7b508e68e3d128&tochange=eeddcefcdad847bf8a5737153079e9619ee5aa66 Top 10 frames of crashing thread: 0 @0x1cdb1e67a80 1 softokn3.dll sftk_FreeSession security/nss/lib/softoken/pkcs11u.c:1868 2 softokn3.dll NSC_DigestUpdate security/nss/lib/softoken/pkcs11c.c:1794 3 nss3.dll PK11_DigestOp security/nss/lib/pk11wrap/pk11cxt.c:783 4 xul.dll nsCryptoHash::Update security/manager/ssl/nsCryptoHash.cpp:116 5 xul.dll nsresult mozilla::safebrowsing::LookupCacheV4::VerifyChecksum toolkit/components/url-classifier/LookupCacheV4.cpp:416 6 xul.dll mozilla::safebrowsing::LookupCacheV4::LoadFromFile toolkit/components/url-classifier/LookupCacheV4.cpp:205 7 xul.dll nsresult mozilla::safebrowsing::LookupCache::LoadPrefixSet toolkit/components/url-classifier/LookupCache.cpp:478 8 xul.dll mozilla::safebrowsing::LookupCache::Open toolkit/components/url-classifier/LookupCache.cpp:83 9 xul.dll class RefPtr<mozilla::safebrowsing::LookupCache> mozilla::safebrowsing::Classifier::GetLookupCache toolkit/components/url-classifier/Classifier.cpp:1526 =============================================================
No further crashes on 65 nightly since I originally filed this. sftk_ObjectFromHandle signature and sftk_SessionFromHandle also doesn't appear to have any further crashes on nightly.
2 total crashes in 65; volume is low enough that nightly isn't a great marker. Crashes are a mix of bad pointers and execs of bad pointers -> sec-high. At least one is a clear e5 UAF Crashes seem to go back to FF 52 or earlier
I don't know if this is a bug in the NSS implementation or a bug in how Safe Browsing is using it.
Group: core-security → core-security-release
Component: Security → Safe Browsing
Product: Core → Toolkit

Low crash rate and unclear steps forward. Marking as wontfix for 65.

sec-high, can we get an owner and priority assigned for investigation?

Flags: needinfo?(dlee)
Assignee: nobody → dlee
Status: NEW → ASSIGNED
Flags: needinfo?(dlee)
Priority: -- → P2

Can we get an update and ETA here? This is currently a sec-high bug.

Flags: needinfo?(ettseng)
Flags: needinfo?(dlee)

I'll update this before the end of this week

Flags: needinfo?(ettseng)
Flags: needinfo?(dlee)
Priority: P2 → P1

By checking those crashes with SafeBrowsing call stacks, I didn't find any suspicious behavior and pointers passed to NSS are also valid.
The way SafeBrowsing uses NSS is very straightforward:
When there is a SafeBrowsing update, the worker thread Init the lib, update the buffer with SHA256 hash and then call Finish().
Since there are also crashes with a similar signature from different components, I guess this probably is not a SafeBrowsing bug.

I checked minidumps from different components, as mentioned in Comment 2, it looks like the pointer used in |sftk_FreeSession| is corrupt.

I am not familiar with NSS stuff, so I'll need some help.

Hi J.C.
Do you know who might be able to help on this bug? Thanks!

Flags: needinfo?(jjones)

That's pretty much me. I'll get started on this as soon as I can, no later than Monday the 18th, and update the bug. (Leaving the NI intact)

I was a week late here due to coming down very ill. I've begun looking into this today. So far it appears to be a race condition, but I haven't yet nailed it down.

Flags: needinfo?(jjones)

So I still don't have a solid lead on this, but it's clearly an NSS race condition, so let's start by moving components.

I'm pretty confident this is not going to be remotely exploitable. We can probably lower this to a sec-moderate, as I don't see how to trigger this, nor in a way to prompt useful corruption.

The stacks indicate this is a crash from a double-free on a given SFTKSession, which would be a ref-miscount in sftk_FreeSession, probably caused by a race on the refCount during destruction itself.

SFTKSession intends to be a smart-pointer with a handy thread-safety lock, but since it cannot track its own state, upon deletion,
further uses of itself are bogus and will be uncaught. So it's a smart-pointer that can't be relied upon to fail-safe once it is
deallocated. Hurray.

Because of how this works, methods like sftk_SessionFromHandle find the session for a given handle, but as far as I can tell right now, the handle can long outlive the session. Destroying the session doesn't remove it from the handle. Perhaps that's the simplest remediation.

[Marking wontfix for 66, affecting 67 (though we haven't seen a crash yet, it's rare), and I don't think this is a regression.]

Assignee: dlee → jjones
Group: core-security-release → crypto-core-security
Severity: critical → normal
Component: Safe Browsing → Libraries
Keywords: regression
OS: Windows 10 → Windows
Product: Toolkit → NSS
QA Contact: jjones
Summary: Crash in sftk_FreeSession → Mutex collision in sftk_FreeSession
Version: Trunk → other

My summary update was improper; much earlier I thought this was a mutex collision, but it's actually a UAF from improper refcounting. Specifically, that once the refcount indicates we can destroy the object, we destroy the refcount, causing future accesses to be invalid.

The most obvious options here are:

  1. as mentioned in comment 11, to return result from sftk_FreeSession that obligates callers to zero their session pointers.
  2. add sentinel values
  3. build an indirection type that can be more like a real smart pointer; this is likely very hard for this type
Summary: Mutex collision in sftk_FreeSession → UAF in sftk_FreeSession due to improper refcounting
See Also: → 1433571
See Also: → 1572164

Is this still something you may work on? Or, that we might find more help to investigate further?
I'm going through some untouched-for-a-while sec-high issues to make sure we've investigated all we can before marking them stalled.

Flags: needinfo?(jjones)

kjacobs and I spent another 8 hours or so (each) on this in August. This bug is a perfect example of why even clever C is dangerous.

Where we left off is that we're going to need to build something to help us find this bug. And this is not the only UAF caused by this improper refcounting in NSS, so it'd be nice to get the whole class of bugs fixed.

I don't know if we should mark this stalled or not. When I have time to work on NSS again, I intend to return to working on this bug.

Flags: needinfo?(jjones)

sftk_SessionFromHandle reads SFTKSession objects from the slot, and it's
possible that during slot session closure/shutdown that a given session may have
already had its refcount subtracted. In that case, just don't return the
session.

Every instance of sftk_FreeSession is correctly guarded to ensure it is never passed a null pointer, which further implicates that the whole structure has been freed or in some other way clobbered before the crash. So the actual crashing code is probably one of these two lines:

    PZLock *lock = SFTK_SESSION_LOCK(slot, session->handle);

    PZ_Lock(lock);

And it's probably SFTK_SESSION_LOCK, since as a macro would put sfk_FreeSession be the top of the crashing stack frames, which it is. Which would mean that (session->)slot or slot->sessionLock are null. Since session isn't null, that implies its contents are no longer consistent.

This is all to say, my initial analysis seems accurate - sftk_FreeSession is being passed a non-null session that has already been deallocated, but it can't tell because upon deallocation the memory can be overwritten. (There could be wider heap corruption, too, but this seems more likely.)

In practically all cases, sftk_FreeSession is being passed a session from sftk_SessionFromHandle which is getting it from the slot via sftk_SlotFromSessionHandle.

The only way to remove a session from the slot is NSC_CloseSession, which is called indirectly via pk11_CloseSession. If the refcounting works correctly, it's not possible to have a deallocated session until pk11_CloseSession is called, as the slot holds a reference effectively itself, complete with an assertion in NSC_CloseSession.

There are locks on the refcounts, and PR_Lock is used to enforce the critical sections. The locks are weird, in that they use a hash function to assign multiple sessions to a single lock, but they are consistent in the practice, so that's probably not the issue.

It is notable however that the lock belongs to the slot, not the session, yet we exit the critical section before destroying the session, which could be a race condition, but one which is still depending on some kind of refcount error elsewhere.

My only reasonable-level suggestion here is to check the session refcount in sftk_SessionFromHandle and return NULL if it's zero, which is patch D47010. This will need some testing.

Eek. I don't like seeing refcounted objects that are accessible when their refcount is zero. That's a violation of an important invariant.

This is really hard to think through. I'm not sure that my sequence of events here is right either, but I don't think that this change addresses the core problem.

Looking at NSC_CloseSession, say that two threads attempt to call that NSC_CloseSession at the same time. Both could pull a valid pointer from sftk_SessionFromHandle if the two calls happen at around same time. The lock is released several times in this process. I think that the lock is consistent in covering both the refCount and the reference to the session from the slot, but the interaction of the two could be a problem.

Assuming that refCount == 1 to start, this looks like a problematic sequence of events. The order of the first two steps doesn't matter.

A: sftk_SessionFromHandle => refCount++
B: sftk_SessionFromHandle => refCount++
A: sftkqueue_delete and refCount--
B: skip the delete block because the session is no longer attached to the slot
B: sftk_FreeSession => refCount--
B: refCount was reduced to 1, call free()
A: sftk_FreeSession => UAF on sftk_SlotFromSession as it attempts to read session->slot

All of those steps, aside from the last two, are independently covered by the lock. The scary part is that covering the free() with the lock wouldn't help. The point is that A is holding a reference to an object that has been freed.

The problem is that there is an interaction between reference counting and the ownership here. Given that this object is pinned to the slot and only freed on shutdown, I have a more dramatic simplification to suggest: remove the reference count entirely. Then use the lock to govern its insertion and removal to and from the slot. Only delete the object if it was removed from the slot.

That doesn't address the problem of sessions that are in active use during shutdown. Notice that NSC_CloseSession doesn't assert that the reference count is 1, it asserts that it is greater than zero, which suggests that this code might not be careful enough about tracking outstanding uses before hitting the shutdown code. Changing that assertion to be a ==1 and running tests would be illuminating. Other than that check, I would advocate for dumping the reference count entirely.

Thank you for the further analysis, Martin - running a concrete example is, to use your words, illuminating.

Your hypothesis that asserting refcount==1 after the sftkqueue_delete in NSC_CloseSession seems to work fine in NSS ASAN builds on MacOS. I'm currently testing webcrypto tests in m-c and working on a patch to try removing the refcount entirely.

That said, these kind of fixes we'd probably want to bury in a big try-push against mozilla-central to fully evaluate on all platforms.

Attachment #9095024 - Attachment description: Bug 1508776 - Catch corner case for sftksession objects r?mt,kjacobs → Bug 1508776 - Remove unneeded refcounting from SFTKSession r?mt,kjacobs
Attachment #9095363 - Attachment description: Bug 1508776 - land NSS e022f47937ca UPGRADE_NSS_RELEASE → Bug 1577822 - land NSS e022f47937ca UPGRADE_NSS_RELEASE

D47184 is a mozilla-central uplift with a collection of NSS changes and the patch from D47010. I do not intend to land D47184, but am planning to run a full all/all try run against mozilla-central using it, once we have some initial thoughts on the refcount removal patch.

I have an all-all try push running against mozilla-central ( https://treeherder.mozilla.org/#/jobs?repo=try&revision=8fc5474b750a6225b5b10b7608d67c1ddbe85ea8&selectedJob=268655312 ) and it has only one semi-suspicious crash, with retriggers in-progress. Overall, this change looks like it works for Firefox.

The nature of the change - particularly to PKCS11_CloseAllSessions - is that consumers of NSS that do a lot of module unloading are going to be most likely to run into any issues from this change. That would be users unplugging smart cards, most likely.

I'll see if I can dig some up for testing.

Comment on attachment 9095024 [details]
Bug 1508776 - Remove unneeded refcounting from SFTKSession r?mt,kjacobs

Security Approval Request

  • How easily could an exploit be constructed based on the patch?: Not easily. This might be impossible to trigger.
  • Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?: Unknown
  • Which older supported branches are affected by this flaw?: All
  • If not all supported branches, which bug introduced the flaw?: None
  • Do you have backports for the affected branches?: Yes
  • If not, how different, hard to create, and risky will they be?: This would be easy to add to the NSS uplifts, but three days of testing in nightly might not be enough. Still, I don't know how this could be triggered (except at shutdown or by removing a smart card repeatedly), so I think maybe it could land in nightly and uplift later...maybe?
  • How likely is this patch to cause regressions; how much testing does it need?: It's unlikely to cause problems for the majority of Firefox users, but those using smart cards could see random issues. We're going to need some testing to feel confident in it before going to a wide audience.
Attachment #9095024 - Flags: sec-approval?

Comment on attachment 9095024 [details]
Bug 1508776 - Remove unneeded refcounting from SFTKSession r?mt,kjacobs

sec-approval+ for mozilla-central.

Attachment #9095024 - Flags: sec-approval? → sec-approval+
Blocks: 1566873
Attachment #9095363 - Attachment is obsolete: true
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → 3.47

I feel this bug is too risky to uplift into ESR or even probably Beta, and while it's a wildptr, there's no known exploit path. I'd like to not uplift this, and let it instead ride the trains.

Tom, would that be you now to give a blessing thus?

Flags: needinfo?(tom)

I think so; that plan sounds ok to me.

Flags: needinfo?(tom)

Thanks, Tom.

No longer blocks: 1566873
Group: crypto-core-security → core-security-release
Flags: qe-verify-
Whiteboard: [post-critsmash-triage]

Can you assign a CVE to this, Tom? Thanks.

Flags: needinfo?(tom)
Alias: CVE-2019-11756
Flags: needinfo?(tom)
Whiteboard: [post-critsmash-triage] → [post-critsmash-triage][adv-main71+]
Attached file advisory.txt

Since this flaw is already mentioned in the release notes at:
https://developer.mozilla.org/en-US/docs/Mozilla/Projects/NSS/NSS_3.47_release_notes

Can we please make this bug public?

Flags: needinfo?(jjones)

Please do not rush making these bugs open to the public. We are still in the process of adjusting our code.

Marking security bugs as public is the responsibility of the security management team. Overall we are generally more conservative for NSS, and more conservative still for sec-high bugs, but the marking-public operation is done in bulk periodically based on bugqueries.

It there a specific reason you need this sec-high disclosed early, Huzaifa? The description of removing the locks is much less useful than the descriptions in this bug and within the patch for generating a PoC. The early opening would need to be justified to Tom Ritter, not me.

Flags: needinfo?(jjones) → needinfo?(huzaifas)

(In reply to J.C. Jones [:jcj] (he/him) from comment #33)

Marking security bugs as public is the responsibility of the security management team. Overall we are generally more conservative for NSS, and more conservative still for sec-high bugs, but the marking-public operation is done in bulk periodically based on bugqueries.

It there a specific reason you need this sec-high disclosed early, Huzaifa? The description of removing the locks is much less useful than the descriptions in this bug and within the patch for generating a PoC. The early opening would need to be justified to Tom Ritter, not me.

Making the summary and the patch known while keeping the bug private just seems odd to me and hence the request. fwiw this bug is already disclosed via the patch and the entry in the release notes

Flags: needinfo?(huzaifas)

(In reply to Huzaifa Sidhpurwala from comment #34)

Making the summary and the patch known while keeping the bug private just seems odd to me and hence the request. fwiw this bug is already disclosed via the patch and the entry in the release notes

Ah, but there is a distinct difference between having a summary (describing the generics of the bug) and code patch (for which efforts are made to not paint a bulls-eye on the vulnerability) public, and disclosing a bug that generally has a ready proof-of-concept and detailed discussion of the vulnerability and an analysis of exactly why it occurs.

From the former, it's extremely difficult to come to a way to exploit a vulnerability. From the latter, it'll be relatively (very) easy. This is also why tests, that expose a way to trigger the problem (similar to a PoC) tend to be landed (much) later than the patches for the problem.

While you are right that having the patch public might make it obvious where the problem lies, it won't hand anything on a silver platter; but it's required for Open Source development -- so this is the best compromise.

Removing employee no longer with company from CC list of private bugs.

Group: core-security-release
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: