Closed Bug 1607449 Opened 5 years ago Closed 4 years ago

ThreadSanitizer: data race [@ fill_CERTCertificateFields] vs. [@ CERT_DestroyCertificate] with potential use-after-free

Categories

(NSS :: Libraries, defect, P1)

x86_64
Linux
defect

Tracking

(firefox-esr68 wontfix, firefox-esr78 wontfix, firefox72 wontfix, firefox73 wontfix, firefox74 wontfix, firefox82 wontfix, firefox83 wontfix, firefox84 fixed)

RESOLVED FIXED
Tracking Status
firefox-esr68 --- wontfix
firefox-esr78 --- wontfix
firefox72 --- wontfix
firefox73 --- wontfix
firefox74 --- wontfix
firefox82 --- wontfix
firefox83 --- wontfix
firefox84 --- fixed

People

(Reporter: decoder, Assigned: kjacobs)

References

(Blocks 1 open bug)

Details

(Keywords: csectype-race, sec-moderate, Whiteboard: [post-critsmash-triage][adv-main84+r])

Crash Data

Attachments

(3 files, 1 obsolete file)

The attached crash information was detected while running CI tests with ThreadSanitizer on mozilla-central revision bc5880b621d5.

In this race it looks like one thread is about to destroy an NSSCertificate while the main thread is still working on it. Filing this as a security bug because this can lead to a potential use-after-free. I don't know who exactly is responsible for the proper synchronization, but since PSM seems to be the owner of the freeing thread, filing it here.

General information about TSan reports

Why fix races?

Data races are undefined behavior and can cause crashes as well as correctness issues. Compiler optimizations can cause racy code to have unpredictable and hard-to-reproduce behavior.

Rating

If you think this race can cause crashes or correctness issues, it would be great to rate the bug appropriately as P1/P2 and/or indicating this in the bug. This makes it a lot easier for us to assess the actual impact that these reports make and if they are helpful to you.

False Positives / Benign Races

Typically, races reported by TSan are not false positives [1], but it is possible that the race is benign. Even in this case it would be nice to come up with a fix if it is easily doable and does not regress performance. Every race that we cannot fix will have to remain on the suppression list and slows down the overall TSan performance. Also note that seemingly benign races can possibly be harmful (also depending on the compiler, optimizations and the architecture) [2][3].

[1] One major exception is the involvement of uninstrumented code from third-party libraries.
[2] http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong
[3] How to miscompile programs with "benign" data races: https://www.usenix.org/legacy/events/hotpar11/tech/final_files/Boehm.pdf

Suppressing unfixable races

If the bug cannot be fixed, then a runtime suppression needs to be added in mozglue/build/TsanOptions.cpp. The suppressions match on the full stack, so it should be picked such that it is unique to this particular race. The bug number of this bug should also be included so we have some documentation on why this suppression was added.

Group: core-security → crypto-core-security
Assignee: nobody → nobody
Component: Security: PSM → Libraries
Product: Core → NSS
QA Contact: jjones
Version: Trunk → other

Kevin, please triage.

Flags: needinfo?(kjacobs.bugzilla)

Could be sec-high if this race could be forced to happen reliably in practice.

This race involves two simultaneous PSM operations:

  1. Fetching all client auth certificates, using ClientAuthDataRunnable in nsNSSIOLayer.cpp, and
  2. Asynchronously verifying the server certificate that we're connected to using SSLServerCertVerificationJob in SSLServerCertVerification.cpp, which is the start of the NSS callback that ultimately runs mozilla::pkix

From inspection, it appears to me the obvious way to trigger this would be to connect to a server which is using a certificate that Firefox also has as a client cert, and has the private key for. The client auth code might choose the server cert from the cache, which then gets deleted when moz::pkix finishes.

Unfortunately, this happens early enough in cert verification that there's no need for the server certificate to actually be trusted in any way, but it does require Firefox to be configured with such a cert in the first place, and I don't believe it's possible to set a server cert for WebRTC, the obvious way to meet this minimums.

Thanks, I didn't get a chance to dig into this until today.

Could an attacker interrogate the client for a legitimate client cert in one connection, then subsequently present it as a server cert (that will fail SSLServerCertVerificationJob, but still trigger the free)? It appears so, as long as https://searchfox.org/mozilla-central/source/security/nss/lib/pki/tdcache.c#718 can be fooled.

Edit: For the first connection you'd need a legitimate server cert chain (since we don't send the client cert until verification is complete), or simply a captured TLS 1.2 handshake...

Flags: needinfo?(kjacobs.bugzilla)

The priority flag is not set for this bug.
:jcj, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(jjones)

Something doesn't mesh here. The TSAN log shows server-auth callback is running at the same time as the client-cert search is occurring; how can that be if they're serial in nature?

Flags: needinfo?(jjones) → needinfo?(kjacobs.bugzilla)

We prevent sending of the client cert until the server chain has been verified. So the Edit only applies to getting a client cert to later reflect and trigger the UAF.

Flags: needinfo?(kjacobs.bugzilla)

The priority flag is not set for this bug.
:jcj, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(jjones)

After further conversation, I understand Kevin's comment 8 about how to go about writing an exploit to reach this UAF. It's possible that this could be provoked in a straightforward manner, which would warrant promotion to sec-high.

Assigning to Kevin.

Dana, we need to talk through how client cert caching works -- and whether we need to use the cache at all there, because that would be the most straightforward fix.

Assignee: nobody → kjacobs.bugzilla
Status: NEW → ASSIGNED
Flags: needinfo?(jjones) → needinfo?(dkeeler)
Priority: -- → P1

I had questioned why these use cases were sharing a single cache (as it would seem to provide very little benefit, if any). Looking back into it though, it makes sense from an NSS perspective and isn't easily changed from PSM like I imagined it might be.

Dana, please correct me if you have a different opinion, but I think NSS just needs to synchronize accesses to these fields.

This cache is entirely an NSS mechanism, so I'm not sure where PSM would come into it.
That said, what is this cache speeding up? Certificate decoding? Perhaps we could remove it entirely.
At the very least, if we need to lock e.g. the trust field as in https://searchfox.org/mozilla-central/rev/ad6234148892bc29bf08d736604ac71925138040/security/nss/lib/pk11wrap/pk11cert.c#381-383, it makes sense we would need to lock other fields as well (incidentally, I noticed that NSS currently uses a global lock for that field, which probably can be improved upon).

Flags: needinfo?(dkeeler)

What specific test triggered this race? I wrote a test for the more trivial exploit, but only managed to trigger bug 1615569.

Flags: needinfo?(choller)

(In reply to Kevin Jacobs [:kjacobs] from comment #13)

What specific test triggered this race? I wrote a test for the more trivial exploit, but only managed to trigger bug 1615569.

As far as I can tell, this happened in or around dom/base/test/test_bug466080.html. The problem is that TSan was running without halt_on_error at that time to speed up the try runs, so it could also have been one of the tests right before that. The race was also intermittent, I discovered it pretty late.

If that is not sufficient, we could remove the suppression for a try run and retrigger the relevant test chunks a few times to get more precise information.

Flags: needinfo?(choller)

This already recurred on CI and, filed under bug 1673960.

I need to revisit my patch w.r.t. Dana's concern in https://phabricator.services.mozilla.com/D64233#1963138. FWIW, my [local] TSAN also stopped reporting this issue shortly after I made that patch.

I'll re-prioritize this, but depending on the report frequency, we might want to re-suppress this one.

Yes, absolutely just revert my patch in the mean-time.

Still an issue, just rare.

Attachment #9184023 - Attachment is obsolete: true
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 3.59
Flags: qe-verify-
Whiteboard: [post-critsmash-triage]

I can't tell if this issue is fixed, and should be included in the internally reported advisories, or just suppressed...

Flags: needinfo?(kjacobs.bugzilla)
Flags: needinfo?(kjacobs.bugzilla)
Whiteboard: [post-critsmash-triage] → [post-critsmash-triage][adv-main84+r]
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: