ThreadSanitizer: data race [@ fill_CERTCertificateFields] vs. [@ CERT_DestroyCertificate] with potential use-after-free
Categories
(NSS :: Libraries, defect, P1)
Tracking
(firefox-esr68 wontfix, firefox-esr78 wontfix, firefox72 wontfix, firefox73 wontfix, firefox74 wontfix, firefox82 wontfix, firefox83 wontfix, firefox84 fixed)
People
(Reporter: decoder, Assigned: kjacobs)
References
(Blocks 1 open bug)
Details
(Keywords: csectype-race, sec-moderate, Whiteboard: [post-critsmash-triage][adv-main84+r])
Crash Data
Attachments
(3 files, 1 obsolete file)
The attached crash information was detected while running CI tests with ThreadSanitizer on mozilla-central revision bc5880b621d5.
In this race it looks like one thread is about to destroy an NSSCertificate
while the main thread is still working on it. Filing this as a security bug because this can lead to a potential use-after-free. I don't know who exactly is responsible for the proper synchronization, but since PSM seems to be the owner of the freeing thread, filing it here.
General information about TSan reports
Why fix races?
Data races are undefined behavior and can cause crashes as well as correctness issues. Compiler optimizations can cause racy code to have unpredictable and hard-to-reproduce behavior.
Rating
If you think this race can cause crashes or correctness issues, it would be great to rate the bug appropriately as P1/P2 and/or indicating this in the bug. This makes it a lot easier for us to assess the actual impact that these reports make and if they are helpful to you.
False Positives / Benign Races
Typically, races reported by TSan are not false positives [1], but it is possible that the race is benign. Even in this case it would be nice to come up with a fix if it is easily doable and does not regress performance. Every race that we cannot fix will have to remain on the suppression list and slows down the overall TSan performance. Also note that seemingly benign races can possibly be harmful (also depending on the compiler, optimizations and the architecture) [2][3].
[1] One major exception is the involvement of uninstrumented code from third-party libraries.
[2] http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong
[3] How to miscompile programs with "benign" data races: https://www.usenix.org/legacy/events/hotpar11/tech/final_files/Boehm.pdf
Suppressing unfixable races
If the bug cannot be fixed, then a runtime suppression needs to be added in mozglue/build/TsanOptions.cpp
. The suppressions match on the full stack, so it should be picked such that it is unique to this particular race. The bug number of this bug should also be included so we have some documentation on why this suppression was added.
Reporter | ||
Comment 1•5 years ago
|
||
Updated•5 years ago
|
Updated•5 years ago
|
![]() |
||
Updated•5 years ago
|
Updated•5 years ago
|
Comment 3•5 years ago
|
||
Could be sec-high if this race could be forced to happen reliably in practice.
Comment 4•5 years ago
|
||
This race involves two simultaneous PSM operations:
- Fetching all client auth certificates, using
ClientAuthDataRunnable
innsNSSIOLayer.cpp
, and - Asynchronously verifying the server certificate that we're connected to using
SSLServerCertVerificationJob
inSSLServerCertVerification.cpp
, which is the start of the NSS callback that ultimately runsmozilla::pkix
From inspection, it appears to me the obvious way to trigger this would be to connect to a server which is using a certificate that Firefox also has as a client cert, and has the private key for. The client auth code might choose the server cert from the cache, which then gets deleted when moz::pkix finishes.
Unfortunately, this happens early enough in cert verification that there's no need for the server certificate to actually be trusted in any way, but it does require Firefox to be configured with such a cert in the first place, and I don't believe it's possible to set a server cert for WebRTC, the obvious way to meet this minimums.
Assignee | ||
Comment 5•5 years ago
•
|
||
Thanks, I didn't get a chance to dig into this until today.
Could an attacker interrogate the client for a legitimate client cert in one connection, then subsequently present it as a server cert (that will fail SSLServerCertVerificationJob
, but still trigger the free)? It appears so, as long as https://searchfox.org/mozilla-central/source/security/nss/lib/pki/tdcache.c#718 can be fooled.
Edit: For the first connection you'd need a legitimate server cert chain (since we don't send the client cert until verification is complete), or simply a captured TLS 1.2 handshake...
Comment 6•5 years ago
|
||
The priority flag is not set for this bug.
:jcj, could you have a look please?
For more information, please visit auto_nag documentation.
Comment 7•5 years ago
|
||
Something doesn't mesh here. The TSAN log shows server-auth callback is running at the same time as the client-cert search is occurring; how can that be if they're serial in nature?
Assignee | ||
Comment 8•5 years ago
•
|
||
We prevent sending of the client cert until the server chain has been verified. So the Edit only applies to getting a client cert to later reflect and trigger the UAF.
Comment 9•5 years ago
|
||
The priority flag is not set for this bug.
:jcj, could you have a look please?
For more information, please visit auto_nag documentation.
Comment 10•5 years ago
|
||
After further conversation, I understand Kevin's comment 8 about how to go about writing an exploit to reach this UAF. It's possible that this could be provoked in a straightforward manner, which would warrant promotion to sec-high.
Assigning to Kevin.
Dana, we need to talk through how client cert caching works -- and whether we need to use the cache at all there, because that would be the most straightforward fix.
Assignee | ||
Comment 11•5 years ago
•
|
||
I had questioned why these use cases were sharing a single cache (as it would seem to provide very little benefit, if any). Looking back into it though, it makes sense from an NSS perspective and isn't easily changed from PSM like I imagined it might be.
Dana, please correct me if you have a different opinion, but I think NSS just needs to synchronize accesses to these fields.
![]() |
||
Comment 12•5 years ago
|
||
This cache is entirely an NSS mechanism, so I'm not sure where PSM would come into it.
That said, what is this cache speeding up? Certificate decoding? Perhaps we could remove it entirely.
At the very least, if we need to lock e.g. the trust
field as in https://searchfox.org/mozilla-central/rev/ad6234148892bc29bf08d736604ac71925138040/security/nss/lib/pk11wrap/pk11cert.c#381-383, it makes sense we would need to lock other fields as well (incidentally, I noticed that NSS currently uses a global lock for that field, which probably can be improved upon).
Assignee | ||
Comment 13•5 years ago
|
||
What specific test triggered this race? I wrote a test for the more trivial exploit, but only managed to trigger bug 1615569.
Reporter | ||
Comment 14•5 years ago
|
||
(In reply to Kevin Jacobs [:kjacobs] from comment #13)
What specific test triggered this race? I wrote a test for the more trivial exploit, but only managed to trigger bug 1615569.
As far as I can tell, this happened in or around dom/base/test/test_bug466080.html
. The problem is that TSan was running without halt_on_error
at that time to speed up the try runs, so it could also have been one of the tests right before that. The race was also intermittent, I discovered it pretty late.
If that is not sufficient, we could remove the suppression for a try run and retrigger the relevant test chunks a few times to get more precise information.
Assignee | ||
Comment 15•5 years ago
|
||
Comment 16•4 years ago
|
||
![]() |
||
Comment 17•4 years ago
|
||
Remove supression for seemingly fixed issue. r=decoder
https://hg.mozilla.org/mozilla-central/rev/c5d8b4f34c06
Assignee | ||
Comment 18•4 years ago
|
||
This already recurred on CI and, filed under bug 1673960.
I need to revisit my patch w.r.t. Dana's concern in https://phabricator.services.mozilla.com/D64233#1963138. FWIW, my [local] TSAN also stopped reporting this issue shortly after I made that patch.
I'll re-prioritize this, but depending on the report frequency, we might want to re-suppress this one.
Comment 19•4 years ago
|
||
Yes, absolutely just revert my patch in the mean-time.
Comment 20•4 years ago
|
||
Still an issue, just rare.
Updated•4 years ago
|
![]() |
||
Comment 22•4 years ago
|
||
Backed out changeset c5d8b4f34c06 r=decoder
https://hg.mozilla.org/integration/autoland/rev/790a8aef5dff53355605d69376f2974d92a050d9
https://hg.mozilla.org/mozilla-central/rev/790a8aef5dff
Comment 24•4 years ago
|
||
Comment 25•4 years ago
|
||
Updated•4 years ago
|
Updated•4 years ago
|
Comment 26•4 years ago
|
||
I can't tell if this issue is fixed, and should be included in the internally reported advisories, or just suppressed...
Assignee | ||
Comment 27•4 years ago
|
||
The issue is fixed. Suppression removed in https://hg.mozilla.org/mozilla-central/rev/4836d57847c7bf330c7c89101bcb144778607c3f
Updated•4 years ago
|
Updated•3 years ago
|
Description
•