Closed Bug 1278965 Opened 4 years ago Closed 3 years ago

TSan: data race security/nss/lib/certdb/stanpcertdb.c on CERTCertificate::{istemp, isperm}

Categories

(NSS :: Libraries, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jseward, Assigned: franziskus)

Details

(Keywords: csectype-race, sec-moderate)

Attachments

(4 files, 9 obsolete files)

18.74 KB, text/plain
Details
2.21 KB, application/octet-stream
Details
4.39 KB, text/plain
Details
14.25 KB, patch
ttaubert
: review+
Details | Diff | Splinter Review
This is reported with a TSan-enabled build of Firefox, after some
general surfing with my working daily profile.  I don't have better
STR right now; sorry.

My impression from peering at this a bit is that "Thread T4 'Socket
Thread'" allocates a SECKEYPublicKey object (function
seckey_ExtractPublicKey, security/nss/lib/cryptohi/seckey.c:565).
A pointer to it eventually winds up in the hands of both "Thread T78
'SSL Cert #9'" and "Thread T79 'SSL Cert #10'", who both modify it at
security/nss/lib/certdb/stanpcertdb.c:441 and :442, function
CERT_NewTempCertificate):

    cc->istemp = PR_TRUE;
    cc->isperm = PR_FALSE;

and neither thread holds any locks at that point.
Attached file TSan complaints (obsolete) —
Because the code at the race point contains two separate assignments
to the object

    cert->istemp = PR_FALSE;
    cert->isperm = PR_TRUE;

TSan issues two complaints that are completely identical, except one is
for |istemp| and the other for |isperm|.  The attached log contains both.
Could be prod:nss, but more likely our use of it so I'll guess PSM.
Group: core-security → crypto-core-security
Component: Security → Security: PSM
I'm a little confused as to how a SECKEYPublicKey ends up as a CERTCertificate. Is it possible that some code in NSS is not using the right allocation/free functions and is thus confusing the analysis? Any thoughts Julian?
Flags: needinfo?(jseward)
(In reply to David Keeler [:keeler] (use needinfo?) from comment #3)
Ah, you ask a good question.  I think now that the allocation stack is
bogus.  What has happened is that NSS (or some part of this library, at
least) is using an arena allocator, which takes a block of 2048 bytes
from malloc, and hands out parts of it on request.  Hence TSan records,
for the block, the requestor that caused the call to malloc, which in
this case is indeed a request for a SECKEYPublicKey.  But that request
is entirely unrelated to the request to allocate the CERTCertificate,
because the latter is the second-or-later allocation request for this
block.

Makes sense?  So, to progress this, I guess we need to figure out if
it is possible to disable the arena allocator temporarily, and have *all*
allocations done via malloc, so we can see where the allocation really happened.
Flags: needinfo?(jseward) → needinfo?(dkeeler)
After some chasing around the forest, it appears the allocation point
is actually in CERT_DecodeDERCertificate (security/nss/lib/certdb/certdb.c).

Applying the following kludge allows TSan to log the real allocation stack
and thread.  Results in the next comment.  In short, it seems the block is
allocated by "Thread T53 'SSL Cert #1'" but is used by both that thread and
"Thread T54 'SSL Cert #2'", so the question is now, how did that happen?


diff --git a/security/nss/lib/certdb/certdb.c b/security/nss/lib/certdb/certdb.c
--- a/security/nss/lib/certdb/certdb.c
+++ b/security/nss/lib/certdb/certdb.c
@@ -702,17 +702,18 @@ CERT_DecodeDERCertificate(SECItem *derSi
     /* make a new arena */
     arena = PORT_NewArena(DER_DEFAULT_CHUNKSIZE);
 
     if (!arena) {
         return 0;
     }
 
     /* allocate the certificate structure */
-    cert = (CERTCertificate *)PORT_ArenaZAlloc(arena, sizeof(CERTCertificate));
+    //cert = (CERTCertificate *)PORT_ArenaZAlloc(arena, sizeof(CERTCertificate));
+    cert = (CERTCertificate *)calloc(1, sizeof(CERTCertificate));
 
     if (!cert) {
         goto loser;
     }
 
     cert->arena = arena;
 
     if (copyDER) {
Attachment #8761299 - Attachment is obsolete: true
Thanks - that makes a lot more sense. It looks like CERT_NewTempCertificate is racy. I wrote a small testcase to reproduce the issue. It consists of two threads that repeatedly decode and destroy a certificate. I'll attach the files to reproduce the issue shortly.
Assignee: nobody → nobody
Component: Security: PSM → Libraries
Flags: needinfo?(dkeeler)
Product: Core → NSS
Version: unspecified → trunk
(In reply to Julian Seward [:jseward] from comment #4)
> ...  So, to progress this, I guess we need to figure out if
> it is possible to disable the arena allocator temporarily, and have *all*
> allocations done via malloc, so we can see where the allocation really
> happened.

Hi Julian: you can do that by setting the environment variable
NSS_DISABLE_ARENA_FREE_LIST to 1.

I recommend you set NSS_DISABLE_ARENA_FREE_LIST when running valgrind,
ASan, TSan, etc.
(In reply to David Keeler [:keeler] (use needinfo?) from comment #8)
> Created attachment 8764392 [details]
> tmpcert.tar.gz

Good to have a small testcase.  But I can't figure out how to build the
relevant nss standalone.  Probably simplest if we confer on irc.
Here's what worked for me on Fedora Linux:

(Probably best to do this in an empty directory)

hg clone https://hg.mozilla.org/projects/nss
hg clone https://hg.mozilla.org/projects/nspr
cd nss
sed -i s/sanitize=address/sanitize=thread/ coreconf/sanitizers.mk
USE_ASAN=1 USE_64=1 make nss_build_all (this should build NSS (as well as NSPR) and put some things in ../dist/)
cd ..
tar xzf tmpcert.tar.gz
cd tmpcert
make (you might have to modify your Makefile depending on the contents of ../dist/)
certutil -d . -N (don't set a password)
LD_LIBRARY_PATH=../dist/Linux4.5_x86_64_cc_glibc_PTH_64_ASAN_DBG.OBJ/lib/ ./tmpcert (again, depending on what's in ../dist)
Thanks.  In the end I managed to reproduce this with Helgrind without
the need for a special TSan build.  With 10 threads I can get the race
to show within 70 seconds, moderately reliably.

I found it pretty difficult to follow the trail of casting and
blocks-within-blocks.  Looking at the race reports, I got the
impression that this is a case of (effectively) a global variable
being handed out to all threads.  Given that one of the threads
actually allocates the block on the heap, though, what I went looking
for is some kind of caching arrangement in which the "first one in"
allocates the block and all subsequent calls use the cached value.

And I think I found it in stan_GetCERTCertificate.  But I am not 100%
sure of my analysis.  Can you check it?


First thread to stan_GetCERTCertificate finds that c->decoding == NULL

therefore calls dc = nssDecodedPKIXCertificate_Create(NULL, &c->encoding);
which allocates dc and dc->data

and then stores dc back in c

  c->decoding = dc;

Subsequent threads calling stan_GetCERTCertificate find c->decoding is
not NULL.  They therefore go direct to

    cc = (CERTCertificate *)dc->data;

and hence all threads get the same |dc| and |cc| value.  The |cc| value
is returned via STAN_GetCERTCertificate back to CERT_NewTempCertificate
where it is written to, uncoordinated, by all threads:

    cc->istemp = PR_TRUE;
    cc->isperm = PR_FALSE;

Is this plausible?  It does assume that multiple threads call stan_GetCERTCertificate
passing it the same |NSSCertificate *c| value.
Flags: needinfo?(dkeeler)
If the comment 12 analysis is correct, it is not clear to me where the
error really lies.  Should it be that stan_GetCERTCertificate should
not cache values it obtains from nssDecodedPKIXCertificate_Create?  Or
should CERT_NewTempCertificate not call STAN_GetCERTCertificate(c)
with the same |c| value on multiple threads?  Etc.
That analysis seems correct, but I don't have a clear idea of what the error is either. It is my understanding that the underlying data structures are supposed to be shared at some level, but I'm not exactly sure how that fits in here.
Flags: needinfo?(dkeeler)
(In reply to David Keeler [:keeler] (use needinfo?) from comment #14)
Ok .. so who can help us here?  Is there an upstream author that knows
how this is supposed to work?
Flags: needinfo?(dkeeler)
Attached file TSan complaint 2
running tmpcert with more threads produces this stack trace for me.
Assignee: nobody → franziskuskiefer
this fixes TSan complaint 2
According to comment 12 this might fix the race. I can't reproduce the race, so no idea if this actually fixes it.
(In reply to Franziskus Kiefer [:fkiefer or :franziskus] from comment #18)
> Created attachment 8766246 [details] [diff] [review]
> race-in-CERT_NewTempCertificate.patch

With this in place, I can no longer detect the race.  So, +1 from me.
Attachment #8766246 - Flags: feedback+
Attachment #8766246 - Flags: review?(ttaubert)
Hey Julian, can you please check this patch as well? Franziskus and I just talked about it and this might be a lock-free solution that removes some probably unneeded code.
Attachment #8766341 - Flags: feedback?(jseward)
Looks like Tim and Franziskus are on it.
Flags: needinfo?(dkeeler)
(In reply to Tim Taubert [:ttaubert] from comment #20)
> Created attachment 8766341 [details] [diff] [review]
> 0001-Bug-1278965-Try-to-fix-TSan-race-in-CERT_NewTempCert.patch
> Hey Julian, can you please check this patch as well?

Yes, this patch tests OK (race-free) too.
Attachment #8766341 - Flags: feedback?(jseward) → feedback+
Comment on attachment 8766341 [details] [diff] [review]
0001-Bug-1278965-Try-to-fix-TSan-race-in-CERT_NewTempCert.patch

Wan-Teh, do you know if there's a reason for this tempCert? Removing it and moving up the istemp and isperm assignments resolves the race and we can't really see a reason for this tempCert procedure.
Attachment #8766341 - Flags: review?(wtc)
Comment on attachment 8766246 [details] [diff] [review]
race-in-CERT_NewTempCertificate.patch

Review of attachment 8766246 [details] [diff] [review]:
-----------------------------------------------------------------

::: lib/certdb/stanpcertdb.c
@@ +442,3 @@
>      cc->istemp = PR_TRUE;
>      cc->isperm = PR_FALSE;
> +    CERT_UnlockCertTrust(cc);

The main problem with this patch is that CERT_LockCertTrust(cc)
is intended to protect cc->trust only. So we would be using
the lock for a new purpose. This at least needs to be
documented.  Also, other accesses to cc->istemp and cc->isperm
should also be protected by CERT_LockCertTrust(cc). So this
patch is incomplete even if we want to take this approach.
Attachment #8766246 - Flags: review-
Attachment #8766246 - Flags: review?(ttaubert)
Comment on attachment 8766341 [details] [diff] [review]
0001-Bug-1278965-Try-to-fix-TSan-race-in-CERT_NewTempCert.patch

Review of attachment 8766341 [details] [diff] [review]:
-----------------------------------------------------------------

::: lib/certdb/stanpcertdb.c
@@ -431,4 @@
>      NSSCertificate_Destroy(c);
> -    /* and use the stored entry */
> -    c = tempCert;
> -    cc = STAN_GetCERTCertificateOrRelease(c);

It is risky to delete this STAN_GetCERTCertificateOrRelease() call
because its input argument is actually |tempCert| and its return
value may be different from the old value of |cc|.

It seems that ideally cc->istemp and cc->isperm should be set
by STAN_GetCERTCertificateOrRelease() internally, if necessary.
I noticed the following code in fill_CERTCertificateFields:

    /* istemp and isperm are supported in NSS 3.4 */
    cc->istemp = PR_FALSE; /* CERT_NewTemp will override this */
    cc->isperm = PR_TRUE;  /* by default */
    /* pointer back */
    cc->nssCertificate = c;

Maybe it is better to just set cc->istemp and cc->isperm to the
right values there.

Alternatively, we can add a lock or use an existing lock to protect
all accesses to cc->istemp and cc->isperm, and require all accesses
to use getter and setter functions.
Attachment #8766341 - Flags: review?(wtc) → review-
Attached patch istemp.patch (obsolete) — Splinter Review
Per discussion of last week I added a new lock for istemp/isperm.
Seeting istemp/isperm correctly for a new temp certificate doesn't solve the entire problem because pk11cert [1] makes a temp cert permanent again. So at least here we have to change both values. And I don't see why we couldn't run in a similar race at this point when calling PK11_ImportCert multiple times with the same cert.

Julian could you check if this solves the race? I only get other races with the test case :(

Wan-Teh can you review? I added a new lock because I think it's cleaner than renaming the trust lock.

[1] http://searchfox.org/nss/rev/0feb87cf59fba52f407a6b9925f30f3486618b5b/lib/pk11wrap/pk11cert.c#973
Attachment #8766246 - Attachment is obsolete: true
Attachment #8766341 - Attachment is obsolete: true
Attachment #8768691 - Flags: review?(wtc)
Attachment #8768691 - Flags: feedback?(jseward)
Comment on attachment 8768691 [details] [diff] [review]
istemp.patch

Review of attachment 8768691 [details] [diff] [review]:
-----------------------------------------------------------------

Verified -- I can't reproduce the race any more with this patch.
Attachment #8768691 - Flags: feedback?(jseward) → feedback+
Comment on attachment 8768691 [details] [diff] [review]
istemp.patch

Review of attachment 8768691 [details] [diff] [review]:
-----------------------------------------------------------------

The reads of cert->istemp and cert->isperm should
also be protected by this lock, right?
Attached patch istemp.patch (obsolete) — Splinter Review
yes, my search was incomplete ...
According to [1] and [2] I think I caught all references of istemp and isperm now.

[1] http://searchfox.org/nss/search?q=-%3Eistemp&case=false&regexp=false&path=
[2] http://searchfox.org/nss/search?q=-%3Eisperm&case=false&regexp=false&path=
Attachment #8768691 - Attachment is obsolete: true
Attachment #8768691 - Flags: review?(wtc)
Attachment #8769075 - Flags: review?(wtc)
(In reply to Franziskus Kiefer [:fkiefer or :franziskus] from comment #29)
> Created attachment 8769075 [details] [diff] [review]
> istemp.patch

This tests OK for me.
Comment on attachment 8769075 [details] [diff] [review]
istemp.patch

Review of attachment 8769075 [details] [diff] [review]:
-----------------------------------------------------------------

Bob: please review this patch.

Franziskus: I suggest some changes.

::: lib/certdb/cert.h
@@ +1427,5 @@
> + */
> +void CERT_LockCertTempPerm(const CERTCertificate *cert);
> +
> +/*
> + * Free the temp/perm lock

Free => Release

"Free" may be misinterpreted as "delete" or "destroy".

Please also fix the comment for CERT_UnlockCertTrust().

@@ +1429,5 @@
> +
> +/*
> + * Free the temp/perm lock
> + */
> +void CERT_UnlockCertTempPerm(const CERTCertificate *cert);

We should also add getter functions for cert->istemp and cert->isperm
(similar to CERT_LockCertTrust()) and export the getter functions in
lib/nss/nss.def. Some code outside NSS uses cert->istemp and cert->isperm
directly and should be changed to use the new getter functions. See

https://dxr.mozilla.org/mozilla-central/search?q=%2Bvar-ref%3ACERTCertificateStr%3A%3Aisperm

https://dxr.mozilla.org/mozilla-central/search?q=%2Bvar-ref%3ACERTCertificateStr%3A%3Aistemp

(It seems that Mozilla only uses cert->isperm.)

CERT_LockCertTempPerm() and CERT_UnlockCertTempPerm() are only used
by NSS itself, so we should declare them in an internal header, such
as lib/certdb/certi.h.
Attachment #8769075 - Flags: superreview?(rrelyea)
Attachment #8769075 - Flags: review?(wtc)
Attachment #8769075 - Flags: review+
Comment on attachment 8769075 [details] [diff] [review]
istemp.patch

Franziskus: please wait for Bob's approval.
Attached patch istemp.patch (obsolete) — Splinter Review
Fixed nits, moved internal functions to certi.h, and added CERT_GetCertIsTemp and CERT_GetCertIsPerm functions.

Bob, can you review?
Attachment #8769075 - Attachment is obsolete: true
Attachment #8769075 - Flags: superreview?(rrelyea)
Attachment #8770926 - Flags: review?(rrelyea)
Bob, FYI, there's a review waiting for you, moderate issue.
Flags: needinfo?(rrelyea)
Attached patch tsan-race.patch (obsolete) — Splinter Review
rebased, let's get this landed.
Attachment #8770926 - Attachment is obsolete: true
Attachment #8770926 - Flags: review?(rrelyea)
Attachment #8834814 - Flags: review?(ttaubert)
Attachment #8766240 - Attachment is obsolete: true
Comment on attachment 8834814 [details] [diff] [review]
tsan-race.patch

Review of attachment 8834814 [details] [diff] [review]:
-----------------------------------------------------------------

::: lib/certdb/certdb.c
@@ +2877,5 @@
> +CERT_LockCertTempPerm(const CERTCertificate *cert)
> +{
> +    PORT_Assert(certTempPermLock != NULL);
> +    PZ_Lock(certTempPermLock);
> +    return;

nit: no need to leave the return here

::: lib/certdb/stanpcertdb.c
@@ +942,5 @@
>      }
>  
> +    CERT_LockCertTempPerm(cert);
> +    isperm = cert->isperm;
> +    CERT_UnlockCertTempPerm(cert);

Should use CERT_GetCertIsPerm().

@@ +998,5 @@
>      return rvItem;
>  }
>  
> +SECStatus
> +CERT_GetCertIsPerm(const CERTCertificate *cert, PRBool *istemp)

Wrong arg name.

@@ +1005,5 @@
> +    CERT_LockCertTempPerm(cert);
> +    if (cert == NULL) {
> +        rv = SECFailure;
> +    } else {
> +        *istemp = cert->istemp;

This reads the wrong field.

@@ +1013,5 @@
> +    return rv;
> +}
> +
> +SECStatus
> +CERT_GetCertIsTemp(const CERTCertificate *cert, PRBool *isperm)

Wrong arg name here too.

@@ +1020,5 @@
> +    CERT_LockCertTempPerm(cert);
> +    if (cert == NULL) {
> +        rv = SECFailure;
> +    } else {
> +        *isperm = cert->isperm;

This also reads the wrong field.

::: lib/certhigh/certhigh.c
@@ +952,5 @@
>  
>          /* if the cert is temp, make it perm; otherwise we're done */
> +        CERT_LockCertTempPerm(cert);
> +        istemp = cert->istemp;
> +        CERT_UnlockCertTempPerm(cert);

Should use CERT_GetCertIsTemp().
Attachment #8834814 - Flags: review?(ttaubert)
Attached patch tsan-race.patch (obsolete) — Splinter Review
Attachment #8834814 - Attachment is obsolete: true
Attachment #8839409 - Flags: review?(ttaubert)
Comment on attachment 8839409 [details] [diff] [review]
tsan-race.patch

Review of attachment 8839409 [details] [diff] [review]:
-----------------------------------------------------------------

::: lib/certdb/stanpcertdb.c
@@ +940,5 @@
>              return SECFailure;
>          }
>      }
>  
> +    CERT_GetCertIsPerm(cert, &isperm);

Should check the return value.

@@ +1001,5 @@
> +{
> +    SECStatus rv;
> +    CERT_LockCertTempPerm(cert);
> +    if (cert == NULL) {
> +        rv = SECFailure;

I think we should bail out with SECFailure when cert == NULL before calling CERT_LockCertTempPerm(). I know we don't have per-cert locks currently but this is cleaner should we ever get around to do this properly.

@@ +1016,5 @@
> +{
> +    SECStatus rv;
> +    CERT_LockCertTempPerm(cert);
> +    if (cert == NULL) {
> +        rv = SECFailure;

Same as above.

::: lib/certhigh/certhigh.c
@@ +950,5 @@
>              goto loser;
>          }
>  
>          /* if the cert is temp, make it perm; otherwise we're done */
> +        CERT_GetCertIsTemp(cert, &istemp);

Should check the return value here too.
Attachment #8839409 - Flags: review?(ttaubert)
Attached patch tsan-race.patchSplinter Review
* checking return values
* early return in CERT_GetCertIs*
Attachment #8839409 - Attachment is obsolete: true
Attachment #8840427 - Flags: review?(ttaubert)
Comment on attachment 8840427 [details] [diff] [review]
tsan-race.patch

Review of attachment 8840427 [details] [diff] [review]:
-----------------------------------------------------------------

r+ but as said below this should probably wait until 3.31.

::: lib/certdb/stanpcertdb.c
@@ +1008,5 @@
> +    }
> +
> +    CERT_LockCertTempPerm(cert);
> +    *isperm = cert->isperm;
> +    rv = SECSuccess;

nit: I'd move this out of the locked area, or maybe just `return SECSuccess`

@@ +1023,5 @@
> +    }
> +
> +    CERT_LockCertTempPerm(cert);
> +    *istemp = cert->istemp;
> +    rv = SECSuccess;

nit: I'd move this out of the locked area, or maybe just `return SECSuccess`

::: lib/certhigh/certhigh.c
@@ +952,5 @@
>  
>          /* if the cert is temp, make it perm; otherwise we're done */
> +        rv = CERT_GetCertIsTemp(cert, &istemp);
> +        if (rv != SECSuccess) {
> +            goto loser;

We'd leak `cert` here without a call to `CERT_DestroyCertificate(cert)`. You could change this to:

rv = CERT_GetCertIsTemp(cert, &istemp);
if (rv == SECSuccess && istemp) {
    ...
}

CERT_DestroyCertificate(cert);
cert = NULL;

if (rv != SECSuccess) {
    goto loser;
}

::: lib/nss/nss.def
@@ +1101,5 @@
>  ;+    global:
>  CERT_CompareAVA;
>  PK11_HasAttributeSet;
> +CERT_GetCertIsPerm;
> +CERT_GetCertIsTemp;

You've probably seen the thread on nss-dev, RH would prefer us to hold this until NSS 3.31.
Attachment #8840427 - Flags: review?(ttaubert) → review+
https://hg.mozilla.org/projects/nss/rev/545e059dbb174a644b4f5e457a7ad2c816b49357
Status: NEW → RESOLVED
Closed: 3 years ago
Flags: needinfo?(rrelyea)
Resolution: --- → FIXED
Target Milestone: --- → 3.31
This needs another try. It's deadlocking :( https://hg.mozilla.org/projects/nss/rev/14784c8d7dbac574f54bb120f304942b89811fac
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
sigh, the pki code tends to call back into the NSS code for services. stan_GetCERTCertificate tends to act like a 'high level' function called from a low level. Adding locks around it could potentially wind up in deadlock conditions. One way to work around this (which is ugly) is to use monitors rather than locks.
The deadlock was a result of a stupid c/p mistake of mine... External pki code has to make sure to use the new CERT_GetCertIsPerm/CERT_GetCertIsTemp functions that do the locking. I don't think this should make any problems. I re-landed it (try looks good https://treeherder.mozilla.org/#/jobs?repo=nss-try&revision=896e3eb3a79933a51886949c7adb67ef37b721c0&filter-tier=1&filter-tier=2&filter-tier=3).

https://hg.mozilla.org/projects/nss/rev/896e3eb3a79933a51886949c7adb67ef37b721c0
Status: REOPENED → RESOLVED
Closed: 3 years ago3 years ago
Resolution: --- → FIXED
Group: crypto-core-security → core-security-release
In bug 1366761 I've added automatic detection of ABI changes.

To approve that the API additions from this bug are permissible ABI changes, I've made the following commit, which adds these new APIs as expected items in the abidiff report:
https://hg.mozilla.org/projects/nss/rev/713b15751ad2
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.