Closed Bug 1278965 Opened 9 years ago Closed 8 years ago

TSan: data race security/nss/lib/certdb/stanpcertdb.c on CERTCertificate::{istemp, isperm}

Categories

(NSS :: Libraries, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jseward, Assigned: franziskus)

Details

(Keywords: csectype-race, sec-moderate)

Attachments

(4 files, 9 obsolete files)

18.74 KB, text/plain
Details
2.21 KB, application/octet-stream
Details
4.39 KB, text/plain
Details
14.25 KB, patch
ttaubert
: review+
Details | Diff | Splinter Review
This is reported with a TSan-enabled build of Firefox, after some general surfing with my working daily profile. I don't have better STR right now; sorry. My impression from peering at this a bit is that "Thread T4 'Socket Thread'" allocates a SECKEYPublicKey object (function seckey_ExtractPublicKey, security/nss/lib/cryptohi/seckey.c:565). A pointer to it eventually winds up in the hands of both "Thread T78 'SSL Cert #9'" and "Thread T79 'SSL Cert #10'", who both modify it at security/nss/lib/certdb/stanpcertdb.c:441 and :442, function CERT_NewTempCertificate): cc->istemp = PR_TRUE; cc->isperm = PR_FALSE; and neither thread holds any locks at that point.
Attached file TSan complaints (obsolete) —
Because the code at the race point contains two separate assignments to the object cert->istemp = PR_FALSE; cert->isperm = PR_TRUE; TSan issues two complaints that are completely identical, except one is for |istemp| and the other for |isperm|. The attached log contains both.
Could be prod:nss, but more likely our use of it so I'll guess PSM.
Group: core-security → crypto-core-security
Component: Security → Security: PSM
I'm a little confused as to how a SECKEYPublicKey ends up as a CERTCertificate. Is it possible that some code in NSS is not using the right allocation/free functions and is thus confusing the analysis? Any thoughts Julian?
Flags: needinfo?(jseward)
(In reply to David Keeler [:keeler] (use needinfo?) from comment #3) Ah, you ask a good question. I think now that the allocation stack is bogus. What has happened is that NSS (or some part of this library, at least) is using an arena allocator, which takes a block of 2048 bytes from malloc, and hands out parts of it on request. Hence TSan records, for the block, the requestor that caused the call to malloc, which in this case is indeed a request for a SECKEYPublicKey. But that request is entirely unrelated to the request to allocate the CERTCertificate, because the latter is the second-or-later allocation request for this block. Makes sense? So, to progress this, I guess we need to figure out if it is possible to disable the arena allocator temporarily, and have *all* allocations done via malloc, so we can see where the allocation really happened.
Flags: needinfo?(jseward) → needinfo?(dkeeler)
After some chasing around the forest, it appears the allocation point is actually in CERT_DecodeDERCertificate (security/nss/lib/certdb/certdb.c). Applying the following kludge allows TSan to log the real allocation stack and thread. Results in the next comment. In short, it seems the block is allocated by "Thread T53 'SSL Cert #1'" but is used by both that thread and "Thread T54 'SSL Cert #2'", so the question is now, how did that happen? diff --git a/security/nss/lib/certdb/certdb.c b/security/nss/lib/certdb/certdb.c --- a/security/nss/lib/certdb/certdb.c +++ b/security/nss/lib/certdb/certdb.c @@ -702,17 +702,18 @@ CERT_DecodeDERCertificate(SECItem *derSi /* make a new arena */ arena = PORT_NewArena(DER_DEFAULT_CHUNKSIZE); if (!arena) { return 0; } /* allocate the certificate structure */ - cert = (CERTCertificate *)PORT_ArenaZAlloc(arena, sizeof(CERTCertificate)); + //cert = (CERTCertificate *)PORT_ArenaZAlloc(arena, sizeof(CERTCertificate)); + cert = (CERTCertificate *)calloc(1, sizeof(CERTCertificate)); if (!cert) { goto loser; } cert->arena = arena; if (copyDER) {
Attachment #8761299 - Attachment is obsolete: true
Thanks - that makes a lot more sense. It looks like CERT_NewTempCertificate is racy. I wrote a small testcase to reproduce the issue. It consists of two threads that repeatedly decode and destroy a certificate. I'll attach the files to reproduce the issue shortly.
Assignee: nobody → nobody
Component: Security: PSM → Libraries
Flags: needinfo?(dkeeler)
Product: Core → NSS
Version: unspecified → trunk
(In reply to Julian Seward [:jseward] from comment #4) > ... So, to progress this, I guess we need to figure out if > it is possible to disable the arena allocator temporarily, and have *all* > allocations done via malloc, so we can see where the allocation really > happened. Hi Julian: you can do that by setting the environment variable NSS_DISABLE_ARENA_FREE_LIST to 1. I recommend you set NSS_DISABLE_ARENA_FREE_LIST when running valgrind, ASan, TSan, etc.
(In reply to David Keeler [:keeler] (use needinfo?) from comment #8) > Created attachment 8764392 [details] > tmpcert.tar.gz Good to have a small testcase. But I can't figure out how to build the relevant nss standalone. Probably simplest if we confer on irc.
Here's what worked for me on Fedora Linux: (Probably best to do this in an empty directory) hg clone https://hg.mozilla.org/projects/nss hg clone https://hg.mozilla.org/projects/nspr cd nss sed -i s/sanitize=address/sanitize=thread/ coreconf/sanitizers.mk USE_ASAN=1 USE_64=1 make nss_build_all (this should build NSS (as well as NSPR) and put some things in ../dist/) cd .. tar xzf tmpcert.tar.gz cd tmpcert make (you might have to modify your Makefile depending on the contents of ../dist/) certutil -d . -N (don't set a password) LD_LIBRARY_PATH=../dist/Linux4.5_x86_64_cc_glibc_PTH_64_ASAN_DBG.OBJ/lib/ ./tmpcert (again, depending on what's in ../dist)
Thanks. In the end I managed to reproduce this with Helgrind without the need for a special TSan build. With 10 threads I can get the race to show within 70 seconds, moderately reliably. I found it pretty difficult to follow the trail of casting and blocks-within-blocks. Looking at the race reports, I got the impression that this is a case of (effectively) a global variable being handed out to all threads. Given that one of the threads actually allocates the block on the heap, though, what I went looking for is some kind of caching arrangement in which the "first one in" allocates the block and all subsequent calls use the cached value. And I think I found it in stan_GetCERTCertificate. But I am not 100% sure of my analysis. Can you check it? First thread to stan_GetCERTCertificate finds that c->decoding == NULL therefore calls dc = nssDecodedPKIXCertificate_Create(NULL, &c->encoding); which allocates dc and dc->data and then stores dc back in c c->decoding = dc; Subsequent threads calling stan_GetCERTCertificate find c->decoding is not NULL. They therefore go direct to cc = (CERTCertificate *)dc->data; and hence all threads get the same |dc| and |cc| value. The |cc| value is returned via STAN_GetCERTCertificate back to CERT_NewTempCertificate where it is written to, uncoordinated, by all threads: cc->istemp = PR_TRUE; cc->isperm = PR_FALSE; Is this plausible? It does assume that multiple threads call stan_GetCERTCertificate passing it the same |NSSCertificate *c| value.
Flags: needinfo?(dkeeler)
If the comment 12 analysis is correct, it is not clear to me where the error really lies. Should it be that stan_GetCERTCertificate should not cache values it obtains from nssDecodedPKIXCertificate_Create? Or should CERT_NewTempCertificate not call STAN_GetCERTCertificate(c) with the same |c| value on multiple threads? Etc.
That analysis seems correct, but I don't have a clear idea of what the error is either. It is my understanding that the underlying data structures are supposed to be shared at some level, but I'm not exactly sure how that fits in here.
Flags: needinfo?(dkeeler)
(In reply to David Keeler [:keeler] (use needinfo?) from comment #14) Ok .. so who can help us here? Is there an upstream author that knows how this is supposed to work?
Flags: needinfo?(dkeeler)
Attached file TSan complaint 2
running tmpcert with more threads produces this stack trace for me.
Assignee: nobody → franziskuskiefer
this fixes TSan complaint 2
According to comment 12 this might fix the race. I can't reproduce the race, so no idea if this actually fixes it.
(In reply to Franziskus Kiefer [:fkiefer or :franziskus] from comment #18) > Created attachment 8766246 [details] [diff] [review] > race-in-CERT_NewTempCertificate.patch With this in place, I can no longer detect the race. So, +1 from me.
Attachment #8766246 - Flags: feedback+
Attachment #8766246 - Flags: review?(ttaubert)
Hey Julian, can you please check this patch as well? Franziskus and I just talked about it and this might be a lock-free solution that removes some probably unneeded code.
Attachment #8766341 - Flags: feedback?(jseward)
Looks like Tim and Franziskus are on it.
Flags: needinfo?(dkeeler)
(In reply to Tim Taubert [:ttaubert] from comment #20) > Created attachment 8766341 [details] [diff] [review] > 0001-Bug-1278965-Try-to-fix-TSan-race-in-CERT_NewTempCert.patch > Hey Julian, can you please check this patch as well? Yes, this patch tests OK (race-free) too.
Attachment #8766341 - Flags: feedback?(jseward) → feedback+
Comment on attachment 8766341 [details] [diff] [review] 0001-Bug-1278965-Try-to-fix-TSan-race-in-CERT_NewTempCert.patch Wan-Teh, do you know if there's a reason for this tempCert? Removing it and moving up the istemp and isperm assignments resolves the race and we can't really see a reason for this tempCert procedure.
Attachment #8766341 - Flags: review?(wtc)
Comment on attachment 8766246 [details] [diff] [review] race-in-CERT_NewTempCertificate.patch Review of attachment 8766246 [details] [diff] [review]: ----------------------------------------------------------------- ::: lib/certdb/stanpcertdb.c @@ +442,3 @@ > cc->istemp = PR_TRUE; > cc->isperm = PR_FALSE; > + CERT_UnlockCertTrust(cc); The main problem with this patch is that CERT_LockCertTrust(cc) is intended to protect cc->trust only. So we would be using the lock for a new purpose. This at least needs to be documented. Also, other accesses to cc->istemp and cc->isperm should also be protected by CERT_LockCertTrust(cc). So this patch is incomplete even if we want to take this approach.
Attachment #8766246 - Flags: review-
Attachment #8766246 - Flags: review?(ttaubert)
Comment on attachment 8766341 [details] [diff] [review] 0001-Bug-1278965-Try-to-fix-TSan-race-in-CERT_NewTempCert.patch Review of attachment 8766341 [details] [diff] [review]: ----------------------------------------------------------------- ::: lib/certdb/stanpcertdb.c @@ -431,4 @@ > NSSCertificate_Destroy(c); > - /* and use the stored entry */ > - c = tempCert; > - cc = STAN_GetCERTCertificateOrRelease(c); It is risky to delete this STAN_GetCERTCertificateOrRelease() call because its input argument is actually |tempCert| and its return value may be different from the old value of |cc|. It seems that ideally cc->istemp and cc->isperm should be set by STAN_GetCERTCertificateOrRelease() internally, if necessary. I noticed the following code in fill_CERTCertificateFields: /* istemp and isperm are supported in NSS 3.4 */ cc->istemp = PR_FALSE; /* CERT_NewTemp will override this */ cc->isperm = PR_TRUE; /* by default */ /* pointer back */ cc->nssCertificate = c; Maybe it is better to just set cc->istemp and cc->isperm to the right values there. Alternatively, we can add a lock or use an existing lock to protect all accesses to cc->istemp and cc->isperm, and require all accesses to use getter and setter functions.
Attachment #8766341 - Flags: review?(wtc) → review-
Attached patch istemp.patch (obsolete) — Splinter Review
Per discussion of last week I added a new lock for istemp/isperm. Seeting istemp/isperm correctly for a new temp certificate doesn't solve the entire problem because pk11cert [1] makes a temp cert permanent again. So at least here we have to change both values. And I don't see why we couldn't run in a similar race at this point when calling PK11_ImportCert multiple times with the same cert. Julian could you check if this solves the race? I only get other races with the test case :( Wan-Teh can you review? I added a new lock because I think it's cleaner than renaming the trust lock. [1] http://searchfox.org/nss/rev/0feb87cf59fba52f407a6b9925f30f3486618b5b/lib/pk11wrap/pk11cert.c#973
Attachment #8766246 - Attachment is obsolete: true
Attachment #8766341 - Attachment is obsolete: true
Attachment #8768691 - Flags: review?(wtc)
Attachment #8768691 - Flags: feedback?(jseward)
Comment on attachment 8768691 [details] [diff] [review] istemp.patch Review of attachment 8768691 [details] [diff] [review]: ----------------------------------------------------------------- Verified -- I can't reproduce the race any more with this patch.
Attachment #8768691 - Flags: feedback?(jseward) → feedback+
Comment on attachment 8768691 [details] [diff] [review] istemp.patch Review of attachment 8768691 [details] [diff] [review]: ----------------------------------------------------------------- The reads of cert->istemp and cert->isperm should also be protected by this lock, right?
Attached patch istemp.patch (obsolete) — Splinter Review
yes, my search was incomplete ... According to [1] and [2] I think I caught all references of istemp and isperm now. [1] http://searchfox.org/nss/search?q=-%3Eistemp&case=false&regexp=false&path= [2] http://searchfox.org/nss/search?q=-%3Eisperm&case=false&regexp=false&path=
Attachment #8768691 - Attachment is obsolete: true
Attachment #8768691 - Flags: review?(wtc)
Attachment #8769075 - Flags: review?(wtc)
(In reply to Franziskus Kiefer [:fkiefer or :franziskus] from comment #29) > Created attachment 8769075 [details] [diff] [review] > istemp.patch This tests OK for me.
Comment on attachment 8769075 [details] [diff] [review] istemp.patch Review of attachment 8769075 [details] [diff] [review]: ----------------------------------------------------------------- Bob: please review this patch. Franziskus: I suggest some changes. ::: lib/certdb/cert.h @@ +1427,5 @@ > + */ > +void CERT_LockCertTempPerm(const CERTCertificate *cert); > + > +/* > + * Free the temp/perm lock Free => Release "Free" may be misinterpreted as "delete" or "destroy". Please also fix the comment for CERT_UnlockCertTrust(). @@ +1429,5 @@ > + > +/* > + * Free the temp/perm lock > + */ > +void CERT_UnlockCertTempPerm(const CERTCertificate *cert); We should also add getter functions for cert->istemp and cert->isperm (similar to CERT_LockCertTrust()) and export the getter functions in lib/nss/nss.def. Some code outside NSS uses cert->istemp and cert->isperm directly and should be changed to use the new getter functions. See https://dxr.mozilla.org/mozilla-central/search?q=%2Bvar-ref%3ACERTCertificateStr%3A%3Aisperm https://dxr.mozilla.org/mozilla-central/search?q=%2Bvar-ref%3ACERTCertificateStr%3A%3Aistemp (It seems that Mozilla only uses cert->isperm.) CERT_LockCertTempPerm() and CERT_UnlockCertTempPerm() are only used by NSS itself, so we should declare them in an internal header, such as lib/certdb/certi.h.
Attachment #8769075 - Flags: superreview?(rrelyea)
Attachment #8769075 - Flags: review?(wtc)
Attachment #8769075 - Flags: review+
Comment on attachment 8769075 [details] [diff] [review] istemp.patch Franziskus: please wait for Bob's approval.
Attached patch istemp.patch (obsolete) — Splinter Review
Fixed nits, moved internal functions to certi.h, and added CERT_GetCertIsTemp and CERT_GetCertIsPerm functions. Bob, can you review?
Attachment #8769075 - Attachment is obsolete: true
Attachment #8769075 - Flags: superreview?(rrelyea)
Attachment #8770926 - Flags: review?(rrelyea)
Bob, FYI, there's a review waiting for you, moderate issue.
Flags: needinfo?(rrelyea)
Attached patch tsan-race.patch (obsolete) — Splinter Review
rebased, let's get this landed.
Attachment #8770926 - Attachment is obsolete: true
Attachment #8770926 - Flags: review?(rrelyea)
Attachment #8834814 - Flags: review?(ttaubert)
Attachment #8766240 - Attachment is obsolete: true
Comment on attachment 8834814 [details] [diff] [review] tsan-race.patch Review of attachment 8834814 [details] [diff] [review]: ----------------------------------------------------------------- ::: lib/certdb/certdb.c @@ +2877,5 @@ > +CERT_LockCertTempPerm(const CERTCertificate *cert) > +{ > + PORT_Assert(certTempPermLock != NULL); > + PZ_Lock(certTempPermLock); > + return; nit: no need to leave the return here ::: lib/certdb/stanpcertdb.c @@ +942,5 @@ > } > > + CERT_LockCertTempPerm(cert); > + isperm = cert->isperm; > + CERT_UnlockCertTempPerm(cert); Should use CERT_GetCertIsPerm(). @@ +998,5 @@ > return rvItem; > } > > +SECStatus > +CERT_GetCertIsPerm(const CERTCertificate *cert, PRBool *istemp) Wrong arg name. @@ +1005,5 @@ > + CERT_LockCertTempPerm(cert); > + if (cert == NULL) { > + rv = SECFailure; > + } else { > + *istemp = cert->istemp; This reads the wrong field. @@ +1013,5 @@ > + return rv; > +} > + > +SECStatus > +CERT_GetCertIsTemp(const CERTCertificate *cert, PRBool *isperm) Wrong arg name here too. @@ +1020,5 @@ > + CERT_LockCertTempPerm(cert); > + if (cert == NULL) { > + rv = SECFailure; > + } else { > + *isperm = cert->isperm; This also reads the wrong field. ::: lib/certhigh/certhigh.c @@ +952,5 @@ > > /* if the cert is temp, make it perm; otherwise we're done */ > + CERT_LockCertTempPerm(cert); > + istemp = cert->istemp; > + CERT_UnlockCertTempPerm(cert); Should use CERT_GetCertIsTemp().
Attachment #8834814 - Flags: review?(ttaubert)
Attached patch tsan-race.patch (obsolete) — Splinter Review
Attachment #8834814 - Attachment is obsolete: true
Attachment #8839409 - Flags: review?(ttaubert)
Comment on attachment 8839409 [details] [diff] [review] tsan-race.patch Review of attachment 8839409 [details] [diff] [review]: ----------------------------------------------------------------- ::: lib/certdb/stanpcertdb.c @@ +940,5 @@ > return SECFailure; > } > } > > + CERT_GetCertIsPerm(cert, &isperm); Should check the return value. @@ +1001,5 @@ > +{ > + SECStatus rv; > + CERT_LockCertTempPerm(cert); > + if (cert == NULL) { > + rv = SECFailure; I think we should bail out with SECFailure when cert == NULL before calling CERT_LockCertTempPerm(). I know we don't have per-cert locks currently but this is cleaner should we ever get around to do this properly. @@ +1016,5 @@ > +{ > + SECStatus rv; > + CERT_LockCertTempPerm(cert); > + if (cert == NULL) { > + rv = SECFailure; Same as above. ::: lib/certhigh/certhigh.c @@ +950,5 @@ > goto loser; > } > > /* if the cert is temp, make it perm; otherwise we're done */ > + CERT_GetCertIsTemp(cert, &istemp); Should check the return value here too.
Attachment #8839409 - Flags: review?(ttaubert)
Attached patch tsan-race.patchSplinter Review
* checking return values * early return in CERT_GetCertIs*
Attachment #8839409 - Attachment is obsolete: true
Attachment #8840427 - Flags: review?(ttaubert)
Comment on attachment 8840427 [details] [diff] [review] tsan-race.patch Review of attachment 8840427 [details] [diff] [review]: ----------------------------------------------------------------- r+ but as said below this should probably wait until 3.31. ::: lib/certdb/stanpcertdb.c @@ +1008,5 @@ > + } > + > + CERT_LockCertTempPerm(cert); > + *isperm = cert->isperm; > + rv = SECSuccess; nit: I'd move this out of the locked area, or maybe just `return SECSuccess` @@ +1023,5 @@ > + } > + > + CERT_LockCertTempPerm(cert); > + *istemp = cert->istemp; > + rv = SECSuccess; nit: I'd move this out of the locked area, or maybe just `return SECSuccess` ::: lib/certhigh/certhigh.c @@ +952,5 @@ > > /* if the cert is temp, make it perm; otherwise we're done */ > + rv = CERT_GetCertIsTemp(cert, &istemp); > + if (rv != SECSuccess) { > + goto loser; We'd leak `cert` here without a call to `CERT_DestroyCertificate(cert)`. You could change this to: rv = CERT_GetCertIsTemp(cert, &istemp); if (rv == SECSuccess && istemp) { ... } CERT_DestroyCertificate(cert); cert = NULL; if (rv != SECSuccess) { goto loser; } ::: lib/nss/nss.def @@ +1101,5 @@ > ;+ global: > CERT_CompareAVA; > PK11_HasAttributeSet; > +CERT_GetCertIsPerm; > +CERT_GetCertIsTemp; You've probably seen the thread on nss-dev, RH would prefer us to hold this until NSS 3.31.
Attachment #8840427 - Flags: review?(ttaubert) → review+
Status: NEW → RESOLVED
Closed: 8 years ago
Flags: needinfo?(rrelyea)
Resolution: --- → FIXED
Target Milestone: --- → 3.31
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
sigh, the pki code tends to call back into the NSS code for services. stan_GetCERTCertificate tends to act like a 'high level' function called from a low level. Adding locks around it could potentially wind up in deadlock conditions. One way to work around this (which is ugly) is to use monitors rather than locks.
The deadlock was a result of a stupid c/p mistake of mine... External pki code has to make sure to use the new CERT_GetCertIsPerm/CERT_GetCertIsTemp functions that do the locking. I don't think this should make any problems. I re-landed it (try looks good https://treeherder.mozilla.org/#/jobs?repo=nss-try&revision=896e3eb3a79933a51886949c7adb67ef37b721c0&filter-tier=1&filter-tier=2&filter-tier=3). https://hg.mozilla.org/projects/nss/rev/896e3eb3a79933a51886949c7adb67ef37b721c0
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
Group: crypto-core-security → core-security-release
In bug 1366761 I've added automatic detection of ABI changes. To approve that the API additions from this bug are permissible ABI changes, I've made the following commit, which adds these new APIs as expected items in the abidiff report: https://hg.mozilla.org/projects/nss/rev/713b15751ad2
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: