pkix_Build_GatherCerts() + pkix_CacheCert_Add() can corrupt "cachedCertTable"
Categories
(NSS :: Libraries, defect, P3)
Tracking
(Not tracked)
People
(Reporter: andrew.cagney, Unassigned)
Details
(Whiteboard: [nss-fx])
Attachments
(3 files)
1.09 KB,
text/plain
|
Details | |
1.59 KB,
patch
|
Details | Diff | Splinter Review | |
1.14 KB,
patch
|
Details | Diff | Splinter Review |
(this bug is old)
pkix_Build_GatherCerts() has two code paths for creating the list "certsFound":
-
pkix_CacheCert_Lookup()
this sets "certsFound" to a new list
"certsFound" and "cachedCertTable" share items but not the list -
pkix_CacheCert_Add(pkix_pl_Pk11CertStore_CertQuery())
this sets "certsFound" to a new list; and then adds the list to "cachedCertTable"
"certsFound" and "cachedCertTable" share a linked list
Because the latter doesn't create a separate list, deleting list elements from "certsFound" can also delete list elements from within "cacheCertTable". And if this happens while pkix_CacheCert_Lookup() is trying to update the same element's reference, a core dump can result.
In detail (note that reference counts may occasionally seem off by 1, its because data is being captured before function local variables release their reference):
pkix_Build_GatherCerts() calls pkix_pl_Pk11CertStore_CertQuery() (via a pointer) to sets "certsFound":
PKIX_CHECK(getCerts
(certStore,
state->certSel,
state->verifyNode,
&nbioContext,
&certsFound,
plContext),
PKIX_GETCERTSFAILED);
it then calls:
PKIX_CHECK(pkix_CacheCert_Add
(certStore,
certSelParams,
certsFound,
plContext),
PKIX_CACHECERTADDFAILED);
vis:
pkix_CacheCert_Add(): store: 0x7ffff4539fb8 ...
certs:
list: 0x7ffff433dfd8 references: 1 type: PKIX_LIST_TYPE
element 0: 0x7ffff4353fd8 references: 1 type: PKIX_LIST_TYPE
item: 0x7ffff4339ef8 references: 1 type: PKIX_CERT_TYPE
cert: 0x7ffff46e9820 E=testing@libreswan.org,CN=Libreswan test CA for mainca,OU=Test Department,O=Libreswan,L=Toronto,ST=Ontario,C=CA
*_Add() then constructs the following structure and stores it in the hash table:
... pkix_CacheCert_Add(): cachedValues: ...
list: 0x7ffff435bfd8 references: 2 type: PKIX_LIST_TYPE
element 0: 0x7ffff4363fd8 references: 1 type: PKIX_LIST_TYPE
item: 0x7ffff435fff8 references: 2 type: PKIX_DATE_TYPE
element 1: 0x7ffff4367fd8 references: 1 type: PKIX_LIST_TYPE
item: 0x7ffff433dfd8 references: 2 type: PKIX_LIST_TYPE
element 0: 0x7ffff4353fd8 references: 1 type: PKIX_LIST_TYPE
item: 0x7ffff4339ef8 references: 1 type: PKIX_CERT_TYPE
cert: 0x7ffff46e9820 E=testing@libreswan.org,CN=Libreswan test CA for mainca,OU=Test Department,O=Libreswan,L=Toronto,ST=Ontario,C=CA
notice how element 1's item: 0x7ffff433dfd8 - with two references - is the same address as "certsFound" that was passed in. i.e, "certsFound" is sharing a list with "cachedCertTable".
pkix_Build_GatherCerts() then continues, because I've only one cert, the next part is skipped (could this de-share certsFound? I've not looked):
if (certsFound && certsFound->length > 1) {
PKIX_List *sorted = NULL;
/* sort Certs to try to optimize search */
PKIX_CHECK(pkix_Build_SortCandidateCerts
(certsFound, &sorted, plContext),
PKIX_BUILDSORTCANDIDATECERTSFAILED);
PKIX_DECREF(certsFound);
certsFound = sorted;
}
the trust chain is then created in "trustedCertList":
PKIX_CHECK(
pkix_Build_SelectCertsFromTrustAnchors(
state->buildConstants.anchors,
certSelParams, &trustedCertList,
plContext),
PKIX_FAILTOSELECTCERTSFROMANCHORS);
trustedCertList:
list: 0x7ffff4375fd8 references: 1 type: PKIX_LIST_TYPE
element 0: 0x7ffff4379fd8 references: 1 type: PKIX_LIST_TYPE
item: 0x7ffff456def8 references: 2 type: PKIX_CERT_TYPE
cert: 0x7ffff46e9820 E=testing@libreswan.org,CN=Libreswan test CA for mainca,OU=Test Department,O=Libreswan,L=Toronto,ST=Ontario,C=CA
in prep for mering "certsFound" and "trustedCertList", pkix_Build_GatherCerts() calls:
PKIX_CHECK(
pkix_Build_RemoveDupUntrustedCerts(trustedCertList,
certsFound,
plContext),
PKIX_REMOVEDUPUNTRUSTEDCERTSFAILED);
in my case "cert: 0x7ffff46e9820" is common. Consequently, this call will remove that cert from "certsFound" vis:
PKIX_List_DeleteItem(): list: 0x7ffff433dfd8 length: 1 index: 0
list: 0x7ffff433dfd8 references: 2 type: PKIX_LIST_TYPE
-element 0: 0x7ffff4353fd8 references: 1 type: PKIX_LIST_TYPE
item: 0x7ffff4339ef8 references: 2 type: PKIX_CERT_TYPE
cert: 0x7ffff46e9820 E=testing@libreswan.org,CN=Libreswan test CA for mainca,OU=Test Department,O=Libreswan,L=Toronto,ST=Ontario,C=CA
this, however, means that the cert has also been removed from "cacheCertTable" (item: 0x7ffff4339ef8 has two references because of a local variable).
If a second thread, simultaneous to this, is in pkix_CacheCert_Lookup() trying to walk this same structure as part of adding a reference to item: 0x7ffff4339ef8 it will likely core dump.
Fix?
I suspect *_Add() should be duplicating the "certsFound" list.
Reporter | ||
Comment 1•4 years ago
|
||
The attached patch changes pkix_CacheCert_Add() to insert a copy of the passed in cert list in the cache.
Without this, the CertCache and the caller end up sharing a list leading to races/crashes.
Comment 2•4 years ago
|
||
Thanks for the report and detailed analysis. I agree that this is a problem but unfortunately, the patch causes a number of tests to break [1]:
Assertion failure: numLeakedObjects == 0, at ../../lib/libpkix/pkix_pl_nss/system/pkix_pl_lifecycle.c:266
./cert.sh: line 95: 9175 Aborted ${PROFTOOL} ${BINDIR}/certutil $*
cert.sh: #391: Verify LeafChainedToDistrustedCA Cert for Email recipient (134=255) - FAILED
cert.sh ERROR: Verify LeafChainedToDistrustedCA Cert for Email recipient failed 134
I would have assumed that this is just the leak and assertion causing each test to fail, but opt builds (without assertions) fail as well:
certutil: certificate is invalid: Peer's certificate has been marked as not trusted by the user.
cert.sh: #388: Verify LeafChainedToDistrustedCA Cert for SSL Server (1=255) - FAILED
cert.sh ERROR: Verify LeafChainedToDistrustedCA Cert for SSL Server failed 1
[1] https://treeherder.mozilla.org/jobs?repo=nss-try&revision=c9e05bdab3fa72255ab694d0a97592a507651cc3
Reporter | ||
Comment 3•4 years ago
|
||
Is there an easy way to both trace the transactions and dump the cache with[out] the patch?
Without the change, a cache lookup would most likely have hit on the key but then returned list with some certs missing. Now they are there and that could easily change behaviour exposing a latent bug.
Reporter | ||
Comment 4•4 years ago
|
||
To part answer my own question:
scratch build:
./build.sh -cc ; rm -rf ../tests_results/ ../dest ; ./build.sh --enable-libpkix
test:
( export USE_64=1 ; export NSS_ENABLE_PKIX_VERIFY=1 ; export DOMSUF=lan ; cd tests/cert/ && ./cert.sh ) 2>&1 | tee cert.log
...
Failed: 18
debug:
NSS_ENABLE_PKIX_VERIFY=1 LD_LIBRARY_PATH=$(cd ../dist/Debug/lib && pwd) gdb --args $(cd ../dist/Debug/bin && pwd)/certutil -V -n PasswordCert -u S -d ../tests_results/security/bernard.1/dbpass
(gdb) break PKIX_Shutdown
(gdb) break cert_VerifyCertChainPkix
Reporter | ||
Comment 5•4 years ago
|
||
Missing PKIX_DECREF()s added.
Reporter | ||
Updated•4 years ago
|
Comment 6•4 years ago
|
||
Comment 7•4 years ago
|
||
(In reply to Ryan Sleevi from comment #6)
Thanks for the thorough investigation! We've seen some strange crashes in
the field in ChromeOS with some of these callstacks, so I believe you've
possibly found something here. That said, I'm hoping you can point out where
in pkix_CacheCert_Lookup the race occurs - since libpkix should be
appropriately incrementing/decrementing the reference counts, and the
reference counts themselves are thread-safe, it wasn't clear to me where the
unsafe race was.
And of course, seconds after asking this, I see it's https://searchfox.org/nss/rev/fac0bade04aded89825cba0cb5b72f486444615a/lib/libpkix/pkix/util/pkix_tools.c#1024-1083 that would cause this, so yeah, just focus on the nits :)
Reporter | ||
Comment 8•4 years ago
|
||
::: lib/libpkix/pkix/util/pkix_tools.c
@@ +1221,5 @@PKIX_LISTAPPENDITEMFAILED);
PKIX_CHECK(PKIX_List_Create
(&cachedCerts, plContext),
PKIX_LISTCREATEFAILED);
It seems this is manually re-implementing the
PKIX_DUPLICATE
macro.Would it make sense to just call
PKIX_DUPLICATE(certs, &cachedCerts, plContext, PKIX_OBJECTDUPLICATELISTFAILED);
It's a shallow copy, I found DUPLICATE confusing - it uses recursion leading me to suspect it is making a deep copy but I'm not sure how deep
Comment 9•4 years ago
|
||
(In reply to Andrew Cagney from comment #8)
It's a shallow copy, I found DUPLICATE confusing - it uses recursion leading me to suspect it is making a deep copy but I'm not sure how deep
I'm not sure I understand your concern here. Are you suggesting there's an intentional behaviour difference you're wanting to preserve?
It's "deep" in the sense that it uses the appropriate PKIX_PL_Object_Duplicate function, which is the correct function you want for independence of objects. e.g. it's the same function used for handling user input, and will either merely increase the refcount (if the object is immutable) or perform a deep copy of mutable fields. That's... what we want here, is it not?
Reporter | ||
Comment 10•4 years ago
|
||
Per suggestion, use PKIX_DUPLICATE() to duplicate the certificate list.
Comment 11•4 years ago
|
||
https://treeherder.mozilla.org/jobs?repo=nss-try&revision=a882f579289854b2b9d4e9f19f025cc84449b78d
For patch v3, kevin if you want to r+ the patch, I'll go ahead and push it.
bob
Comment 12•4 years ago
•
|
||
r=me.
In reviewing this, I notice an unrelated issue: pkix_CacheCert_Lookup
never actually returns a hit. Internally, *pFound
is set to PKIX_FALSE at the outset, then only set (again) to PKIX_FALSE. We still maintain the table and evict old entries, but even under a hit *pFound
doesn't become true, so we don't return the result in pCerts
. That should probably be a separate bug...
Updated•4 years ago
|
Updated•4 years ago
|
Reporter | ||
Comment 13•4 years ago
|
||
(In reply to Kevin Jacobs [:kjacobs] from comment #12)
r=me.
In reviewing this, I notice an unrelated issue:
pkix_CacheCert_Lookup
never actually returns a hit. Internally,*pFound
is set to PKIX_FALSE at the outset, then only set (again) to PKIX_FALSE. We still maintain the table and evict old entries, but even under a hit*pFound
doesn't become true, so we don't return the result inpCerts
. That should probably be a separate bug...
Is there a bug?
Comment 14•4 years ago
|
||
(In reply to Andrew Cagney from comment #13)
Is there a bug?
* FUNCTION: pkix_CacheCert_Lookup
* DESCRIPTION:
*
* Look up Cert Hash Table for a cached item based on "store" and Subject in
* "certSelParams" as the hash keys and returns values Certs in "pCerts".
It doesn't do that. This isn't a bug in your patch, though.
Reporter | ||
Comment 15•4 years ago
|
||
(In reply to Kevin Jacobs [:kjacobs] from comment #14)
(In reply to Andrew Cagney from comment #13)
Is there a bug?
Sorry, what I meant to ask is, was a new bug filed?
Comment 16•4 years ago
|
||
(In reply to Andrew Cagney from comment #15)
(In reply to Kevin Jacobs [:kjacobs] from comment #14)
(In reply to Andrew Cagney from comment #13)
Is there a bug?
Sorry, what I meant to ask is, was a new bug filed?
Ah, okay. I went ahead and filed bug 1692132.
Description
•