Open Bug 1682044 Opened 4 years ago Updated 4 years ago

pkix_Build_GatherCerts() + pkix_CacheCert_Add() can corrupt "cachedCertTable"

Categories

(NSS :: Libraries, defect, P3)

Tracking

(Not tracked)

UNCONFIRMED

People

(Reporter: andrew.cagney, Unassigned)

Details

(Whiteboard: [nss-fx])

Attachments

(3 files)

(this bug is old)

pkix_Build_GatherCerts() has two code paths for creating the list "certsFound":

  • pkix_CacheCert_Lookup()
    this sets "certsFound" to a new list
    "certsFound" and "cachedCertTable" share items but not the list

  • pkix_CacheCert_Add(pkix_pl_Pk11CertStore_CertQuery())
    this sets "certsFound" to a new list; and then adds the list to "cachedCertTable"
    "certsFound" and "cachedCertTable" share a linked list

Because the latter doesn't create a separate list, deleting list elements from "certsFound" can also delete list elements from within "cacheCertTable". And if this happens while pkix_CacheCert_Lookup() is trying to update the same element's reference, a core dump can result.

In detail (note that reference counts may occasionally seem off by 1, its because data is being captured before function local variables release their reference):

pkix_Build_GatherCerts() calls pkix_pl_Pk11CertStore_CertQuery() (via a pointer) to sets "certsFound":

                           PKIX_CHECK(getCerts
                                    (certStore,
                                    state->certSel,
                                    state->verifyNode,
                                    &nbioContext,
                                    &certsFound,
                                    plContext),
                                    PKIX_GETCERTSFAILED);

it then calls:

                            PKIX_CHECK(pkix_CacheCert_Add
                                    (certStore,
                                    certSelParams,
                                    certsFound,
                                    plContext),
                                    PKIX_CACHECERTADDFAILED);

vis:

pkix_CacheCert_Add(): store: 0x7ffff4539fb8 ...
certs:
list: 0x7ffff433dfd8 references: 1 type: PKIX_LIST_TYPE
element 0: 0x7ffff4353fd8 references: 1 type: PKIX_LIST_TYPE
item: 0x7ffff4339ef8 references: 1 type: PKIX_CERT_TYPE
cert: 0x7ffff46e9820 E=testing@libreswan.org,CN=Libreswan test CA for mainca,OU=Test Department,O=Libreswan,L=Toronto,ST=Ontario,C=CA

*_Add() then constructs the following structure and stores it in the hash table:

... pkix_CacheCert_Add(): cachedValues: ...
list: 0x7ffff435bfd8 references: 2 type: PKIX_LIST_TYPE
element 0: 0x7ffff4363fd8 references: 1 type: PKIX_LIST_TYPE
item: 0x7ffff435fff8 references: 2 type: PKIX_DATE_TYPE
element 1: 0x7ffff4367fd8 references: 1 type: PKIX_LIST_TYPE
item: 0x7ffff433dfd8 references: 2 type: PKIX_LIST_TYPE
element 0: 0x7ffff4353fd8 references: 1 type: PKIX_LIST_TYPE
item: 0x7ffff4339ef8 references: 1 type: PKIX_CERT_TYPE
cert: 0x7ffff46e9820 E=testing@libreswan.org,CN=Libreswan test CA for mainca,OU=Test Department,O=Libreswan,L=Toronto,ST=Ontario,C=CA

notice how element 1's item: 0x7ffff433dfd8 - with two references - is the same address as "certsFound" that was passed in. i.e, "certsFound" is sharing a list with "cachedCertTable".

pkix_Build_GatherCerts() then continues, because I've only one cert, the next part is skipped (could this de-share certsFound? I've not looked):

    if (certsFound && certsFound->length > 1) {
        PKIX_List *sorted = NULL;
        
        /* sort Certs to try to optimize search */
        PKIX_CHECK(pkix_Build_SortCandidateCerts
                   (certsFound, &sorted, plContext),
                   PKIX_BUILDSORTCANDIDATECERTSFAILED);
        PKIX_DECREF(certsFound);
        certsFound = sorted;
    }

the trust chain is then created in "trustedCertList":

    PKIX_CHECK(
        pkix_Build_SelectCertsFromTrustAnchors(
            state->buildConstants.anchors,
            certSelParams, &trustedCertList,
            plContext),
        PKIX_FAILTOSELECTCERTSFROMANCHORS);

trustedCertList:
list: 0x7ffff4375fd8 references: 1 type: PKIX_LIST_TYPE
element 0: 0x7ffff4379fd8 references: 1 type: PKIX_LIST_TYPE
item: 0x7ffff456def8 references: 2 type: PKIX_CERT_TYPE
cert: 0x7ffff46e9820 E=testing@libreswan.org,CN=Libreswan test CA for mainca,OU=Test Department,O=Libreswan,L=Toronto,ST=Ontario,C=CA

in prep for mering "certsFound" and "trustedCertList", pkix_Build_GatherCerts() calls:

    PKIX_CHECK(
        pkix_Build_RemoveDupUntrustedCerts(trustedCertList,
                                           certsFound,
                                           plContext),
        PKIX_REMOVEDUPUNTRUSTEDCERTSFAILED);

in my case "cert: 0x7ffff46e9820" is common. Consequently, this call will remove that cert from "certsFound" vis:

PKIX_List_DeleteItem(): list: 0x7ffff433dfd8 length: 1 index: 0
list: 0x7ffff433dfd8 references: 2 type: PKIX_LIST_TYPE
-element 0: 0x7ffff4353fd8 references: 1 type: PKIX_LIST_TYPE
item: 0x7ffff4339ef8 references: 2 type: PKIX_CERT_TYPE
cert: 0x7ffff46e9820 E=testing@libreswan.org,CN=Libreswan test CA for mainca,OU=Test Department,O=Libreswan,L=Toronto,ST=Ontario,C=CA

this, however, means that the cert has also been removed from "cacheCertTable" (item: 0x7ffff4339ef8 has two references because of a local variable).

If a second thread, simultaneous to this, is in pkix_CacheCert_Lookup() trying to walk this same structure as part of adding a reference to item: 0x7ffff4339ef8 it will likely core dump.

Fix?

I suspect *_Add() should be duplicating the "certsFound" list.

The attached patch changes pkix_CacheCert_Add() to insert a copy of the passed in cert list in the cache.

Without this, the CertCache and the caller end up sharing a list leading to races/crashes.

Thanks for the report and detailed analysis. I agree that this is a problem but unfortunately, the patch causes a number of tests to break [1]:

Assertion failure: numLeakedObjects == 0, at ../../lib/libpkix/pkix_pl_nss/system/pkix_pl_lifecycle.c:266
./cert.sh: line 95:  9175 Aborted                 ${PROFTOOL} ${BINDIR}/certutil $*
cert.sh: #391: Verify LeafChainedToDistrustedCA Cert for Email recipient (134=255)  - FAILED
cert.sh ERROR: Verify LeafChainedToDistrustedCA Cert for Email recipient failed 134

I would have assumed that this is just the leak and assertion causing each test to fail, but opt builds (without assertions) fail as well:

certutil: certificate is invalid: Peer's certificate has been marked as not trusted by the user.
cert.sh: #388: Verify LeafChainedToDistrustedCA Cert for SSL Server (1=255)  - FAILED
cert.sh ERROR: Verify LeafChainedToDistrustedCA Cert for SSL Server failed 1

[1] https://treeherder.mozilla.org/jobs?repo=nss-try&revision=c9e05bdab3fa72255ab694d0a97592a507651cc3

Is there an easy way to both trace the transactions and dump the cache with[out] the patch?

Without the change, a cache lookup would most likely have hit on the key but then returned list with some certs missing. Now they are there and that could easily change behaviour exposing a latent bug.

To part answer my own question:

scratch build:

./build.sh -cc ; rm -rf ../tests_results/ ../dest ; ./build.sh --enable-libpkix

test:

( export USE_64=1 ; export NSS_ENABLE_PKIX_VERIFY=1 ; export DOMSUF=lan ; cd tests/cert/ && ./cert.sh ) 2>&1 | tee cert.log
...
Failed: 18

debug:

NSS_ENABLE_PKIX_VERIFY=1 LD_LIBRARY_PATH=$(cd ../dist/Debug/lib && pwd) gdb --args $(cd ../dist/Debug/bin && pwd)/certutil -V -n PasswordCert -u S -d ../tests_results/security/bernard.1/dbpass
(gdb) break PKIX_Shutdown
(gdb) break cert_VerifyCertChainPkix

Missing PKIX_DECREF()s added.

Flags: needinfo?(kjacobs.bugzilla)
Comment on attachment 9194499 [details] [diff] [review] v2 patch to duplicate cert list when adding to cache Review of attachment 9194499 [details] [diff] [review]: ----------------------------------------------------------------- Thanks for the thorough investigation! We've seen some strange crashes in the field in ChromeOS with some of these callstacks, so I believe you've possibly found something here. That said, I'm hoping you can point out where in pkix_CacheCert_Lookup the race occurs - since libpkix should be appropriately incrementing/decrementing the reference counts, and the reference counts themselves are thread-safe, it wasn't clear to me where the unsafe race was. Below is mostly small nits for idiomatic libpkix and simplification. ::: lib/libpkix/pkix/util/pkix_tools.c @@ +1221,5 @@ > PKIX_LISTAPPENDITEMFAILED); > > + PKIX_CHECK(PKIX_List_Create > + (&cachedCerts, plContext), > + PKIX_LISTCREATEFAILED); It seems this is manually re-implementing the `PKIX_DUPLICATE` macro. Would it make sense to just call `PKIX_DUPLICATE(certs, &cachedCerts, plContext, PKIX_OBJECTDUPLICATELISTFAILED);` To understand what will happen, is it will call `PKIX_PL_Object_Duplicate`, which will then make it into https://searchfox.org/nss/rev/fac0bade04aded89825cba0cb5b72f486444615a/lib/libpkix/pkix/util/pkix_list.c#429 and perform the same deep-duplication of the objects. @@ +1225,5 @@ > + PKIX_LISTCREATEFAILED); > + PKIX_CHECK(PKIX_List_GetLength > + (certs, &numCerts, plContext), > + PKIX_LISTGETLENGTHFAILED); > + for (unsigned i = 0; i < numCerts; i++) { Small nit: Per NSS style, this should be a PKIX_UInt32. I believe this also requires declaring this variable at the start of the scope (i.e. line 1171), because of C89; I don't believe C89 support was dropped yet (unfortunately). e.g. https://searchfox.org/nss/source/lib/libpkix/pkix/checker/pkix_policychecker.c#567 is an example of this sort of iteration approach.

(In reply to Ryan Sleevi from comment #6)

Thanks for the thorough investigation! We've seen some strange crashes in
the field in ChromeOS with some of these callstacks, so I believe you've
possibly found something here. That said, I'm hoping you can point out where
in pkix_CacheCert_Lookup the race occurs - since libpkix should be
appropriately incrementing/decrementing the reference counts, and the
reference counts themselves are thread-safe, it wasn't clear to me where the
unsafe race was.

And of course, seconds after asking this, I see it's https://searchfox.org/nss/rev/fac0bade04aded89825cba0cb5b72f486444615a/lib/libpkix/pkix/util/pkix_tools.c#1024-1083 that would cause this, so yeah, just focus on the nits :)

::: lib/libpkix/pkix/util/pkix_tools.c
@@ +1221,5 @@

             PKIX_LISTAPPENDITEMFAILED);
  •    PKIX_CHECK(PKIX_List_Create
    
  •   (&cachedCerts, plContext),
    
  •            PKIX_LISTCREATEFAILED);
    

It seems this is manually re-implementing the PKIX_DUPLICATE macro.

Would it make sense to just call

PKIX_DUPLICATE(certs, &cachedCerts, plContext, PKIX_OBJECTDUPLICATELISTFAILED);

It's a shallow copy, I found DUPLICATE confusing - it uses recursion leading me to suspect it is making a deep copy but I'm not sure how deep

(In reply to Andrew Cagney from comment #8)

It's a shallow copy, I found DUPLICATE confusing - it uses recursion leading me to suspect it is making a deep copy but I'm not sure how deep

I'm not sure I understand your concern here. Are you suggesting there's an intentional behaviour difference you're wanting to preserve?

It's "deep" in the sense that it uses the appropriate PKIX_PL_Object_Duplicate function, which is the correct function you want for independence of objects. e.g. it's the same function used for handling user input, and will either merely increase the refcount (if the object is immutable) or perform a deep copy of mutable fields. That's... what we want here, is it not?

Per suggestion, use PKIX_DUPLICATE() to duplicate the certificate list.

Flags: needinfo?(kjacobs.bugzilla)

https://treeherder.mozilla.org/jobs?repo=nss-try&revision=a882f579289854b2b9d4e9f19f025cc84449b78d
For patch v3, kevin if you want to r+ the patch, I'll go ahead and push it.

bob

Flags: needinfo?(kjacobs.bugzilla)

r=me.

In reviewing this, I notice an unrelated issue: pkix_CacheCert_Lookup never actually returns a hit. Internally, *pFound is set to PKIX_FALSE at the outset, then only set (again) to PKIX_FALSE. We still maintain the table and evict old entries, but even under a hit *pFound doesn't become true, so we don't return the result in pCerts. That should probably be a separate bug...

Flags: needinfo?(kjacobs.bugzilla)
Whiteboard: [nss-fx]
Severity: -- → S3
Priority: -- → P3

(In reply to Kevin Jacobs [:kjacobs] from comment #12)

r=me.

In reviewing this, I notice an unrelated issue: pkix_CacheCert_Lookup never actually returns a hit. Internally, *pFound is set to PKIX_FALSE at the outset, then only set (again) to PKIX_FALSE. We still maintain the table and evict old entries, but even under a hit *pFound doesn't become true, so we don't return the result in pCerts. That should probably be a separate bug...

Is there a bug?

(In reply to Andrew Cagney from comment #13)

Is there a bug?

 * FUNCTION: pkix_CacheCert_Lookup
 * DESCRIPTION:
 *
 *  Look up Cert Hash Table for a cached item based on "store" and Subject in
 *  "certSelParams" as the hash keys and returns values Certs in "pCerts".

It doesn't do that. This isn't a bug in your patch, though.

(In reply to Kevin Jacobs [:kjacobs] from comment #14)

(In reply to Andrew Cagney from comment #13)

Is there a bug?

Sorry, what I meant to ask is, was a new bug filed?

(In reply to Andrew Cagney from comment #15)

(In reply to Kevin Jacobs [:kjacobs] from comment #14)

(In reply to Andrew Cagney from comment #13)

Is there a bug?

Sorry, what I meant to ask is, was a new bug filed?

Ah, okay. I went ahead and filed bug 1692132.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: