NSS must force deinitialization in child after fork

NEW
Assigned to

Status

NSS
Libraries
P1
critical
8 years ago
7 years ago

People

(Reporter: Christophe Ravel, Assigned: Alexei Volkov)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

8 years ago
After a fork, the child process needs to reinitialize NSS. But it is not possible because NSS in the child keeps some information about the previous initialization from the parent and any new NSS_Init() call will end as sucess immediately, not doing any real initialization work.
NSS_Shutdown() in the child before doing NSS_Init() can fail for some reasons (see more later), which makes impossible to use NSS in the child.
(Reporter)

Comment 1

8 years ago
Somebody did some debugging, first going forward from where we call nss_Shutdown in the child process and then going forward until we find why we error out.

Note: NSS is still initialized and running in the parent process.

We error out because

g_default_trust_domain->->cache->issuerAndSN->count is greater than zero...it is one.

Here is all the gory details of my analysis, including the stack where we set the value to one.  Hopefully that will shed some light on what is causing the problem.

================================

Code failure path....starts with nss_Shutdown.  the first
routine that fails is...

STAN_Shutdown() -

mozilla/security/nss/lib/pki/pki3hack.c

NSS_IMPLEMENT PRStatus
STAN_Shutdown()
{
    PRStatus status = PR_SUCCESS;
    if (g_default_trust_domain) {
        if (NSSTrustDomain_Destroy(g_default_trust_domain) == PR_SUCCESS) { <<-- this fails
            g_default_trust_domain = NULL;
        } else {
            status = PR_FAILURE;
        }
    }

NSSTrustDomain_Destroy() -

mozilla/security/nss/lib/pki/trustdomain.c

NSS_IMPLEMENT PRStatus
NSSTrustDomain_Destroy (
  NSSTrustDomain *td
)
{
    PRStatus status = PR_SUCCESS;
    if (--td->refCount == 0) {
        /* Destroy each token in the list of tokens */
        if (td->tokens) {
            nssListIterator_Destroy(td->tokens);
            td->tokens = NULL;
        }
        if (td->tokenList) {
            nssList_Clear(td->tokenList, token_destructor);
            nssList_Destroy(td->tokenList);
            td->tokenList = NULL;
        }
        NSSRWLock_Destroy(td->tokensLock);
        td->tokensLock = NULL;
        status = nssTrustDomain_DestroyCache(td); <<-- FAIL
        if (status == PR_FAILURE) {
            return status;
        }


nssTrustDomain_DestroyCache() -

mozilla/security/nss/lib/pki/tdcache.c


/* The entries of the hashtable are currently dependent on the certificate(s)
 * that produced them.  That is, the entries will be freed when the cert is
 * released from the cache.  If there are certs in the cache at any time,
 * including shutdown, the hash table entries will hold memory.  In order for
 * clean shutdown, it is necessary for there to be no certs in the cache.
 */

extern const NSSError NSS_ERROR_INTERNAL_ERROR;
extern const NSSError NSS_ERROR_BUSY;

NSS_IMPLEMENT PRStatus
nssTrustDomain_DestroyCache (
  NSSTrustDomain *td
)
{
    if (!td->cache) {
        nss_SetError(NSS_ERROR_INTERNAL_ERROR);
        return PR_FAILURE;
    }
    if (nssHash_Count(td->cache->issuerAndSN) > 0) { <<-- nssHash_Count == 1 FAILS
        nss_SetError(NSS_ERROR_BUSY);
        return PR_FAILURE;
    }
    PZ_DestroyLock(td->cache->lock);
    nssHash_Destroy(td->cache->issuerAndSN);
    nssHash_Destroy(td->cache->subject);
    nssHash_Destroy(td->cache->nickname);
    nssHash_Destroy(td->cache->email);
    nssArena_Destroy(td->cache->arena);
    td->cache = NULL;
#ifdef DEBUG_CACHE
    PR_LOG(s_log, PR_LOG_DEBUG, ("Cache destroyed."));
#endif
    return PR_SUCCESS;
}

!! What is NSSTrustDomain?

!! From nsspkit.h

/*
 * NSSTrustDomain
 *
 * A Trust Domain is the field in which certificates may be validated.
 * A trust domain will generally have one or more cryptographic modules
 * open; these modules perform the cryptographic operations, and
 * provide the basic "root" trust information from which the trust in
 * a specific certificate or key depends.
 *
 * A client program, or a simple server, would typically have one
 * trust domain.  A server supporting multiple "virtual servers" might
 * have a separate trust domain for each virtual server.  The separate
 * trust domains might share some modules (e.g., a hardware crypto
 * accelerator) but not others (e.g., the tokens storing the different
 * servers' private keys, or the databases with each server's trusted
 * root certificates).
 *
 * This object descends from the "permananet database" in the old code.
 */

struct NSSTrustDomainStr;
typedef struct NSSTrustDomainStr NSSTrustDomain;

!! From pkit.h:

struct NSSTrustDomainStr {
    PRInt32 refCount;
    NSSArena *arena;
    NSSCallback *defaultCallback;
    nssList *tokenList;
    nssListIterator *tokens;
    nssTDCertificateCache *cache;
    NSSRWLock *tokensLock;
    void *spkDigestInfo;
    CERTStatusConfig *statusConfig;
};

typedef struct nssTDCertificateCacheStr nssTDCertificateCache;

!! From tdcache.c

102 /* Certificate cache routines */
103
104 /* XXX
105  * Locking is not handled well at all.  A single, global lock with sub-locks
106  * in the collection types.  Cleanup needed.
107  */
108
109 /* should it live in its own arena? */
110 struct nssTDCertificateCacheStr
111 {
112     PZLock *lock;
113     NSSArena *arena;
114     nssHash *issuerAndSN;
115     nssHash *subject;
116     nssHash *nickname;
117     nssHash *email;
118 };

/security/nss/lib/base/baset.h (View CVS log or CVS annotations)

    * line 111 -- typedef struct nssHashStr nssHash;

!! From lib/base/hash.c

70 struct nssHashStr {
71   NSSArena *arena;
72   PRBool i_alloced_arena;
73   PRLock *mutex;
74
75   /*
76    * The invariant that mutex protects is:
77    *   The count accurately reflects the hashtable state.
78    */
79
80   PLHashTable *plHashTable;
81   PRUint32 count;
82 };


> 0x0003dbc8,20/X
0x3dbc8:        0               39198           0               0               0
                3e3e8**         0               0               0               18
                39198           1c              0               0               0

> 3e3e8,10/X
0x3e3e8:        4ab38           38980           3e408**issuerAndSN         3e428           3e448
                3e468           38980           14              38980           0
                4ab98           4ac00           1               0               38980
                14
> 3e408,10/X
0x3e408:        38980           0               4ab98           4ac00           1 **count
                0               38980           14              38980           0
                4ac28           44ff8           1               0               38980
                14

!! Setting breakpoints on nssTrustDomain_DestroyCache() I was
!! able to verify that the NSSTrustDomain structure passed along
!! to nssTrustDomain_DestroyCache() is the global structure

g_default_trust_domain

!! I think this may be key - where is this messed with in the code?

!! Looks like this might be the only place in the nss code were we
!! set g_default_trust_domain

139 NSS_IMPLEMENT PRStatus
140 STAN_LoadDefaultNSS3TrustDomain (
141   void
142 )

!! Yep - that is the routine we init the stuff

whitestar1-1# mdb -p 1356
Loading modules: [ ld.so.1 libc.so.1 libuutil.so.1 ]
> STAN_LoadDefaultNSS3TrustDomain:b
> :c
mdb: stop at libnss3.so`STAN_LoadDefaultNSS3TrustDomain
mdb: target stopped at:
libnss3.so`STAN_LoadDefaultNSS3TrustDomain:     save      %sp, -0x60, %sp
mdb: You've got symbols!

> g_default_trust_domain/X
libnss3.so`g_default_trust_domain:
libnss3.so`g_default_trust_domain:              0

> :u
mdb: target stopped at:
libnss3.so`nss_Init+0x834:      tst       %o0

> g_default_trust_domain/X
libnss3.so`g_default_trust_domain:
libnss3.so`g_default_trust_domain:              3dbc8

> 3dbc8,10/X
0x3dbc8:        1               39198           0               3dbf8           33a80
                3e3e8 **        4aa20           0               0               18
                39198           1c              39198           452f0           3dc98
                2
> 3e3e8,10/X
0x3e3e8:        4ab38           38980           3e408 **        3e428           3e448
                3e468           38980           14              38980           0
                4ab98           4ac00           0               0               38980
                14
> 3e408,10/X
0x3e408:        38980           0               4ab98           4ac00           0 **count
                0               38980           14              38980           0
                4ac28           44ff8           0               0               38980
                14

!! but we did not set count to 1...so, set a watchpoint on
!! 0x3e418...

> 3e418:w
> :c
mdb: stop on write of [0x3e418, 0x3e419)
mdb: target stopped at:
libnss3.so`nssHash_Add+0x94:    st        %i2, [%i0 + 0x10]
> $c
libnss3.so`nssHash_Add+0x94(3e408, 6adb8, 1, 6b190, a4a4c, 0)
libnss3.so`nssTrustDomain_AddCertsToCache+0x1bc(3dbc8, ffbff1f4, 1, 0, 3e3e8, 6a470)
libnss3.so`cert_createObject+0x18(6ad68, 6a520, 0, 1c000, febaefe0, 4d41524b)
libnss3.so`nssPKIObjectCollection_GetObjects+0x54(6a4e8, ffbff324, 1, 6a798, b094c, 6ad68)
libnss3.so`nssPKIObjectCollection_GetCertificates+0xa8(6a4e8, 0, 2, 0, 6a4e8, ffbff324)
libnss3.so`nssTrustDomain_FindCertificateByIssuerAndSerialNumber+0x108(3dbc8, ffbff390, ffbff388, 33c00,
6800, 0)
libnss3.so`NSSTrustDomain_FindCertificateByEncodedCertificate+0x94(3dbc8, ffbff408, 695d0, b4974, 6800,
6800)
libnss3.so`CERT_NewTempCertificate+0x5c(3dbc8, ffbff474, 0, 0, 1, 3dbc8)
libssl3.so`ssl3_HandleCertificate+0x1dc(541a0, 421d4, ffbff474, 541a0, 0, 69538)
libssl3.so`ssl3_HandleHandshakeMessage+0x2a4(541a0, 5b735, 252, febd3230, 0, ffffd000)
libssl3.so`ssl3_HandleHandshake+0x11c(543f8, 54928, 541a0, 0, 5b731, 252)
libssl3.so`ssl3_HandleRecord+0x5a0(541a0, 0, 543f8, 1, 0, 16)
libssl3.so`ssl3_GatherCompleteHandshake+0x54(541a0, 0, 532d8, 1, 1, ffbff68c)
libssl3.so`ssl_GatherRecord1stHandshake+0x30(541a0, 37, 94cc1000, 94cc1000, 8000, 541a0)
libssl3.so`ssl_Do1stHandshake+0xec(fffffffe, febd8ed8, 53188, 541a0, 1, 0)
libssl3.so`ssl_SecureSend+0x1cc(0, 52d74, 44, 8000, 541a0, 94cc1000)
libssl3.so`ssl_Send+0x9c(33b78, 52d74, 44, febe0728, ffffffff, 541a0)
libldap.so.5`ber_flush+0x94(497b0, 52ca0, 0, 636f6d, ff00, ffffe800)
libldap.so.5`nsldapi_ber_flush+0x40(49598, 497b0, 52ca0, 0, 1, fed7586c)
libldap.so.5`nsldapi_send_server_request+0x634(49598, 52ca0, 1, 0, 49f10, 0)
libldap.so.5`nsldapi_send_initial_request+0x30(49598, 1, 60, 38310, 52ca0, 38310)
libldap.so.5`ldap_sasl_bind+0x514(49598, 38310, 0, ffbffaf0, 0, 0)
libldap.so.5`ldap_sasl_bind_s+0x60(49598, 38310, 0, ffbffaf0, 0, 0)
libkdb_ldap.so.1`krb5_ldap_bind+0xd8(37a80, 338f8, 0, 250a4, fedab3e0, 34bc0)
libkdb_ldap.so.1`krb5_ldap_initialize+0x1e4(37a80, 367a0, ffbffbc4, 24fb4, 0, fedca000)
libkdb_ldap.so.1`krb5_ldap_db_init+0x100(34eb8, 37a80, 3, 0, 1, fedca000)
libkdb_ldap.so.1`krb5_ldap_open+0x540(34eb8, 34bb0, 34ba0, 101, 37a80, 0)
libkdb.so.1`krb5_db_open+0xa8(34eb8, 0, 101, 35440, 7758, 32008)
0x1ae3c(34508, ffbfff1a, 35d00, 33e98, 35440, 1)
initialize_realms+0x4c8(34508, 35d00, ffbffe64, ff37b274, 0, 0)
main+0xc8(1, ffbffe64, ffbffe6c, 1fc00, ffbfff1a, 0)
_start+0x108(0, 0, 0, 0, 0, 0)

> :u
mdb: target stopped at:
libnss3.so`nssTrustDomain_AddCertsToCache+0x1c4:tst       %o0
> 3e408,10/X
0x3e408:        38980           0               4ab98           4ac00           1 ** count

!! and that set it...so, the stack were we set the count is directly
!! above.  So, maybe we have to clear the hash before we call nss_Shutdown?...or before we call STAN_Shutdown()?
Priority: -- → P1
Assignee: nobody → christophe.ravel.bugs
(Reporter)

Updated

8 years ago
Assignee: christophe.ravel.bugs → alexei.volkov.bugs

Comment 2

8 years ago
It looks like there's an open reference after the fork().

It's not really safe to just clear it because somewhere in the application address space is a pointer that references this trust domain, which will soon point to garbage.

Once you get past this, you most likely will still have slot references open. The application really needs to quiese it's NSS usage before the fork().

bob

Comment 3

8 years ago
(In reply to comment #2)
> It looks like there's an open reference after the fork().
> 
> It's not really safe to just clear it because somewhere in the application
> address space is a pointer that references this trust domain, which will soon
> point to garbage.
> 
> Once you get past this, you most likely will still have slot references open.
> The application really needs to quiese it's NSS usage before the fork().
> 
> bob

Does it mean all SSL connections must be shutdown before fork, parent cannot keep them open during fork? That is big limitation for NSS consumers and significant change from previous behaviour. Plus it is not required by PKCS#11 (which says that parent is untouched).

Comment 4

8 years ago
if you can shut down the connections on the child side (after the fork()), that would be fine, but you need to shutdown all those connections before you call NSS shutdown. Otherwise the child has pointers to objects that will become invalid.

PKCS #11 interface is handle based, so old handles can be identified as invalid. the NSS data structures are pointers (normally more convenient for applications), but it means that we can't check them for validity like PKCS #11.

bob

Comment 5

8 years ago
(In reply to comment #4)
> if you can shut down the connections on the child side (after the fork()), that
> would be fine, but you need to shutdown all those connections before you call
> NSS shutdown. Otherwise the child has pointers to objects that will become
> invalid.
> 

Which means - is it safe and supported by NSS to do that after fork or not?

> PKCS #11 interface is handle based, so old handles can be identified as
> invalid. the NSS data structures are pointers (normally more convenient for
> applications), but it means that we can't check them for validity like PKCS
> #11.
> 

Well, how usable are these data structures after fork? They could be marked as not usable by NSS somehow and cleaned up later when it will be safe to do it. Yes, it can look like leak but it is problem of application it did not cleanup. But it will not break libraries and applications using NSS for 10+ years.

Comment 6

8 years ago
> Which means - is it safe and supported by NSS to do that after fork or not?

Yes, it's safe to close down those data structures.

> Well, how usable are these data structures after fork? They could be marked as
> not usable by NSS somehow and cleaned up later when it will be safe to do it.
> Yes, it can look like leak but it is problem of application it did not cleanup.
> But it will not break libraries and applications using NSS for 10+ years.

None of this is new. Your app has been broken for 10+ years. The only change is the softoken now enforces the PKCS #11 fork() semantics.

You have never been able to shutdown NSS with leaked data structures. It's just before you never tried. There has always been PKCS #11 modules that enforce or require the PKCS #11 fork() semantics.

bob

Comment 7

8 years ago
(In reply to comment #6)
> > Which means - is it safe and supported by NSS to do that after fork or not?
> 
> Yes, it's safe to close down those data structures.
> 

OK, thanks.

> > Well, how usable are these data structures after fork? They could be marked as
> > not usable by NSS somehow and cleaned up later when it will be safe to do it.
> > Yes, it can look like leak but it is problem of application it did not cleanup.
> > But it will not break libraries and applications using NSS for 10+ years.
> 
> None of this is new. Your app has been broken for 10+ years. The only change is
> the softoken now enforces the PKCS #11 fork() semantics.
> 
> You have never been able to shutdown NSS with leaked data structures. It's just
> before you never tried. There has always been PKCS #11 modules that enforce or
> require the PKCS #11 fork() semantics.
> 

I did not say that impossibility to shutdown NSS with "leaked" data structures is new. New is the enforcement of the PKCS#11 fork() semantics (which is good) + with going further because PKCS#11 does not require the shutdown by consumer, it requires new initialization only (but probably not possible in NSS without significant changes). I understand nobody documented NSS behaviour after fork() but calling consumers broken because of change in way how it was working for 10+ years is not fully accurate ;-)

The biggest problem with this change is that NSS is dangerous to be used by other libraries which are in "child" relationship with libc. E.g. naming services like LDAP. Because they cannot control application's behaviour and application have no idea there is NSS somewhere deep in stack. Yes, we can mask the most of problems but with the actual implementation, every consumer must be NSS-aware to be safe in all corner cases. Not very good for such generic library like NSS.
You need to log in before you can comment on or make changes to this bug.