If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

Crash in secoid code, trying to reinitialize after failed NSS shutdown in SECOID_FindOID_Util

REOPENED
Assigned to

Status

NSS
Libraries
P1
critical
REOPENED
8 years ago
6 years ago

People

(Reporter: Nelson Bolyard (seldom reads bugmail), Assigned: Alexei Volkov)

Tracking

(Blocks: 1 bug, {crash})

3.12.4
3.12.9
Sun
Solaris
crash
Dependency tree / graph

Firefox Tracking Flags

(blocking2.0 -)

Details

(Whiteboard: [See bug 702307 for the Gecko crash with the same signature], [mobile-crash], crash signature)

Attachments

(1 attachment, 3 obsolete attachments)

I am filing this bug in UNCONFIRMED state, because many of the details 
below are based on hearsay, and I am not very certain of them.

A certain tool used to upgrade a server installation from an old version 
of the server product to a new version of that product attempts to change
the password on the server's key DB.  ACCORDING TO THE REPORTER, the 
sequence of events is:
-  program initializes NSS successfully
-  program walks through slots looking for token by name, 
       (leaking at least one)
-  program changes the password on the token, succeeds
-  program attempts to shutdown NSS, fails (due to leaked slot reference)
 (It may be that the program forks here, and the steps below occur in a child)
-  program attempts to initialize NSS a second time.  
    NSS crashes during this second initialization attempt.  

I have attempted to reconstruct the compolete crash stack.
The crash stack is believed to be:

[libplds4.so+0x1858]      PL_HashTableLookupConst+0x4 [plhash.c#376]
[libnssutil3.so+0x11ad8]  SECOID_FindOID_Util+0x30    [secoid.c#1949]
[libnssutil3.so+0x11b74]  SECOID_FindOIDTag_Util+0xc  [secoid.c#1965]
[libnssdbm3.so+0x46d4]    nsslowkey_GetPWCheckEntry   [keydb.c#1384]
[libnssdbm3.so+0x58a8]    lg_GetMetaData              [keydb.c#2225]
[libsoftokn3.so+0x38454]  sftkdb_HasPasswordSet       [sftkpwd#688]
[libsoftokn3.so+0x159c8]  NSC_GetTokenInfo            [pkcs11.c#2978]
[libnss3.so+0x598d8]      PK11_InitToken              [pk11slot.c#1117]
[libnss3.so+0x59f7c]      PK11_InitSlot               [pk11slot.c#1366]
[libnss3.so+0x3f5e4]      SECMOD_LoadPKCS11Module     [pk11load.c#428] (*)
[libnss3.so+0x4f0fc]      SECMOD_LoadModule+0xe0      [pk11pars.c#392]
[libnss3.so+0x4f158]      SECMOD_LoadModule+0x13c     [pk11pars.c#412]
[libnss3.so+0x10a74]      nss_init                    [nssinit.c#537]
[libnss3.so+0x10e24]      NSS_Initialize+0x108        [nssinit.c#655]

(*) Note that this is AFTER C_Initialize was called on the slot/token 
by function secmod_ModuleInit which was called at line 378.

I believe that PL_HashTableLookupConst crashed because the first argument 
passed to it was NULL, or an otherwise bad pointer value, and it attempts to 
dereference that without checking it for NULL.  I will assume it was NULL.
There is an assertion, but no code to avoid crashing in the non-DEBUG case.  
So, this could be an NSPR bug.  But I'm interested in why SECOID_FindOID_Util 
passed a NULL value for the first argument.  

The relevant code in SECOID_FindOid_Util is

1942 SECOidData *
1943 SECOID_FindOID(const SECItem *oid)
1944 {
1945     SECOidData *ret;
1946 
1947     PR_ASSERT(oidhash != NULL);
1948     
1949     ret = PL_HashTableLookupConst ( oidhash, oid );

Here again, there is an assertion to detect a NULL oidhash value in DEBUG
builds, but no test to avoid a crash in non-DEBUG builds.  Fortunately, this
is in libUTIL which is not frozen for FIPS.

The question is, how did we get here with oidhash being NULL?  

The static variable oidhash is initialized by a single function, SECOID_Init().
SECOID_Init is called from three separate places in NSS shared libraries.
They are:
1) In nsc_CommonInitialize at line 2576 in softoken
2) In legacy_Open in the legacy nssdbm, 
     called from softoken function sftkdbCall_open, 
     called from sftk_DBInit, 
     called from SFTK_SlotReInit,
     called from SFTK_SlotInit  or  sftk_CreateNewSlot
     called from nsc_CommonInitialize at line 2648
3) In nss_Init at line 546,

Now, we know that NSC_Initialize has been called, but if the module still
thought it was loaded and initialized, then it NOT have called 
nsc_CommonInitialize.  So, if there is a way that the softoken stayed loaded
but libnssutil got unloaded and reloaded, THAT would explain this crash.

So, that is one hypothesis.
(Assignee)

Comment 1

8 years ago
Here is the stack on the debug build:
=>[4] SECOID_FindOID_Util(oid = 0xffbfd2a0), line 1947 in "secoid.c"
  [5] SECOID_FindOIDTag_Util(oid = 0xffbfd2a0), line 1965 in "secoid.c"
  [6] nsslowkey_GetPWCheckEntry(handle = 0x885888, entry = 0xffbfd338), line 1384 in "keydb.c"
  [7] lg_GetMetaData(sdb = 0x809818, id = 0xf88e3420 "password", item1 = 0xffbfd658, item2 = 0xffbfd64c), line 2225 in "keydb.c"
  [8] sftkdb_HasPasswordSet(keydb = 0x9f1948), line 688 in "sftkpwd.c"
  [9] NSC_GetTokenInfo(slotID = 2U, pInfo = 0xffbfd78c), line 2978 in "pkcs11.c"
  [10] PK11_InitToken(slot = 0xbd89b8, loadCerts = 1), line 1117 in "pk11slot.c"
  [11] PK11_InitSlot(mod = 0xb471f8, slotID = 2U, slot = 0xbd89b8), line 1366 in "pk11slot.c"
  [12] SECMOD_LoadPKCS11Module(mod = 0xb471f8), line 428 in "pk11load.c"
  [13] SECMOD_LoadModule(modulespec = 0x698320 " name="NSS Internal PKCS #11 Module" parameters="configdir='/space/v21/p5b02sbs/domains/domain1/config' certPrefix='' keyPrefix='' secmod='secmod.db' flags= updatedir='' updateCertPrefix='' updateKeyPrefix='' updateid='' updateTokenDescription='' " NSS="trustOrder=75 cipherOrder=100 slotParams={0x00000001=[slotFlags=RSA,RC4,RC2,DES,DH,SHA1,MD5,MD2,SSL,TLS,AES,Camellia,SEED,RANDOM askpw=any timeout=30 ] }  Flags=internal,critical"", parent = 0xb46bb0, recurse = 1), line 392 in "pk11pars.c"
  [14] SECMOD_LoadModule(modulespec = 0x7fb1a8 "name="NSS Internal Module" parameters="configdir='/space/v21/p5b02sbs/domains/domain1/config' certPrefix='' keyPrefix='' secmod='secmod.db' flags= updatedir='' updateCertPrefix='' updateKeyPrefix='' updateid='' updateTokenDescription='' " NSS="flags=internal,moduleDB,moduleDBOnly,critical"", parent = (nil), recurse = 1), line 412 in "pk11pars.c"
  [15] nss_Init(configdir = 0xaf9d48 "/space/v21/p5b02sbs/domains/domain1/config", certPrefix = 0xd43e456c "", keyPrefix = 0xd43e4570 "", secmodName = 0xd43e4574 "secmod.db", updateDir = 0xd1ebbf14 "", updCertPrefix = 0xd1ebbf18 "", updKeyPrefix = 0xd1ebbf1c "", updateID = 0xd1ebbf20 "", updateName = 0xd1ebbf24 "", readOnly = 0, noCertDB = 0, noModDB = 0, forceOpen = 0, noRootInit = 0, optimizeSpace = 0, noSingleThreadedModules = 1, allowAlreadyInitializedModules = 1, dontFinalizeModules = 1), line 537 in "nssinit.c"
  [16] NSS_Initialize(configdir = 0xaf9d48 "/space/v21/p5b02sbs/domains/domain1/config", certPrefix = 0xd43e456c "", keyPrefix = 0xd43e4570 "", secmodName = 0xd43e4574 "secmod.db", flags = 960U), line 668 in "nssinit.c"

The problem seems to be in NSC_GetTokenInfo line 2956 in pkcs11.c:  the handle returned by sftk_getKeyDB is not null as it suppose to be at this point of execution. 

    PORT_Memcpy(pInfo->label,slot->tokDescription,sizeof(pInfo->label));
    handle = sftk_getKeyDB(slot);
    pInfo->flags = CKF_RNG | CKF_DUAL_CRYPTO_OPERATIONS;
    if (handle == NULL) {
        ...
    } else {
	/*
	 * we have three possible states which we may be in:
	 *   (1) No DB password has been initialized. This also means we
	 *   have no keys in the key db.
	 *   (2) Password initialized to NULL. This means we have keys, but
	 *   the user has chosen not use a password.
	 *   (3) Finally we have an initialized password whicn is not NULL, and
	 *   we will need to prompt for it.
	 */
	if (sftkdb_HasPasswordSet(handle) == SECFailure) {


The handle pointer seems to be valid:
(dbx) print *handle
*handle = {
    db                = 0x809818
    ref               = 2
    type              = 1073741824U
    passwordKey       = {
        type = 10426712
        data = (nil)
        len  = 0
    }
    newKey            = (nil)
    oldKey            = (nil)
    updatePasswordKey = (nil)
    passwordLock      = 0x9f1988
    peerDB            = 0x80aa88
    update            = (nil)
    updateID          = (nil)
    updateDBIsInit    = -1
}
(dbx) print *handle->db
*handle->db = {
    private               = 0x8848d0
    version               = 1667575670
    sdb_type              = SDB_LEGACY
    sdb_flags             = 4
    app_private           = 0x9f1948
    sdb_FindObjectsInit   = 0xd40b4738 = &lg_FindObjectsInit()
    sdb_FindObjects       = 0xd40b4870 = &lg_FindObjects()
    sdb_FindObjectsFinal  = 0xd40b4970 = &lg_FindObjectsFinal()
    sdb_GetAttributeValue = 0xd40ae4d8 = &lg_GetAttributeValue()
    sdb_SetAttributeValue = 0xd40af430 = &lg_SetAttributeValue()
    sdb_CreateObject      = 0xd40b1d10 = &lg_CreateObject()
    sdb_DestroyObject     = 0xd40b1f60 = &lg_DestroyObject()
    sdb_GetMetaData       = 0xd40a91d8 = &lg_GetMetaData(SDB *sdb, const char *id, SECItem *item1, SECItem *item2)
    sdb_PutMetaData       = 0xd40a9318 = &lg_PutMetaData(SDB *sdb, const char *id, const SECItem *item1, const SECItem *item2)
    sdb_Begin             = 0xd40b5320 = &lg_Begin()
    sdb_Commit            = 0xd40b53a8 = &lg_Commit()
    sdb_Abort             = 0xd40b5438 = &lg_Abort()
    sdb_Reset             = 0xd40a9438 = &lg_Reset(SDB *sdb)
    sdb_Close             = 0xd40b5bb0 = &lg_Close()
    sdb_SetForkState      = 0xd40b5b60 = &lg_SetForkState()
}
(dbx) print *handle
*handle = {
    db                = 0x809818
    ref               = 2
    type              = 1073741824U
    passwordKey       = {
        type = 10426712
        data = (nil)
        len  = 0
    }
    newKey            = (nil)
    oldKey            = (nil)
    updatePasswordKey = (nil)
    passwordLock      = 0x9f1988
    peerDB            = 0x80aa88
    update            = (nil)
    updateID          = (nil)
    updateDBIsInit    = -1
}
(dbx) print *handle->db
*handle->db = {
    private               = 0x8848d0
    version               = 1667575670
    sdb_type              = SDB_LEGACY
    sdb_flags             = 4
    app_private           = 0x9f1948
    sdb_FindObjectsInit   = 0xd40b4738 = &lg_FindObjectsInit()
    sdb_FindObjects       = 0xd40b4870 = &lg_FindObjects()
    sdb_FindObjectsFinal  = 0xd40b4970 = &lg_FindObjectsFinal()
    sdb_GetAttributeValue = 0xd40ae4d8 = &lg_GetAttributeValue()
    sdb_SetAttributeValue = 0xd40af430 = &lg_SetAttributeValue()
    sdb_CreateObject      = 0xd40b1d10 = &lg_CreateObject()
    sdb_DestroyObject     = 0xd40b1f60 = &lg_DestroyObject()
    sdb_GetMetaData       = 0xd40a91d8 = &lg_GetMetaData(SDB *sdb, const char *id, SECItem *item1, SECItem *item2)
    sdb_PutMetaData       = 0xd40a9318 = &lg_PutMetaData(SDB *sdb, const char *id, const SECItem *item1, const SECItem *item2)
    sdb_Begin             = 0xd40b5320 = &lg_Begin()
    sdb_Commit            = 0xd40b53a8 = &lg_Commit()
    sdb_Abort             = 0xd40b5438 = &lg_Abort()
    sdb_Reset             = 0xd40a9438 = &lg_Reset(SDB *sdb)
    sdb_Close             = 0xd40b5bb0 = &lg_Close()
    sdb_SetForkState      = 0xd40b5b60 = &lg_SetForkState()
}
So, it appears that the stack you found with the debugger matches the stack
I reconstructed with MXR.  
Question: why do you think that handle should be NULL at line 2956 in 
NSC_GetTokenInfo?
(Assignee)

Comment 3

8 years ago
Never mind. I was looking at the wrong slot. More info to come...
Alexei figured out a lot more about this.  The crash occurs if the program 
calls NSS_Initialize with the value NSS_INIT_COOPERATE on the first 
call to NSS_Initialize.  If we change the code to not set that flag, the 
crash goes away.  This is only with 3.12, not 3.11.

The flag causes NSS to NOT call C_Finalize before unloading the module.

Somehow, with this flag set, NSS_Shutdown returns SECSuccess, but has not
called C_Finalize, and also apparently has not successfully unloaded the 
module.  NSS_Shutdown calls SECOID_Shutdown, which zaps the secoid hash table.
In the second NSS_Init, the module thinks it is still initialized, so it does
not initialize again.  Consequently, it does not call SECOID_Init.  So it goes
on to use the SECOID table without initializing it, believing it to still be
initialized, and crashes.

As yet unansweredd questions:
a) why does the softoken module NOT get unloaded by NSS_Shutdown ?
b) why is this only a problem in 3.12.3 and not in 3.11.4 ?
Status: UNCONFIRMED → NEW
Ever confirmed: true

Comment 5

8 years ago
Hmm.... I always thought NSS_INIT_COOPERATE was a bad idea.

So, are we sure softoken does not get unloaded in 3.12.3? The part of the issue may be that it does.


OK, I think I may have a clear picture of what is going on here. In 3.11.4 softoken was statically linked. Because the application had a reference leak, the Shutdown failed, so C_Finalize was never called. After the fork(), there was not fork check, so softoken still looked initialized and would fail with CKR_ALREADY_INITIALIZED. NSS considers failure to initialize softoken a fatal error and failes NSS_Initialize(). The developers stumble onto the NSS_INIT_COOPERATE flag, which allows NSS to just use an already initialized softoken. Setting that flag causes the problems to go away. The flag prevents C_Finalize from being called still, but now NSS accepts an 'Already initialized' token as 'usable'. Alternately the app is a mixed app where it may call some other program that opens the softoken PKCS #11 module directly (Completely evil IMHO, but something that some sun engineers thought might have been a good idea). In this case it really needs the NSS_INIT_COOPERATE semantic, and that was just masking the fact that they did a fork().

Now we have NSS 3.12.3. NSS 3.12.3 does 3 things that thwart the magic semantics that allowed the app to work in 3.11.4. First it loads and unloads softoken dynamically. This is fine when C_Finalize is called before the unload.
Second, it does the fork() check. We will get a different error message if the call C_Initialize after the fork() now. Finally, softoken is no longer statically linked with libutil, it is dynamically linked.

The exact steps that cause the crash is still a bit mysterious to me. Is there something funny about the heap when we unload softoken without calling C_Finalize. The ultimate question about the crash why is the secoid hash table smashed?

I believe we could get back to exactly 3.11 semantics for this application as follows:

1) the app needs to turn off the fork handler().
2) add the following patch to NSS.
RCS file: /cvsroot/mozilla/security/nss/lib/pk11wrap/pk11load.c,v
retrieving revision 1.25
diff -u -p -r1.25 pk11load.c
--- pk11load.c	2 Oct 2008 00:56:15 -0000	1.25
+++ pk11load.c	25 Sep 2009 16:32:47 -0000
@@ -469,7 +469,7 @@ SECMOD_UnloadModule(SECMODModule *mod) {
      * if not, we should change this to SECFailure and move it above the
      * mod->loaded = PR_FALSE; */
     if (mod->internal) {
-        if (0 == PR_AtomicDecrement(&softokenLoadCount)) {
+        if (0 == PR_AtomicDecrement(&softokenLoadCount) && finalizeModules) {
           if (softokenLib) {
               disableUnload = PR_GetEnv("NSS_DISABLE_UNLOAD");
               if (!disableUnload) {


I don't think this is the right overall solution. While NSS_INIT_COOPERATE is really fundamentally counter to being able to handle proper PKCS #11 fork() semantics (OK, the whole idea that two different pieces of code can use a single PKCS #11 module in the same address space without knowing about each other is ludicrous without a change in the PKCS #11 implementation), there may be some apps that got this to mostly work by the fact they convinced all user to unload their module. I think we need to see why unloading the softoken module without a finalize is thrashing some of our data structures. If we can figure that out, we should be able to allow the app to work without turning off the fork() check.

bob

Comment 6

8 years ago
Oh... I think I have it....

In NSS 3.11 not only did we never unload the softoken, softoken was linked with it's own copy of lib libutil. Each component initialized and uninitialized the oid system separately.

Now they share a single copy.

Let's assume that the unload is failing for some reason in NSS 3.12. (which my above code will guarrentee, so it won't fix anything). This will make sure that softoken will think it is still initialized, so it won't actually run through it's NSC_Initialize function. NSS, however, will happily shutdown the SECOID system in NSS_Shutdown(). In NSS 3.11, this was not a problem (since softoken had it's own copy). In NSS 3.12, it now shuts down OIDS for both softoken and nssdbm. If softoken doesn't re-initalize, the OID system will not init. It's in NSC_Initialize the the legacy_Open function is called. That is the only place SECOID_Init is called in the nssdbm code. NSS's own SECOID_Init() call happens *After* we load all the tokens.

We can prevent this crash be moving that init call to before we load any modules in nss_Initialize.

I also suggest that we also move the SECOID_Shutdown call to after we shutdown the modules --- and call it only if the modules shutdown cleanly.

bob
(Assignee)

Comment 7

8 years ago
> Now they share a single copy. 

Thanks Bob. Now it make sense, why 3.11 didn't have a crash.

> We can prevent this crash be moving that init call to before we load any
> modules in nss_Initialize.

Right, but this is not enough. The nss_shutdown as it written today will leave unfinalized softoken module without underlining secoid table. Any application  will crash in an attempt to access the module.
 
> I also suggest that we also move the SECOID_Shutdown call to after we shutdown
> the modules --- and call it only if the modules shutdown cleanly.

So, the changes in NSS_Shutdown must be a part of the patch. The current code does not treat as an error the case when leaving unfinalized module behind.

454 SECStatus
455 SECMOD_UnloadModule(SECMODModule *mod) {
456     PRLibrary *library;
457     char *disableUnload = NULL;
458 
459     if (!mod->loaded) {
460         return SECFailure;
461     }
462     if (finalizeModules) {
463         if (!mod->moduleDBOnly) PK11_GETTAB(mod)->C_Finalize(NULL);
464     }

We can add "else" at line 464 and set an error that we will check later. This will work if no other code overwrites the set error. If it does, then nss will still clean up the secoid table.

Bob, it is true that if nss uses the cooperate flag, but initializes an additional module, then that module will not be finalized as well. Seem that nss may leak when the flag is used. But luckily, it is a rear case.

Comment 8

8 years ago
Alexei, Bob, perhaps we can change SECOID_Init() and SECOID_Shutdown()
to a reference-count semantics now that the secoid tables are shared
by multiple shared libraries?

Comment 9

8 years ago
> We can add "else" at line 464 and set an error that we will check later. This
> will work if no other code overwrites the set error. If it does, then nss will
> still clean up the secoid table.

yeah, sigh. I would be tempted just to punt on freeing the secoid table at that point. Softoken will free it when it finalizes.


> Bob, it is true that if nss uses the cooperate flag, but initializes an
> additional module, then that module will not be finalized as well. 

The flag is trying to do something fairly unnatural. Leaking is the least of our worries;).

Comment 10

8 years ago
re wtc:

That's probably the cleanest solution.

bob
It's clear to me that there's a pattern here.  Many of the init/finalize
function pairings in NSS have this same problem.  Examples include
  NSS_Init     / NSS_Shutdown
  C_Initialize / C_Finalize
  SECOID_INit  / SECOID_Shutdown

There are many more examples.  They all have the property that, when they
were designed, it was assumed that there would be only one caller, an 
assumption that has subsequently become false as software has become more
complex and multiple callers can exist who don't know about one another.

I think that they all have the same problem, and the quick-and-easy 
approaches to fixing them (such as mere reference counting) have the same problems with all of them.  Ref Counting works for APIs that are designed 
to use that from the start, but is a problem when added to APIs after the 
fact, APIs that did not initially use it.  

I think we may need to do something similar to what Bob has recently done 
for NSS_Init.  That is, add a new version of init that returns a unique 
handle, and a new finalize that requires that handle to be unique.  Then 
references can be associated with the unique handles.  

I think this is the solution to the PKCS#11 problem too, the one that Julien
tried to address with NSS_INIT_COOPERATE.  Unfortunately, I'm afraid pushing
that change through the PKCS#11 WG will be a Herculean task.
(Assignee)

Comment 12

8 years ago
Created attachment 405925 [details] [diff] [review]
Patch v1 - return uniq handles to context and use reference-count semantics
Assignee: nelson → alexei.volkov.bugs
Attachment #405925 - Flags: review?(nelson)
(Assignee)

Updated

8 years ago
Attachment #405925 - Attachment is patch: true
Attachment #405925 - Attachment mime type: application/octet-stream → text/plain

Comment 13

8 years ago
Comment on attachment 405925 [details] [diff] [review]
Patch v1 - return uniq handles to context and use reference-count semantics

Unlike NSS_Initialize and NSS_Shutdown, SECOID_Init and SECOID_Shutdown
are only used by NSS itself (shared libraries and a single command,
blapitest, which doesn't use NSS shared libraries), and they're only
recently exported in NSS 3.12 (from the new nssutil3 library).

A simple change to add reference-counting semantics to SECOID_Init and
SECOID_Shutdown will suffice.  Adding new functions SECOID_InitContext
and SECOID_ShutdownContext is an overkill, and will merely make it
hard on ourselves.

The SECOIDInitContext structure is not useful unless you envision
different NSS shared libraries will have different OID tables.
(Reporter)

Updated

8 years ago
Attachment #405925 - Flags: review?(nelson) → review-
Comment on attachment 405925 [details] [diff] [review]
Patch v1 - return uniq handles to context and use reference-count semantics

r- because the modified blapitest.c won't compile.

In addition, I have a few suggestions that might be improvements.
1. immediately after each call to SECOID_ShutdownContext, set the 
value of the corresponding static context pointer to NULL.
2. Immediately before each call to SECOID_InitContext, check and 
make sure that the corresponding static context pointer is NULL.

Comment 15

8 years ago
I also think we can fix this without adding a SECOID_InitContext and SECOID_ShutdownContext.

Before we add all those, we should have a patch that does not require changes to softoken and nssdbm. It can be as simple as simply moving the init/shutdown in NSS init to being as complete as adding reference counts to SECOID_Init as wtc suggests.

Please do not submit a patch that tries to change softoken or softoken/legacydb

bob
(Assignee)

Comment 16

8 years ago
(In reply to comment #15)
Understand that the patch is not suitable for 3.12.x because of the changes in the code inside fips crypto boundary.

> Please do not submit a patch that tries to change softoken or softoken/legacydb
Why not? I don't think we should impose 3.12 post fips validation rules on the next major release.

Comment 17

8 years ago
Alexei:

I object to the patch even without the FIPS crypto boundary
issue.  The solution should match the scope of the problem.
This problem is recently introduced in NSS 3.12 and only
affects NSS itself.  So we should use a simple
reference-counting solution.  Otherwise we are just making
it hard for ourselves.

Comment 18

8 years ago
1) So it seems to me the priority is to fix 3.12, at least first.
2) As Wan-Teh pointed out there are solutions that does not require mucking with the FIPS boundary. If such a solution works in both 3.12 and 3.13, we should adopt it.

I would be OK with a solution that changes code in the FIPS boundary if:
1) the 3.12 fix is already available, and 
2) a compelling reason is given why a general fix that doesn't change the FIPS boundary is deficient.

bob
(Assignee)

Comment 19

8 years ago
It is possible to use atomic function to adjust ref-count, but it seems not to be valuable in this case since SECOID init and shutdown functions are not capable to work in non-thread safe environment.

I wonder if adding use of atomic function will convey wrong impression about init and shutdown functions to the people who are going to use them. I prefer not to use atomic function, but what is your opinion, Wan-Teh, Bob?

Comment 20

8 years ago
The use of atomic functions is valuable after the first SECOID_Init
function is called.  I would still use atomic functions.  Note: SECOID_Init
is really hidden underneath the current NSS_Initialize and the new
NSS_InitContext function, which could be called simultaneously
by multiple threads.
(Assignee)

Comment 21

8 years ago
Created attachment 407557 [details] [diff] [review]
Patch v2 - simple reference-counting solution

Wan-Teh, Nelson, thanks for suggestion. Please review.
Attachment #405925 - Attachment is obsolete: true
Attachment #407557 - Flags: review?(wtc)
(Assignee)

Updated

8 years ago
Attachment #407557 - Attachment is obsolete: true
Attachment #407557 - Flags: review?(wtc)
(Assignee)

Comment 22

8 years ago
Comment on attachment 407557 [details] [diff] [review]
Patch v2 - simple reference-counting solution

Wrong patch.
(Assignee)

Comment 23

8 years ago
Created attachment 407558 [details] [diff] [review]
Patch v2 - simple reference-counting solution
Attachment #407558 - Flags: review?(wtc)

Comment 24

8 years ago
Comment on attachment 407558 [details] [diff] [review]
Patch v2 - simple reference-counting solution

>+static int           secoidInitCount;

Nit: Use PRInt32 instead to match the prototype of
PR_AtomicIncrement and PR_AtomicDecrement.

>+    if (secoidInitCount) {
>+        PR_AtomicIncrement(&secoidInitCount);
>+        return SECSuccess;
>+    }

This is wrong.  You should never test secoidInitCount
directly.  Instead, test the return value of
PR_AtomicIncrement.

This should read:

    if (PR_AtomicIncrement(&secoidInitCount) > 1) {
        /* Already initialized. */
        return SECSuccess;
    }

>@@ -1920,6 +1927,7 @@ SECOID_Init(void)
>     }
> 
>     PORT_Assert (i == SEC_OID_TOTAL);
>+    PR_AtomicIncrement(&secoidInitCount);

This should be removed.

> SECStatus
> SECOID_Shutdown(void)
> {
>+    secoidInitCount = PR_AtomicDecrement(&secoidInitCount);
>+    if (secoidInitCount > 0) {
>+        return SECSuccess;
>+    }

Same problem here: you can't test secoidInitCount directly.
Test the return value of PR_AtomicDecrement instead.  This
should read:

    if (PR_AtomicDecrement(&secoidInitCount) > 0) {
        return SECSuccess;
    }
Attachment #407558 - Flags: review?(wtc) → review-

Comment 25

8 years ago
>> SECStatus
>> SECOID_Shutdown(void)
>> {
>>+    secoidInitCount = PR_AtomicDecrement(&secoidInitCount);
>>+    if (secoidInitCount > 0) {
>>+        return SECSuccess;
>>+    }
>
> Same problem here: you can't test secoidInitCount directly.
> Test the return value of PR_AtomicDecrement instead.  This
> should read:
>
>    if (PR_AtomicDecrement(&secoidInitCount) > 0) {
>        return SECSuccess;
>    }

Note that your previous code (in the shutdown case) would have been acceptable if you used a temporary instead of the atomic variable itself (e.g.):

 SECStatus
 SECOID_Shutdown(void)
 {
+    PRInt32 mySecoidInitCount = PR_AtomicDecrement(&secoidInitCount);
+    if (mySecoidInitCount > 0) {
+        return SECSuccess;
+    }
(Reporter)

Updated

8 years ago
Priority: -- → P1
Target Milestone: --- → 3.12.5
(Assignee)

Comment 26

8 years ago
Created attachment 410040 [details] [diff] [review]
Patch v3 - simple reference-counting solution

OK, I think I've fixed most of the problems. I think we agree that code will operate with an assumption, that the first all to SECOID_Init will be done from thread safe env.

One thing that I'm not sure, is the code in SECOID_Shutdown right after atomic decrement function. At that point I need to decide what to do, if caller has called SECOID_Shutdown too many times. Nelson helped me out by suggesting to restore the value of secoidInitCount as it was before we decremented it and return a failure.
Attachment #407558 - Attachment is obsolete: true
Attachment #410040 - Flags: superreview?(rrelyea)
Attachment #410040 - Flags: review?(wtc)

Comment 27

8 years ago
Comment on attachment 410040 [details] [diff] [review]
Patch v3 - simple reference-counting solution

r+

I do have some comments.

There may be some races still, but I don't think they are possible in real life:

Caller 1 calls SECOID_Init(). -- initiallization starts/task switch.

Caller 2 calls SECOID_Init(), it returns and then calls a SECOID function without the hashtables set up (boom).

Second race:

Caller 1 has the oid table initialized. He is the last user and calls SECOID_Shutdown. We decrement the count and get the value *task switch*

Caller 2 call SECOID_Init(). Count is zero so he proceeds to initialize. We not the hash tables already initialized, asserts and returns an error.

I don't think Race 1 is possible in our current code unless we are racing init's (which is a completely different problem we are trying to solve). We have a comment in the code to this effect, so I think we can ignore this race.

Race 2 appears to be possible in the case where we don't have a clean shutdown. NSS shuts down, but the PKCS #11 module hasn't been Finalized because of an outstanding reference. That reference is freed on another thread at the same time as the process calls NSS_Init again. This is also unlikely. outstanding references seem to be lost references, not references that may come back.

RE handling multi shutdowns. I think your handling is fine. There is little we can do about a pathelogical situation. If recovery is possible, your code does the right think. I may suggest an Assert there, though. Debug builds should warn the caller something is horribly wrong -- fix it;).

bob
Attachment #410040 - Flags: superreview?(rrelyea) → superreview+

Comment 28

8 years ago
Comment on attachment 410040 [details] [diff] [review]
Patch v3 - simple reference-counting solution

r=wtc.

In general you should avoid fixing style issues because
such whitespace changes make the patch larger and obscure
the actual changes.

In lib/util/secoid.c:

>     for ( i = 0; i < SEC_OID_TOTAL; i++ ) {
>-	oid = &oids[i];
>+        PLHashEntry *entry;
>+        const SECOidData *oid = &oids[i];

The indentation of these two lines is different from the
the other top-level statements inside this for loop.  I
know their indentation is wrong (should be 4), but it's
more important to be consistent.

>+loser:
>+    /* Initial call of SECOID_Init is from thread safe env. At this point
>+     * secoidInitCount is equal to 1. SECOID_Shutdown will restore initial
>+     * value of globals and set secoidInitCount to 0. */
>+    SECOID_Shutdown();
>+    PORT_SetError(SEC_ERROR_LIBRARY_FAILURE);
>+    return SECFailure;

We should move the PORT_Assert(0) calls here, too.

In lib/util/secoid.h:

>+ * Caller application must ensure that initial call of the function is done
>+ * from thread safe environment.

Remove "application" because the caller may be a library.
Did you really mean an application?
Attachment #410040 - Flags: review?(wtc) → review+
(Assignee)

Comment 29

8 years ago
> >+loser:
> >+    /* Initial call of SECOID_Init is from thread safe env. At this point
> >+     * secoidInitCount is equal to 1. SECOID_Shutdown will restore initial
> >+     * value of globals and set secoidInitCount to 0. */
> >+    SECOID_Shutdown();
> >+    PORT_SetError(SEC_ERROR_LIBRARY_FAILURE);
> >+    return SECFailure;
> 
> We should move the PORT_Assert(0) calls here, too.

I didn't move assert thinking that it would be valuable to know a specific place where a problem has occurred.

> 
> In lib/util/secoid.h:
> 
> >+ * Caller application must ensure that initial call of the function is done
> >+ * from thread safe environment.
> 
> Remove "application" because the caller may be a library.
> Did you really mean an application?
Well, these are public functions now. We initialize it in blapitest.c:441 before decoding - giving an example of how to do so. It is statically linked test program, but if an application would use only utils, then it should not be so.
I can imagine, that an application at least may do the same thing: initialize the oid table and use it for verification of a data.
(Assignee)

Comment 30

8 years ago
> I may suggest an Assert there, though. Debug builds should
> warn the caller something is horribly wrong -- fix it;).
The code asserts that initCount>=0 after decrementing secoidInitCount.
(Assignee)

Comment 31

8 years ago
patch integrated
Status: NEW → RESOLVED
Last Resolved: 8 years ago
Resolution: --- → FIXED
(Assignee)

Comment 32

8 years ago
Today I was informed, that web server integration tests failed due to the patch.
They run

modutil -list -nocertdb -dbdir nss

We assert at shutdown with the stack:
#4  0x003a1b9b in PR_Assert (s=0x10631d "initCount >= 0", file=0x106286 "secoid.c", ln=2085) at ../../../../pr/src/io/prlog.c:563
#5  0x00100756 in SECOID_Shutdown () at secoid.c:2085
#6  0x005724f2 in nsc_CommonFinalize (pReserved=0x0, isFIPS=0) at pkcs11.c:2752
#7  0x005726a8 in NSC_Finalize (pReserved=0x0) at pkcs11.c:2815
#8  0x001ffa12 in SECMOD_UnloadModule (mod=0x808c10) at pk11load.c:553
#9  0x0021aba2 in SECMOD_SlotDestroyModule (module=0x808c10, fromSlot=1) at pk11util.c:872
#10 0x002163ca in PK11_DestroySlot (slot=0x80c600) at pk11slot.c:451
#11 0x002163fa in PK11_FreeSlot (slot=0x80c600) at pk11slot.c:464
#12 0x0021aaa7 in SECMOD_DestroyModule (module=0x808c10) at pk11util.c:843
#13 0x0021ac04 in SECMOD_DestroyModuleListElement (element=0x4062d0) at pk11util.c:889
#14 0x0021ac37 in SECMOD_DestroyModuleList (list=0x4062d0) at pk11util.c:905
#15 0x00219845 in SECMOD_Shutdown () at pk11util.c:99
#16 0x001d101c in NSS_Shutdown () at nssinit.c:894
#17 0x00003b33 in main (argc=5, argv=0xbffff254) at modutil.c:917

The problem is an imbalanced number of callers to SECOID_Init and SECOID_Shutdown: due to -nocertdb argument, legacy_Open function that suppose to call SECOID_Init, is not called. http://mxr.mozilla.org/security/source/security/nss/lib/softoken/pkcs11.c#2145. Yet, we unconditionally call sftkdb_Shutdown from nsc_CommonFinalize.

The missing stack:
#0  SECOID_Init () at secoid.c:1865
#1  0x00688486 in legacy_Open (configdir=0x4062e0 "nss", certPrefix=0x4062f0 "", keyPrefix=0x406300 "", certVersion=8, keyVersion=3, flags=1, certDB=0xbfffebcc, keyDB=0xbfffebd0) at lginit.c:608
#2  0x0056aa70 in sftkdbCall_open (dir=0x4062e0 "nss", certPrefix=0x4062f0 "", keyPrefix=0x406300 "", certVersion=8, keyVersion=3, flags=1, isFIPS=0, certDB=0xbfffebcc, keyDB=0xbfffebd0) at lgglue.c:362
#3  0x00590ab7 in sftk_DBInit (configdir=0x4062e0 "nss", certPrefix=0x4062f0 "", keyPrefix=0x406300 "", updatedir=0x406280 "", updCertPrefix=0x0, updKeyPrefix=0x0, updateID=0x406290 "", readOnly=1, noCertDB=0, noKeyDB=0, forceOpen=0, isFIPS=0, certDB=0xbfffec94, keyDB=0xbfffec90) at sftkdb.c:2585
#4  0x005713b5 in SFTK_SlotReInit (slot=0x414b50, configdir=0x4062e0 "nss", updatedir=0x406280 "", updateID=0x406290 "", params=0x406668, moduleIndex=0) at pkcs11.c:2148
#5  0x00571733 in SFTK_SlotInit (configdir=0x4062e0 "nss", updatedir=0x406280 "", updateID=0x406290 "", params=0x406668, moduleIndex=0) at pkcs11.c:2260
#6  0x00572383 in nsc_CommonInitialize (pReserved=0xbfffee04, isFIPS=0) at pkcs11.c:2648
#7  0x0057242c in NSC_Initialize (pReserved=0xbfffee04) at pkcs11.c:2710
#8  0x001fed3a in secmod_ModuleInit (mod=0x808c10, reload=0xbfffef20, alreadyLoaded=0xbfffee6c) at pk11load.c:214
#9  0x001ff6b8 in secmod_LoadPKCS11Module (mod=0x808c10, oldModule=0xbfffef20) at pk11load.c:460
#10 0x0020d0a2 in SECMOD_LoadModule (modulespec=0x404da0 " name=\"NSS Internal PKCS #11 Module\" parameters=\"configdir='nss' certPrefix='' keyPrefix='' secmod='' flags=readOnly updatedir='' updateCertPrefix='' updateKeyPrefix='' updateid='' updateTokenDescript"..., parent=0x805c10, recurse=1) at pk11pars.c:775
#11 0x0020d155 in SECMOD_LoadModule (modulespec=0x403410 "name=\"NSS Internal Module\" parameters=\"configdir='nss' certPrefix='' keyPrefix='' secmod='' flags=readOnly updatedir='' updateCertPrefix='' updateKeyPrefix='' updateid='' updateTokenDescription='' \" N"..., parent=0x0, recurse=1) at pk11pars.c:804
#12 0x001d04e2 in nss_Init (configdir=0x300c0 "nss", certPrefix=0x22394 "", keyPrefix=0x22394 "", secmodName=0x0, updateDir=0x2f76d4 "", updCertPrefix=0x2f76d4 "", updKeyPrefix=0x2f76d4 "", updateID=0x2f76d4 "", updateName=0x2f76d4 "", readOnly=1, noCertDB=0, noModDB=0, forceOpen=0, noRootInit=0, optimizeSpace=0, noSingleThreadedModules=0, allowAlreadyInitializedModules=0, dontFinalizeModules=0) at nssinit.c:537
#13 0x001d08e2 in NSS_Initialize (configdir=0x300c0 "nss", certPrefix=0x22394 "", keyPrefix=0x22394 "", secmodName=0x0, flags=1) at nssinit.c:658
#14 0x0000336d in init_crypto (create=0, readOnly=1) at modutil.c:682
#15 0x00003877 in main (argc=4, argv=0xbffff268) at modutil.c:862
we may have to back this patch out in order to get 3.12.5 shipped with 
the TLS fix this week.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Assignee)

Comment 34

8 years ago
Bob, we may want to move SECOID_Init into the glue layer. I think it will be ok to initialize secoid table any time we load legacy library and define callbacks for it. I'm talking about moving the init call into sftkdbLoad_Legacy function an place it in the location after all symbols on the library have been loaded.
Shutdown call would go to sftkdbCall_Shutdown right before unloading a library.

What do you think?
That may be a good idea, but I doubt we'll do it in 3.12.x because of FIPS.
Because of the extremely urgent need to get the TLS renegotiation 
vulnerability fix released, this bug must be entirely fixed or the patch
backed out by close of business California time today. Sorry.
(Assignee)

Comment 37

8 years ago
patch in attachment 410040 [details] [diff] [review] is backed out.
(Assignee)

Updated

8 years ago
Blocks: 544289
We have a new topcrash on the trunk (#12) - http://tinyurl.com/23m9tzj links to the crash reports, which seem to match some of the stacks in Comment 0 posted by Nelson. Would this bug cover these crashes or should a file a new bug? All of the bug comments indicated they were checking for addons updates when they crashed.
Alexei, this was a P1 bug for Sun, and was never fixed.  
Do you still want to fix it?
Target Milestone: 3.12.5 → 3.12.8

Updated

7 years ago
Severity: normal → critical
Summary: Crash in secoid code, trying to reinitialize after failed NSS shutdown → Crash in secoid code, trying to reinitialize after failed NSS shutdown [@ PL_HashTableLookupConst ]

Updated

7 years ago
blocking2.0: --- → ?

Comment 40

7 years ago
moved to #3 top crash for firefox 4.0b5

guessing this is a new bug, or maybe new changes that exposed an old bug.  either way we need to figure out a way to get this new topcrash fixed.

The new stacks look like

Frame  	Module  	Signature [Expand]  	Source
0 	plds4.dll 	PL_HashTableLookupConst 	nsprpub/lib/ds/plhash.c:381
1 	nssutil3.dll 	SECOID_FindOID_Util 	security/nss/lib/util/secoid.c:1997
2 	nssutil3.dll 	SECOID_FindOIDTag_Util 	security/nss/lib/util/secoid.c:2013
3 	nssutil3.dll 	SECOID_GetAlgorithmTag_Util 	security/nss/lib/util/secalgid.c:49
4 	nss3.dll 	seckey_ExtractPublicKey 	security/nss/lib/cryptohi/seckey.c:1044
5 	xul.dll 	nsDataSignatureVerifier::VerifyData 	security/manager/ssl/src/nsDataSignatureVerifier.cpp:87
6 	xul.dll 	NS_InvokeByIndex_P 	xpcom/reflect/xptcall/src/md/win32/xptcinvoke.cpp:102
7 	xul.dll 	js::InvokeCommon<int > 	js/src/jsinterp.cpp:566
8 	xul.dll 	js::Invoke 	js/src/jsinterp.cpp:696
9 	xul.dll 	js::Interpret 	js/src/jsinterp.cpp:4707
10 	xul.dll 	js::InvokeCommon<int > 	js/src/jsinterp.cpp:577
11 	xul.dll 	js::Invoke 	js/src/jsinterp.cpp:696
12 	xul.dll 	js::InternalInvoke 	js/src/jsinterp.cpp:736
13 	xul.dll 	nsXPCWrappedJSClass::CallMethod 	js/src/xpconnect/src/xpcwrappedjsclass.cpp:1688
14 	xul.dll 	nsXPCWrappedJS::CallMethod 	js/src/xpconnect/src/xpcwrappedjs.cpp:570
15 	xul.dll 	PrepareAndDispatch 	xpcom/reflect/xptcall/src/md/win32/xptcstubs.cpp:114
16 	xul.dll 	SharedStub 	xpcom/reflect/xptcall/src/md/win32/xptcstubs.cpp:141
17 	xul.dll 	nsDOMEventListenerWrapper::HandleEvent 	content/events/src/nsDOMEventTargetHelper.cpp:65
18 	xul.dll 	nsEventListenerManager::HandleEventSubType 	content/events/src/nsEventListenerManager.cpp:1112


nss was updated in july,  
2010-07-19 07:45 +0200    Kai Engert - Bug 575620, landing NSS 3.12.7 beta 2

and the js code further down in the the stack has changes in august.

if its addons manager as suggested by user comments that code has been changing too.   probably need to try and isolate this to a regression range where we saw and uptick or volume increase.

Comment 41

7 years ago
we went from 0-4 crashes per day during aug 1-17 on older releases to 56 crashes per day on aug 18.  then a spike yesterday to 243 crashes as beta5 went live to a larger testing pool.

20100816 4   , 1 3.6.8 , 1 3.6.6 , 1 3.6 , 1 3.5.11 ,
20100817    ,
20100818 56   , 53 4.0b5pre , 2 3.6.8 , 1 3.5.11 ,
20100819 40   , 36 4.0b5pre , 2 3.6.8 , 1 3.5.7 , 1 3.5.11 ,
20100820 30   , 29 4.0b5pre , 1 3.6.7 ,
20100821 25   4.0b5pre 25   ,
20100822 15   , 12 4.0b5pre , 1 3.7a5 , 1 3.6.3 , 1 3.5.11 ,
20100823 30   , 29 4.0b5pre , 1 3.6.8 ,
20100824 19   , 17 4.0b5pre , 1 3.6.8 , 1 3.6.3 ,
20100825 17  4.0b5pre  17   ,
20100826 19   , 15 4.0b5pre , 2 3.6 , 2 3.5.11 ,
20100827 22   , 19 4.0b5pre , 3 3.6.8 ,
20100828 13   , 11 4.0b5pre , 1 3.6.8 , 1 3.1b2 ,
20100829 21   , 15 4.0b5pre , 4 3.6 , 1 4.0b4 , 1 3.6.8 ,
20100830 22   , 19 4.0b5pre , 1 3.6b5 , 1 3.6.8 , 1 3.6.6 ,
20100831 28   , 13 4.0b5 , 11 4.0b5pre , 2 3.6.8 , 1 4.0b6pre , 1 4.0b4 ,
20100901 57   , 40 4.0b6pre , 10 4.0b5 , 4 4.0b5pre , 2 3.5.11 , 1 4.0b4 ,
20100902 42   , 25 4.0b6pre , 8 4.0b5 , 5 4.0b5pre , 2 3.5.11 , 1 3.6.8 , 
20100903 33   , 19 4.0b6pre , 5 4.0b5pre , 5 4.0b5 , 1 4.0b4 , 1 3.6.8 ,
20100904 28   , 17 4.0b6pre , 7 4.0b5pre , 3 4.0b5 , 1 3.6.9 ,
20100905 23   , 16 4.0b6pre , 2 4.0b5pre , 2 3.6.8 , 1 4.0b5 , 1 4.0b4 , 
20100906 18   , 13 4.0b6pre , 3 4.0b5pre , 2 4.0b5 ,
20100907 243   , 220 4.0b5 , 14 4.0b6pre , 5 4.0b5pre , 2 3.5.5 , 1 3.6.8 ,

Comment 42

7 years ago
of those spiking crashes on the 08/18 these are the build breakdown
43 from 4.0b5pre2010081811 
7  from 4.0b5pre2010081807 
3  from 4.0b5pre2010081803
Of course, what really needs to be fixed is the non-NSS bug that held 
onto a reference to an NSS object that caused the NSS shutdown to fail.  
Even if NSS doesn't crash, if it cannot shutdown, then thereafter the 
application is in an inconsistent security state..  The application 
should not continue.  

For Firefox, the top priority should be to fix whatever code OUTSIDE of NSS 
is holding a reference, thereby preventing NSS from shutting down.
Whiteboard: See comment 43
(Reporter)

Updated

7 years ago
Keywords: topcrash
See also bug 588931 comment 10 and bug 588511.
I think those provide major clues about how this happens.

Comment 45

7 years ago
I believe bug 588511 will have fixed the Mozilla topcrash here, so this doesn't block.
blocking2.0: ? → -

Updated

7 years ago
Target Milestone: 3.12.8 → 3.12.9
Crash Signature: [@ PL_HashTableLookupConst ]
Also occurred in Aurora (ffx 8 ) for Fennec 
https://crash-stats.mozilla.com/report/list?range_value=7&range_unit=days&date=2011-09-26%2008%3A00%3A00&signature=PL_HashTableLookupConst&version=Fennec%3A8.0a2
Keywords: crash
Whiteboard: See comment 43 → See comment 43, [mobile-crash]

Comment 47

6 years ago
It's #5 top crasher in Fennec 8.0.
I filed bug 702307 to deal with the Gecko / Firefox Mobile aspect of this. Please comment in bug 702307 for anything regarding Firefox crashes.
Keywords: topcrash
Whiteboard: See comment 43, [mobile-crash] → [See bug 702307 for the Gecko crash with the same signature]
Whiteboard: [See bug 702307 for the Gecko crash with the same signature] → [See bug 702307 for the Gecko crash with the same signature], [mobile-crash]

Updated

6 years ago
Depends on: 716345

Updated

6 years ago
Crash Signature: [@ PL_HashTableLookupConst ] → [@ PL_HashTableLookupConst ] [@ PL_HashTableLookupConst | SECOID_FindOID_Util]

Updated

6 years ago
Summary: Crash in secoid code, trying to reinitialize after failed NSS shutdown [@ PL_HashTableLookupConst ] → Crash in secoid code, trying to reinitialize after failed NSS shutdown in SECOID_FindOID_Util
You need to log in before you can comment on or make changes to this bug.