Open Bug 517615 Opened 16 years ago Updated 1 year ago

Crash in secoid code, trying to reinitialize after failed NSS shutdown in SECOID_FindOID_Util

Categories

(NSS :: Libraries, defect, P3)

3.12.4
Sun
Solaris

Tracking

(blocking2.0 -)

REOPENED
3.12.9
Tracking Status
blocking2.0 --- -

People

(Reporter: nelson, Unassigned)

References

(Blocks 2 open bugs)

Details

(Keywords: crash, Whiteboard: [See bug 702307 for the Gecko crash with the same signature], [mobile-crash])

Crash Data

Attachments

(1 file, 3 obsolete files)

I am filing this bug in UNCONFIRMED state, because many of the details below are based on hearsay, and I am not very certain of them. A certain tool used to upgrade a server installation from an old version of the server product to a new version of that product attempts to change the password on the server's key DB. ACCORDING TO THE REPORTER, the sequence of events is: - program initializes NSS successfully - program walks through slots looking for token by name, (leaking at least one) - program changes the password on the token, succeeds - program attempts to shutdown NSS, fails (due to leaked slot reference) (It may be that the program forks here, and the steps below occur in a child) - program attempts to initialize NSS a second time. NSS crashes during this second initialization attempt. I have attempted to reconstruct the compolete crash stack. The crash stack is believed to be: [libplds4.so+0x1858] PL_HashTableLookupConst+0x4 [plhash.c#376] [libnssutil3.so+0x11ad8] SECOID_FindOID_Util+0x30 [secoid.c#1949] [libnssutil3.so+0x11b74] SECOID_FindOIDTag_Util+0xc [secoid.c#1965] [libnssdbm3.so+0x46d4] nsslowkey_GetPWCheckEntry [keydb.c#1384] [libnssdbm3.so+0x58a8] lg_GetMetaData [keydb.c#2225] [libsoftokn3.so+0x38454] sftkdb_HasPasswordSet [sftkpwd#688] [libsoftokn3.so+0x159c8] NSC_GetTokenInfo [pkcs11.c#2978] [libnss3.so+0x598d8] PK11_InitToken [pk11slot.c#1117] [libnss3.so+0x59f7c] PK11_InitSlot [pk11slot.c#1366] [libnss3.so+0x3f5e4] SECMOD_LoadPKCS11Module [pk11load.c#428] (*) [libnss3.so+0x4f0fc] SECMOD_LoadModule+0xe0 [pk11pars.c#392] [libnss3.so+0x4f158] SECMOD_LoadModule+0x13c [pk11pars.c#412] [libnss3.so+0x10a74] nss_init [nssinit.c#537] [libnss3.so+0x10e24] NSS_Initialize+0x108 [nssinit.c#655] (*) Note that this is AFTER C_Initialize was called on the slot/token by function secmod_ModuleInit which was called at line 378. I believe that PL_HashTableLookupConst crashed because the first argument passed to it was NULL, or an otherwise bad pointer value, and it attempts to dereference that without checking it for NULL. I will assume it was NULL. There is an assertion, but no code to avoid crashing in the non-DEBUG case. So, this could be an NSPR bug. But I'm interested in why SECOID_FindOID_Util passed a NULL value for the first argument. The relevant code in SECOID_FindOid_Util is 1942 SECOidData * 1943 SECOID_FindOID(const SECItem *oid) 1944 { 1945 SECOidData *ret; 1946 1947 PR_ASSERT(oidhash != NULL); 1948 1949 ret = PL_HashTableLookupConst ( oidhash, oid ); Here again, there is an assertion to detect a NULL oidhash value in DEBUG builds, but no test to avoid a crash in non-DEBUG builds. Fortunately, this is in libUTIL which is not frozen for FIPS. The question is, how did we get here with oidhash being NULL? The static variable oidhash is initialized by a single function, SECOID_Init(). SECOID_Init is called from three separate places in NSS shared libraries. They are: 1) In nsc_CommonInitialize at line 2576 in softoken 2) In legacy_Open in the legacy nssdbm, called from softoken function sftkdbCall_open, called from sftk_DBInit, called from SFTK_SlotReInit, called from SFTK_SlotInit or sftk_CreateNewSlot called from nsc_CommonInitialize at line 2648 3) In nss_Init at line 546, Now, we know that NSC_Initialize has been called, but if the module still thought it was loaded and initialized, then it NOT have called nsc_CommonInitialize. So, if there is a way that the softoken stayed loaded but libnssutil got unloaded and reloaded, THAT would explain this crash. So, that is one hypothesis.
Here is the stack on the debug build: =>[4] SECOID_FindOID_Util(oid = 0xffbfd2a0), line 1947 in "secoid.c" [5] SECOID_FindOIDTag_Util(oid = 0xffbfd2a0), line 1965 in "secoid.c" [6] nsslowkey_GetPWCheckEntry(handle = 0x885888, entry = 0xffbfd338), line 1384 in "keydb.c" [7] lg_GetMetaData(sdb = 0x809818, id = 0xf88e3420 "password", item1 = 0xffbfd658, item2 = 0xffbfd64c), line 2225 in "keydb.c" [8] sftkdb_HasPasswordSet(keydb = 0x9f1948), line 688 in "sftkpwd.c" [9] NSC_GetTokenInfo(slotID = 2U, pInfo = 0xffbfd78c), line 2978 in "pkcs11.c" [10] PK11_InitToken(slot = 0xbd89b8, loadCerts = 1), line 1117 in "pk11slot.c" [11] PK11_InitSlot(mod = 0xb471f8, slotID = 2U, slot = 0xbd89b8), line 1366 in "pk11slot.c" [12] SECMOD_LoadPKCS11Module(mod = 0xb471f8), line 428 in "pk11load.c" [13] SECMOD_LoadModule(modulespec = 0x698320 " name="NSS Internal PKCS #11 Module" parameters="configdir='/space/v21/p5b02sbs/domains/domain1/config' certPrefix='' keyPrefix='' secmod='secmod.db' flags= updatedir='' updateCertPrefix='' updateKeyPrefix='' updateid='' updateTokenDescription='' " NSS="trustOrder=75 cipherOrder=100 slotParams={0x00000001=[slotFlags=RSA,RC4,RC2,DES,DH,SHA1,MD5,MD2,SSL,TLS,AES,Camellia,SEED,RANDOM askpw=any timeout=30 ] } Flags=internal,critical"", parent = 0xb46bb0, recurse = 1), line 392 in "pk11pars.c" [14] SECMOD_LoadModule(modulespec = 0x7fb1a8 "name="NSS Internal Module" parameters="configdir='/space/v21/p5b02sbs/domains/domain1/config' certPrefix='' keyPrefix='' secmod='secmod.db' flags= updatedir='' updateCertPrefix='' updateKeyPrefix='' updateid='' updateTokenDescription='' " NSS="flags=internal,moduleDB,moduleDBOnly,critical"", parent = (nil), recurse = 1), line 412 in "pk11pars.c" [15] nss_Init(configdir = 0xaf9d48 "/space/v21/p5b02sbs/domains/domain1/config", certPrefix = 0xd43e456c "", keyPrefix = 0xd43e4570 "", secmodName = 0xd43e4574 "secmod.db", updateDir = 0xd1ebbf14 "", updCertPrefix = 0xd1ebbf18 "", updKeyPrefix = 0xd1ebbf1c "", updateID = 0xd1ebbf20 "", updateName = 0xd1ebbf24 "", readOnly = 0, noCertDB = 0, noModDB = 0, forceOpen = 0, noRootInit = 0, optimizeSpace = 0, noSingleThreadedModules = 1, allowAlreadyInitializedModules = 1, dontFinalizeModules = 1), line 537 in "nssinit.c" [16] NSS_Initialize(configdir = 0xaf9d48 "/space/v21/p5b02sbs/domains/domain1/config", certPrefix = 0xd43e456c "", keyPrefix = 0xd43e4570 "", secmodName = 0xd43e4574 "secmod.db", flags = 960U), line 668 in "nssinit.c" The problem seems to be in NSC_GetTokenInfo line 2956 in pkcs11.c: the handle returned by sftk_getKeyDB is not null as it suppose to be at this point of execution. PORT_Memcpy(pInfo->label,slot->tokDescription,sizeof(pInfo->label)); handle = sftk_getKeyDB(slot); pInfo->flags = CKF_RNG | CKF_DUAL_CRYPTO_OPERATIONS; if (handle == NULL) { ... } else { /* * we have three possible states which we may be in: * (1) No DB password has been initialized. This also means we * have no keys in the key db. * (2) Password initialized to NULL. This means we have keys, but * the user has chosen not use a password. * (3) Finally we have an initialized password whicn is not NULL, and * we will need to prompt for it. */ if (sftkdb_HasPasswordSet(handle) == SECFailure) { The handle pointer seems to be valid: (dbx) print *handle *handle = { db = 0x809818 ref = 2 type = 1073741824U passwordKey = { type = 10426712 data = (nil) len = 0 } newKey = (nil) oldKey = (nil) updatePasswordKey = (nil) passwordLock = 0x9f1988 peerDB = 0x80aa88 update = (nil) updateID = (nil) updateDBIsInit = -1 } (dbx) print *handle->db *handle->db = { private = 0x8848d0 version = 1667575670 sdb_type = SDB_LEGACY sdb_flags = 4 app_private = 0x9f1948 sdb_FindObjectsInit = 0xd40b4738 = &lg_FindObjectsInit() sdb_FindObjects = 0xd40b4870 = &lg_FindObjects() sdb_FindObjectsFinal = 0xd40b4970 = &lg_FindObjectsFinal() sdb_GetAttributeValue = 0xd40ae4d8 = &lg_GetAttributeValue() sdb_SetAttributeValue = 0xd40af430 = &lg_SetAttributeValue() sdb_CreateObject = 0xd40b1d10 = &lg_CreateObject() sdb_DestroyObject = 0xd40b1f60 = &lg_DestroyObject() sdb_GetMetaData = 0xd40a91d8 = &lg_GetMetaData(SDB *sdb, const char *id, SECItem *item1, SECItem *item2) sdb_PutMetaData = 0xd40a9318 = &lg_PutMetaData(SDB *sdb, const char *id, const SECItem *item1, const SECItem *item2) sdb_Begin = 0xd40b5320 = &lg_Begin() sdb_Commit = 0xd40b53a8 = &lg_Commit() sdb_Abort = 0xd40b5438 = &lg_Abort() sdb_Reset = 0xd40a9438 = &lg_Reset(SDB *sdb) sdb_Close = 0xd40b5bb0 = &lg_Close() sdb_SetForkState = 0xd40b5b60 = &lg_SetForkState() } (dbx) print *handle *handle = { db = 0x809818 ref = 2 type = 1073741824U passwordKey = { type = 10426712 data = (nil) len = 0 } newKey = (nil) oldKey = (nil) updatePasswordKey = (nil) passwordLock = 0x9f1988 peerDB = 0x80aa88 update = (nil) updateID = (nil) updateDBIsInit = -1 } (dbx) print *handle->db *handle->db = { private = 0x8848d0 version = 1667575670 sdb_type = SDB_LEGACY sdb_flags = 4 app_private = 0x9f1948 sdb_FindObjectsInit = 0xd40b4738 = &lg_FindObjectsInit() sdb_FindObjects = 0xd40b4870 = &lg_FindObjects() sdb_FindObjectsFinal = 0xd40b4970 = &lg_FindObjectsFinal() sdb_GetAttributeValue = 0xd40ae4d8 = &lg_GetAttributeValue() sdb_SetAttributeValue = 0xd40af430 = &lg_SetAttributeValue() sdb_CreateObject = 0xd40b1d10 = &lg_CreateObject() sdb_DestroyObject = 0xd40b1f60 = &lg_DestroyObject() sdb_GetMetaData = 0xd40a91d8 = &lg_GetMetaData(SDB *sdb, const char *id, SECItem *item1, SECItem *item2) sdb_PutMetaData = 0xd40a9318 = &lg_PutMetaData(SDB *sdb, const char *id, const SECItem *item1, const SECItem *item2) sdb_Begin = 0xd40b5320 = &lg_Begin() sdb_Commit = 0xd40b53a8 = &lg_Commit() sdb_Abort = 0xd40b5438 = &lg_Abort() sdb_Reset = 0xd40a9438 = &lg_Reset(SDB *sdb) sdb_Close = 0xd40b5bb0 = &lg_Close() sdb_SetForkState = 0xd40b5b60 = &lg_SetForkState() }
So, it appears that the stack you found with the debugger matches the stack I reconstructed with MXR. Question: why do you think that handle should be NULL at line 2956 in NSC_GetTokenInfo?
Never mind. I was looking at the wrong slot. More info to come...
Alexei figured out a lot more about this. The crash occurs if the program calls NSS_Initialize with the value NSS_INIT_COOPERATE on the first call to NSS_Initialize. If we change the code to not set that flag, the crash goes away. This is only with 3.12, not 3.11. The flag causes NSS to NOT call C_Finalize before unloading the module. Somehow, with this flag set, NSS_Shutdown returns SECSuccess, but has not called C_Finalize, and also apparently has not successfully unloaded the module. NSS_Shutdown calls SECOID_Shutdown, which zaps the secoid hash table. In the second NSS_Init, the module thinks it is still initialized, so it does not initialize again. Consequently, it does not call SECOID_Init. So it goes on to use the SECOID table without initializing it, believing it to still be initialized, and crashes. As yet unansweredd questions: a) why does the softoken module NOT get unloaded by NSS_Shutdown ? b) why is this only a problem in 3.12.3 and not in 3.11.4 ?
Status: UNCONFIRMED → NEW
Ever confirmed: true
Hmm.... I always thought NSS_INIT_COOPERATE was a bad idea. So, are we sure softoken does not get unloaded in 3.12.3? The part of the issue may be that it does. OK, I think I may have a clear picture of what is going on here. In 3.11.4 softoken was statically linked. Because the application had a reference leak, the Shutdown failed, so C_Finalize was never called. After the fork(), there was not fork check, so softoken still looked initialized and would fail with CKR_ALREADY_INITIALIZED. NSS considers failure to initialize softoken a fatal error and failes NSS_Initialize(). The developers stumble onto the NSS_INIT_COOPERATE flag, which allows NSS to just use an already initialized softoken. Setting that flag causes the problems to go away. The flag prevents C_Finalize from being called still, but now NSS accepts an 'Already initialized' token as 'usable'. Alternately the app is a mixed app where it may call some other program that opens the softoken PKCS #11 module directly (Completely evil IMHO, but something that some sun engineers thought might have been a good idea). In this case it really needs the NSS_INIT_COOPERATE semantic, and that was just masking the fact that they did a fork(). Now we have NSS 3.12.3. NSS 3.12.3 does 3 things that thwart the magic semantics that allowed the app to work in 3.11.4. First it loads and unloads softoken dynamically. This is fine when C_Finalize is called before the unload. Second, it does the fork() check. We will get a different error message if the call C_Initialize after the fork() now. Finally, softoken is no longer statically linked with libutil, it is dynamically linked. The exact steps that cause the crash is still a bit mysterious to me. Is there something funny about the heap when we unload softoken without calling C_Finalize. The ultimate question about the crash why is the secoid hash table smashed? I believe we could get back to exactly 3.11 semantics for this application as follows: 1) the app needs to turn off the fork handler(). 2) add the following patch to NSS. RCS file: /cvsroot/mozilla/security/nss/lib/pk11wrap/pk11load.c,v retrieving revision 1.25 diff -u -p -r1.25 pk11load.c --- pk11load.c 2 Oct 2008 00:56:15 -0000 1.25 +++ pk11load.c 25 Sep 2009 16:32:47 -0000 @@ -469,7 +469,7 @@ SECMOD_UnloadModule(SECMODModule *mod) { * if not, we should change this to SECFailure and move it above the * mod->loaded = PR_FALSE; */ if (mod->internal) { - if (0 == PR_AtomicDecrement(&softokenLoadCount)) { + if (0 == PR_AtomicDecrement(&softokenLoadCount) && finalizeModules) { if (softokenLib) { disableUnload = PR_GetEnv("NSS_DISABLE_UNLOAD"); if (!disableUnload) { I don't think this is the right overall solution. While NSS_INIT_COOPERATE is really fundamentally counter to being able to handle proper PKCS #11 fork() semantics (OK, the whole idea that two different pieces of code can use a single PKCS #11 module in the same address space without knowing about each other is ludicrous without a change in the PKCS #11 implementation), there may be some apps that got this to mostly work by the fact they convinced all user to unload their module. I think we need to see why unloading the softoken module without a finalize is thrashing some of our data structures. If we can figure that out, we should be able to allow the app to work without turning off the fork() check. bob
Oh... I think I have it.... In NSS 3.11 not only did we never unload the softoken, softoken was linked with it's own copy of lib libutil. Each component initialized and uninitialized the oid system separately. Now they share a single copy. Let's assume that the unload is failing for some reason in NSS 3.12. (which my above code will guarrentee, so it won't fix anything). This will make sure that softoken will think it is still initialized, so it won't actually run through it's NSC_Initialize function. NSS, however, will happily shutdown the SECOID system in NSS_Shutdown(). In NSS 3.11, this was not a problem (since softoken had it's own copy). In NSS 3.12, it now shuts down OIDS for both softoken and nssdbm. If softoken doesn't re-initalize, the OID system will not init. It's in NSC_Initialize the the legacy_Open function is called. That is the only place SECOID_Init is called in the nssdbm code. NSS's own SECOID_Init() call happens *After* we load all the tokens. We can prevent this crash be moving that init call to before we load any modules in nss_Initialize. I also suggest that we also move the SECOID_Shutdown call to after we shutdown the modules --- and call it only if the modules shutdown cleanly. bob
> Now they share a single copy. Thanks Bob. Now it make sense, why 3.11 didn't have a crash. > We can prevent this crash be moving that init call to before we load any > modules in nss_Initialize. Right, but this is not enough. The nss_shutdown as it written today will leave unfinalized softoken module without underlining secoid table. Any application will crash in an attempt to access the module. > I also suggest that we also move the SECOID_Shutdown call to after we shutdown > the modules --- and call it only if the modules shutdown cleanly. So, the changes in NSS_Shutdown must be a part of the patch. The current code does not treat as an error the case when leaving unfinalized module behind. 454 SECStatus 455 SECMOD_UnloadModule(SECMODModule *mod) { 456 PRLibrary *library; 457 char *disableUnload = NULL; 458 459 if (!mod->loaded) { 460 return SECFailure; 461 } 462 if (finalizeModules) { 463 if (!mod->moduleDBOnly) PK11_GETTAB(mod)->C_Finalize(NULL); 464 } We can add "else" at line 464 and set an error that we will check later. This will work if no other code overwrites the set error. If it does, then nss will still clean up the secoid table. Bob, it is true that if nss uses the cooperate flag, but initializes an additional module, then that module will not be finalized as well. Seem that nss may leak when the flag is used. But luckily, it is a rear case.
Alexei, Bob, perhaps we can change SECOID_Init() and SECOID_Shutdown() to a reference-count semantics now that the secoid tables are shared by multiple shared libraries?
> We can add "else" at line 464 and set an error that we will check later. This > will work if no other code overwrites the set error. If it does, then nss will > still clean up the secoid table. yeah, sigh. I would be tempted just to punt on freeing the secoid table at that point. Softoken will free it when it finalizes. > Bob, it is true that if nss uses the cooperate flag, but initializes an > additional module, then that module will not be finalized as well. The flag is trying to do something fairly unnatural. Leaking is the least of our worries;).
re wtc: That's probably the cleanest solution. bob
It's clear to me that there's a pattern here. Many of the init/finalize function pairings in NSS have this same problem. Examples include NSS_Init / NSS_Shutdown C_Initialize / C_Finalize SECOID_INit / SECOID_Shutdown There are many more examples. They all have the property that, when they were designed, it was assumed that there would be only one caller, an assumption that has subsequently become false as software has become more complex and multiple callers can exist who don't know about one another. I think that they all have the same problem, and the quick-and-easy approaches to fixing them (such as mere reference counting) have the same problems with all of them. Ref Counting works for APIs that are designed to use that from the start, but is a problem when added to APIs after the fact, APIs that did not initially use it. I think we may need to do something similar to what Bob has recently done for NSS_Init. That is, add a new version of init that returns a unique handle, and a new finalize that requires that handle to be unique. Then references can be associated with the unique handles. I think this is the solution to the PKCS#11 problem too, the one that Julien tried to address with NSS_INIT_COOPERATE. Unfortunately, I'm afraid pushing that change through the PKCS#11 WG will be a Herculean task.
Assignee: nelson → alexei.volkov.bugs
Attachment #405925 - Flags: review?(nelson)
Attachment #405925 - Attachment is patch: true
Attachment #405925 - Attachment mime type: application/octet-stream → text/plain
Comment on attachment 405925 [details] [diff] [review] Patch v1 - return uniq handles to context and use reference-count semantics Unlike NSS_Initialize and NSS_Shutdown, SECOID_Init and SECOID_Shutdown are only used by NSS itself (shared libraries and a single command, blapitest, which doesn't use NSS shared libraries), and they're only recently exported in NSS 3.12 (from the new nssutil3 library). A simple change to add reference-counting semantics to SECOID_Init and SECOID_Shutdown will suffice. Adding new functions SECOID_InitContext and SECOID_ShutdownContext is an overkill, and will merely make it hard on ourselves. The SECOIDInitContext structure is not useful unless you envision different NSS shared libraries will have different OID tables.
Attachment #405925 - Flags: review?(nelson) → review-
Comment on attachment 405925 [details] [diff] [review] Patch v1 - return uniq handles to context and use reference-count semantics r- because the modified blapitest.c won't compile. In addition, I have a few suggestions that might be improvements. 1. immediately after each call to SECOID_ShutdownContext, set the value of the corresponding static context pointer to NULL. 2. Immediately before each call to SECOID_InitContext, check and make sure that the corresponding static context pointer is NULL.
I also think we can fix this without adding a SECOID_InitContext and SECOID_ShutdownContext. Before we add all those, we should have a patch that does not require changes to softoken and nssdbm. It can be as simple as simply moving the init/shutdown in NSS init to being as complete as adding reference counts to SECOID_Init as wtc suggests. Please do not submit a patch that tries to change softoken or softoken/legacydb bob
(In reply to comment #15) Understand that the patch is not suitable for 3.12.x because of the changes in the code inside fips crypto boundary. > Please do not submit a patch that tries to change softoken or softoken/legacydb Why not? I don't think we should impose 3.12 post fips validation rules on the next major release.
Alexei: I object to the patch even without the FIPS crypto boundary issue. The solution should match the scope of the problem. This problem is recently introduced in NSS 3.12 and only affects NSS itself. So we should use a simple reference-counting solution. Otherwise we are just making it hard for ourselves.
1) So it seems to me the priority is to fix 3.12, at least first. 2) As Wan-Teh pointed out there are solutions that does not require mucking with the FIPS boundary. If such a solution works in both 3.12 and 3.13, we should adopt it. I would be OK with a solution that changes code in the FIPS boundary if: 1) the 3.12 fix is already available, and 2) a compelling reason is given why a general fix that doesn't change the FIPS boundary is deficient. bob
It is possible to use atomic function to adjust ref-count, but it seems not to be valuable in this case since SECOID init and shutdown functions are not capable to work in non-thread safe environment. I wonder if adding use of atomic function will convey wrong impression about init and shutdown functions to the people who are going to use them. I prefer not to use atomic function, but what is your opinion, Wan-Teh, Bob?
The use of atomic functions is valuable after the first SECOID_Init function is called. I would still use atomic functions. Note: SECOID_Init is really hidden underneath the current NSS_Initialize and the new NSS_InitContext function, which could be called simultaneously by multiple threads.
Wan-Teh, Nelson, thanks for suggestion. Please review.
Attachment #405925 - Attachment is obsolete: true
Attachment #407557 - Flags: review?(wtc)
Attachment #407557 - Attachment is obsolete: true
Attachment #407557 - Flags: review?(wtc)
Comment on attachment 407557 [details] [diff] [review] Patch v2 - simple reference-counting solution Wrong patch.
Attachment #407558 - Flags: review?(wtc)
Comment on attachment 407558 [details] [diff] [review] Patch v2 - simple reference-counting solution >+static int secoidInitCount; Nit: Use PRInt32 instead to match the prototype of PR_AtomicIncrement and PR_AtomicDecrement. >+ if (secoidInitCount) { >+ PR_AtomicIncrement(&secoidInitCount); >+ return SECSuccess; >+ } This is wrong. You should never test secoidInitCount directly. Instead, test the return value of PR_AtomicIncrement. This should read: if (PR_AtomicIncrement(&secoidInitCount) > 1) { /* Already initialized. */ return SECSuccess; } >@@ -1920,6 +1927,7 @@ SECOID_Init(void) > } > > PORT_Assert (i == SEC_OID_TOTAL); >+ PR_AtomicIncrement(&secoidInitCount); This should be removed. > SECStatus > SECOID_Shutdown(void) > { >+ secoidInitCount = PR_AtomicDecrement(&secoidInitCount); >+ if (secoidInitCount > 0) { >+ return SECSuccess; >+ } Same problem here: you can't test secoidInitCount directly. Test the return value of PR_AtomicDecrement instead. This should read: if (PR_AtomicDecrement(&secoidInitCount) > 0) { return SECSuccess; }
Attachment #407558 - Flags: review?(wtc) → review-
>> SECStatus >> SECOID_Shutdown(void) >> { >>+ secoidInitCount = PR_AtomicDecrement(&secoidInitCount); >>+ if (secoidInitCount > 0) { >>+ return SECSuccess; >>+ } > > Same problem here: you can't test secoidInitCount directly. > Test the return value of PR_AtomicDecrement instead. This > should read: > > if (PR_AtomicDecrement(&secoidInitCount) > 0) { > return SECSuccess; > } Note that your previous code (in the shutdown case) would have been acceptable if you used a temporary instead of the atomic variable itself (e.g.): SECStatus SECOID_Shutdown(void) { + PRInt32 mySecoidInitCount = PR_AtomicDecrement(&secoidInitCount); + if (mySecoidInitCount > 0) { + return SECSuccess; + }
Priority: -- → P1
Target Milestone: --- → 3.12.5
OK, I think I've fixed most of the problems. I think we agree that code will operate with an assumption, that the first all to SECOID_Init will be done from thread safe env. One thing that I'm not sure, is the code in SECOID_Shutdown right after atomic decrement function. At that point I need to decide what to do, if caller has called SECOID_Shutdown too many times. Nelson helped me out by suggesting to restore the value of secoidInitCount as it was before we decremented it and return a failure.
Attachment #407558 - Attachment is obsolete: true
Attachment #410040 - Flags: superreview?(rrelyea)
Attachment #410040 - Flags: review?(wtc)
Comment on attachment 410040 [details] [diff] [review] Patch v3 - simple reference-counting solution r+ I do have some comments. There may be some races still, but I don't think they are possible in real life: Caller 1 calls SECOID_Init(). -- initiallization starts/task switch. Caller 2 calls SECOID_Init(), it returns and then calls a SECOID function without the hashtables set up (boom). Second race: Caller 1 has the oid table initialized. He is the last user and calls SECOID_Shutdown. We decrement the count and get the value *task switch* Caller 2 call SECOID_Init(). Count is zero so he proceeds to initialize. We not the hash tables already initialized, asserts and returns an error. I don't think Race 1 is possible in our current code unless we are racing init's (which is a completely different problem we are trying to solve). We have a comment in the code to this effect, so I think we can ignore this race. Race 2 appears to be possible in the case where we don't have a clean shutdown. NSS shuts down, but the PKCS #11 module hasn't been Finalized because of an outstanding reference. That reference is freed on another thread at the same time as the process calls NSS_Init again. This is also unlikely. outstanding references seem to be lost references, not references that may come back. RE handling multi shutdowns. I think your handling is fine. There is little we can do about a pathelogical situation. If recovery is possible, your code does the right think. I may suggest an Assert there, though. Debug builds should warn the caller something is horribly wrong -- fix it;). bob
Attachment #410040 - Flags: superreview?(rrelyea) → superreview+
Comment on attachment 410040 [details] [diff] [review] Patch v3 - simple reference-counting solution r=wtc. In general you should avoid fixing style issues because such whitespace changes make the patch larger and obscure the actual changes. In lib/util/secoid.c: > for ( i = 0; i < SEC_OID_TOTAL; i++ ) { >- oid = &oids[i]; >+ PLHashEntry *entry; >+ const SECOidData *oid = &oids[i]; The indentation of these two lines is different from the the other top-level statements inside this for loop. I know their indentation is wrong (should be 4), but it's more important to be consistent. >+loser: >+ /* Initial call of SECOID_Init is from thread safe env. At this point >+ * secoidInitCount is equal to 1. SECOID_Shutdown will restore initial >+ * value of globals and set secoidInitCount to 0. */ >+ SECOID_Shutdown(); >+ PORT_SetError(SEC_ERROR_LIBRARY_FAILURE); >+ return SECFailure; We should move the PORT_Assert(0) calls here, too. In lib/util/secoid.h: >+ * Caller application must ensure that initial call of the function is done >+ * from thread safe environment. Remove "application" because the caller may be a library. Did you really mean an application?
Attachment #410040 - Flags: review?(wtc) → review+
> >+loser: > >+ /* Initial call of SECOID_Init is from thread safe env. At this point > >+ * secoidInitCount is equal to 1. SECOID_Shutdown will restore initial > >+ * value of globals and set secoidInitCount to 0. */ > >+ SECOID_Shutdown(); > >+ PORT_SetError(SEC_ERROR_LIBRARY_FAILURE); > >+ return SECFailure; > > We should move the PORT_Assert(0) calls here, too. I didn't move assert thinking that it would be valuable to know a specific place where a problem has occurred. > > In lib/util/secoid.h: > > >+ * Caller application must ensure that initial call of the function is done > >+ * from thread safe environment. > > Remove "application" because the caller may be a library. > Did you really mean an application? Well, these are public functions now. We initialize it in blapitest.c:441 before decoding - giving an example of how to do so. It is statically linked test program, but if an application would use only utils, then it should not be so. I can imagine, that an application at least may do the same thing: initialize the oid table and use it for verification of a data.
> I may suggest an Assert there, though. Debug builds should > warn the caller something is horribly wrong -- fix it;). The code asserts that initCount>=0 after decrementing secoidInitCount.
patch integrated
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Today I was informed, that web server integration tests failed due to the patch. They run modutil -list -nocertdb -dbdir nss We assert at shutdown with the stack: #4 0x003a1b9b in PR_Assert (s=0x10631d "initCount >= 0", file=0x106286 "secoid.c", ln=2085) at ../../../../pr/src/io/prlog.c:563 #5 0x00100756 in SECOID_Shutdown () at secoid.c:2085 #6 0x005724f2 in nsc_CommonFinalize (pReserved=0x0, isFIPS=0) at pkcs11.c:2752 #7 0x005726a8 in NSC_Finalize (pReserved=0x0) at pkcs11.c:2815 #8 0x001ffa12 in SECMOD_UnloadModule (mod=0x808c10) at pk11load.c:553 #9 0x0021aba2 in SECMOD_SlotDestroyModule (module=0x808c10, fromSlot=1) at pk11util.c:872 #10 0x002163ca in PK11_DestroySlot (slot=0x80c600) at pk11slot.c:451 #11 0x002163fa in PK11_FreeSlot (slot=0x80c600) at pk11slot.c:464 #12 0x0021aaa7 in SECMOD_DestroyModule (module=0x808c10) at pk11util.c:843 #13 0x0021ac04 in SECMOD_DestroyModuleListElement (element=0x4062d0) at pk11util.c:889 #14 0x0021ac37 in SECMOD_DestroyModuleList (list=0x4062d0) at pk11util.c:905 #15 0x00219845 in SECMOD_Shutdown () at pk11util.c:99 #16 0x001d101c in NSS_Shutdown () at nssinit.c:894 #17 0x00003b33 in main (argc=5, argv=0xbffff254) at modutil.c:917 The problem is an imbalanced number of callers to SECOID_Init and SECOID_Shutdown: due to -nocertdb argument, legacy_Open function that suppose to call SECOID_Init, is not called. http://mxr.mozilla.org/security/source/security/nss/lib/softoken/pkcs11.c#2145. Yet, we unconditionally call sftkdb_Shutdown from nsc_CommonFinalize. The missing stack: #0 SECOID_Init () at secoid.c:1865 #1 0x00688486 in legacy_Open (configdir=0x4062e0 "nss", certPrefix=0x4062f0 "", keyPrefix=0x406300 "", certVersion=8, keyVersion=3, flags=1, certDB=0xbfffebcc, keyDB=0xbfffebd0) at lginit.c:608 #2 0x0056aa70 in sftkdbCall_open (dir=0x4062e0 "nss", certPrefix=0x4062f0 "", keyPrefix=0x406300 "", certVersion=8, keyVersion=3, flags=1, isFIPS=0, certDB=0xbfffebcc, keyDB=0xbfffebd0) at lgglue.c:362 #3 0x00590ab7 in sftk_DBInit (configdir=0x4062e0 "nss", certPrefix=0x4062f0 "", keyPrefix=0x406300 "", updatedir=0x406280 "", updCertPrefix=0x0, updKeyPrefix=0x0, updateID=0x406290 "", readOnly=1, noCertDB=0, noKeyDB=0, forceOpen=0, isFIPS=0, certDB=0xbfffec94, keyDB=0xbfffec90) at sftkdb.c:2585 #4 0x005713b5 in SFTK_SlotReInit (slot=0x414b50, configdir=0x4062e0 "nss", updatedir=0x406280 "", updateID=0x406290 "", params=0x406668, moduleIndex=0) at pkcs11.c:2148 #5 0x00571733 in SFTK_SlotInit (configdir=0x4062e0 "nss", updatedir=0x406280 "", updateID=0x406290 "", params=0x406668, moduleIndex=0) at pkcs11.c:2260 #6 0x00572383 in nsc_CommonInitialize (pReserved=0xbfffee04, isFIPS=0) at pkcs11.c:2648 #7 0x0057242c in NSC_Initialize (pReserved=0xbfffee04) at pkcs11.c:2710 #8 0x001fed3a in secmod_ModuleInit (mod=0x808c10, reload=0xbfffef20, alreadyLoaded=0xbfffee6c) at pk11load.c:214 #9 0x001ff6b8 in secmod_LoadPKCS11Module (mod=0x808c10, oldModule=0xbfffef20) at pk11load.c:460 #10 0x0020d0a2 in SECMOD_LoadModule (modulespec=0x404da0 " name=\"NSS Internal PKCS #11 Module\" parameters=\"configdir='nss' certPrefix='' keyPrefix='' secmod='' flags=readOnly updatedir='' updateCertPrefix='' updateKeyPrefix='' updateid='' updateTokenDescript"..., parent=0x805c10, recurse=1) at pk11pars.c:775 #11 0x0020d155 in SECMOD_LoadModule (modulespec=0x403410 "name=\"NSS Internal Module\" parameters=\"configdir='nss' certPrefix='' keyPrefix='' secmod='' flags=readOnly updatedir='' updateCertPrefix='' updateKeyPrefix='' updateid='' updateTokenDescription='' \" N"..., parent=0x0, recurse=1) at pk11pars.c:804 #12 0x001d04e2 in nss_Init (configdir=0x300c0 "nss", certPrefix=0x22394 "", keyPrefix=0x22394 "", secmodName=0x0, updateDir=0x2f76d4 "", updCertPrefix=0x2f76d4 "", updKeyPrefix=0x2f76d4 "", updateID=0x2f76d4 "", updateName=0x2f76d4 "", readOnly=1, noCertDB=0, noModDB=0, forceOpen=0, noRootInit=0, optimizeSpace=0, noSingleThreadedModules=0, allowAlreadyInitializedModules=0, dontFinalizeModules=0) at nssinit.c:537 #13 0x001d08e2 in NSS_Initialize (configdir=0x300c0 "nss", certPrefix=0x22394 "", keyPrefix=0x22394 "", secmodName=0x0, flags=1) at nssinit.c:658 #14 0x0000336d in init_crypto (create=0, readOnly=1) at modutil.c:682 #15 0x00003877 in main (argc=4, argv=0xbffff268) at modutil.c:862
we may have to back this patch out in order to get 3.12.5 shipped with the TLS fix this week.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Bob, we may want to move SECOID_Init into the glue layer. I think it will be ok to initialize secoid table any time we load legacy library and define callbacks for it. I'm talking about moving the init call into sftkdbLoad_Legacy function an place it in the location after all symbols on the library have been loaded. Shutdown call would go to sftkdbCall_Shutdown right before unloading a library. What do you think?
That may be a good idea, but I doubt we'll do it in 3.12.x because of FIPS.
Because of the extremely urgent need to get the TLS renegotiation vulnerability fix released, this bug must be entirely fixed or the patch backed out by close of business California time today. Sorry.
patch in attachment 410040 [details] [diff] [review] is backed out.
Blocks: FIPS2010
We have a new topcrash on the trunk (#12) - http://tinyurl.com/23m9tzj links to the crash reports, which seem to match some of the stacks in Comment 0 posted by Nelson. Would this bug cover these crashes or should a file a new bug? All of the bug comments indicated they were checking for addons updates when they crashed.
Alexei, this was a P1 bug for Sun, and was never fixed. Do you still want to fix it?
Target Milestone: 3.12.5 → 3.12.8
Severity: normal → critical
Summary: Crash in secoid code, trying to reinitialize after failed NSS shutdown → Crash in secoid code, trying to reinitialize after failed NSS shutdown [@ PL_HashTableLookupConst ]
blocking2.0: --- → ?
moved to #3 top crash for firefox 4.0b5 guessing this is a new bug, or maybe new changes that exposed an old bug. either way we need to figure out a way to get this new topcrash fixed. The new stacks look like Frame Module Signature [Expand] Source 0 plds4.dll PL_HashTableLookupConst nsprpub/lib/ds/plhash.c:381 1 nssutil3.dll SECOID_FindOID_Util security/nss/lib/util/secoid.c:1997 2 nssutil3.dll SECOID_FindOIDTag_Util security/nss/lib/util/secoid.c:2013 3 nssutil3.dll SECOID_GetAlgorithmTag_Util security/nss/lib/util/secalgid.c:49 4 nss3.dll seckey_ExtractPublicKey security/nss/lib/cryptohi/seckey.c:1044 5 xul.dll nsDataSignatureVerifier::VerifyData security/manager/ssl/src/nsDataSignatureVerifier.cpp:87 6 xul.dll NS_InvokeByIndex_P xpcom/reflect/xptcall/src/md/win32/xptcinvoke.cpp:102 7 xul.dll js::InvokeCommon<int > js/src/jsinterp.cpp:566 8 xul.dll js::Invoke js/src/jsinterp.cpp:696 9 xul.dll js::Interpret js/src/jsinterp.cpp:4707 10 xul.dll js::InvokeCommon<int > js/src/jsinterp.cpp:577 11 xul.dll js::Invoke js/src/jsinterp.cpp:696 12 xul.dll js::InternalInvoke js/src/jsinterp.cpp:736 13 xul.dll nsXPCWrappedJSClass::CallMethod js/src/xpconnect/src/xpcwrappedjsclass.cpp:1688 14 xul.dll nsXPCWrappedJS::CallMethod js/src/xpconnect/src/xpcwrappedjs.cpp:570 15 xul.dll PrepareAndDispatch xpcom/reflect/xptcall/src/md/win32/xptcstubs.cpp:114 16 xul.dll SharedStub xpcom/reflect/xptcall/src/md/win32/xptcstubs.cpp:141 17 xul.dll nsDOMEventListenerWrapper::HandleEvent content/events/src/nsDOMEventTargetHelper.cpp:65 18 xul.dll nsEventListenerManager::HandleEventSubType content/events/src/nsEventListenerManager.cpp:1112 nss was updated in july, 2010-07-19 07:45 +0200 Kai Engert - Bug 575620, landing NSS 3.12.7 beta 2 and the js code further down in the the stack has changes in august. if its addons manager as suggested by user comments that code has been changing too. probably need to try and isolate this to a regression range where we saw and uptick or volume increase.
we went from 0-4 crashes per day during aug 1-17 on older releases to 56 crashes per day on aug 18. then a spike yesterday to 243 crashes as beta5 went live to a larger testing pool. 20100816 4 , 1 3.6.8 , 1 3.6.6 , 1 3.6 , 1 3.5.11 , 20100817 , 20100818 56 , 53 4.0b5pre , 2 3.6.8 , 1 3.5.11 , 20100819 40 , 36 4.0b5pre , 2 3.6.8 , 1 3.5.7 , 1 3.5.11 , 20100820 30 , 29 4.0b5pre , 1 3.6.7 , 20100821 25 4.0b5pre 25 , 20100822 15 , 12 4.0b5pre , 1 3.7a5 , 1 3.6.3 , 1 3.5.11 , 20100823 30 , 29 4.0b5pre , 1 3.6.8 , 20100824 19 , 17 4.0b5pre , 1 3.6.8 , 1 3.6.3 , 20100825 17 4.0b5pre 17 , 20100826 19 , 15 4.0b5pre , 2 3.6 , 2 3.5.11 , 20100827 22 , 19 4.0b5pre , 3 3.6.8 , 20100828 13 , 11 4.0b5pre , 1 3.6.8 , 1 3.1b2 , 20100829 21 , 15 4.0b5pre , 4 3.6 , 1 4.0b4 , 1 3.6.8 , 20100830 22 , 19 4.0b5pre , 1 3.6b5 , 1 3.6.8 , 1 3.6.6 , 20100831 28 , 13 4.0b5 , 11 4.0b5pre , 2 3.6.8 , 1 4.0b6pre , 1 4.0b4 , 20100901 57 , 40 4.0b6pre , 10 4.0b5 , 4 4.0b5pre , 2 3.5.11 , 1 4.0b4 , 20100902 42 , 25 4.0b6pre , 8 4.0b5 , 5 4.0b5pre , 2 3.5.11 , 1 3.6.8 , 20100903 33 , 19 4.0b6pre , 5 4.0b5pre , 5 4.0b5 , 1 4.0b4 , 1 3.6.8 , 20100904 28 , 17 4.0b6pre , 7 4.0b5pre , 3 4.0b5 , 1 3.6.9 , 20100905 23 , 16 4.0b6pre , 2 4.0b5pre , 2 3.6.8 , 1 4.0b5 , 1 4.0b4 , 20100906 18 , 13 4.0b6pre , 3 4.0b5pre , 2 4.0b5 , 20100907 243 , 220 4.0b5 , 14 4.0b6pre , 5 4.0b5pre , 2 3.5.5 , 1 3.6.8 ,
of those spiking crashes on the 08/18 these are the build breakdown 43 from 4.0b5pre2010081811 7 from 4.0b5pre2010081807 3 from 4.0b5pre2010081803
Of course, what really needs to be fixed is the non-NSS bug that held onto a reference to an NSS object that caused the NSS shutdown to fail. Even if NSS doesn't crash, if it cannot shutdown, then thereafter the application is in an inconsistent security state.. The application should not continue. For Firefox, the top priority should be to fix whatever code OUTSIDE of NSS is holding a reference, thereby preventing NSS from shutting down.
Whiteboard: See comment 43
Keywords: topcrash
See also bug 588931 comment 10 and bug 588511. I think those provide major clues about how this happens.
I believe bug 588511 will have fixed the Mozilla topcrash here, so this doesn't block.
blocking2.0: ? → -
Target Milestone: 3.12.8 → 3.12.9
Crash Signature: [@ PL_HashTableLookupConst ]
It's #5 top crasher in Fennec 8.0.
I filed bug 702307 to deal with the Gecko / Firefox Mobile aspect of this. Please comment in bug 702307 for anything regarding Firefox crashes.
Keywords: topcrash
Whiteboard: See comment 43, [mobile-crash] → [See bug 702307 for the Gecko crash with the same signature]
Whiteboard: [See bug 702307 for the Gecko crash with the same signature] → [See bug 702307 for the Gecko crash with the same signature], [mobile-crash]
Depends on: 716345
Crash Signature: [@ PL_HashTableLookupConst ] → [@ PL_HashTableLookupConst ] [@ PL_HashTableLookupConst | SECOID_FindOID_Util]
Summary: Crash in secoid code, trying to reinitialize after failed NSS shutdown [@ PL_HashTableLookupConst ] → Crash in secoid code, trying to reinitialize after failed NSS shutdown in SECOID_FindOID_Util

The bug assignee didn't login in Bugzilla in the last 7 months and this bug has priority 'P1'/severity 'critical'.
:beurdouche, could you have a look please?
For more information, please visit auto_nag documentation.

Assignee: alvolkov.bgs → nobody
Flags: needinfo?(bbeurdouche)

Since the crash volume is low (less than 5 per week), the severity is downgraded to S3. Feel free to change it back if you think the bug is still critical.

For more information, please visit auto_nag documentation.

Severity: critical → S3
Flags: needinfo?(bbeurdouche)
Priority: P1 → P3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: