Open Bug 462569 Opened 17 years ago Updated 2 years ago

NSS_Shutdown succeeds in child, even though parent called NSS_Initialize and forked

Categories

(NSS :: Libraries, defect, P3)

3.12

Tracking

(Not tracked)

People

(Reporter: julien.pierre, Unassigned)

Details

Attachments

(1 obsolete file)

This problem was discovered in the context of a process that forks while NSS is initialized. The parent process called NSS_Initialize, then forked. The child process attempted to call NSS_Shutdown immediately. This NSS_Shutdown call should have failed. But it reported success. What happens is that our softoken PKCS#11 module detects that a fork occured, and returns CKR_DEVICE_ERROR for every PKCS#11 call in the child process. Unfortunately, NSS_Shutdown completely ignores these failure codes, even though it makes several calls to the PKCS#11 module, and returns SECSuccess. The stacks that I have observed where the PKCS#11 error codes were ignored during shutdown are : [1] __lwp_kill(0x1, 0x6), at 0xd10cf597 [2] _ti_thr_kill(0x1, 0x6), at 0xd0fa9ddd [3] thr_kill(0x1, 0x6), at 0xd113164e [4] raise(0x6), at 0xd10e7d8d [5] abort(0xd0ddaf18, 0xd11736b9, 0x804764c, 0xd0d9b02d, 0xd0dc971c, 0xd0dc9720), at 0xd10d03fe =>[6] PR_Assert(s = 0xd0dc971c "0", file = 0xd0dc9720 "pkcs11.c", ln = 3324), line 554 in "prlog.c" [7] NSC_CloseAllSessions(slotID = 1U), line 3324 in "pkcs11.c" [8] PK11_DestroySlot(slot = 0x808d658), line 431 in "pk11slot.c" [9] PK11_FreeSlot(slot = 0x808d658), line 461 in "pk11slot.c" [10] SECMOD_DestroyModule(module = 0x8064ca8), line 779 in "pk11util.c" [11] SECMOD_DestroyModuleListElement(element = 0x8061740), line 825 in "pk11util.c" [12] SECMOD_DestroyModuleList(list = 0x8061740), line 841 in "pk11util.c" [13] SECMOD_Shutdown(), line 99 in "pk11util.c" [14] NSS_Shutdown(), line 888 in "nssinit.c" [15] main(argc = 2, argv = 0x804777c), line 99 in "tstforkscen.c" [1] __lwp_kill(0x1, 0x6), at 0xd10cf597 [2] _ti_thr_kill(0x1, 0x6), at 0xd0fa9ddd [3] thr_kill(0x1, 0x6), at 0xd113164e [4] raise(0x6), at 0xd10e7d8d [5] abort(0xd0ddaf18, 0xd11736b9, 0x804764c, 0xd0d9b02d, 0xd0dc971c, 0xd0dc9720), at 0xd10d03fe =>[6] PR_Assert(s = 0xd0dc971c "0", file = 0xd0dc9720 "pkcs11.c", ln = 3324), line 554 in "prlog.c" [7] NSC_CloseAllSessions(slotID = 2U), line 3324 in "pkcs11.c" [8] PK11_DestroySlot(slot = 0x808da68), line 431 in "pk11slot.c" [9] PK11_FreeSlot(slot = 0x808da68), line 461 in "pk11slot.c" [10] SECMOD_DestroyModule(module = 0x8064ca8), line 779 in "pk11util.c" [11] SECMOD_DestroyModuleListElement(element = 0x8061740), line 825 in "pk11util.c" [12] SECMOD_DestroyModuleList(list = 0x8061740), line 841 in "pk11util.c" [13] SECMOD_Shutdown(), line 99 in "pk11util.c" [14] NSS_Shutdown(), line 888 in "nssinit.c" [15] main(argc = 2, argv = 0x804777c), line 99 in "tstforkscen.c" current thread: t@1 [1] __lwp_kill(0x1, 0x6), at 0xd10cf597 [2] _ti_thr_kill(0x1, 0x6), at 0xd0fa9ddd [3] thr_kill(0x1, 0x6), at 0xd113164e [4] raise(0x6), at 0xd10e7d8d [5] abort(0xd0ddaf18, 0xd11736b9, 0x8047600, 0xd0d98f4d, 0xd0dc950c, 0xd0dc9510), at 0xd10d03fe =>[6] PR_Assert(s = 0xd0dc950c "0", file = 0xd0dc9510 "pkcs11.c", ln = 2660), line 554 in "prlog.c" [7] NSC_Finalize(pReserved = (nil)), line 2660 in "pkcs11.c" [8] SECMOD_UnloadModule(mod = 0x8064ca8), line 463 in "pk11load.c" [9] SECMOD_SlotDestroyModule(module = 0x8064ca8, fromSlot = 1), line 808 in "pk11util.c" [10] PK11_DestroySlot(slot = 0x808da68), line 448 in "pk11slot.c" [11] PK11_FreeSlot(slot = 0x808da68), line 461 in "pk11slot.c" [12] SECMOD_DestroyModule(module = 0x8064ca8), line 779 in "pk11util.c" [13] SECMOD_DestroyModuleListElement(element = 0x8061740), line 825 in "pk11util.c" [14] SECMOD_DestroyModuleList(list = 0x8061740), line 841 in "pk11util.c" [15] SECMOD_Shutdown(), line 99 in "pk11util.c" [16] NSS_Shutdown(), line 888 in "nssinit.c" [17] main(argc = 2, argv = 0x804777c), line 99 in "tstforkscen.c" There are multiple ways to deal with this problem. 1) fix these stacks so that the PKCS#11 error gets propagated all the way to the top level. I think I looked at a few of them before and concluded it would be difficult because some functions returned void and were public functions. It could still be done by adding new functions that return something else potentially. But this approach may or may not be the way to go. Firstable, these are only the current stacks being used in NSS_Shutdown today with this particular module. There may well be other stacks in other cases that are ignored as well. So it could be a hard problem to fix in the general case. Also, in theory, regardless of any error from the PKCS#11 module, NSS shutdown might still proceed successfully after unloading the PKCS#11 module from memory, assuming the OS hasn't refcounted the library in the process for some other reason, and it truly gets unloaded. Ie. in cases where the process hasn't been forked, the current behavior of NSS_Shutdown() ignoring PKCS#11 module errors may be desirable to keep. 2) since the NSS library doesn't currently support forking for reasons other than just PKCS#11, we could add fork detection at the NSS_Initialize / NSS_Shutdown level, like we did in softoken. This would ensure that we always get the status right and NSS_Shutdown always fails after a fork regardless of module status. This is much easier to implement and to get right than the #1 approach. On the downside, if we do decide do support forking with NSS in the future, we would have to make some changes. But I think it may require new APIs anyway, because the current behavior of NSS after fork is undefined Comments welcome.
Option 3, remove the check from NSC_Finalize and NSC_CloseAllSessions, allowing them to succeed in this case and in other common cases.
Option 3 would be a change in our softoken, and would only work for that particular module. It wouldn't resolve the issue for any other PKCS#11 module. NSS_Shutdown would still succeed when it shouldn't in the child if those other modules have errors shutting down, which they might, given that the sequence parent: C_Initialize fork() child: C_CloseAllSessions() C_CloseAllSessions() C_Finalize() is not expected to work by the PKCS#11 spec.
After further investigation of different scenarios, I am leaning towards option 1. I think the current behavior of ignoring the PKCS#11 errors in NSS_Shutdown() is not useful after all. Even though NSS_Shutdown() can dlclose() all the modules, that doesn't accomplish a proper shutdown of the module. It will only clear some of the module's state. Certain state like file handles still remains in the process and cached in libc. Because the fd's are refcounted in libc, after the PKCS#11 module is later re-loaded into the process and re-initialized, it may end up still reusing the same file handles that were previously in use on the next fopen() or equivalent call. Ie. the same handles are shared with the parent's process. While Option 2) is simple to implement, its scope is fairly large. We probably don't want to add a CHECK_FORK() in every single current and future exported NSS API . So, option 1 is probably better.
So PKCS #11 requires tokens to be tolerant of forced unloads. Calling shutdown is only required if you need to call init in the future (though it's also nice for the application to signal that it's through). unloading the library, even while running is supposed to be allowed (as long as you don't actually try to call into it). That is why C_Finalize failures are not considered fatal to shutdown. I violently oppose option 1. bob
Bob, We have applications that do multiple NSS_Init / NSS_Shutdown . The DS (which is our forking app in question) actually does that. Mozilla also does that during profile switch. When we unload a module in NSS_Shutdown, there is no way for us to tell that the application will not subsequently call NSS_Init and reload the same module. Even though NSS_Shutdown doesn't take flags, perhaps we can use NSS_Init flags to control the checks we want to have or not. There is already a flag to control whether to finalize or not : NSS_INIT_NOPK11FINALIZE . This was put in for Java in case Java had called C_Initialize on a module first, in order to avoid a double call to C_Finalize, or a call in the wrong order. But this flag merely controls whether to call C_Finalize, not whether to check its status.
> When we unload a module in NSS_Shutdown, there is no way > for us to tell that the application will not subsequently call NSS_Init and > reload the same module. I never said it did. I expect such an instance. If the PKCS #11 module can't handle the unload/reload it's a bug in the module. > Even though NSS_Shutdown doesn't take flags, perhaps we can use NSS_Init flags > to control the checks we want to have or not. This seems like a complete non sequiter to me. What does this have to do with shutdown returning an error on C_Finalize failures? bob
Bob, Re: comment 6, Here is a real-world situation we have seen with the DS : 1. parent dlopens module (as a result of NSS_Init) 2. parent C_Initializes module (as a result of NSS_Init) 3. parent forks 4. child C_Finalizes module (as a result of NSS_Shutdown). C_Finalizes fail and doesn't perform its cleanup since this is not supposed to work per the PKCS#11 spec. The result is ignored by NSS_Shutdown 5. child dlcloses module (as a result of NSS_Shutdown) . At that point, all state in the module is lost, including the variable that held the fork detection state. 6. child dlopens module (as a result of NSS_Init) 7. child calls C_Initialize on module (as a result of NSS_Init). At that point, it gets back the same DB file handle as the parent due to libc caching of fd's that weren't closed. This is the problem we know may happen with several PKCS#11 modules, not just our own softoken. I think in step 4, the problem cannot be considered to be a PKCS#11 module bug. The PKCS#11 spec doesn't say that C_Finalize should be able to perform proper cleanup in this situation - after a fork. How would you propose that we deal with this situation ?
Bob, Re: comment 6, In response to your second question, I was proposing we add a new NSS_Init flag to control whether NSS_Shutdown checks for C_Finalize status, since we already have flags to NSS_Init that actually control NSS_Shutdown behavior. Depending on the default behavior we decide on, this new flag could be called NSS_INIT_STRICT_C_FINALIZE or NSS_INIT_RELAXED_C_FINALIZE . For this to be helpful with detecting problems in precompiled binary apps at Sun, the default behavior would have to be strict (do the status check) and the optional flag would have to be "relaxed" (ie. skip the C_Finalize status check).
Julien, even if C_Finalize detects a fork and returns failure, it should close the DBs. We don't need a special flag for that.
> 6. child dlopens module (as a result of NSS_Init) > 7. child calls C_Initialize on module (as a result of NSS_Init). At that point, > it gets back the same DB file handle as the parent due to libc caching of fd's > that weren't closed. Then there needs to be an 'on unload' handler to take care of this problem. bob
Nelson, re: comment 9, we can try do that for own PKCS#11 module. That would be going above and beyond the PKCS#11 spec requirements. But we have no way of knowing what exactly other PKCS#11 modules will do, for example, but not limited to, Solaris crypto framework. Remember that this bug is not specifically about softoken, so any fix we make that's only in softoken cannot be the whole answer to this bug. Bob, Re: comment 4, can you cite where the PKCS#11 spec requires modules to be tolerant of forced unloads ? I would like to understand specifically why you think it's so important for NSS_Shutdown to ignore C_Finalize errors. What cases are we going to break if we do this that are important for us to support ? Even in the context of a single process and platforms other than Unix and no forking, I don't think the behavior can really guaranteed in general by modules for the following sequence. 1. load PKCS#11 lib 2. C_Initialize 3. C_Finalize -> failure 4. unload PKCS#11 lib 5. reload PKCS#11 lib 6. C_Initialize . If C_Finalize fails at 3, it's very possible some resources will be leaked. These may be file descriptors, shard libraries, heap, locks, etc. These problems may or may not be of consequence. Most likely step 4 will not resolve these problems. And these issues of resource leaks may very well create unexpected behavior in step 6. Re: comment 10, The PKCS#11 spec doesn't say anything about doing any clean up in shared lib unload handlers as far as I know. Indeed, this mechanism may not have been available on all platforms at the time the spec was written. It is probably available now on every platform. The file descriptors are not the only issue. There are plenty of others, like making softoken unload the other shared libraries we load on-demand like freebl, the glue, sqlite, etc. I'm not against adding unload handlers in softoken wherever we can. But it doesn't do anything to help detecting or resolving issues with other PKCS#11 libraries. Other PKCS#11 libraries have the same kind of issues that we do, and may not implement any unload handlers even if our softoken does, since doing so is not dictated by the PKCS#11 spec.
I would also add that IMO, when a module returns an error in C_Finalize, this is an indication to the application (here, NSS) that the module is not able to free its resources. Why would the module be able to free its resources at unload time in an handler and not in C_Finalize ? In fact, if we can figure out how to do it in softoken, we shouldn't need to bother with unload handlers - we can just do it in C_Finalize and make it succeed, rather than error. Right now C_Finalize is failing in the fork case on purpose to avoid deadlocks in freeing resources. This fork case is the main case we are concerned here with, but not necessarily the only case where C_Finalize can fail. IMO, NSS cannot really predict whether the module's resource leak indicated by the C_Finalize failure is going to cause issues after unload/reload/reinitialization of the module. So far, the evidence with several modules says that it probably will cause at least some issues. On that basis, I think it is desireable for NSS_Shutdown to at least have a mode of detecting these errors.
NSS is torn here between the competing (and diametrically opposed) goals of a) wanting to succeed as much as possible, making the calling application as successful as possible, and b) recognizing that NSS Softoken is often used as a surrogate for other PKCS#11 modules that do little or no more than the PKCS#11 specification (some version) specifies, and wanting to help apps that do that succeed with other non-NSS modules by not allowing them to become dependent on non-standard behavior. Clearly, NSS's former lack of fork detection led to sloppy application programs being written, apps that worked only with NSS because they were dependent on its willingness to keep on going after a fork. The fork detection steps added in NSS 3.12 help to mitigate that problem, somewhat, by detecting the problem and (in effect) telling the app child "sorry, we're not going to just keep on going like nothing was wrong." The question is, how far is it necessary to go to achieve that objective? I think that, in some sense, NSS 3.12 has gone too far. Here's an analogy that isn't perfect but may help. Our fork checks are intended to have an effect similar to a barrier to stop cars from entering the freeway going the wrong way. We want to detect that the car is going the wrong way as soon as we can and give negative feedback that should motivate the driver to stop. It is in our interest, and the driver's, to do that. But once the driver sees the error of his way, he may then wish to turn around and exit. We _could_ create a situation for the driver's car wherein we effectively say to him, "You have entered the freeway going in the wrong way, and now we're not going to let you move in any direction. Any attempt to move forwards, backwards or sideways will result in all your tires going flat immediately, and perhaps worse damage. Your only options are to leave in a tow truck or on a stretcher." Clearly, it is not in our interest, nor in the driver's, to do that. But I would argue that disallowing the cleanup and shutdown is doing exactly that. Granted that it may not be possible to clean up 100% at that point, but a 90% clean up is preferable to a zero % clean up. Ultimately we cannot stop the application from unloading the PKCS#11 module (unless we crash) and then reloading it. So, it is in our own interest to do as good a job at cleaning up as we can, under the circumstances, so that if it subsequently reloads our PKCS#11 module, our module will behave as well as it possibly can. We clearly CAN close the DB files in C_Finalize or at module unload time, and it helps no one to refuse to do so. If we leak a little memory from the heap, well, that's unfortunate but not fatal. Recognizing that locks could be a problem (a solvable one, but one that we have not yet solved) we could choose to try to detect when the parent was in a state that could have left locks in use, and avoid trying to clean up locks in those cases. More importantly (in my view), we can try to detect when the parent was NOT in a state that could have left locks in use, and clean them up in that case.
Attached patch patch v1 (obsolete) — Splinter Review
The type of the constant in the old code was "const PRUint32" which is the same sa "const unsigned int". The type of the new value is also "const unsigned int".
Attachment #346195 - Flags: review?(julien.pierre.boogz)
Attachment #346195 - Attachment is obsolete: true
Attachment #346195 - Flags: review?(julien.pierre.boogz)
Comment on attachment 346195 [details] [diff] [review] patch v1 D'oh! I attached this patch to the wrong bug. Sorry.
Nelson, re: comment 13, I think you meant to write "apps that worked only with NSS -softoken- because they were dependent on its willingness to keep on going after a fork". Several things have changed at the same time in NSS 3.12 that are relevant to this problem. a) One change is that fork checks were added to softoken. This is the fix for bug 331096. b) The other change is that libnss dlopens/dlcloses softoken, which it never used to do in 3.11. softoken used to be always implicitly loaded with NSS. This was different from other PKCS#11 modules, which were always dlopen'ed/dlclose'ed. That change, combined with the behavior in NSS_Shutdown that we are discussing now of ignoring C_CloseAllSessions / C_Finalize errors, has the result of partially defeating the effect of the softoken fork check when the child calls NSS_Shutdown. The application is not notified that there was any problem with the shutdown of the softoken, which was one of the intents of adding the fork checks. Unfortunately some information gets lost around the way. c) I believe other modules' fork checks (SCF) may also be defeated by NSS_Shutdown ignoring C_Finalize errors in the child. I don't really like car analogies and I'm sorry I don't find it all that helpful. Much of your comments are about how strictly the softoken fork check are currently implemented. I think those are by definition separate problems from the lack of error checking in the upper-level (libnss, above the PKCS#11 line). But, since you brought it up, there is a very good reason why the fork checks prevent both finalize, initialize - and all other PKCS#11 function calls - because those calls are subject to deadlocks after fork. I think it's both in our interest, and our users' interest, to keep the behavior strict, until we have dealt with the deadlock problem. Once we have, we can start relaxing the fork checks and allow more softoken PKCS#11 calls to succeed after fork. It doesn't make sense to relax these checks before, because there is nothing the user can do to reliably recover in this situation until we do so. I think that should be the subject of a separate bug about softoken - and I thought there was already one filed, at least in the case of the C_Initialize call. As far as how helpful a 90% vs 0% cleanup is to the app, I'm not certain. NSS_Shutdown returns a SECStatus which doesn't leave much room for reporting a status in between. IMO, by returning SECSuccess when the cleanup is in fact less than 100% (C_Finalize or C_CloseAllSessions failure), NSS_Shutdown is lying to the application. Those last few % of shutdown can make the difference between shared file handles and very serious database corruption problems - or even database removal, if I need to remind you. If this bug was only about softoken, we could fix all of it completely in softoken by making it always return CKR_OK in those 2 calls (which we probably eventually will do). Then NSS_Shutdown would never have to worry about a failure status. It could continue to ignore it. However, other PKCS#11 modules can still fail to shut down too, even if ours never fails and has a 100% success of cleanup.
Severity: normal → S3
Priority: -- → P3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: