Closed Bug 237934 Opened 21 years ago Closed 21 years ago

nss_InitLock not atomic

Categories

(NSS :: Libraries, defect, P1)

3.3.4
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: julien.pierre, Assigned: julien.pierre)

Details

(Keywords: fixed1.7)

Attachments

(2 files, 1 obsolete file)

We ran into a problem on a multi-processor machine in the SSL code in ssl3con.c where a thread tries to PZ_Unlock(symWrapKeyLock) and asserts because another thread owns the lock . There is only one place in NSS where that lock is locked - earlier in the same function. The lock is created by the following sequence : /* atomically initialize the lock */ if (!symWrapKeysLock) nss_InitLock(&symWrapKeysLock, nssILockOther); I think what happened is that multiple threads created the lock, and that lock creation is not truly atomic. Here is the code for nss_InitLock in util, which has not changed in current versions of NSS : /* Given the address of a (global) pointer to a PZLock, * atomicly create the lock and initialize the (global) pointer, * if it is not already created/initialized. */ SECStatus __nss_InitLock( PZLock **ppLock, nssILockType ltype ) { static PRInt32 initializers; PORT_Assert( ppLock != NULL); /* atomically initialize the lock */ while (!*ppLock) { PRInt32 myAttempt = PR_AtomicIncrement(&initializers); if (myAttempt == 1) { *ppLock = PZ_NewLock(ltype); (void) PR_AtomicDecrement(&initializers); break; } PR_Sleep(PR_INTERVAL_NO_WAIT); /* PR_Yield() */ (void) PR_AtomicDecrement(&initializers); } return (*ppLock != NULL) ? SECSuccess : SECFailure; } I believe this code is missing a test, which I added below : if (myAttempt == 1) { if (!*ppLock) { *ppLock = PZ_NewLock(ltype); } (void) PR_AtomicDecrement(&initializers); break; } The reason for adding this test is the following case, which could occur even on a single processor machine : 1) thread 1 gets suspended inside the loop, right after *ppLock is tested for NULL, but before PR_AtomicIncrement executes 2) thread 2 executes, increments the counter to 1, creates the lock, decrements it to 0, and exits the function 3) thread 1 resumes execution, increments the counter to 1, overwrites the lock, and decrements the counter Besides the atomicity bug, I think this function has performance issues. It uses a single initializers static variable to serialize creation of all locks in the process, regardless of the lock address . This also brings the following question : why are we creating this global lock dynamically in SSL, rather than at initialization time ? There aren't many places in NSS fortunately where nss_InitLock is called so the atomicity and performance problems hopefully should be limited, but we have actually run into the atomicity issue several times in tests.
Assignee: wchang0222 → julien.pierre.bugs
Priority: -- → P1
Target Milestone: --- → 3.9.1
Julien's fix is an improvement over the current code. We should definitely check in Julien's fix. Since this piece of code does not use locks, its correctness on multiprocessor systems with weak memory consistency models (e.g., SPARC, Alpha, and PowerPC) requires careful review. So I can't say whether Julien's fix addresses all the thread-safety issues. It is best that this code be replaced by code that uses PR_CallOnce. The SSL library doesn't have an initialization routine, so we can't create these locks at initialization time.
The SSL library has neither a global init function nor a global shutdown function. Having such a pair of functions would have been a cleaner solution for MANY issues in libSSL, but would have necessitates changes to applications that were unwilling to make such changes. So, that solution is one of many things that has been waiting for that magical day when we do a "Major" revision of libSSL and can finally make API changes. Using PR_RunOnce is not a simple solution. PR_RunOnce requires its own structure of info, of which there must be a separate instance for each object that is to be initialized only once. So, to use PR_RunOnce as part of the implementation of nss_InitLock necessitates allocating an instance of that structure for each lock thus initialized, and that allocation may also be subject to races. One could envision creating a new structure which contains both an PR_RunOnce structure and a lock pointer. Each place where a lock is today allocated and initialized with nss_InitLock would need to be changed so that an instance of that structure is allocated instead of a mere instance of the lock pointer, and all subsequent references to that lock pointer would have to become references to the lock pointer member of that structure. This could probably be accomplished with #defines to minimize the number of lines of code that actually must be changed.
I filed the Sun-internal bug that Julien was investigating, and we've been discussing the __nss_InitLock cache coherency issues on IRC. When I read Wan-Teh's update, I looked at the source to PR_CallOnce and found the following: PR_IMPLEMENT(PRStatus) PR_CallOnce( PRCallOnceType *once, PRCallOnceFN func) { if (!_pr_initialized) _PR_ImplicitInitialization(); if (!once->initialized) { if (PR_AtomicSet(&once->inProgress, 1) == 0) { once->status = (*func)(); PR_Lock(mod_init.ml); once->initialized = 1; PR_NotifyAllCondVar(mod_init.cv); PR_Unlock(mod_init.ml); } else { PR_Lock(mod_init.ml); while (!once->initialized) { PR_WaitCondVar(mod_init.cv, PR_INTERVAL_NO_TIMEOUT); } PR_Unlock(mod_init.ml); } } return once->status; } Here, once->initialized is inspected outside of mod_init.ml. I believe this is unsafe on SMP platforms that don't guarantee store order; immediately after CPU0 executes func and sets once->initialized = 1, it's possible for CPU1's cache to contain once->initialized == 1 WITHOUT containing the result of func. For example, it could be possible for CPU1 to return from PR_CallOnce only to discover a NULL symWrapKeysLock. Do you agree that PR_CallOnce is potentially unsafe?
Chris, you are right. The idea is that if we use PR_CalllOnce for all one-time initializations, only PR_CallOnce needs to be fixed. The while loop in __nss_InitLock can be rewritten as follows: /* atomically initialize the lock */ while (!*ppLock) { PRInt32 oldValue = PR_AtomicSet(&initializers, 1); if (oldValue == 0) { *ppLock = PZ_NewLock(ltype); (void) PR_AtomicSet(&initializers, 0); break; } PR_Sleep(PR_INTERVAL_NO_WAIT); /* PR_Yield() */ } One will recognize this as a form of spin lock. It follows that Julien's fix is the "locked doublecheck" (or rather "doublechecked locking") idiom. I have seen other people use homegrown spinlocks to safely create global objects for a library that doesn't have an initialization function. SSL's need can be met with the new PR_CallOnceWithArg function (added in NSPR 4.3).
Is there a separate bug tracking the PR_CallOnce deficiency? If not, should I file one?
Chris, No, there isn't a bug yet on PR_CallOnce - please open one against NSPR and assigned to Wan-Teh so it can be tracked separately. As far as the place to do the initialization, for SSL server apps, there are session cache initialization APIs which must be called . That would be a place where we could initialize these locks . However, for SSL client apps, there is no place to do it until the first time the lock is needed .
OS: SunOS → Solaris
If you need this bug fixed in NSS 3.9.1, we better have a patch attached and review it soon.
r=wtc on Julien's fix in the description of this bug. (The rewrite I proposed in comment 4 can be applied to the trunk.) Julien, have you verified your fix? By the way, we should also fix nss_InitMonitor and nssRWLock_AtomicCreate.
Re: comment 4, please explain how PR_CallOnceWithArg solves any problem. It STILL necessitates a separate NSPR callonce struct for every lock, monitor, etc, that must be initialized.
Wan-Teh, Unfortunately this bug is not easy to reproduce due to the low probability. I haven't reproduced it myself - the web server team did. So verifying the fix may be very difficult. We would indeed like to have this fix in 3.9.1 . We are however having some schedule changes so we have more time to come up with an elegant patch. I will email you about this.
Sorry I wasn't clear about the rewrite. I was referring to using PR_AtomicSet instead of PR_AtomicIncrement and PR_AtomicDecrement. I was not referring to using PR_CallOnceWithArg.
This is Julien's fix, applied to nss_InitLock, nss_InitMonitor, and NSSRWLock_AtomicCreate.
Attachment #144787 - Flags: superreview?(MisterSSL)
Attachment #144787 - Flags: review?(julien.pierre.bugs)
Target Milestone: 3.9.1 → 3.9.2
Comment on attachment 144787 [details] [diff] [review] Proposed patch (checked in) r=nelsonb I'm certain this patch makes the code no less correct than before. It probably fixes the problem on most machines. I cannot say it fixes the problem on all machines.
Attachment #144787 - Flags: superreview?(julien.pierre.bugs)
Attachment #144787 - Flags: superreview?(MisterSSL)
Attachment #144787 - Flags: review?(julien.pierre.bugs)
Attachment #144787 - Flags: review+
Comment on attachment 144787 [details] [diff] [review] Proposed patch (checked in) This code is better than before. It's OK to check in for 3.9.1 . But let's keep this bug open for 3.9.2 if we find it does not fix the problem 100%.
Attachment #144787 - Flags: superreview?(julien.pierre.bugs) → superreview+
Comment on attachment 144787 [details] [diff] [review] Proposed patch (checked in) Requesting drivers' approval for Mozilla 1.7 final. Mozilla doesn't need this fix. However, we are certifying the NSS 3.9.1 release, and it is our desire that Mozilla 1.7 final uses a certified NSS release. This fix has a very low risk.
Attachment #144787 - Flags: approval1.7?
Julien, please check this in on the trunk (or tell me and I will do it). The branch is waiting for mozilla driver approval.
Comment on attachment 144787 [details] [diff] [review] Proposed patch (checked in) I checked in this patch on the NSS trunk (NSS 3.10). Checking in nsslocks.c; /cvsroot/mozilla/security/nss/lib/util/nsslocks.c,v <-- nsslocks.c new revision: 1.4; previous revision: 1.3 done Checking in nssrwlk.c; /cvsroot/mozilla/security/nss/lib/util/nssrwlk.c,v <-- nssrwlk.c new revision: 1.6; previous revision: 1.5 done
Comment on attachment 144787 [details] [diff] [review] Proposed patch (checked in) a=chofmann for 1.7
Attachment #144787 - Flags: approval1.7? → approval1.7+
Comment on attachment 144787 [details] [diff] [review] Proposed patch (checked in) I checked in this patch on the NSS_3_9_BRANCH (NSS 3.9.1) and NSS_CLIENT_TAG (Mozilla 1.7 final).
Attachment #144787 - Attachment description: Proposed patch → Proposed patch (checked in)
Keywords: fixed1.7
This patch initializes the lock during the server session cache initialization . This should prevent the issue we have seen from ever happening again . It doesn't solve the problem for client applications, however; or for servers that don't setup the SID cache (I'm not sure if that even works).
Attachment #149974 - Flags: superreview?(MisterSSL)
Attachment #149974 - Flags: review?(bugz)
OS: Solaris → All
Hardware: Sun → All
Nelson, could you please review the new patch I attached last week ? I would like to get it into 3.9.2 . Thanks !
Attachment #149974 - Flags: superreview?(bugz)
Attachment #149974 - Flags: superreview?(MisterSSL)
Attachment #149974 - Flags: review?(bugz)
Attachment #149974 - Flags: review?(MisterSSL)
Comment on attachment 149974 [details] [diff] [review] work around issue for SSL server applications I have some minor nits with this patch. They're minor enough that I will still give it r+, but I encourage you to consider changing them anyway. 1. The patch adds explicit null initialization to several static external variables. -static PZLock * symWrapKeysLock; +static PZLock * symWrapKeysLock = NULL; and -static sslSessionID *cache; -static PZLock * cacheLock; +static sslSessionID *cache = NULL; +static PZLock * cacheLock = NULL; These initializations are unnecessary and IMO undesirable. The c language GUARANTEES that all external (meaning: not declared within a function) and static variables are initialized to an all-zeros value. This is true on ALL platforms. It's a guarantee of the language, not of the OS. This guarantee was in K&R rev 1, before ANSI c. So, adding the explicit = NULL assignment doesn't change the logical result of this code at all. Either way, the value will be initialized to all zeros. But the explicit initialization does increase the size of the code. It causes the symbol to be linked into the initialized data segment, rather than in the BSS segment, and causes the initial value to be placed into the executable file from which the initialized data segment is initialized. (Each program and shared-library's BSS segment is always zeroed before that program or shared library begins execution). So, I wouldn't initialize any static or external variables to zero or NULL. 2. Please put a blank line after the closing brace of the new function ssl_InitSymWrapKeysLock(), so that the function will be clearly separated from the next one. 3. Please remove the XXX comments in the function formerly known as lock_cache, now known as ssl_InitCacheLock, since they are no longer true, due to these changes. 4. When "lock_cache" was static, it was apparent that the cache being locked was the client ssl session cache, since that is the only cache implemented in sslnonce.c. But now that the function is being made external (not static), and it being called from other source files, its name should clearly indicate exactly which of SSL several caches it affects. Please rename ssl_InitCacheLock to ssl_InitClientSessionCacheLock since that is what the function actually does.
Attachment #149974 - Flags: review?(MisterSSL) → review+
Nelson, 1. I was unaware the C language guaranteed this. If so, wouldn't the NULL / 0 initial assignment get optimized out, rather than additional code being generated to do a second assignment ? I don't mind taking the assignments out, but I prefer to have them in anyway as I don't consider them harmful. The minimal load-time performance impact and code growth are a small price to pay for supportability, IMO. When code gets moved around, the necessary assignment could very easily be overlooked. For this reason, I think it is acceptable to explicitly initialize all variables, including globals in this case. 2. done 3. done 4. lock_cache is still static, even in the patch . The diff doesn't show that well. The lock still only gets locked from one file . The init only used to happen at the time we locked, but now a separate init function got added, and the init of the lock happens earlier than the first time we try to lock it in SSL server apps. Nevertheless, I have changed the name of the lock init function as suggested.
Attached patch updated patchSplinter Review
Attachment #149974 - Attachment is obsolete: true
Attachment #151183 - Flags: review?(nelson)
Comment on attachment 151183 [details] [diff] [review] updated patch Initializing a static or external (global) variable with any value, including zero, causes it to be allocated in the initialized data segment, increasing the size of that segment. No executable instructions are generated to do that initialization, and hence no such instructions can be optimized out. Rather, the initial value of the entire initialized data segment becomes a part of the binary file. More initialized global/static variables == larger binaries. Download size of the NSS shared libs has become an issue for AOL. NSS is considered too big. So, I object to changes that unnecessarily increase the download size by increasing the size of the initialized data segment to hold zero values, especially if done on a large/wide scale in the code. (This patch is not large scale. I'm just speaking to other potential changes.) I disagree that explicitly initialized globals and statics are more maintainable than implicitly initialized ones. One merely has to become accustomed to thinking of globals and statics as implicitly initialized to zero. Having said that, I'll let it go for *this* patch.
Attachment #151183 - Flags: review?(nelson) → review+
checked this in to the tip . Checking in ssl3con.c; /cvsroot/mozilla/security/nss/lib/ssl/ssl3con.c,v <-- ssl3con.c new revision: 1.68; previous revision: 1.67 done Checking in sslimpl.h; /cvsroot/mozilla/security/nss/lib/ssl/sslimpl.h,v <-- sslimpl.h new revision: 1.35; previous revision: 1.34 done Checking in sslnonce.c; /cvsroot/mozilla/security/nss/lib/ssl/sslnonce.c,v <-- sslnonce.c new revision: 1.15; previous revision: 1.14 done Checking in sslsnce.c; /cvsroot/mozilla/security/nss/lib/ssl/sslsnce.c,v <-- sslsnce.c new revision: 1.30; previous revision: 1.29 done Also to NSS_3_9_BRANCH (but my CVS command output got overwritten - so it's not here).
Status: NEW → RESOLVED
Closed: 21 years ago
Resolution: --- → FIXED
Attachment #149974 - Flags: superreview?(bugz)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: