Closed Bug 237934 Opened 17 years ago Closed 17 years ago

nss_InitLock not atomic

Categories

(NSS :: Libraries, defect, P1)

3.3.4
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: julien.pierre, Assigned: julien.pierre)

Details

(Keywords: fixed1.7)

Attachments

(2 files, 1 obsolete file)

We ran into a problem on a multi-processor machine in the SSL code in ssl3con.c
where a thread tries to PZ_Unlock(symWrapKeyLock) and asserts because another
thread owns the lock .

There is only one place in NSS where that lock is locked - earlier in the same
function.

The lock is created by the following sequence :
    /* atomically initialize the lock */
    if (!symWrapKeysLock)
	nss_InitLock(&symWrapKeysLock, nssILockOther);

I think what happened is that multiple threads created the lock, and that lock
creation is not truly atomic. Here is the code for nss_InitLock in util, which
has not changed in current versions of NSS :

/* Given the address of a (global) pointer to a PZLock, 
 * atomicly create the lock and initialize the (global) pointer, 
 * if it is not already created/initialized.
 */

SECStatus 
__nss_InitLock(   PZLock    **ppLock, nssILockType ltype )
{
    static PRInt32  initializers;

    PORT_Assert( ppLock != NULL);

    /* atomically initialize the lock */
    while (!*ppLock) {
        PRInt32 myAttempt = PR_AtomicIncrement(&initializers);
        if (myAttempt == 1) {
	    *ppLock = PZ_NewLock(ltype);
            (void) PR_AtomicDecrement(&initializers);
            break;
        }
        PR_Sleep(PR_INTERVAL_NO_WAIT);          /* PR_Yield() */
        (void) PR_AtomicDecrement(&initializers);
    }

    return (*ppLock != NULL) ? SECSuccess : SECFailure;
}

I believe this code is missing a test, which I added below :

        if (myAttempt == 1) {
            if (!*ppLock) {
	       *ppLock = PZ_NewLock(ltype);
            }
            (void) PR_AtomicDecrement(&initializers);
            break;
        }

The reason for adding this test is the following case, which could occur even on
a single processor machine :

1) thread 1 gets suspended inside the loop, right after *ppLock is tested for
NULL, but before PR_AtomicIncrement executes

2) thread 2 executes, increments the counter to 1, creates the lock, decrements
it to 0, and exits the function

3) thread 1 resumes execution, increments the counter to 1, overwrites the lock,
and decrements the counter

Besides the atomicity bug, I think this function has performance issues. It uses
a single initializers static variable to serialize creation of all locks in the
process, regardless of the lock address .

This also brings the following question : why are we creating this global lock
dynamically in SSL, rather than at initialization time ? 

There aren't many places in NSS fortunately where nss_InitLock is called so the
atomicity and performance problems hopefully should be limited, but we have
actually run into the atomicity issue several times in tests.
Assignee: wchang0222 → julien.pierre.bugs
Priority: -- → P1
Target Milestone: --- → 3.9.1
Julien's fix is an improvement over the current code.
We should definitely check in Julien's fix.

Since this piece of code does not use locks, its
correctness on multiprocessor systems with weak
memory consistency models (e.g., SPARC, Alpha, and
PowerPC) requires careful review.  So I can't say
whether Julien's fix addresses all the thread-safety
issues.

It is best that this code be replaced by code that
uses PR_CallOnce.  The SSL library doesn't have an
initialization routine, so we can't create these
locks at initialization time.
The SSL library has neither a global init function nor a global shutdown 
function.   Having such a pair of functions would have been a cleaner 
solution for MANY issues in libSSL, but would have necessitates changes 
to applications that were unwilling to make such changes.  So, that 
solution is one of many things that has been waiting for that magical day
when we do a "Major" revision of libSSL and can finally make API changes.  

Using PR_RunOnce is not a simple solution.  PR_RunOnce requires its own
structure of info, of which there must be a separate instance for each 
object that is to be initialized only once.   So, to use PR_RunOnce as 
part of the implementation of nss_InitLock necessitates allocating an 
instance of that structure for each lock thus initialized, and that 
allocation may also be subject to races.  

One could envision creating a new structure which contains both an 
PR_RunOnce structure and a lock pointer.  Each place where a lock is
today allocated and initialized with nss_InitLock would need to be changed 
so that an instance of that structure is allocated instead of a mere 
instance of the lock pointer, and all subsequent references to that lock 
pointer would have to become references to the lock pointer member of 
that structure.   This could probably be accomplished with #defines 
to minimize the number of lines of code that actually must be changed.
I filed the Sun-internal bug that Julien was investigating, and we've been
discussing the __nss_InitLock cache coherency issues on IRC.

When I read Wan-Teh's update, I looked at the source to PR_CallOnce and found
the following:

PR_IMPLEMENT(PRStatus) PR_CallOnce(
    PRCallOnceType *once,
    PRCallOnceFN    func)
{
    if (!_pr_initialized) _PR_ImplicitInitialization();

    if (!once->initialized) {
        if (PR_AtomicSet(&once->inProgress, 1) == 0) {
            once->status = (*func)();
            PR_Lock(mod_init.ml);
            once->initialized = 1;
            PR_NotifyAllCondVar(mod_init.cv);
            PR_Unlock(mod_init.ml);
        } else {
            PR_Lock(mod_init.ml);
            while (!once->initialized) {
                PR_WaitCondVar(mod_init.cv, PR_INTERVAL_NO_TIMEOUT);
            }
            PR_Unlock(mod_init.ml);
        }
    }
    return once->status;
}

Here, once->initialized is inspected outside of mod_init.ml. I believe this is
unsafe on SMP platforms that don't guarantee store order; immediately after CPU0
executes func and sets once->initialized = 1, it's possible for CPU1's cache to
contain once->initialized == 1 WITHOUT containing the result of func. For
example, it could be possible for CPU1 to return from PR_CallOnce only to
discover a NULL symWrapKeysLock.

Do you agree that PR_CallOnce is potentially unsafe?
Chris, you are right.  The idea is that if we use
PR_CalllOnce for all one-time initializations, only
PR_CallOnce needs to be fixed.

The while loop in __nss_InitLock can be rewritten as
follows:

    /* atomically initialize the lock */
    while (!*ppLock) {
        PRInt32 oldValue = PR_AtomicSet(&initializers, 1);
        if (oldValue == 0) {
	    *ppLock = PZ_NewLock(ltype);
            (void) PR_AtomicSet(&initializers, 0);
            break;
        }
        PR_Sleep(PR_INTERVAL_NO_WAIT);          /* PR_Yield() */
    }

One will recognize this as a form of spin lock.  It follows
that Julien's fix is the "locked doublecheck" (or rather
"doublechecked locking") idiom.

I have seen other people use homegrown spinlocks to
safely create global objects for a library that doesn't
have an initialization function.

SSL's need can be met with the new PR_CallOnceWithArg function
(added in NSPR 4.3).
Is there a separate bug tracking the PR_CallOnce deficiency? If not, should I
file one?
Chris,

No, there isn't a bug yet on PR_CallOnce - please open one against NSPR and
assigned to Wan-Teh so it can be tracked separately.

As far as the place to do the initialization, for SSL server apps, there are
session cache initialization APIs which must be called . That would be a place
where we could initialize these locks . However, for SSL client apps, there is
no place to do it until the first time the lock is needed .
OS: SunOS → Solaris
If you need this bug fixed in NSS 3.9.1, we
better have a patch attached and review it
soon.
r=wtc on Julien's fix in the description of this bug.
(The rewrite I proposed in comment 4 can be applied to
the trunk.)

Julien, have you verified your fix?

By the way, we should also fix nss_InitMonitor and
nssRWLock_AtomicCreate.
Re: comment 4, please explain how PR_CallOnceWithArg solves any problem.
It STILL necessitates a separate NSPR callonce struct for every lock,
monitor, etc, that must be initialized.  
Wan-Teh,

Unfortunately this bug is not easy to reproduce due to the low probability. I
haven't reproduced it myself - the web server team did. So verifying the fix may
be very difficult.

We would indeed like to have this fix in 3.9.1 . We are however having some
schedule changes so we have more time to come up with an elegant patch. I will
email you about this.


Sorry I wasn't clear about the rewrite.  I was referring
to using PR_AtomicSet instead of PR_AtomicIncrement
and PR_AtomicDecrement.  I was not referring to using
PR_CallOnceWithArg.
This is Julien's fix, applied to nss_InitLock,
nss_InitMonitor, and NSSRWLock_AtomicCreate.
Attachment #144787 - Flags: superreview?(MisterSSL)
Attachment #144787 - Flags: review?(julien.pierre.bugs)
Target Milestone: 3.9.1 → 3.9.2
Comment on attachment 144787 [details] [diff] [review]
Proposed patch (checked in)

r=nelsonb
I'm certain this patch makes the code no less 
correct than before.  It probably fixes the problem on most machines.  I cannot
say it fixes the problem on all machines.
Attachment #144787 - Flags: superreview?(julien.pierre.bugs)
Attachment #144787 - Flags: superreview?(MisterSSL)
Attachment #144787 - Flags: review?(julien.pierre.bugs)
Attachment #144787 - Flags: review+
Comment on attachment 144787 [details] [diff] [review]
Proposed patch (checked in)

This code is better than before. It's OK to check in for 3.9.1 . But let's keep
this bug open for 3.9.2 if we find it does not fix the problem 100%.
Attachment #144787 - Flags: superreview?(julien.pierre.bugs) → superreview+
Comment on attachment 144787 [details] [diff] [review]
Proposed patch (checked in)

Requesting drivers' approval for Mozilla 1.7 final.

Mozilla doesn't need this fix.	However, we are
certifying the NSS 3.9.1 release, and it is our
desire that Mozilla 1.7 final uses a certified
NSS release.

This fix has a very low risk.
Attachment #144787 - Flags: approval1.7?
Julien, please check this in on the trunk (or tell me and I will do it).
The branch is waiting for mozilla driver approval.  
Comment on attachment 144787 [details] [diff] [review]
Proposed patch (checked in)

I checked in this patch on the NSS trunk (NSS 3.10).

Checking in nsslocks.c;
/cvsroot/mozilla/security/nss/lib/util/nsslocks.c,v  <--  nsslocks.c
new revision: 1.4; previous revision: 1.3
done
Checking in nssrwlk.c;
/cvsroot/mozilla/security/nss/lib/util/nssrwlk.c,v  <--  nssrwlk.c
new revision: 1.6; previous revision: 1.5
done
Comment on attachment 144787 [details] [diff] [review]
Proposed patch (checked in)

a=chofmann for 1.7
Attachment #144787 - Flags: approval1.7? → approval1.7+
Comment on attachment 144787 [details] [diff] [review]
Proposed patch (checked in)

I checked in this patch on the NSS_3_9_BRANCH (NSS 3.9.1)
and NSS_CLIENT_TAG (Mozilla 1.7 final).
Attachment #144787 - Attachment description: Proposed patch → Proposed patch (checked in)
Keywords: fixed1.7
This patch initializes the lock during the server session cache initialization
. This should prevent the issue we have seen from ever happening again .

It doesn't solve the problem for client applications, however; or for servers
that don't setup the SID cache (I'm not sure if that even works).
Attachment #149974 - Flags: superreview?(MisterSSL)
Attachment #149974 - Flags: review?(bugz)
OS: Solaris → All
Hardware: Sun → All
Nelson, could you please review the new patch I attached last week ? I would
like to get it into 3.9.2 .

Thanks !
Attachment #149974 - Flags: superreview?(bugz)
Attachment #149974 - Flags: superreview?(MisterSSL)
Attachment #149974 - Flags: review?(bugz)
Attachment #149974 - Flags: review?(MisterSSL)
Comment on attachment 149974 [details] [diff] [review]
work around issue for SSL server applications

I have some minor nits with this patch.  
They're minor enough that I will still give it r+, 
but I encourage you to consider changing them anyway.

1. The patch adds explicit null initialization to several static external 
variables.  

-static PZLock *	  symWrapKeysLock;
+static PZLock *	  symWrapKeysLock = NULL;
and
-static sslSessionID *cache;
-static PZLock *      cacheLock;
+static sslSessionID *cache = NULL;
+static PZLock *      cacheLock = NULL;

These initializations are unnecessary and IMO undesirable.

The c language GUARANTEES that all external (meaning: not declared within 
a function) and static variables are initialized to an all-zeros value.  
This is true on ALL platforms.	It's a guarantee of the language, not of 
the OS.  This guarantee was in K&R rev 1, before ANSI c.

So, adding the explicit = NULL assignment doesn't change the logical 
result of this code at all.  Either way, the value will be initialized
to all zeros.  But the explicit initialization does increase the size of 
the code.  It causes the symbol to be linked into the initialized data 
segment, rather than in the BSS segment, and causes the initial value to 
be placed into the executable file from which the initialized data 
segment is initialized.  (Each program and shared-library's BSS segment 
is always zeroed before that program or shared library begins execution).  

So, I wouldn't initialize any static or external variables to zero or NULL.

2. Please put a blank line after the closing brace of the new function 
ssl_InitSymWrapKeysLock(), so that the function will be clearly separated
from the next one.

3. Please remove the XXX comments in the function formerly known as 
lock_cache, now known as ssl_InitCacheLock, since they are no longer
true, due to these changes.

4. When "lock_cache" was static, it was apparent that the cache being
locked was the client ssl session cache, since that is the only cache
implemented in sslnonce.c.  But now that the function is being made 
external (not static), and it being called from other source files,
its name should clearly indicate exactly which of SSL several caches
it affects.  

Please rename ssl_InitCacheLock to ssl_InitClientSessionCacheLock
since that is what the function actually does.
Attachment #149974 - Flags: review?(MisterSSL) → review+
Nelson,

1. 
I was unaware the C language guaranteed this. If so, wouldn't the NULL / 0
initial assignment get optimized out, rather than additional code being
generated to do a second assignment ? 

I don't mind taking the assignments out, but I prefer to have them in anyway as
I don't consider them harmful. The minimal load-time performance impact and code
growth are a small price to pay for supportability, IMO. When code gets moved
around, the necessary assignment could very easily be overlooked. For this
reason, I think it is acceptable to explicitly initialize all variables,
including globals in this case.

2. done

3. done

4. lock_cache is still static, even in the patch . The diff doesn't show that
well. The lock still only gets locked from one file . The init only used to
happen at the time we locked, but now a separate init function got added, and
the init of the lock happens earlier than the first time we try to lock it in
SSL server apps. Nevertheless, I have changed the name of the lock init function
as suggested.
Attached patch updated patchSplinter Review
Attachment #149974 - Attachment is obsolete: true
Attachment #151183 - Flags: review?(nelson)
Comment on attachment 151183 [details] [diff] [review]
updated patch

Initializing a static or external (global) variable with any value, including
zero, causes it to be allocated in the initialized data segment, increasing 
the size of that segment.  No executable instructions are generated to do 
that initialization, and hence no such instructions can be optimized out. 
Rather, the initial value of the entire initialized data segment becomes a 
part of the binary file.  More initialized global/static variables == larger
binaries.

Download size of the NSS shared libs has become an issue for AOL.  
NSS is considered too big.  So, I object to changes that unnecessarily 
increase the download size by increasing the size of the initialized data 
segment to hold zero values, especially if done on a large/wide scale in
the code.  (This patch is not large scale.  I'm just speaking to other
potential changes.)

I disagree that explicitly initialized globals and statics are more 
maintainable than implicitly initialized ones.	One merely has to become
accustomed to thinking of globals and statics as implicitly initialized 
to zero.

Having said that, I'll let it go for *this* patch.
Attachment #151183 - Flags: review?(nelson) → review+
checked this in to the tip .

Checking in ssl3con.c;
/cvsroot/mozilla/security/nss/lib/ssl/ssl3con.c,v  <--  ssl3con.c
new revision: 1.68; previous revision: 1.67
done
Checking in sslimpl.h;
/cvsroot/mozilla/security/nss/lib/ssl/sslimpl.h,v  <--  sslimpl.h
new revision: 1.35; previous revision: 1.34
done
Checking in sslnonce.c;
/cvsroot/mozilla/security/nss/lib/ssl/sslnonce.c,v  <--  sslnonce.c
new revision: 1.15; previous revision: 1.14
done
Checking in sslsnce.c;
/cvsroot/mozilla/security/nss/lib/ssl/sslsnce.c,v  <--  sslsnce.c
new revision: 1.30; previous revision: 1.29
done

Also to NSS_3_9_BRANCH (but my CVS command output got overwritten - so it's not
here).
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
Attachment #149974 - Flags: superreview?(bugz)
You need to log in before you can comment on or make changes to this bug.