Closed Bug 126087 Opened 23 years ago Closed 22 years ago

Hot Locks (PK11SymKey->refLock)

Categories

(NSS :: Libraries, defect, P1)

Sun
Solaris
defect

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: kirk.erickson, Assigned: kirk.erickson)

References

Details

(Whiteboard: [adt2 RTM])

Attachments

(11 files, 5 obsolete files)

15.30 KB, text/plain
Details
7.01 KB, text/plain
Details
8.35 KB, text/plain
Details
2.02 KB, patch
Details | Diff | Splinter Review
5.48 KB, text/html
Details
4.46 KB, text/html
Details
526 bytes, patch
Details | Diff | Splinter Review
665 bytes, patch
Details | Diff | Splinter Review
1.53 KB, patch
wtc
: review+
Details | Diff | Splinter Review
2.35 KB, patch
Details | Diff | Splinter Review
2.76 KB, patch
Details | Diff | Splinter Review
Created this bug to track changes expected to be completed by next
week to address lock contention reported in meeting called by 
Wan-Teh with Sun tuning team last Thursday.  

The data from Ning Sun is the first attachment.  Note, iWS 6.0 SP1
was profiled, utilizing stock NSS (3.3); the sessionLock fix that
made it into 3.4 recently was not present (see bug 121523).

My plan is to benchmark before and after applying each of two sets of
changes in turn.

The first is to simply remove the lock in prng_GenerateGlobalRandomBytes().
It was reported that this lock is not needed at all.

The second is to remove all the locking we can around instances of
PK11SlotInfo->refLock, by using PR_Atomic instructions instead.

To test the later, we need to ensure the underlying Sparc4 instructions
are engaged.  Wan-Teh said to create a symbolic link thusly:

kirke@soupnazi[10] pwd
/export2/kirke/workarea-nss/mozilla/dist/SunOS5.8_OPT.OBJ/lib
kirke@soupnazi[11] ln -s libultrasparc4.so libatomic.so
Note, the sessionLock fix already checked in addresses
the sessionLock (lock 6).
soupnazi
When I remove the lock in prng_GenerateGlobalRandomBytes(),
I'm unable to benchmark.  I saw 100% cpu usage, and  0.90 ops/sec.
I updated my tree this morning, and made sure we worked fine before
making only the following change:

Index: mozilla/security/nss/lib/freebl/prng_fips1861.c
===================================================================
RCS file: /cvsroot/mozilla/security/nss/lib/freebl/prng_fips1861.c,v
retrieving revision 1.12
diff -b -u -r1.12 prng_fips1861.c
--- mozilla/security/nss/lib/freebl/prng_fips1861.c	15 Nov 2001 02:41:17 -0000
1.12
+++ mozilla/security/nss/lib/freebl/prng_fips1861.c	19 Feb 2002 19:02:31 -0000
@@ -366,13 +366,10 @@
 	PORT_SetError(SEC_ERROR_INVALID_ARGS);
 	return SECFailure;
     }
-    /* --- LOCKED --- */
-    PZ_Lock(rng->lock);
     /* Check the amount of seed data in the generator.  If not enough,
      * don't produce any data.
      */
     if (rng->seedCount < MIN_SEED_COUNT) {
-	PZ_Unlock(rng->lock);
 	PORT_SetError(SEC_ERROR_NEED_RANDOM);
 	return SECFailure;
     }
@@ -392,8 +389,6 @@
 	len -= num;
 	output += num;
     }
-    PZ_Unlock(rng->lock);
-    /* --- UNLOCKED --- */
     return rv;
 }

Bob, is this unsafe?
or are the Sun folks right - this lock can be removed and I'm just
removing it incorrectly?
As I look at the code, it appears to be unsafe. There are two possible issues here:
in a race condition, two different callers may get the same random bits (this is
bad because it creates an 'undetectable' security leak). The second problem is
probably what is tripping you up, in a race condition, avail may become negative.

We need to think about a way we can code grabbing more random data concurrently
without horking the state of our random number generator. We need to get the
resulting code reviewed very carefully to make sure we aren't introducing other
problems.

bob
The latest NSS library (Ian-0228-opt) works fine on our machine without
any errors. Our 1-way SPECweb99-SSL result with the new library shows
slight perf degredation compared to the original NSS (367 vs. 354 ops/s).
But the scalability has been improved significantly with the new
library. The 8-way results goes up by ~43%. It scales well upto 2-way
but starts to tail off at 4-way. We collected the plockstat on both 1-way
and 4-way. the sleep count of the first 5 hot locks on 4-way increased by
over 100 times than 1-way.

Please see the attached 2 files for detail statistics and the stack trace
for each hot lock. Here is a brief summary on the data.

1-way:

1. pk11 slot objectLock var pk11_handleObject()
2. symKey refLock used by PK11_FreeSymKey()
3. PK11SlotInfo reference lock var PK11_ReferenceSlot()
4. sessionLock
5. objectLock from pk11_GetObjectFromList()
6. connection queue (not related to NSS)
7. slot freeListLock pk11_getKeyFromList()
8. lock from NSFC cache (not related to NSS)
9. object attributeLock from pk11_FindAttribute()
4-way:

1. pk11 slot objectLock var pk11_handleObject()
2. symKey refLock used by PK11_FreeSymKey()
3. sessionLock
4. PK11SlotInfo reference lock var PK11_ReferenceSlot()
5. slot freeListLock pk11_getKeyFromList()
6. objectLock from pk11_GetObjectFromList()
7. prng_GenerateGlobalRandomBytes()
8. connection queue (not related to NSS)
9. lock from NSFC cache (not related to NSS)
10. nssPKIObject_AddRef() ?
11. LockArena() from NSPR
12. PK11SlotList lock from PK11_GetFirstSafe()
13. object attributeLock from pk11_FindAttribute()

The locks are sorted roughly according to the sleep count as
suggested by Alex.
Re: the PRNG lock.  It cannot just be removed.  It protects the full state of
the global PRNG.  The PRNG itself is a serial algorithm, that is, state[N+1] =
f(state[N]).  I don't see how this can be done without a single lock.

One way to avoid repeated accesses to the lock would be to increase the amount
of psuedo-random data generated on each pass.  Currently, only 20 bytes is
generated.  I don't know if that would decrease the total amount of time spent
waiting on the lock, however.
I see a small gain on soupnazi (4p) with this patch.
It removes the reference to slot->objectLock in pk11_handleObject().

Rather than taking the highest ops/sec, I've made my harness 
throw out the first two samples, and average the remainder.

With a clean pull and build of today's tip:  532.22  14%
With only this change:			     538.00  13%

Its possible Ning will see a more dramatic improvement, but we need
Ian's checkin for her to run the tip under Deimos.
Kirk,

My checkins are all on the tip.  The last build I gave her was a tip build, as
of Friday.
A small gain in both full and restart times.
The entries with names ending with -atomicHO designate the
runs with the above patch in pk11_handleObject(), which uses
PR_AtomicIncrement() instead of slot->objectLock.

I've supplied Ning with new libs and commands with these bits
for iWS 6.0sp2 comparisons under Deimos
(ftp://box.red.iplanet.com/pub/deimos/2002-0304/).
This patch includes the previous patch, and addresses the 3rd hotest lock.
It removes slot->refLock from PK11_ReferenceSlot(), by using 
PR_AtomicIncrement().

I saw no speedup in my selfserv stress, but perhaps Ning will in her
iWS stress tests.  I'll update the bits on the ftp site in hopes she
can compare this patch with the previous patch.
Attachment #72469 - Attachment is obsolete: true
Comment on attachment 72670 [details] [diff] [review]
AtomicIncrement in pk11_handleObject and PK11_ReferenceSlot

Neither Ning nor I saw a gain with atomic instructions for hot locks (1) and
(3).
Attachment #72670 - Attachment is obsolete: true
Using PR_AtomicDecrement() rather than symKey->refLock
yield a gain for me. Will provide Ning with this library set
I pulled from the tip and benchmarked before and after applying
this patch only.

204.84	full-zones		1% idle
214.32	full-zones-atomic2	1% idle
---------------------------------------
  9.48 ops/sec speedup on full handshakes
  948/204.84 = 4.63% gain

537.02	restart-zones		15% idle
542.07	restart-zones-atomic2	14% idle
----------------------------------------
  5.05 ops/sec speedup on restarts
  505/537.02 = 0.94% gain
Attachment #73359 - Attachment is obsolete: true
Ning and I both saw a gain with the previous patch.
This patch combines the previous change (removing symKey->refLock) 
with this change.  It reduces the contention on
slot->freeListLock by using atomic instructions on slot->keyCount.

To do more with the slot->freeListLock, we need to create additional
locks and round-robin their use.  This entails profiling to ascertain
an "ideal" number of locks, or a mechanism in software to detect the
contention, and create on demand.  Before we take this step, I'd like
to see the big picture again.

Please do your 1 and 4 way runs again with this set, and lets make 
sure its still really hot and worth this effort.  I'd like to know
contention hasn't fallen elsewhere before embarking down that path.

This change was based on comments back from Ning and  Bob Relyea:

> The freeListLock is locking a very short piece of code, you can see this 
> in pk11skey.c:
> 
> In getKeyFromList:
> 
>     PK11_USE_THREADS(PZ_Lock(slot->freeListLock);)
>     if (slot->freeSymKeysHead) {
>	  symKey = slot->freeSymKeysHead;
>	  slot->freeSymKeysHead = symKey->next;
>	  slot->keyCount--;
>     }
>     PK11_USE_THREADS(PZ_Unlock(slot->freeListLock);)
> 
> and in  FreeSymKey:
> 
>	 PK11_USE_THREADS(PZ_Lock(slot->freeListLock);)
>	  if (slot->keyCount < slot->maxKeyCount) {
>	     symKey->next = slot->freeSymKeysHead;
>	     slot->freeSymKeysHead = symKey;
>	     slot->keyCount++;
>	     symKey->slot = NULL;
>	     freeit = PR_FALSE;
>	  }
>	  PK11_USE_THREADS(PZ_Unlock(slot->freeListLock);)
> I can see a couple of things you can do to limit the locks:
> 
> 1st move the slot->keyCount stuff out of the locks and use Atomic 
> increment/decrement. This mostly helps the free case because you can 
> simply move the lock insiide the if statement.
> 
> 2nd, the only other thing that can be done is to use a round robin set 
> of free lists, each with it's own locks. You'll have to do profiling to 
> see how many free lists you would need. Be sure to addjust maxKeyCount 
> down by the number of freelists you have. It might be best to make the 
> number of lists a power of two, it makes the math to figure out which 
> list faster. I don't think you really need to use locks to protect the 
> round robin value since it's not a serious problem if you accidently 
> skip a bucket once because of contention.
> 
> bob
> 
> Ning Sun wrote:
> 
> >Hi Kirk,
> >
> >I tested your new NSS bits, it shows about 2% improvement on 4-way
> >over the Ian-0228-opt version.
> >
> >Thanks a lot for the help.
> >
> >You are right. the second lock in my report should be slot->freeListLock 
> >instead of the refLock. The contention on this is very high. Hopefully 
> >we can do something about it.
> >
> >Thanks,
> >-ning

Compared to the previous patch, I saw a gain of from

	214.32 (1% idle)
to	216.41 (0% idle) on full handshakes, and from

	542.07 (14% idle)
to	545.95 (15% idle) on restarts.

Compared to the tip (which is closed and about to release as NSS 3.4)

	216.41 - 204.84 = 11.57 ops/sec gain on full handshakes (5.65%)
	545.95 - 537.02 =  8.93 ops/sec gain on restarts (1.66%)
I stressed the previous patch overnight with no problems.
Here are the performance numbers from soupnazi, an
Ultra Sparc II machine with 4 450Mhz cpus.

It includes results from the current NSS 3.4 tip, NSS 3.3.1
and from November (just before Bob's big checkin) for comparison.

The results from this patch are named ...zones-atomic2a
The results from the previous version ...zones-atomic2

In these runs, the server was hit by nine machines on the
same subnet.  It ran with 100 threads, as did all clients.
Attachment #72495 - Attachment is obsolete: true
OS: Linux → Solaris
Priority: -- → P1
Target Milestone: --- → 3.4.1
Without the zone allocator, heap contention appears to mask gains.
Ning is using SmartHeap, and saw this patch perform better than the
previous patch, but found that official svbld bits performed better
than either 0308 or 0314.  As a result, I  tried to isolate what
build environment differences might account for the general drop
with my builds.

Found that I was using an older compiler than Sonja.  And upgraded.
Found no other differences in build environment variables, compiler
command lines, or linker command lines.

The only other difference I found was that I was doing a noimport
build, pulling NSPRPUB_RELEASE_4_2_BETA2 to pick up Wan-Teh's 
version of the zone allocator.	I backed out to NSPRPUB_RELEASE_4_1_2,
which is the release which NSS34 will release with as things stand.
Then I applied my own zone allocator changes, which I used previously
with NSPR 4.1.2.

These are the results of my selfserv stress runs.
Once again, these changes appear to help selfserv stress, especially
on full handshake runs (207.39->222.58), where idle dropped back to
zero for the first time too.

Also, 561.50 is a new record for restarts (see restart-zones-34-atomic).
Both with and without this patch (see names with -atomic) NSS 3.4 
(see names with -34) show a jump in idle cpu time when the zone 
allocator (-zones) is engaged.
Checked in patch above and closed this bug.
Will open a new bug when we get a new hotlist that shows
where contention has fallen.

The final patch address contention on slot->freeListLock
and &symKey->refCount, touching pk11wrap/pk11skey.c only.
Status: NEW → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
While debugging some changes I made, I noticed that tstclnt was not shutting
down cleanly as it used to.  It seemed the internal slot was being leaked
(several references), and after looking at checkins since 3.4 RTM I realized it
had to be here.  I made this change and it is fine now.  Checking in; just
noting the change here.
Comment on attachment 74085 [details] [diff] [review]
address hotlock 2: symKey->refLock and slot->freeListLock

Kirk,

Thanks for putting together this patch and testing it.

I see two issues with this patch that need to be addressed.

1. I don't think there is a gain to use atomic increment
and decrement on slot->keyCount because the slot->keyCount--
or slot->keyCount++ operations are only a small portion of
the code between lock and unlock; there is still code to
operate on the free list.  Roughly, you reduce the critical
region from 4 lines of code to 3 lines.

2. The use of atomic increment and decrement on symKey->refLock
should be a win, but you did not replace symKey->refCount++
with PR_AtomicIncrement(&sym->refCount). :-)

>@@ -264,15 +264,15 @@
> 	    PORT_Free(symKey->data.data);
> 	}
>         slot = symKey->slot;
>-        PK11_USE_THREADS(PZ_Lock(slot->freeListLock);)
> 	if (slot->keyCount < slot->maxKeyCount) {
>+            PK11_USE_THREADS(PZ_Lock(slot->freeListLock);)

With this change, you are now testing slot->keyCount before
you grab the lock.  This is not thread-safe.

> 	   symKey->next = slot->freeSymKeysHead;
> 	   slot->freeSymKeysHead = symKey;
>-	   slot->keyCount++;
>+	    PK11_USE_THREADS(PZ_Unlock(slot->freeListLock);)
>+	    PR_AtomicIncrement(&slot->keyCount);
> 	   symKey->slot = NULL;
> 	   freeit = PR_FALSE;
>         }
>-	PK11_USE_THREADS(PZ_Unlock(slot->freeListLock);)
>         if (freeit) {
> 	    pk11_CloseSession(symKey->slot, symKey->session,
> 							symKey->sessionOwner);

The change to shorten the critical region here is great!

I am pretty sure that
    freeit = PR_FALSE;
can be done after the unlock call.

However, I think
    symKey->slot = NULL;
cannot be after the unlock call.  It does not need to be
in the critical region, but it should be before the lock
call.
Attachment #74085 - Flags: needs-work+
To elaborate my comment on this change:

> 	   symKey->next = slot->freeSymKeysHead;
> 	   slot->freeSymKeysHead = symKey;
>-
   slot->keyCount++;
>+
    PK11_USE_THREADS(PZ_Unlock(slot->freeListLock);)
>+
    PR_AtomicIncrement(&slot->keyCount);
> 	   symKey->slot = NULL;
> 	   freeit = PR_FALSE;
>         }
>-
PK11_USE_THREADS(PZ_Unlock(slot->freeListLock);)
>         if (freeit) {
> 	    pk11_CloseSession(symKey->slot, symKey->session,
> 	
					symKey->sessionOwner);

Once you add symKey to the freelist and unlock, you no longer
have a valid reference to it, so you can't set symKey->slot
to NULL after you unlock.

You can set symKey->slot to NULL either before the lock call
or inside the critical region.
Wan-Teh wrote:

> >       if (slot->keyCount < slot->maxKeyCount) {
> >+            PK11_USE_THREADS(PZ_Lock(slot->freeListLock);)
> 
> With this change, you are now testing slot->keyCount before
> you grab the lock.  This is not thread-safe.

I recall balking on this too, but Bob had reason to believe
this was safe.

maxKeyCount is initialized by PK11_NewSlotInfo()/PK11_InitToken()
in pk11slot.c, and never touched subsequently.  My recollection
is that its safe because its only after initialization is complete
that the reference in PK11_FreeSymKey() can occur.

I'll come up with a new patch that addresses your other concerns,
and perhaps Bob will clear up this aspect.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
This bug was brought out by my last fix.  The slot cannot be set to NULL before
leaving the freelist, or another thread may grab the key off the freelist and
have their slot NULL'ed.
Sorry, I should have read the last comments more carefully, but I was deep in
debugging.  I made the obvious change of moving the symKey->slot = NULL back
into the critical region, and checked it in to fix the tinderboxes.  The other
option, of doing it before the critical region, is just as valid, so if that is
preferable go ahead and move it again.
Wan Teh is write, we can't mark the slot as NULL once we've added the key to the
free list unless we are still locking the free list, however, we can set the
slot to NULL before we add the key to the free list. It's safe at that point
because we know there are no other references to the key (we are in the middle
of destroying it). So the key is free to be tweaked until we put it on the free
list. (just be sure to set it to NULL *after* we've made a copy of the slot
since we still need that ;).

bob
OK, I meant Wan-Teh is 'right' not 'write':). It's too late, I think I need to
go home;).
There are still occasional tinderbox failures in stress tests, almost assuredly
due to these changes.  I wonder if Wan-Teh's other observation is the problem. 
That is, the PR_AtomicDecrement was not matched with a PR_AtomicIncrement.

We have:

PK11SymKey *
PK11_ReferenceSymKey(PK11SymKey *symKey)
{
    PK11_USE_THREADS(PZ_Lock(symKey->refLock);)
    symKey->refCount++;
    PK11_USE_THREADS(PZ_Unlock(symKey->refLock);)
    return symKey;
}

The lock no longer protects the refCount from being decremented, so it is
useless.  And I assume mixing the ++ operator with PR_AtomicDecrement is not safe.
Ian,

You are right. The PK11_ReferenceSymKey function
should say:

PK11SymKey *
PK11_ReferenceSymKey(PK11SymKey *symKey)
{
    PR_AtomicIncrement(&symKey->refCount++);
    return symKey;
}

Bob, is it okay to use NSPR atomic routines here?
NSPR atomic routines are implemented with NSPR
locks on machines that don't have the appropriate
atomic instructions.  It would appear that this
is not PKCS#11 compliant because we won't using
the PKCS#11 mutexes.  Do we have stubs for
PR_AtomicIncrement and PR_AtomicDecrement?

Correction: the ++ operator should be deleted.

PK11SymKey *
PK11_ReferenceSymKey(PK11SymKey *symKey)
{
    PR_AtomicIncrement(&symKey->refCount);
    return symKey;
}

Wan-Teh,

I checked in that change a while ago, as an experiment to see if it fixes the
intermittant tbox failures.

As for not using a lock, IINM this should not affect PKCS#11 because it is above
the line.  That is, this code is in libnss, so we can do whatever we want.
> I see two issues with this patch that need to be addressed.
>
> 1. I don't think there is a gain to use atomic increment
> and decrement on slot->keyCount because the slot->keyCount--
> or slot->keyCount++ operations are only a small portion of
> the code between lock and unlock; there is still code to
> operate on the free list.  Roughly, you reduce the critical
> region from 4 lines of code to 3 lines.

I've created a new patch and attached it here for review
which backs out of the change in pk11_getKeyFromList():

    PK11_USE_THREADS(PZ_Lock(slot->freeListLock);)
    if (slot->freeSymKeysHead) {
	symKey = slot->freeSymKeysHead;
	slot->freeSymKeysHead = symKey->next;
	slot->keyCount--;
    }
    PK11_USE_THREADS(PZ_Unlock(slot->freeListLock);)

> 2. The use of atomic increment and decrement on symKey->refLock
> should be a win, but you did not replace symKey->refCount++
> with PR_AtomicIncrement(&sym->refCount). :-)

I gather between Ian and Wan-Teh's checkins that this has been
resolved.  I was addressing the hotlock which Ning pointed to
in PK11_FreeSymKey() and neglected the instance in
PK11_ReferenceSymKey. We're setting symKey->slot to NULL
before the UNLOCK as things stand.
Comment on attachment 78918 [details] [diff] [review]
Patch to back out of atomic slot-keyCount change

Kirk,

Thanks for the patch.  This patch is correct but is
incomplete.  You also need to back out the
PR_AtomicIncrement(&slot->keyCount) change.

Also, now that we use atomic increment/decrement
routines on symKey->refCount, we can remove
symKey->refLock.  Could you take care of that too?
Attachment #78918 - Flags: needs-work+
Attachment #78918 - Attachment is obsolete: true
Ok, this patch catches both PK11_FreeSymKey() and 
pk11_getKeyFromList().	This is what we had originally,
and what this patch takes us back too:

	PK11_USE_THREADS(PZ_Lock(slot->freeListLock);)
	if (slot->keyCount < slot->maxKeyCount) {
	   symKey->next = slot->freeSymKeysHead;
	   slot->freeSymKeysHead = symKey;
	   slot->keyCount++;
	   symKey->slot = NULL;
	   freeit = PR_FALSE;
	}
	PK11_USE_THREADS(PZ_Unlock(slot->freeListLock);)
Comment on attachment 78948 [details] [diff] [review]
backout of atomic slot->keyCount change

r=wtc.	Thanks.
Attachment #78948 - Flags: review+
Since we are using PR atomic routines on symKey->refCount,
symKey->refLock should be deleted.

PR atomic routines take a PRInt* pointer as the argument,
so symKey->refCount should be declared as PRInt32.
Changed the QA contact to Bishakha.
QA Contact: sonja.mirtitsch → bishakhabanerjee
Moved to NSS 3.4.2.
Target Milestone: 3.4.1 → 3.4.2
Comment on attachment 82289 [details] [diff] [review]
Patch for the NSS_3_4_BRANCH

This patch has been checked into the NSS_3_4_BRANCH.
Marked the bug fixed.

Note that I updated the bug's summary to reflect what
lock contention actually got addressed.
Status: REOPENED → RESOLVED
Closed: 22 years ago22 years ago
Resolution: --- → FIXED
Summary: Hot Locks (GenerateRandom and PK11SlotInfo->refLock) → Hot Locks (PK11SymKey->refLock)
Blocks: 145836
adt1.0.1+ (on ADT's behalf) for checkin to the 1.0 branch, pending Driver's
approval. Pls check this in asap. thanks!
Whiteboard: [adt2 RTM]
Keywords: mozilla1.0.1fixed1.0.1
bishakhabanerjee@netscape.com - can you verify this bug fix in 1.01 branch? 
When verified, pls replace fixed1.0.1 keyword with verified1.0.1.  Thanks.
marking verified
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: