selfserv performance is down 7.4% relative to NSS 3.3.1

RESOLVED FIXED in 3.8

Status

P1
normal
RESOLVED FIXED
17 years ago
16 years ago

People

(Reporter: kirk.erickson, Assigned: wtc)

Tracking

Sun
Solaris

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(21 attachments, 1 obsolete attachment)

8.43 KB, text/plain
Details
3.52 KB, text/plain
Details
3.58 KB, text/plain
Details
2.38 KB, text/plain
Details
2.35 KB, text/plain
Details
28.95 KB, text/plain
Details
32.51 KB, text/plain
Details
5.74 KB, image/png
Details
5.75 KB, image/png
Details
12.79 KB, image/png
Details
12.72 KB, image/png
Details
5.29 KB, image/png
Details
5.39 KB, image/png
Details
3.39 KB, image/png
Details
3.60 KB, image/png
Details
16.53 KB, image/png
Details
15.97 KB, image/png
Details
5.86 KB, image/png
Details
5.44 KB, image/png
Details
3.49 KB, text/plain
Details
2.40 KB, text/html
Details
(Reporter)

Description

17 years ago
After resolving bug 107777 (to get selfserv to log ops/sec), and hitting
from multiple clients, comparisons of the tip and 3.3.1 stress revealed
a performance drop.  This bug was opened to collect the benchmark and
profiling data for reference.

2001-1128-1239 benchmarked iws-perf

                restart         full
                (6 clients)     (4 clients)
                -----------     -----------
3.3.1           210 (6%)        164
1103            220 (8%)        168
1127            221 (6%)        164

3.3.1-zones     585 (3%)        228
1103-zones      557 (3%)        224
1127-zones      536 (3%)        221

Previous results:
-----------------
3.3.1-zones     585             226
1120-zones      546             220

10% slowdown on restarts
 3% slowdown on fulls

1103 was just prior to checkin of the major set of changes for 3.4.
(Reporter)

Comment 1

17 years ago
Created attachment 62382 [details]
Notes from soupnazi runs
(Reporter)

Comment 2

17 years ago
Created attachment 62385 [details]
Execution Summary for 3.3.1

The following series of attachments are apples to apples profile
data of 3.3.1 and the tip as of 1211.  These are all with synctrace
turned off. 

The execution summaries reveal performance drop:
   User CPU:	- .5%
   System CPU:	- .2%
   Wait CPU:	- .5%
   User Lock:	-1.2%

Additional analyzer output from runs with synctrace on will follow to
address the User Lock (attributed syncronization wait time).
(Reporter)

Comment 3

17 years ago
Created attachment 62387 [details]
Execution Summary for the tip as of Dec 11, 2001 (3.4 Beta)
(Reporter)

Comment 4

17 years ago
Created attachment 62389 [details]
ops/sec, command lines and runtime info (331)
(Reporter)

Comment 5

17 years ago
Created attachment 62390 [details]
ops/sec, command lines and runtime info
(Reporter)

Updated

17 years ago
Attachment #62389 - Attachment description: ca → ops/sec, command lines and runtime info (331)
(Reporter)

Comment 6

17 years ago
Created attachment 62391 [details]
Exclusive User CPU

Sorry Nelson.  I wanted to show counts, but don't see a way from the main
menu.  I believe when we look at callers/callees will be able to show counts
too.
(Reporter)

Comment 7

17 years ago
Created attachment 62392 [details]
Exclusive User CPU (1211)
(Reporter)

Comment 8

17 years ago
Created attachment 62395 [details]
shaCompress Callers/Callees (331)

Sorry, there's no way to save the caller/callee windows in text.
Shifting to capturing with xwd and converting to png.
Hitting the main routines we identified in the meeting yesterday
(see Exclusive CPU top dogs).

Still no sign of counts toggle...
(Reporter)

Comment 9

17 years ago
Created attachment 62397 [details]
shaCompress callers/callees (1211)
(Reporter)

Comment 10

17 years ago
Created attachment 62403 [details]
memcpy callers/callees (331)
(Reporter)

Comment 11

17 years ago
Created attachment 62404 [details]
memcpy callers/callees (1211)
(Reporter)

Comment 12

17 years ago
Created attachment 62406 [details]
MD5_Update callers/callees (331)

memcpy is the guy we're after.
It calls MD5_Update and MD5_End.
(Reporter)

Comment 13

17 years ago
Created attachment 62412 [details]
MD5_Update callers/callees (1211)
(Reporter)

Comment 14

17 years ago
Created attachment 62413 [details]
MD5_End callers/callees (331)
(Reporter)

Comment 15

17 years ago
Created attachment 62414 [details]
MD5_End callers/callees (1211)
(Reporter)

Comment 16

17 years ago
Realize now we need to ask for counts when collecting.
Apparently, I had this turned off for the above runs.

What follows may help (I'll include counts) for the runs
we had synctrace turned on.  Note, unlike the runs above where
the duration under 3.3.1 and 1211 were within a second, this
pair missed by more, although the server was stopped within
a few seconds of the same time. The duration for the 331 run
was 251.106 seconds, while the 1211 run was 275.003.
(Reporter)

Comment 17

17 years ago
Created attachment 62416 [details]
Inclusive Sync Wait (331)

Correction.  The only counts I've discovered are sync wait counts.
(Reporter)

Comment 18

17 years ago
Created attachment 62417 [details]
Inclusive Sync Wait (1211)
(Reporter)

Comment 19

17 years ago
Created attachment 62418 [details]
pthread_mutex_lock (331)
(Reporter)

Comment 20

17 years ago
Created attachment 62419 [details]
pthread_mutex_lock (1211)
(Assignee)

Updated

17 years ago
Target Milestone: --- → 3.4
(Reporter)

Comment 21

17 years ago
Created attachment 64528 [details]
Deleted Attachment

This comment has been deleted.
(Reporter)

Updated

17 years ago
Attachment #64528 - Attachment is obsolete: true
(Reporter)

Comment 22

17 years ago
Created attachment 64563 [details]
Bob's proposed course of action to address sessionLock contention
The report that I read (supplied by Sun) claimed that in one single SSL
connection, a single lock in PKCS11 was locked and released 400 times.
That number seems excessive.

Bob's proposed course of action (above) will certainly spread the 
contention on that lock out over multiple locks, but I wonder, can 
something be done do reduce the number of times that single lock is
locked and unlocked during a single SSL connection?  
(Reporter)

Comment 24

17 years ago
> can something be done do reduce the number of times that single lock is
> locked and unlocked during a single SSL connection?  

The sessionLock lives in PK11SlotStr and protects elements of PK11SlotStr
as well as the associated sessions.  I believe its simply protecting too much.
Bob's proposal would reduce the number of hits on this lock by introducing
32 new locks to protect changes to PK11SessionStr.  Reducing the use of the 
global sessionLock will certainly reduce contention.

We can use PR_AtomicIncrement() and friends when we know we have a 32-bit 
value to reduce the number of hits on the set of locks as a whole.

I'm wondering what the downside of simply having a slotLock in PK11SlotStr,
and sessionLock in PK11SessionStr. I guess its because we're creating and
destroying sessions often.  But this could be addressed by maintaining a
pool of sessionLocks.

I guess the tokObject hash table will be next.  Why try to protect these
with locks in PK11SlotStr?
Status: NEW → ASSIGNED

Comment 25

17 years ago
Spreading out the buckets will also help. In a single SSL connection, there are
something like 8 sessions. One other think I was thinking about in the lock
buckets was to spread them out. so hash entry 0, 32, 64, 96, 128, etc would be
in one bucket, 1, 33, 65, 97, 129, etc would be in another rather than grouping
0-31, 32-63, 64-95, etc. This would reduce the acquisition of a single lock in
SSL to something under 50. BTW I suspect that the 400 number is more that just
SSL setup time. It would be nice to see the SSL tap output. I remember cases
where someone would be sending HTTP headers in multiple writes: two writes per
line, one with the line, the second of one byte. Each write (or read) would
generate something on the order of 10 PKCS #11 calls.

As for Kirk's question, the reason the locks are in the slotstr is to prevent 1)
prevent the cost of creation and destruction of the locks, and 2) they are
protecting data in the slotstr (the hash table is part of the slot structure).

Objects are created and destroyed with about the same or greater frequency as
sessions, but they aren't referenced as often. Just about every PKCS #11 call
references a session . About half reference objects.
(Reporter)

Comment 26

17 years ago
> As for Kirk's question, the reason the locks are in the slotstr
> is to prevent 1) prevent the cost of creation and destruction
> of the locks, and 2) they are protecting data in the slotstr
> (the hash table is part of the slot structure).

We can address (1) by pooling the locks.  I believe we should be
addressing (2) with a lock in PK11SlotStr, but fail to see the merit in
bucketing leaf (PK11SessionStr) protection there.

I'm countering your buckets approach to be certain we're pushing the
locks down as far as possible.  Its clear that the globalness is whats
killing us.  Pushing the sessionLock down into SessionStr shifts the
contention uniformly.  We wouldn't need to experiment with bucket sizes
and the like, and the code would be easier to understand and maintain.

Since there are only a few slots, we could address (2) by allocating 
a separate RWLock for slot->head (the session hash), so many readers
can traverse the list simultaneously.   If there's loads of creating
and destroying and not much referencing, bucketing may be necessary,
but this doesn't change my opinion about protecting the elemets of
SessionStr with a lock in SessionStr.

Comment 27

17 years ago
> We can address (1) by pooling the locks.  I believe we should be
> addressing (2) with a lock in PK11SlotStr, but fail to see the merit in
> bucketing leaf (PK11SessionStr) protection there.

> I'm countering your buckets approach to be certain we're pushing the
> locks down as far as possible.  Its clear that the globalness is whats
> killing us.  Pushing the sessionLock down into SessionStr shifts the
> contention uniformly.  We wouldn't need to experiment with bucket sizes
> and the like, and the code would be easier to understand and maintain.

I think you misunderstand what is being protected: It's the hash tables
themselves, not the session. All sessions in a particular hash table table
*MUST*  share the lock. The lock is not protecting any session data.
The only exception would be locking the reference count. Currently the only
place where the lock is acquired just to protect the session lock is on session
destruction. To remove this you need to add a per session lock to handle the
reference count. You still need the bucket lock to protect the hash table. This
adds 1) lock creation/destruction every session, 2) Additional lock calls every
SSL connection (you have to lock the reference count pretty much every time you
lock the hash buckets in normal operation).

This increases the time you hold the hash bucket lock, increasing the chance of
contention.

We may find we want to do this, but this would be in addition to creating the
hash bucket lock.

> Since there are only a few slots, we could address (2) by allocating 
> a separate RWLock for slot->head (the session hash), so many readers
> can traverse the list simultaneously.   If there's loads of creating
> and destroying and not much referencing, bucketing may be necessary,
> but this doesn't change my opinion about protecting the elemets of
> SessionStr with a lock in SessionStr.

I thought about R/W locks. Unfortunately there *is* loads of creations and
destructions of sessions. R/W locks will cause the same or worse issues than we
have now.


(Reporter)

Comment 28

17 years ago
See 121523 for further info on the sessionLock.
The sessionLock changes attempt to address a scaling problem
that was present in 3.3.1.

This bug is for the performance drop attributable to changes 
between 3.3.1 and 3.4.

Bob believes he can regain a portion of the 7.4% drop.
I will attempt to test his conjecture that the drop is
attributable to softoken by benchmarking iWS with 
crypto hardware support.
(Assignee)

Updated

17 years ago
Priority: -- → P1
Whiteboard: This is P1 and targeted NSS 3.4.1.
(Assignee)

Updated

17 years ago
Target Milestone: 3.4 → ---
(Assignee)

Comment 29

17 years ago
Set target milestone 3.4.
Target Milestone: --- → 3.4.1
(Assignee)

Comment 30

17 years ago
Changed the QA contact to Bishakha.
QA Contact: sonja.mirtitsch → bishakhabanerjee
(Assignee)

Comment 31

17 years ago
Set target milestone to NSS 3.5.
Target Milestone: 3.4.1 → 3.5
(Assignee)

Comment 32

17 years ago
Moved to 3.6.
Whiteboard: This is P1 and targeted NSS 3.4.1.
Target Milestone: 3.5 → 3.6
(Assignee)

Comment 33

16 years ago
Moved to Future for now.  Will consider this
for NSS 3.7.
Target Milestone: 3.6 → Future
(Assignee)

Updated

16 years ago
Target Milestone: Future → 3.7
(Assignee)

Comment 34

16 years ago
Moved to target milestone 3.8 because the original
NSS 3.7 release has been renamed 3.8.
Target Milestone: 3.7 → 3.8
(Reporter)

Comment 35

16 years ago
Created attachment 116227 [details]
Selfserv stress comparison of 331, 342, and 372

Benchmarking iWS has become a lost art that went with Julien;
it'd likely take me several days to configure the harness if I
had the hardware.  Unfortunately, I lost root access to soupnazi,
where we installed the Deimos card earlier.  I'm also down to
5 client machines on soupnazi's subnet.  So following through
with Bob's proposal isn't feasable now.

I have located a 4-cpu machine I can hit from 15 clients:

SunOS dsame2 5.8 Generic_108528-16 sun4u sparc SUNW,Ultra-4
Sun Microsystems  sun4u Sun Enterprise 450 (4 X UltraSPARC-II 480MHz)
System clock frequency: 96 MHz	Memory size: 4096 Megabytes

So I decided to see where we stand, by comparing:

    NSS 3.3.1 nspr20/v4.2.2
    NSS 3.4.2 nspr20/v4.1.2
	NSS 3.7.2 nspr20/v4.1.2

Needed to add -L to 3.3.1 and 3.4.2 selfserv (107777), and
zone allocator to NSPR 4.1.2 (97485) to gather all runs with
NSPR_USE_ZONE_ALLOCATOR=1.

I've attached the selfserv stress results.  The performance
drop we saw last year (between 331 and 34) is not apparent
between 331 and 342.

2001-1128-1239 3.3.1-zones     585	228
2003-0304-1352 3.3.1-zones     588		230

So soupnazi and dsame2 delivered similiar throughput at 331.
The 342 results fall about midway betwen 331 and 372, and:

NSS 3.7.2 is 4.80% faster than NSS 3.3.1 on restarts.
NSS 3.7.2 is 3.70% faster than NSS 3.3.1 on full handshakes.

Its clear from these results there's nothing to be gained by
any further investigation of the differences between 331 and 34.

Closing.
(Reporter)

Comment 36

16 years ago
Benchmark results indicate previous runs were inconclusive,
or changes occured between 11/03/01 and 3.4.2 which addressed
the performance drop.

Closing.
Status: ASSIGNED → RESOLVED
Last Resolved: 16 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.