Closed
Bug 116255
Opened 23 years ago
Closed 22 years ago
selfserv performance is down 7.4% relative to NSS 3.3.1
Categories
(NSS :: Libraries, defect, P1)
Tracking
(Not tracked)
RESOLVED
FIXED
3.8
People
(Reporter: kirk.erickson, Assigned: wtc)
Details
Attachments
(21 files, 1 obsolete file)
8.43 KB,
text/plain
|
Details | |
3.52 KB,
text/plain
|
Details | |
3.58 KB,
text/plain
|
Details | |
2.38 KB,
text/plain
|
Details | |
2.35 KB,
text/plain
|
Details | |
28.95 KB,
text/plain
|
Details | |
32.51 KB,
text/plain
|
Details | |
5.74 KB,
image/png
|
Details | |
5.75 KB,
image/png
|
Details | |
12.79 KB,
image/png
|
Details | |
12.72 KB,
image/png
|
Details | |
5.29 KB,
image/png
|
Details | |
5.39 KB,
image/png
|
Details | |
3.39 KB,
image/png
|
Details | |
3.60 KB,
image/png
|
Details | |
16.53 KB,
image/png
|
Details | |
15.97 KB,
image/png
|
Details | |
5.86 KB,
image/png
|
Details | |
5.44 KB,
image/png
|
Details | |
3.49 KB,
text/plain
|
Details | |
2.40 KB,
text/html
|
Details |
After resolving bug 107777 (to get selfserv to log ops/sec), and hitting
from multiple clients, comparisons of the tip and 3.3.1 stress revealed
a performance drop. This bug was opened to collect the benchmark and
profiling data for reference.
2001-1128-1239 benchmarked iws-perf
restart full
(6 clients) (4 clients)
----------- -----------
3.3.1 210 (6%) 164
1103 220 (8%) 168
1127 221 (6%) 164
3.3.1-zones 585 (3%) 228
1103-zones 557 (3%) 224
1127-zones 536 (3%) 221
Previous results:
-----------------
3.3.1-zones 585 226
1120-zones 546 220
10% slowdown on restarts
3% slowdown on fulls
1103 was just prior to checkin of the major set of changes for 3.4.
Reporter | ||
Comment 1•23 years ago
|
||
Reporter | ||
Comment 2•23 years ago
|
||
The following series of attachments are apples to apples profile
data of 3.3.1 and the tip as of 1211. These are all with synctrace
turned off.
The execution summaries reveal performance drop:
User CPU: - .5%
System CPU: - .2%
Wait CPU: - .5%
User Lock: -1.2%
Additional analyzer output from runs with synctrace on will follow to
address the User Lock (attributed syncronization wait time).
Reporter | ||
Comment 3•23 years ago
|
||
Reporter | ||
Comment 4•23 years ago
|
||
Reporter | ||
Comment 5•23 years ago
|
||
Reporter | ||
Updated•23 years ago
|
Attachment #62389 -
Attachment description: ca → ops/sec, command lines and runtime info (331)
Reporter | ||
Comment 6•23 years ago
|
||
Sorry Nelson. I wanted to show counts, but don't see a way from the main
menu. I believe when we look at callers/callees will be able to show counts
too.
Reporter | ||
Comment 7•23 years ago
|
||
Reporter | ||
Comment 8•23 years ago
|
||
Sorry, there's no way to save the caller/callee windows in text.
Shifting to capturing with xwd and converting to png.
Hitting the main routines we identified in the meeting yesterday
(see Exclusive CPU top dogs).
Still no sign of counts toggle...
Reporter | ||
Comment 9•23 years ago
|
||
Reporter | ||
Comment 10•23 years ago
|
||
Reporter | ||
Comment 11•23 years ago
|
||
Reporter | ||
Comment 12•23 years ago
|
||
memcpy is the guy we're after.
It calls MD5_Update and MD5_End.
Reporter | ||
Comment 13•23 years ago
|
||
Reporter | ||
Comment 14•23 years ago
|
||
Reporter | ||
Comment 15•23 years ago
|
||
Reporter | ||
Comment 16•23 years ago
|
||
Realize now we need to ask for counts when collecting.
Apparently, I had this turned off for the above runs.
What follows may help (I'll include counts) for the runs
we had synctrace turned on. Note, unlike the runs above where
the duration under 3.3.1 and 1211 were within a second, this
pair missed by more, although the server was stopped within
a few seconds of the same time. The duration for the 331 run
was 251.106 seconds, while the 1211 run was 275.003.
Reporter | ||
Comment 17•23 years ago
|
||
Correction. The only counts I've discovered are sync wait counts.
Reporter | ||
Comment 18•23 years ago
|
||
Reporter | ||
Comment 19•23 years ago
|
||
Reporter | ||
Comment 20•23 years ago
|
||
Assignee | ||
Updated•23 years ago
|
Target Milestone: --- → 3.4
Reporter | ||
Comment 21•23 years ago
|
||
This comment has been deleted.
Reporter | ||
Updated•23 years ago
|
Attachment #64528 -
Attachment is obsolete: true
Reporter | ||
Comment 22•23 years ago
|
||
Comment 23•23 years ago
|
||
The report that I read (supplied by Sun) claimed that in one single SSL
connection, a single lock in PKCS11 was locked and released 400 times.
That number seems excessive.
Bob's proposed course of action (above) will certainly spread the
contention on that lock out over multiple locks, but I wonder, can
something be done do reduce the number of times that single lock is
locked and unlocked during a single SSL connection?
Reporter | ||
Comment 24•23 years ago
|
||
> can something be done do reduce the number of times that single lock is
> locked and unlocked during a single SSL connection?
The sessionLock lives in PK11SlotStr and protects elements of PK11SlotStr
as well as the associated sessions. I believe its simply protecting too much.
Bob's proposal would reduce the number of hits on this lock by introducing
32 new locks to protect changes to PK11SessionStr. Reducing the use of the
global sessionLock will certainly reduce contention.
We can use PR_AtomicIncrement() and friends when we know we have a 32-bit
value to reduce the number of hits on the set of locks as a whole.
I'm wondering what the downside of simply having a slotLock in PK11SlotStr,
and sessionLock in PK11SessionStr. I guess its because we're creating and
destroying sessions often. But this could be addressed by maintaining a
pool of sessionLocks.
I guess the tokObject hash table will be next. Why try to protect these
with locks in PK11SlotStr?
Status: NEW → ASSIGNED
Comment 25•23 years ago
|
||
Spreading out the buckets will also help. In a single SSL connection, there are
something like 8 sessions. One other think I was thinking about in the lock
buckets was to spread them out. so hash entry 0, 32, 64, 96, 128, etc would be
in one bucket, 1, 33, 65, 97, 129, etc would be in another rather than grouping
0-31, 32-63, 64-95, etc. This would reduce the acquisition of a single lock in
SSL to something under 50. BTW I suspect that the 400 number is more that just
SSL setup time. It would be nice to see the SSL tap output. I remember cases
where someone would be sending HTTP headers in multiple writes: two writes per
line, one with the line, the second of one byte. Each write (or read) would
generate something on the order of 10 PKCS #11 calls.
As for Kirk's question, the reason the locks are in the slotstr is to prevent 1)
prevent the cost of creation and destruction of the locks, and 2) they are
protecting data in the slotstr (the hash table is part of the slot structure).
Objects are created and destroyed with about the same or greater frequency as
sessions, but they aren't referenced as often. Just about every PKCS #11 call
references a session . About half reference objects.
Reporter | ||
Comment 26•23 years ago
|
||
> As for Kirk's question, the reason the locks are in the slotstr
> is to prevent 1) prevent the cost of creation and destruction
> of the locks, and 2) they are protecting data in the slotstr
> (the hash table is part of the slot structure).
We can address (1) by pooling the locks. I believe we should be
addressing (2) with a lock in PK11SlotStr, but fail to see the merit in
bucketing leaf (PK11SessionStr) protection there.
I'm countering your buckets approach to be certain we're pushing the
locks down as far as possible. Its clear that the globalness is whats
killing us. Pushing the sessionLock down into SessionStr shifts the
contention uniformly. We wouldn't need to experiment with bucket sizes
and the like, and the code would be easier to understand and maintain.
Since there are only a few slots, we could address (2) by allocating
a separate RWLock for slot->head (the session hash), so many readers
can traverse the list simultaneously. If there's loads of creating
and destroying and not much referencing, bucketing may be necessary,
but this doesn't change my opinion about protecting the elemets of
SessionStr with a lock in SessionStr.
Comment 27•23 years ago
|
||
> We can address (1) by pooling the locks. I believe we should be
> addressing (2) with a lock in PK11SlotStr, but fail to see the merit in
> bucketing leaf (PK11SessionStr) protection there.
> I'm countering your buckets approach to be certain we're pushing the
> locks down as far as possible. Its clear that the globalness is whats
> killing us. Pushing the sessionLock down into SessionStr shifts the
> contention uniformly. We wouldn't need to experiment with bucket sizes
> and the like, and the code would be easier to understand and maintain.
I think you misunderstand what is being protected: It's the hash tables
themselves, not the session. All sessions in a particular hash table table
*MUST* share the lock. The lock is not protecting any session data.
The only exception would be locking the reference count. Currently the only
place where the lock is acquired just to protect the session lock is on session
destruction. To remove this you need to add a per session lock to handle the
reference count. You still need the bucket lock to protect the hash table. This
adds 1) lock creation/destruction every session, 2) Additional lock calls every
SSL connection (you have to lock the reference count pretty much every time you
lock the hash buckets in normal operation).
This increases the time you hold the hash bucket lock, increasing the chance of
contention.
We may find we want to do this, but this would be in addition to creating the
hash bucket lock.
> Since there are only a few slots, we could address (2) by allocating
> a separate RWLock for slot->head (the session hash), so many readers
> can traverse the list simultaneously. If there's loads of creating
> and destroying and not much referencing, bucketing may be necessary,
> but this doesn't change my opinion about protecting the elemets of
> SessionStr with a lock in SessionStr.
I thought about R/W locks. Unfortunately there *is* loads of creations and
destructions of sessions. R/W locks will cause the same or worse issues than we
have now.
Reporter | ||
Comment 28•23 years ago
|
||
See 121523 for further info on the sessionLock.
The sessionLock changes attempt to address a scaling problem
that was present in 3.3.1.
This bug is for the performance drop attributable to changes
between 3.3.1 and 3.4.
Bob believes he can regain a portion of the 7.4% drop.
I will attempt to test his conjecture that the drop is
attributable to softoken by benchmarking iWS with
crypto hardware support.
Assignee | ||
Updated•23 years ago
|
Priority: -- → P1
Whiteboard: This is P1 and targeted NSS 3.4.1.
Assignee | ||
Updated•23 years ago
|
Target Milestone: 3.4 → ---
Assignee | ||
Comment 30•23 years ago
|
||
Changed the QA contact to Bishakha.
QA Contact: sonja.mirtitsch → bishakhabanerjee
Assignee | ||
Comment 32•23 years ago
|
||
Moved to 3.6.
Whiteboard: This is P1 and targeted NSS 3.4.1.
Target Milestone: 3.5 → 3.6
Assignee | ||
Comment 33•22 years ago
|
||
Moved to Future for now. Will consider this
for NSS 3.7.
Target Milestone: 3.6 → Future
Assignee | ||
Updated•22 years ago
|
Target Milestone: Future → 3.7
Assignee | ||
Comment 34•22 years ago
|
||
Moved to target milestone 3.8 because the original
NSS 3.7 release has been renamed 3.8.
Target Milestone: 3.7 → 3.8
Reporter | ||
Comment 35•22 years ago
|
||
Benchmarking iWS has become a lost art that went with Julien;
it'd likely take me several days to configure the harness if I
had the hardware. Unfortunately, I lost root access to soupnazi,
where we installed the Deimos card earlier. I'm also down to
5 client machines on soupnazi's subnet. So following through
with Bob's proposal isn't feasable now.
I have located a 4-cpu machine I can hit from 15 clients:
SunOS dsame2 5.8 Generic_108528-16 sun4u sparc SUNW,Ultra-4
Sun Microsystems sun4u Sun Enterprise 450 (4 X UltraSPARC-II 480MHz)
System clock frequency: 96 MHz Memory size: 4096 Megabytes
So I decided to see where we stand, by comparing:
NSS 3.3.1 nspr20/v4.2.2
NSS 3.4.2 nspr20/v4.1.2
NSS 3.7.2 nspr20/v4.1.2
Needed to add -L to 3.3.1 and 3.4.2 selfserv (107777), and
zone allocator to NSPR 4.1.2 (97485) to gather all runs with
NSPR_USE_ZONE_ALLOCATOR=1.
I've attached the selfserv stress results. The performance
drop we saw last year (between 331 and 34) is not apparent
between 331 and 342.
2001-1128-1239 3.3.1-zones 585 228
2003-0304-1352 3.3.1-zones 588 230
So soupnazi and dsame2 delivered similiar throughput at 331.
The 342 results fall about midway betwen 331 and 372, and:
NSS 3.7.2 is 4.80% faster than NSS 3.3.1 on restarts.
NSS 3.7.2 is 3.70% faster than NSS 3.3.1 on full handshakes.
Its clear from these results there's nothing to be gained by
any further investigation of the differences between 331 and 34.
Closing.
Reporter | ||
Comment 36•22 years ago
|
||
Benchmark results indicate previous runs were inconclusive,
or changes occured between 11/03/01 and 3.4.2 which addressed
the performance drop.
Closing.
Status: ASSIGNED → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•