Closed Bug 116255 Opened 23 years ago Closed 22 years ago

selfserv performance is down 7.4% relative to NSS 3.3.1

Categories

(NSS :: Libraries, defect, P1)

Sun
Solaris
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: kirk.erickson, Assigned: wtc)

Details

Attachments

(21 files, 1 obsolete file)

8.43 KB, text/plain
Details
3.52 KB, text/plain
Details
3.58 KB, text/plain
Details
2.38 KB, text/plain
Details
2.35 KB, text/plain
Details
28.95 KB, text/plain
Details
32.51 KB, text/plain
Details
5.74 KB, image/png
Details
5.75 KB, image/png
Details
12.79 KB, image/png
Details
12.72 KB, image/png
Details
5.29 KB, image/png
Details
5.39 KB, image/png
Details
3.39 KB, image/png
Details
3.60 KB, image/png
Details
16.53 KB, image/png
Details
15.97 KB, image/png
Details
5.86 KB, image/png
Details
5.44 KB, image/png
Details
3.49 KB, text/plain
Details
2.40 KB, text/html
Details
After resolving bug 107777 (to get selfserv to log ops/sec), and hitting from multiple clients, comparisons of the tip and 3.3.1 stress revealed a performance drop. This bug was opened to collect the benchmark and profiling data for reference. 2001-1128-1239 benchmarked iws-perf restart full (6 clients) (4 clients) ----------- ----------- 3.3.1 210 (6%) 164 1103 220 (8%) 168 1127 221 (6%) 164 3.3.1-zones 585 (3%) 228 1103-zones 557 (3%) 224 1127-zones 536 (3%) 221 Previous results: ----------------- 3.3.1-zones 585 226 1120-zones 546 220 10% slowdown on restarts 3% slowdown on fulls 1103 was just prior to checkin of the major set of changes for 3.4.
The following series of attachments are apples to apples profile data of 3.3.1 and the tip as of 1211. These are all with synctrace turned off. The execution summaries reveal performance drop: User CPU: - .5% System CPU: - .2% Wait CPU: - .5% User Lock: -1.2% Additional analyzer output from runs with synctrace on will follow to address the User Lock (attributed syncronization wait time).
Attachment #62389 - Attachment description: ca → ops/sec, command lines and runtime info (331)
Attached file Exclusive User CPU
Sorry Nelson. I wanted to show counts, but don't see a way from the main menu. I believe when we look at callers/callees will be able to show counts too.
Sorry, there's no way to save the caller/callee windows in text. Shifting to capturing with xwd and converting to png. Hitting the main routines we identified in the meeting yesterday (see Exclusive CPU top dogs). Still no sign of counts toggle...
memcpy is the guy we're after. It calls MD5_Update and MD5_End.
Realize now we need to ask for counts when collecting. Apparently, I had this turned off for the above runs. What follows may help (I'll include counts) for the runs we had synctrace turned on. Note, unlike the runs above where the duration under 3.3.1 and 1211 were within a second, this pair missed by more, although the server was stopped within a few seconds of the same time. The duration for the 331 run was 251.106 seconds, while the 1211 run was 275.003.
Correction. The only counts I've discovered are sync wait counts.
Target Milestone: --- → 3.4
Attached file Deleted Attachment (obsolete) —
This comment has been deleted.
Attachment #64528 - Attachment is obsolete: true
The report that I read (supplied by Sun) claimed that in one single SSL connection, a single lock in PKCS11 was locked and released 400 times. That number seems excessive. Bob's proposed course of action (above) will certainly spread the contention on that lock out over multiple locks, but I wonder, can something be done do reduce the number of times that single lock is locked and unlocked during a single SSL connection?
> can something be done do reduce the number of times that single lock is > locked and unlocked during a single SSL connection? The sessionLock lives in PK11SlotStr and protects elements of PK11SlotStr as well as the associated sessions. I believe its simply protecting too much. Bob's proposal would reduce the number of hits on this lock by introducing 32 new locks to protect changes to PK11SessionStr. Reducing the use of the global sessionLock will certainly reduce contention. We can use PR_AtomicIncrement() and friends when we know we have a 32-bit value to reduce the number of hits on the set of locks as a whole. I'm wondering what the downside of simply having a slotLock in PK11SlotStr, and sessionLock in PK11SessionStr. I guess its because we're creating and destroying sessions often. But this could be addressed by maintaining a pool of sessionLocks. I guess the tokObject hash table will be next. Why try to protect these with locks in PK11SlotStr?
Status: NEW → ASSIGNED
Spreading out the buckets will also help. In a single SSL connection, there are something like 8 sessions. One other think I was thinking about in the lock buckets was to spread them out. so hash entry 0, 32, 64, 96, 128, etc would be in one bucket, 1, 33, 65, 97, 129, etc would be in another rather than grouping 0-31, 32-63, 64-95, etc. This would reduce the acquisition of a single lock in SSL to something under 50. BTW I suspect that the 400 number is more that just SSL setup time. It would be nice to see the SSL tap output. I remember cases where someone would be sending HTTP headers in multiple writes: two writes per line, one with the line, the second of one byte. Each write (or read) would generate something on the order of 10 PKCS #11 calls. As for Kirk's question, the reason the locks are in the slotstr is to prevent 1) prevent the cost of creation and destruction of the locks, and 2) they are protecting data in the slotstr (the hash table is part of the slot structure). Objects are created and destroyed with about the same or greater frequency as sessions, but they aren't referenced as often. Just about every PKCS #11 call references a session . About half reference objects.
> As for Kirk's question, the reason the locks are in the slotstr > is to prevent 1) prevent the cost of creation and destruction > of the locks, and 2) they are protecting data in the slotstr > (the hash table is part of the slot structure). We can address (1) by pooling the locks. I believe we should be addressing (2) with a lock in PK11SlotStr, but fail to see the merit in bucketing leaf (PK11SessionStr) protection there. I'm countering your buckets approach to be certain we're pushing the locks down as far as possible. Its clear that the globalness is whats killing us. Pushing the sessionLock down into SessionStr shifts the contention uniformly. We wouldn't need to experiment with bucket sizes and the like, and the code would be easier to understand and maintain. Since there are only a few slots, we could address (2) by allocating a separate RWLock for slot->head (the session hash), so many readers can traverse the list simultaneously. If there's loads of creating and destroying and not much referencing, bucketing may be necessary, but this doesn't change my opinion about protecting the elemets of SessionStr with a lock in SessionStr.
> We can address (1) by pooling the locks. I believe we should be > addressing (2) with a lock in PK11SlotStr, but fail to see the merit in > bucketing leaf (PK11SessionStr) protection there. > I'm countering your buckets approach to be certain we're pushing the > locks down as far as possible. Its clear that the globalness is whats > killing us. Pushing the sessionLock down into SessionStr shifts the > contention uniformly. We wouldn't need to experiment with bucket sizes > and the like, and the code would be easier to understand and maintain. I think you misunderstand what is being protected: It's the hash tables themselves, not the session. All sessions in a particular hash table table *MUST* share the lock. The lock is not protecting any session data. The only exception would be locking the reference count. Currently the only place where the lock is acquired just to protect the session lock is on session destruction. To remove this you need to add a per session lock to handle the reference count. You still need the bucket lock to protect the hash table. This adds 1) lock creation/destruction every session, 2) Additional lock calls every SSL connection (you have to lock the reference count pretty much every time you lock the hash buckets in normal operation). This increases the time you hold the hash bucket lock, increasing the chance of contention. We may find we want to do this, but this would be in addition to creating the hash bucket lock. > Since there are only a few slots, we could address (2) by allocating > a separate RWLock for slot->head (the session hash), so many readers > can traverse the list simultaneously. If there's loads of creating > and destroying and not much referencing, bucketing may be necessary, > but this doesn't change my opinion about protecting the elemets of > SessionStr with a lock in SessionStr. I thought about R/W locks. Unfortunately there *is* loads of creations and destructions of sessions. R/W locks will cause the same or worse issues than we have now.
See 121523 for further info on the sessionLock. The sessionLock changes attempt to address a scaling problem that was present in 3.3.1. This bug is for the performance drop attributable to changes between 3.3.1 and 3.4. Bob believes he can regain a portion of the 7.4% drop. I will attempt to test his conjecture that the drop is attributable to softoken by benchmarking iWS with crypto hardware support.
Priority: -- → P1
Whiteboard: This is P1 and targeted NSS 3.4.1.
Target Milestone: 3.4 → ---
Set target milestone 3.4.
Target Milestone: --- → 3.4.1
Changed the QA contact to Bishakha.
QA Contact: sonja.mirtitsch → bishakhabanerjee
Set target milestone to NSS 3.5.
Target Milestone: 3.4.1 → 3.5
Moved to 3.6.
Whiteboard: This is P1 and targeted NSS 3.4.1.
Target Milestone: 3.5 → 3.6
Moved to Future for now. Will consider this for NSS 3.7.
Target Milestone: 3.6 → Future
Target Milestone: Future → 3.7
Moved to target milestone 3.8 because the original NSS 3.7 release has been renamed 3.8.
Target Milestone: 3.7 → 3.8
Benchmarking iWS has become a lost art that went with Julien; it'd likely take me several days to configure the harness if I had the hardware. Unfortunately, I lost root access to soupnazi, where we installed the Deimos card earlier. I'm also down to 5 client machines on soupnazi's subnet. So following through with Bob's proposal isn't feasable now. I have located a 4-cpu machine I can hit from 15 clients: SunOS dsame2 5.8 Generic_108528-16 sun4u sparc SUNW,Ultra-4 Sun Microsystems sun4u Sun Enterprise 450 (4 X UltraSPARC-II 480MHz) System clock frequency: 96 MHz Memory size: 4096 Megabytes So I decided to see where we stand, by comparing: NSS 3.3.1 nspr20/v4.2.2 NSS 3.4.2 nspr20/v4.1.2 NSS 3.7.2 nspr20/v4.1.2 Needed to add -L to 3.3.1 and 3.4.2 selfserv (107777), and zone allocator to NSPR 4.1.2 (97485) to gather all runs with NSPR_USE_ZONE_ALLOCATOR=1. I've attached the selfserv stress results. The performance drop we saw last year (between 331 and 34) is not apparent between 331 and 342. 2001-1128-1239 3.3.1-zones 585 228 2003-0304-1352 3.3.1-zones 588 230 So soupnazi and dsame2 delivered similiar throughput at 331. The 342 results fall about midway betwen 331 and 372, and: NSS 3.7.2 is 4.80% faster than NSS 3.3.1 on restarts. NSS 3.7.2 is 3.70% faster than NSS 3.3.1 on full handshakes. Its clear from these results there's nothing to be gained by any further investigation of the differences between 331 and 34. Closing.
Benchmark results indicate previous runs were inconclusive, or changes occured between 11/03/01 and 3.4.2 which addressed the performance drop. Closing.
Status: ASSIGNED → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: