Closed Bug 116255 Opened 23 years ago Closed 22 years ago

selfserv performance is down 7.4% relative to NSS 3.3.1

Tracking

(Not tracked)

Status:

RESOLVED FIXED

Milestone:

3.8

People

(Reporter: kirk.erickson, Assigned: wtc)

Details

Attachments

(21 files, 1 obsolete file)

Notes from soupnazi runs 23 years ago Kirk Erickson 8.43 KB, text/plain		Details
Execution Summary for 3.3.1 23 years ago Kirk Erickson 3.52 KB, text/plain		Details
Execution Summary for the tip as of Dec 11, 2001 (3.4 Beta) 23 years ago Kirk Erickson 3.58 KB, text/plain		Details
ops/sec, command lines and runtime info (331) 23 years ago Kirk Erickson 2.38 KB, text/plain		Details
ops/sec, command lines and runtime info 23 years ago Kirk Erickson 2.35 KB, text/plain		Details
Exclusive User CPU 23 years ago Kirk Erickson 28.95 KB, text/plain		Details
Exclusive User CPU (1211) 23 years ago Kirk Erickson 32.51 KB, text/plain		Details
shaCompress Callers/Callees (331) 23 years ago Kirk Erickson 5.74 KB, image/png		Details
shaCompress callers/callees (1211) 23 years ago Kirk Erickson 5.75 KB, image/png		Details
memcpy callers/callees (331) 23 years ago Kirk Erickson 12.79 KB, image/png		Details
memcpy callers/callees (1211) 23 years ago Kirk Erickson 12.72 KB, image/png		Details
MD5_Update callers/callees (331) 23 years ago Kirk Erickson 5.29 KB, image/png		Details
MD5_Update callers/callees (1211) 23 years ago Kirk Erickson 5.39 KB, image/png		Details
MD5_End callers/callees (331) 23 years ago Kirk Erickson 3.39 KB, image/png		Details
MD5_End callers/callees (1211) 23 years ago Kirk Erickson 3.60 KB, image/png		Details
Inclusive Sync Wait (331) 23 years ago Kirk Erickson 16.53 KB, image/png		Details
Inclusive Sync Wait (1211) 23 years ago Kirk Erickson 15.97 KB, image/png		Details
pthread_mutex_lock (331) 23 years ago Kirk Erickson 5.86 KB, image/png		Details
pthread_mutex_lock (1211) 23 years ago Kirk Erickson 5.44 KB, image/png		Details
Deleted Attachment 23 years ago Kirk Erickson 33 bytes, text/plain		Details
Bob's proposed course of action to address sessionLock contention 23 years ago Kirk Erickson 3.49 KB, text/plain		Details
Selfserv stress comparison of 331, 342, and 372 22 years ago Kirk Erickson 2.40 KB, text/html		Details

Kirk Erickson

Reporter

Description

•

23 years ago

After resolving bug 107777 (to get selfserv to log ops/sec), and hitting from multiple clients, comparisons of the tip and 3.3.1 stress revealed a performance drop. This bug was opened to collect the benchmark and profiling data for reference. 2001-1128-1239 benchmarked iws-perf restart full (6 clients) (4 clients) ----------- ----------- 3.3.1 210 (6%) 164 1103 220 (8%) 168 1127 221 (6%) 164 3.3.1-zones 585 (3%) 228 1103-zones 557 (3%) 224 1127-zones 536 (3%) 221 Previous results: ----------------- 3.3.1-zones 585 226 1120-zones 546 220 10% slowdown on restarts 3% slowdown on fulls 1103 was just prior to checkin of the major set of changes for 3.4.

Kirk Erickson

Reporter

Comment 1

•

23 years ago

Attached file Notes from soupnazi runs — Details

Kirk Erickson

Reporter

Comment 2

•

23 years ago

Attached file Execution Summary for 3.3.1 — Details

The following series of attachments are apples to apples profile data of 3.3.1 and the tip as of 1211. These are all with synctrace turned off. The execution summaries reveal performance drop: User CPU: - .5% System CPU: - .2% Wait CPU: - .5% User Lock: -1.2% Additional analyzer output from runs with synctrace on will follow to address the User Lock (attributed syncronization wait time).

Kirk Erickson

Reporter

Comment 3

•

23 years ago

Attached file Execution Summary for the tip as of Dec 11, 2001 (3.4 Beta) — Details

Kirk Erickson

Reporter

Comment 4

•

23 years ago

Attached file ops/sec, command lines and runtime info (331) — Details

Kirk Erickson

Reporter

Comment 5

•

23 years ago

Attached file ops/sec, command lines and runtime info — Details

Kirk Erickson

Reporter

Updated

•

23 years ago

Attachment #62389 - Attachment description: ca → ops/sec, command lines and runtime info (331)

Kirk Erickson

Reporter

Comment 6

•

23 years ago

Attached file Exclusive User CPU — Details

Sorry Nelson. I wanted to show counts, but don't see a way from the main menu. I believe when we look at callers/callees will be able to show counts too.

Kirk Erickson

Reporter

Comment 7

•

23 years ago

Attached file Exclusive User CPU (1211) — Details

Kirk Erickson

Reporter

Comment 8

•

23 years ago

Attached image shaCompress Callers/Callees (331) — Details

Sorry, there's no way to save the caller/callee windows in text. Shifting to capturing with xwd and converting to png. Hitting the main routines we identified in the meeting yesterday (see Exclusive CPU top dogs). Still no sign of counts toggle...

Kirk Erickson

Reporter

Comment 9

•

23 years ago

Attached image shaCompress callers/callees (1211) — Details

Kirk Erickson

Reporter

Comment 10

•

23 years ago

Attached image memcpy callers/callees (331) — Details

Kirk Erickson

Reporter

Comment 11

•

23 years ago

Attached image memcpy callers/callees (1211) — Details

Kirk Erickson

Reporter

Comment 12

•

23 years ago

Attached image MD5_Update callers/callees (331) — Details

memcpy is the guy we're after. It calls MD5_Update and MD5_End.

Kirk Erickson

Reporter

Comment 13

•

23 years ago

Attached image MD5_Update callers/callees (1211) — Details

Kirk Erickson

Reporter

Comment 14

•

23 years ago

Attached image MD5_End callers/callees (331) — Details

Kirk Erickson

Reporter

Comment 15

•

23 years ago

Attached image MD5_End callers/callees (1211) — Details

Kirk Erickson

Reporter

Comment 16

•

23 years ago

Realize now we need to ask for counts when collecting. Apparently, I had this turned off for the above runs. What follows may help (I'll include counts) for the runs we had synctrace turned on. Note, unlike the runs above where the duration under 3.3.1 and 1211 were within a second, this pair missed by more, although the server was stopped within a few seconds of the same time. The duration for the 331 run was 251.106 seconds, while the 1211 run was 275.003.

Kirk Erickson

Reporter

Comment 17

•

23 years ago

Attached image Inclusive Sync Wait (331) — Details

Correction. The only counts I've discovered are sync wait counts.

Kirk Erickson

Reporter

Comment 18

•

23 years ago

Attached image Inclusive Sync Wait (1211) — Details

Kirk Erickson

Reporter

Comment 19

•

23 years ago

Attached image pthread_mutex_lock (331) — Details

Kirk Erickson

Reporter

Comment 20

•

23 years ago

Attached image pthread_mutex_lock (1211) — Details

Wan-Teh Chang

Assignee

Updated

•

23 years ago

Target Milestone: --- → 3.4

Kirk Erickson

Reporter

Comment 21

•

23 years ago

Attached file Deleted Attachment (obsolete) — Details

This comment has been deleted.

Kirk Erickson

Reporter

Updated

•

23 years ago

Attachment #64528 - Attachment is obsolete: true

Kirk Erickson

Reporter

Comment 22

•

23 years ago

Attached file Bob's proposed course of action to address sessionLock contention — Details

Nelson Bolyard (seldom reads bugmail)

Comment 23

•

23 years ago

The report that I read (supplied by Sun) claimed that in one single SSL connection, a single lock in PKCS11 was locked and released 400 times. That number seems excessive. Bob's proposed course of action (above) will certainly spread the contention on that lock out over multiple locks, but I wonder, can something be done do reduce the number of times that single lock is locked and unlocked during a single SSL connection?

Kirk Erickson

Reporter

Comment 24

•

23 years ago

> can something be done do reduce the number of times that single lock is > locked and unlocked during a single SSL connection? The sessionLock lives in PK11SlotStr and protects elements of PK11SlotStr as well as the associated sessions. I believe its simply protecting too much. Bob's proposal would reduce the number of hits on this lock by introducing 32 new locks to protect changes to PK11SessionStr. Reducing the use of the global sessionLock will certainly reduce contention. We can use PR_AtomicIncrement() and friends when we know we have a 32-bit value to reduce the number of hits on the set of locks as a whole. I'm wondering what the downside of simply having a slotLock in PK11SlotStr, and sessionLock in PK11SessionStr. I guess its because we're creating and destroying sessions often. But this could be addressed by maintaining a pool of sessionLocks. I guess the tokObject hash table will be next. Why try to protect these with locks in PK11SlotStr?

Status: NEW → ASSIGNED

Robert Relyea

Comment 25

•

23 years ago

Spreading out the buckets will also help. In a single SSL connection, there are something like 8 sessions. One other think I was thinking about in the lock buckets was to spread them out. so hash entry 0, 32, 64, 96, 128, etc would be in one bucket, 1, 33, 65, 97, 129, etc would be in another rather than grouping 0-31, 32-63, 64-95, etc. This would reduce the acquisition of a single lock in SSL to something under 50. BTW I suspect that the 400 number is more that just SSL setup time. It would be nice to see the SSL tap output. I remember cases where someone would be sending HTTP headers in multiple writes: two writes per line, one with the line, the second of one byte. Each write (or read) would generate something on the order of 10 PKCS #11 calls. As for Kirk's question, the reason the locks are in the slotstr is to prevent 1) prevent the cost of creation and destruction of the locks, and 2) they are protecting data in the slotstr (the hash table is part of the slot structure). Objects are created and destroyed with about the same or greater frequency as sessions, but they aren't referenced as often. Just about every PKCS #11 call references a session . About half reference objects.

Kirk Erickson

Reporter

Comment 26

•

23 years ago

> As for Kirk's question, the reason the locks are in the slotstr > is to prevent 1) prevent the cost of creation and destruction > of the locks, and 2) they are protecting data in the slotstr > (the hash table is part of the slot structure). We can address (1) by pooling the locks. I believe we should be addressing (2) with a lock in PK11SlotStr, but fail to see the merit in bucketing leaf (PK11SessionStr) protection there. I'm countering your buckets approach to be certain we're pushing the locks down as far as possible. Its clear that the globalness is whats killing us. Pushing the sessionLock down into SessionStr shifts the contention uniformly. We wouldn't need to experiment with bucket sizes and the like, and the code would be easier to understand and maintain. Since there are only a few slots, we could address (2) by allocating a separate RWLock for slot->head (the session hash), so many readers can traverse the list simultaneously. If there's loads of creating and destroying and not much referencing, bucketing may be necessary, but this doesn't change my opinion about protecting the elemets of SessionStr with a lock in SessionStr.

Robert Relyea

Comment 27

•

23 years ago

> We can address (1) by pooling the locks. I believe we should be > addressing (2) with a lock in PK11SlotStr, but fail to see the merit in > bucketing leaf (PK11SessionStr) protection there. > I'm countering your buckets approach to be certain we're pushing the > locks down as far as possible. Its clear that the globalness is whats > killing us. Pushing the sessionLock down into SessionStr shifts the > contention uniformly. We wouldn't need to experiment with bucket sizes > and the like, and the code would be easier to understand and maintain. I think you misunderstand what is being protected: It's the hash tables themselves, not the session. All sessions in a particular hash table table *MUST* share the lock. The lock is not protecting any session data. The only exception would be locking the reference count. Currently the only place where the lock is acquired just to protect the session lock is on session destruction. To remove this you need to add a per session lock to handle the reference count. You still need the bucket lock to protect the hash table. This adds 1) lock creation/destruction every session, 2) Additional lock calls every SSL connection (you have to lock the reference count pretty much every time you lock the hash buckets in normal operation). This increases the time you hold the hash bucket lock, increasing the chance of contention. We may find we want to do this, but this would be in addition to creating the hash bucket lock. > Since there are only a few slots, we could address (2) by allocating > a separate RWLock for slot->head (the session hash), so many readers > can traverse the list simultaneously. If there's loads of creating > and destroying and not much referencing, bucketing may be necessary, > but this doesn't change my opinion about protecting the elemets of > SessionStr with a lock in SessionStr. I thought about R/W locks. Unfortunately there *is* loads of creations and destructions of sessions. R/W locks will cause the same or worse issues than we have now.

Kirk Erickson

Reporter

Comment 28

•

23 years ago

See 121523 for further info on the sessionLock. The sessionLock changes attempt to address a scaling problem that was present in 3.3.1. This bug is for the performance drop attributable to changes between 3.3.1 and 3.4. Bob believes he can regain a portion of the 7.4% drop. I will attempt to test his conjecture that the drop is attributable to softoken by benchmarking iWS with crypto hardware support.

Wan-Teh Chang

Assignee

Updated

•

23 years ago

Priority: -- → P1

Whiteboard: This is P1 and targeted NSS 3.4.1.

Wan-Teh Chang

Assignee

Updated

•

23 years ago

Target Milestone: 3.4 → ---

Wan-Teh Chang

Assignee

Comment 29

•

23 years ago

Set target milestone 3.4.

Target Milestone: --- → 3.4.1

Wan-Teh Chang

Assignee

Comment 30

•

23 years ago

Changed the QA contact to Bishakha.

QA Contact: sonja.mirtitsch → bishakhabanerjee

Wan-Teh Chang

Assignee

Comment 31

•

23 years ago

Set target milestone to NSS 3.5.

Target Milestone: 3.4.1 → 3.5

Wan-Teh Chang

Assignee

Comment 32

•

23 years ago

Moved to 3.6.

Whiteboard: This is P1 and targeted NSS 3.4.1.

Target Milestone: 3.5 → 3.6

Wan-Teh Chang

Assignee

Comment 33

•

22 years ago

Moved to Future for now. Will consider this for NSS 3.7.

Target Milestone: 3.6 → Future

Wan-Teh Chang

Assignee

Updated

•

22 years ago

Target Milestone: Future → 3.7

Wan-Teh Chang

Assignee

Comment 34

•

22 years ago

Moved to target milestone 3.8 because the original NSS 3.7 release has been renamed 3.8.

Target Milestone: 3.7 → 3.8

Kirk Erickson

Reporter

Comment 35

•

22 years ago

Attached file Selfserv stress comparison of 331, 342, and 372 — Details

Benchmarking iWS has become a lost art that went with Julien; it'd likely take me several days to configure the harness if I had the hardware. Unfortunately, I lost root access to soupnazi, where we installed the Deimos card earlier. I'm also down to 5 client machines on soupnazi's subnet. So following through with Bob's proposal isn't feasable now. I have located a 4-cpu machine I can hit from 15 clients: SunOS dsame2 5.8 Generic_108528-16 sun4u sparc SUNW,Ultra-4 Sun Microsystems sun4u Sun Enterprise 450 (4 X UltraSPARC-II 480MHz) System clock frequency: 96 MHz Memory size: 4096 Megabytes So I decided to see where we stand, by comparing: NSS 3.3.1 nspr20/v4.2.2 NSS 3.4.2 nspr20/v4.1.2 NSS 3.7.2 nspr20/v4.1.2 Needed to add -L to 3.3.1 and 3.4.2 selfserv (107777), and zone allocator to NSPR 4.1.2 (97485) to gather all runs with NSPR_USE_ZONE_ALLOCATOR=1. I've attached the selfserv stress results. The performance drop we saw last year (between 331 and 34) is not apparent between 331 and 342. 2001-1128-1239 3.3.1-zones 585 228 2003-0304-1352 3.3.1-zones 588 230 So soupnazi and dsame2 delivered similiar throughput at 331. The 342 results fall about midway betwen 331 and 372, and: NSS 3.7.2 is 4.80% faster than NSS 3.3.1 on restarts. NSS 3.7.2 is 3.70% faster than NSS 3.3.1 on full handshakes. Its clear from these results there's nothing to be gained by any further investigation of the differences between 331 and 34. Closing.

Kirk Erickson

Reporter

Comment 36

•

22 years ago

Benchmark results indicate previous runs were inconclusive, or changes occured between 11/03/01 and 3.4.2 which addressed the performance drop. Closing.

Status: ASSIGNED → RESOLVED

Closed: 22 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.