On Solaris 2.8, running NES with client auth required, the server ended in a deadlock after 1h30 and 73695 full SSL handshakes.
Attachment #68589 - Attachment mime type: application/octet-stream → text/plain
The lock that most of the threads are waiting on protects the list of active tokens in the trust domain. The list is short, and the lock would be pounded heavily in a stressed environment. In NSS 3.3, the call would have gone straight to the temp db, followed by the perm db. This is a new level of locking, more akin to the PK11SlotList from NSS 3.3. That lock was R/W, perhaps this one should be as well?
Ian, Replacing a normal lock by a reader-writer lock will only change the performance characteristics and will not solve the looping or deadlock problems we are seeing.
All the threads are blocked in poll(), pthread_mutex_lock(), or pthread_cond_wait(), etc. The only thread that is running is ----------------- lwp# 93 / thread# 168 -------------------- feb827e8 PL_HashTableRawLookup (214c90, 551d1300, c2ccc0, 1, 0, c2cd59) + 98 feb832e4 PL_HashTableLookup (214c90, c2ccc0, 80, 0, 0, 0) + 5c fe608d40 SECOID_FindOID (c2ccc0, ddad48, 22e, 0, 0, f65250) + 90 fe608f5c SECOID_KnownCertExtenOID (c2ccc0, ddabb8, fe658c9c, ddad48, 22e, ddad44) + 2c fe5e67f8 cert_HasUnknownCriticalExten (c2cd38, ddabb8, fe658c9c, ddabf4, 84, ddad48) + 98 fe5e0898 CERT_DecodeDERCertificate (fb360bc4, 1, 0, 0, 0, db9950) + 190 fe61afe0 nssDecodedPKIXCertificate_Create (0, 873d18, 0, 4, 6b, 874078) + 78 fe617e3c nssDecodedCert_Create (0, 873d18, 1, 8, fffffff8, 843526) + 4c fe60e650 nssCertificate_GetDecoding (873cf0, fb360d20, 8, 8432f8, 9c3f8, 9c428) + 50 fe61f8e4 get_token_cert (9c3c0, 9c3f8, f4262204, 0, 9c3f8, 9c428) + 304 fe6200cc retrieve_cert (9c3c0, 9c3f8, f4262204, fb361058, 0, 8d9ce9) + 1d4 fe61f3bc traverse_objects_by_template (9c3c0, 0, fb360fc0, 3, fe61fef8, fb361058) + 354 fe620700 nssToken_TraverseCertificatesBySubject (9c3c0, 0, 531f40, fb361058, 0, 0) + 258 fe614440 NSSTrustDomain_FindBestCertificateBySubject (9c240, 531f40, ec7730, fb3611c0, 0, 9aa0c8) + 128 fe60ec68 NSSCertificate_BuildChain (531f10, ec7730, fb3611c0, 0, fb3611b4, 2) + 2b8 fe59d228 CERT_FindCertIssuer (846a88, 39975, ad54463e, 0, 0, 5f4881) + e8 fe59dc64 CERT_VerifyCertChain (9c240, 846a88, 1, 0, 39975, ad54463e) + 52c fe59f070 CERT_VerifyCert (39975, ad54463e, 1, 0, 39975, ad54463e) + 610 fe59f290 CERT_VerifyCertNow (39975, ad54463e, 1, 0, 0, 73cdd0) + 78 fe73231c SSL_AuthCertificate (9c240, 599eb8, 1, 1, 1, 93b968) + 104 fe72c8d0 ssl3_HandleCertificate (ddee48, 102730c, 26a, a91a95, 0, fe819f58) + 728 fe72e854 ssl3_HandleHandshakeMessage (ddee48, 102730c, 26a, 0, 8a, 1027308) + 63c fe72ef28 ssl3_HandleHandshake (ddee48, a5971c, 0, ffffffff, fb3616b4, 1027308) + 2e0 fe72faa4 ssl3_HandleRecord (1027308, 37a, fb3616c4, fffffff8, 28, 821408) + 874 fe7315d4 ssl3_GatherCompleteHandshake (ddee48, 0, 46, ffffffff, fffffff8, 8bd6b8) + 10c fe735f68 ssl_GatherRecord1stHandshake (ddee48, a8, 20, ffffffff, fffffff8, 8bd6b8) + 100 fe741c00 ssl_Do1stHandshake (ddee48, a8, e2aa00, 4, 0, ddef40) + 340 fe744570 ssl_SecureRecv (ddee48, d84760, 1fff, 0, 0, 745ed1) + 230 fe74cbc4 ssl_Recv (599eb8, d84760, 1fff, 0, 1e8480, 19) + 12c fea9f118 PR_Recv (599eb8, d84760, 1fff, 0, 1e8480, 0) + 68 ff1c1bd8 __0fNDaemonSessionNGetConnectionv (c9eb78, 0, 1, 1, fe83c7a0, 0) + 4a0 ff1c1e48 __0fNDaemonSessionDrunv (c9eb78, 99, 98, 98, 0, 0) + f8 ff045c10 __0fGThreadErun_v (c9eb78, a8, 24f10, fe8cfb60, 10, 697d09) + 60 ff045b8c ThreadMain (c9eb78, 4, fe8ce000, 4, c9f070, 0) + 34 feadcb34 _pt_root (c9f070, fc633d18, 0, 5, 1, fe401000) + 1a4 fe8bbc08 _thread_start (c9f070, 0, 0, 0, 0, 0) + 40
I can reproduce this every time I run the test. One time it happened after 5 minutes, another time took 45 minutes. I'm using a build with NSPR from /s/b/c. Since I don't have the source, it's hard to see what is going on there. But the running thread is definitely stuck in PL_HashTableRawLookup. It bounces back and forth between the same two lines of code. http://lxr.mozilla.org/mozilla/source/nsprpub/lib/ds/plhash.c#185 What I notice is that he == *hep == he->next. Thus the while loop will run infinitely. It seems that the OID hashtable is corrupt, but I don't know how/why yet.
After debugging with Wan-Teh, we decided that the OID hashtable needs to use PL_HashTableLookupConst instead of PL_HashTableLookup. This is because PL_HashTableLookup may change the contents of the hashtable, and the OID hashtable is not threadsafe. The reason this occurs in 3.4 is that prior versions of NSS used DBM for the OID hash. I found an additional bug by reviewing secoid.c. Patch coming.
patch was checked in as rev 1.11 of secoid.c. Bug 124923 was opened about providing a lock for the dynamic OID hash. I have run selfserv on solaris under the same conditions that caused a hang three out of three times, in about 20 minutes, 5 minutes, and 45 minutes. It has now been running for three hours without any problem. Marking fixed. Julien, you may wish to verify with NES.
Status: NEW → RESOLVED
Last Resolved: 17 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.