SSL server stress test ends up in infinite loop on Solaris

RESOLVED FIXED in 3.4

Status

P1
normal
RESOLVED FIXED
17 years ago
17 years ago

People

(Reporter: julien.pierre, Assigned: wtc)

Tracking

Sun
Solaris

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(2 attachments)

(Reporter)

Description

17 years ago
On Solaris 2.8, running NES with client auth required, the server ended in a
deadlock after 1h30 and 73695 full SSL handshakes.
(Reporter)

Updated

17 years ago
Priority: -- → P1
Target Milestone: --- → 3.4
(Reporter)

Comment 1

17 years ago
Created attachment 68589 [details]
Stack of all threads in the web server looping
(Assignee)

Updated

17 years ago
Attachment #68589 - Attachment mime type: application/octet-stream → text/plain

Comment 2

17 years ago
The lock that most of the threads are waiting on protects the list of active
tokens in the trust domain.  The list is short, and the lock would be pounded
heavily in a stressed environment.

In NSS 3.3, the call would have gone straight to the temp db, followed by the
perm db.  This is a new level of locking, more akin to the PK11SlotList from NSS
3.3.  That lock was R/W, perhaps this one should be as well?
(Assignee)

Comment 3

17 years ago
Ian,

Replacing a normal lock by a reader-writer lock will only
change the performance characteristics and will not solve
the looping or deadlock problems we are seeing.
(Assignee)

Comment 4

17 years ago
All the threads are blocked in poll(), pthread_mutex_lock(), or
pthread_cond_wait(), etc.  The only thread that is running is

-----------------  lwp# 93 / thread# 168  --------------------
 feb827e8 PL_HashTableRawLookup (214c90, 551d1300, c2ccc0, 1, 0, c2cd59) + 98
 feb832e4 PL_HashTableLookup (214c90, c2ccc0, 80, 0, 0, 0) + 5c
 fe608d40 SECOID_FindOID (c2ccc0, ddad48, 22e, 0, 0, f65250) + 90
 fe608f5c SECOID_KnownCertExtenOID (c2ccc0, ddabb8, fe658c9c, ddad48, 22e,
ddad44) + 2c
 fe5e67f8 cert_HasUnknownCriticalExten (c2cd38, ddabb8, fe658c9c, ddabf4, 84,
ddad48) + 98
 fe5e0898 CERT_DecodeDERCertificate (fb360bc4, 1, 0, 0, 0, db9950) + 190
 fe61afe0 nssDecodedPKIXCertificate_Create (0, 873d18, 0, 4, 6b, 874078) + 78
 fe617e3c nssDecodedCert_Create (0, 873d18, 1, 8, fffffff8, 843526) + 4c
 fe60e650 nssCertificate_GetDecoding (873cf0, fb360d20, 8, 8432f8, 9c3f8, 9c428)
+ 50
 fe61f8e4 get_token_cert (9c3c0, 9c3f8, f4262204, 0, 9c3f8, 9c428) + 304
 fe6200cc retrieve_cert (9c3c0, 9c3f8, f4262204, fb361058, 0, 8d9ce9) + 1d4
 fe61f3bc traverse_objects_by_template (9c3c0, 0, fb360fc0, 3, fe61fef8,
fb361058) + 354
 fe620700 nssToken_TraverseCertificatesBySubject (9c3c0, 0, 531f40, fb361058, 0,
0) + 258
 fe614440 NSSTrustDomain_FindBestCertificateBySubject (9c240, 531f40, ec7730,
fb3611c0, 0, 9aa0c8) + 128
 fe60ec68 NSSCertificate_BuildChain (531f10, ec7730, fb3611c0, 0, fb3611b4, 2) + 2b8
 fe59d228 CERT_FindCertIssuer (846a88, 39975, ad54463e, 0, 0, 5f4881) + e8
 fe59dc64 CERT_VerifyCertChain (9c240, 846a88, 1, 0, 39975, ad54463e) + 52c
 fe59f070 CERT_VerifyCert (39975, ad54463e, 1, 0, 39975, ad54463e) + 610
 fe59f290 CERT_VerifyCertNow (39975, ad54463e, 1, 0, 0, 73cdd0) + 78
 fe73231c SSL_AuthCertificate (9c240, 599eb8, 1, 1, 1, 93b968) + 104
 fe72c8d0 ssl3_HandleCertificate (ddee48, 102730c, 26a, a91a95, 0, fe819f58) + 728
 fe72e854 ssl3_HandleHandshakeMessage (ddee48, 102730c, 26a, 0, 8a, 1027308) + 63c
 fe72ef28 ssl3_HandleHandshake (ddee48, a5971c, 0, ffffffff, fb3616b4, 1027308)
+ 2e0
 fe72faa4 ssl3_HandleRecord (1027308, 37a, fb3616c4, fffffff8, 28, 821408) + 874
 fe7315d4 ssl3_GatherCompleteHandshake (ddee48, 0, 46, ffffffff, fffffff8,
8bd6b8) + 10c
 fe735f68 ssl_GatherRecord1stHandshake (ddee48, a8, 20, ffffffff, fffffff8,
8bd6b8) + 100
 fe741c00 ssl_Do1stHandshake (ddee48, a8, e2aa00, 4, 0, ddef40) + 340
 fe744570 ssl_SecureRecv (ddee48, d84760, 1fff, 0, 0, 745ed1) + 230
 fe74cbc4 ssl_Recv (599eb8, d84760, 1fff, 0, 1e8480, 19) + 12c
 fea9f118 PR_Recv  (599eb8, d84760, 1fff, 0, 1e8480, 0) + 68
 ff1c1bd8 __0fNDaemonSessionNGetConnectionv (c9eb78, 0, 1, 1, fe83c7a0, 0) + 4a0
 ff1c1e48 __0fNDaemonSessionDrunv (c9eb78, 99, 98, 98, 0, 0) + f8
 ff045c10 __0fGThreadErun_v (c9eb78, a8, 24f10, fe8cfb60, 10, 697d09) + 60
 ff045b8c ThreadMain (c9eb78, 4, fe8ce000, 4, c9f070, 0) + 34
 feadcb34 _pt_root (c9f070, fc633d18, 0, 5, 1, fe401000) + 1a4
 fe8bbc08 _thread_start (c9f070, 0, 0, 0, 0, 0) + 40

Comment 5

17 years ago
I can reproduce this every time I run the test.  One time it happened after 5
minutes, another time took 45 minutes.

I'm using a build with NSPR from /s/b/c.  Since I don't have the source, it's
hard to see what is going on there.  But the running thread is definitely stuck
in PL_HashTableRawLookup.  It bounces back and forth between the same two lines
of code.

http://lxr.mozilla.org/mozilla/source/nsprpub/lib/ds/plhash.c#185

What I notice is that he == *hep == he->next.  Thus the while loop will run
infinitely.  It seems that the OID hashtable is corrupt, but I don't know
how/why yet.

Comment 6

17 years ago
After debugging with Wan-Teh, we decided that the OID hashtable needs to use
PL_HashTableLookupConst instead of PL_HashTableLookup.  This is because
PL_HashTableLookup may change the contents of the hashtable, and the OID
hashtable is not threadsafe.

The reason this occurs in 3.4 is that prior versions of NSS used DBM for the OID
hash.

I found an additional bug by reviewing secoid.c.  Patch coming.

Comment 7

17 years ago
Created attachment 68921 [details] [diff] [review]
fix thread safety issues in secoid.c

Comment 8

17 years ago
patch was checked in as rev 1.11 of secoid.c.  Bug 124923 was opened about
providing a lock for the dynamic OID hash.

I have run selfserv on solaris under the same conditions that caused a hang
three out of three times, in about 20 minutes, 5 minutes, and 45 minutes.  It
has now been running for three hours without any problem.

Marking fixed.  Julien, you may wish to verify with NES.
Status: NEW → RESOLVED
Last Resolved: 17 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.