Closed Bug 761715 Opened 13 years ago Closed 13 years ago

Occasional selfserv lockup

Categories

(NSS :: Tools, defect)

3.13.4
x86
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: KaiE, Unassigned)

Details

I had already reported that we experience an occassional lockup of the NSS test suite on Windows machines. I was never able to find out what went wrong. Today I saw the same effect for the first time on my Linux system, maybe this issue is the same as the Windows lockup. Symptom: - test suite is stuck - last information printed by test scripts is an attempt to kill selfserv (on windows I see "kill pid-of-selfserv", today on Linux I saw "kill -USR1 pid-of-selfserv") - selfserv process with matching pid is still alive Today I was able to attach with the debugger (on Linux), and I found the following state: - 9 threads - all threads blocked inside PR_WaitCondVar, all waiting for the same condition - call to PR_WaitCondVar in main thread was triggered by PR_Cleanup Main thread is inside uses "no timeout": 991 while (pt_book.user > pt_book.this_many) 992 PR_WaitCondVar(pt_book.cv, PR_INTERVAL_NO_TIMEOUT); (gdb) print pt_book.user $1 = 9 (gdb) print pt_book.this_many $2 = 1 Other threads are inside _pt_root here: 165 if (detached) 166 { 167 while (!thred->okToDelete) 168 PR_WaitCondVar(pt_book.cv, PR_INTERVAL_NO_TIMEOUT); 169 } Example variable values from one of the worker threads: (gdb) print thred->okToDelete $3 = 151295240 (gdb) print *thred $5 = {state = 3, priority = PR_PRIORITY_NORMAL, arg = 0x9084d6c, startFunc = 0x804cf7d <thread_wrapper>, stack = 0x9085000, environment = 0x0, dump = 0, dumpArg = 0x0, tpdLength = 0, privateData = 0x0, errorCode = -5950, osErrorCode = 2, errorStringLength = 0, errorStringSize = 0, errorString = 0x0, name = 0x0, id = 3012242240, okToDelete = 151295240, waiting = 0x0, sp = 0x0, next = 0x0, prev = 0x908e0a8, suspend = 0, suspendResumeMutex = {__data = { __lock = 0, __count = 0, __owner = 0, __kind = 0, __nusers = 0, {__spins = 0, __list = {__next = 0x0}}}, __size = '\000' <repeats 23 times>, __align = 0}, suspendResumeCV = {__data = {__lock = 0, __futex = 0, __total_seq = 0, __wakeup_seq = 0, __woken_seq = 0, __mutex = 0x0, __nwaiters = 0, __broadcast_seq = 0}, __size = '\000' <repeats 47 times>, __align = 0}, interrupt_blocked = 0, syspoll_list = 0x0, syspoll_count = 0}
(Threads 3 to 9 same as thread 2, but each with a unique argument to _pt_root.) Thread 2 (Thread 0xb38b2b40 (LWP 26730)): #0 0xb772d424 in __kernel_vsyscall () #1 0x4fb0c85c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0 #2 0xb74c3c58 in PR_WaitCondVar (cvar=0x9049508, timeout=4294967295) at ../../../../pr/src/pthreads/ptsynch.c:385 #3 0xb74cb139 in _pt_root (arg=0x908e1f0) at ../../../../pr/src/pthreads/ptthread.c:168 #4 0x4fb08cd3 in start_thread () from /lib/libpthread.so.0 #5 0x4fa11a2e in clone () from /lib/libc.so.6 Thread 1 (Thread 0xb746f6c0 (LWP 26707)): #0 0xb772d424 in __kernel_vsyscall () #1 0x4fb0c85c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0 #2 0xb74c3c58 in PR_WaitCondVar (cvar=0x9049508, timeout=4294967295) at ../../../../pr/src/pthreads/ptsynch.c:385 #3 0xb74cc68c in PR_Cleanup () at ../../../../pr/src/pthreads/ptthread.c:992 ---Type <return> to continue, or q <return> to quit--- #4 0x0805122e in main (argc=16, argv=0xbfaff284) at selfserv.c:2401
I'm confused. Right now, with latest trunk, no patches, I run into this bug each time the test suite attempts to kill selfserv for the first time. Either something on my Fedora 16 system has changed, or something in NSPR/NSS has changed recently.
Ok, this regression was introduced by bug 758837. After backing out http://kuix.de/nss-snag/HEAD/nsprpub/3860.patch I no longer run into this deadlock. But why does this deadlock occurr on my local Fedora 16 machine (32bit - 4 cores), but not on any of the NSS trunk tinderbox machines? I used the following command for test execution: NSS_TESTS="ssl" NSS_SSL_RUN="stress" NSS_CYCLES="pkix" NSS_SSL_TESTS="bypass_normal" HOST=localhost DOMSUF=localdomain ./all.sh
Depends on: 758837
> NSS_TESTS="ssl" NSS_SSL_RUN="stress" NSS_CYCLES="pkix" > NSS_SSL_TESTS="bypass_normal" HOST=localhost DOMSUF=localdomain ./all.sh It's not necessary to use this special command to start the test. I just reproduced the deadlock using the standard command to run all the tests (no environment variables), and run into it on first attempt to kill selfserv.
Properties of my test system: - Fedora 16 - 8 GB RAM - 32 bit - Kernel 3.3.7 - PAE flavor kernel (supporting more than 3 GB RAM on 32 bit systems)
Name : glibc Version : 2.14.90 Release : 24.fc16.7 Architecture: i686
Thank you for tracking this down. This is caused by the lack of header file dependencies in the NSPR build system. Please try doing a clean build (make clean; make all) in NSPR. I just checked in a dummy white space change to mozilla/nsprpub/config/prdepend.h to force a clean build on the tinderboxes. The problem should be fixed in the next build cycle. Checking in prdepend.h; /cvsroot/mozilla/nsprpub/config/prdepend.h,v <-- prdepend.h new revision: 1.17; previous revision: 1.16 done
Thanks for pointing out the need to clobber NSPR. As a consequence I updated my local NSS build script to use "make nss_clean_all" (formerly I simply ran "make clean"...)
Status: NEW → RESOLVED
Closed: 13 years ago
No longer depends on: 758837
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.