Closed
Bug 761715
Opened 13 years ago
Closed 13 years ago
Occasional selfserv lockup
Categories
(NSS :: Tools, defect)
Tracking
(Not tracked)
RESOLVED
WORKSFORME
People
(Reporter: KaiE, Unassigned)
Details
I had already reported that we experience an occassional lockup of the NSS test suite on Windows machines. I was never able to find out what went wrong. Today I saw the same effect for the first time on my Linux system, maybe this issue is the same as the Windows lockup.
Symptom:
- test suite is stuck
- last information printed by test scripts is an
attempt to kill selfserv
(on windows I see "kill pid-of-selfserv",
today on Linux I saw "kill -USR1 pid-of-selfserv")
- selfserv process with matching pid is still alive
Today I was able to attach with the debugger (on Linux),
and I found the following state:
- 9 threads
- all threads blocked inside PR_WaitCondVar, all waiting for the same condition
- call to PR_WaitCondVar in main thread was triggered by PR_Cleanup
Main thread is inside uses "no timeout":
991 while (pt_book.user > pt_book.this_many)
992 PR_WaitCondVar(pt_book.cv, PR_INTERVAL_NO_TIMEOUT);
(gdb) print pt_book.user
$1 = 9
(gdb) print pt_book.this_many
$2 = 1
Other threads are inside _pt_root here:
165 if (detached)
166 {
167 while (!thred->okToDelete)
168 PR_WaitCondVar(pt_book.cv, PR_INTERVAL_NO_TIMEOUT);
169 }
Example variable values from one of the worker threads:
(gdb) print thred->okToDelete
$3 = 151295240
(gdb) print *thred
$5 = {state = 3, priority = PR_PRIORITY_NORMAL, arg = 0x9084d6c, startFunc = 0x804cf7d <thread_wrapper>, stack = 0x9085000, environment = 0x0, dump = 0, dumpArg = 0x0, tpdLength = 0, privateData = 0x0, errorCode = -5950,
osErrorCode = 2, errorStringLength = 0, errorStringSize = 0, errorString = 0x0, name = 0x0, id = 3012242240, okToDelete = 151295240, waiting = 0x0, sp = 0x0, next = 0x0, prev = 0x908e0a8, suspend = 0, suspendResumeMutex = {__data = {
__lock = 0, __count = 0, __owner = 0, __kind = 0, __nusers = 0, {__spins = 0, __list = {__next = 0x0}}}, __size = '\000' <repeats 23 times>, __align = 0}, suspendResumeCV = {__data = {__lock = 0, __futex = 0, __total_seq = 0,
__wakeup_seq = 0, __woken_seq = 0, __mutex = 0x0, __nwaiters = 0, __broadcast_seq = 0}, __size = '\000' <repeats 47 times>, __align = 0}, interrupt_blocked = 0, syspoll_list = 0x0, syspoll_count = 0}
Reporter | ||
Comment 1•13 years ago
|
||
(Threads 3 to 9 same as thread 2, but each with a unique argument to _pt_root.)
Thread 2 (Thread 0xb38b2b40 (LWP 26730)):
#0 0xb772d424 in __kernel_vsyscall ()
#1 0x4fb0c85c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#2 0xb74c3c58 in PR_WaitCondVar (cvar=0x9049508, timeout=4294967295) at ../../../../pr/src/pthreads/ptsynch.c:385
#3 0xb74cb139 in _pt_root (arg=0x908e1f0) at ../../../../pr/src/pthreads/ptthread.c:168
#4 0x4fb08cd3 in start_thread () from /lib/libpthread.so.0
#5 0x4fa11a2e in clone () from /lib/libc.so.6
Thread 1 (Thread 0xb746f6c0 (LWP 26707)):
#0 0xb772d424 in __kernel_vsyscall ()
#1 0x4fb0c85c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#2 0xb74c3c58 in PR_WaitCondVar (cvar=0x9049508, timeout=4294967295) at ../../../../pr/src/pthreads/ptsynch.c:385
#3 0xb74cc68c in PR_Cleanup () at ../../../../pr/src/pthreads/ptthread.c:992
---Type <return> to continue, or q <return> to quit---
#4 0x0805122e in main (argc=16, argv=0xbfaff284) at selfserv.c:2401
Reporter | ||
Comment 2•13 years ago
|
||
I'm confused. Right now, with latest trunk, no patches, I run into this bug each time the test suite attempts to kill selfserv for the first time.
Either something on my Fedora 16 system has changed, or something in NSPR/NSS has changed recently.
Reporter | ||
Comment 3•13 years ago
|
||
Ok, this regression was introduced by bug 758837. After backing out http://kuix.de/nss-snag/HEAD/nsprpub/3860.patch I no longer run into this deadlock.
But why does this deadlock occurr on my local Fedora 16 machine (32bit - 4 cores), but not on any of the NSS trunk tinderbox machines?
I used the following command for test execution:
NSS_TESTS="ssl" NSS_SSL_RUN="stress" NSS_CYCLES="pkix" NSS_SSL_TESTS="bypass_normal" HOST=localhost DOMSUF=localdomain ./all.sh
Depends on: 758837
Reporter | ||
Comment 4•13 years ago
|
||
> NSS_TESTS="ssl" NSS_SSL_RUN="stress" NSS_CYCLES="pkix"
> NSS_SSL_TESTS="bypass_normal" HOST=localhost DOMSUF=localdomain ./all.sh
It's not necessary to use this special command to start the test.
I just reproduced the deadlock using the standard command to run all the tests (no environment variables), and run into it on first attempt to kill selfserv.
Reporter | ||
Comment 5•13 years ago
|
||
Properties of my test system:
- Fedora 16
- 8 GB RAM
- 32 bit
- Kernel 3.3.7
- PAE flavor kernel (supporting more than 3 GB RAM on 32 bit systems)
Reporter | ||
Comment 6•13 years ago
|
||
Name : glibc
Version : 2.14.90
Release : 24.fc16.7
Architecture: i686
Comment 7•13 years ago
|
||
Thank you for tracking this down. This is caused by the lack
of header file dependencies in the NSPR build system.
Please try doing a clean build (make clean; make all) in NSPR.
I just checked in a dummy white space change to
mozilla/nsprpub/config/prdepend.h to force a clean build on
the tinderboxes. The problem should be fixed in the next
build cycle.
Checking in prdepend.h;
/cvsroot/mozilla/nsprpub/config/prdepend.h,v <-- prdepend.h
new revision: 1.17; previous revision: 1.16
done
Reporter | ||
Comment 8•13 years ago
|
||
Thanks for pointing out the need to clobber NSPR. As a consequence I updated my local NSS build script to use "make nss_clean_all" (formerly I simply ran "make clean"...)
You need to log in
before you can comment on or make changes to this bug.
Description
•