761715 - Occasional selfserv lockup

Reporter

Description

•

13 years ago

I had already reported that we experience an occassional lockup of the NSS test suite on Windows machines. I was never able to find out what went wrong. Today I saw the same effect for the first time on my Linux system, maybe this issue is the same as the Windows lockup. Symptom: - test suite is stuck - last information printed by test scripts is an attempt to kill selfserv (on windows I see "kill pid-of-selfserv", today on Linux I saw "kill -USR1 pid-of-selfserv") - selfserv process with matching pid is still alive Today I was able to attach with the debugger (on Linux), and I found the following state: - 9 threads - all threads blocked inside PR_WaitCondVar, all waiting for the same condition - call to PR_WaitCondVar in main thread was triggered by PR_Cleanup Main thread is inside uses "no timeout": 991 while (pt_book.user > pt_book.this_many) 992 PR_WaitCondVar(pt_book.cv, PR_INTERVAL_NO_TIMEOUT); (gdb) print pt_book.user $1 = 9 (gdb) print pt_book.this_many $2 = 1 Other threads are inside _pt_root here: 165 if (detached) 166 { 167 while (!thred->okToDelete) 168 PR_WaitCondVar(pt_book.cv, PR_INTERVAL_NO_TIMEOUT); 169 } Example variable values from one of the worker threads: (gdb) print thred->okToDelete $3 = 151295240 (gdb) print *thred $5 = {state = 3, priority = PR_PRIORITY_NORMAL, arg = 0x9084d6c, startFunc = 0x804cf7d <thread_wrapper>, stack = 0x9085000, environment = 0x0, dump = 0, dumpArg = 0x0, tpdLength = 0, privateData = 0x0, errorCode = -5950, osErrorCode = 2, errorStringLength = 0, errorStringSize = 0, errorString = 0x0, name = 0x0, id = 3012242240, okToDelete = 151295240, waiting = 0x0, sp = 0x0, next = 0x0, prev = 0x908e0a8, suspend = 0, suspendResumeMutex = {__data = { __lock = 0, __count = 0, __owner = 0, __kind = 0, __nusers = 0, {__spins = 0, __list = {__next = 0x0}}}, __size = '\000' <repeats 23 times>, __align = 0}, suspendResumeCV = {__data = {__lock = 0, __futex = 0, __total_seq = 0, __wakeup_seq = 0, __woken_seq = 0, __mutex = 0x0, __nwaiters = 0, __broadcast_seq = 0}, __size = '\000' <repeats 47 times>, __align = 0}, interrupt_blocked = 0, syspoll_list = 0x0, syspoll_count = 0}

Kai Engert [:KaiE:]

Reporter

Comment 1

•

13 years ago

(Threads 3 to 9 same as thread 2, but each with a unique argument to _pt_root.) Thread 2 (Thread 0xb38b2b40 (LWP 26730)): #0 0xb772d424 in __kernel_vsyscall () #1 0x4fb0c85c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0 #2 0xb74c3c58 in PR_WaitCondVar (cvar=0x9049508, timeout=4294967295) at ../../../../pr/src/pthreads/ptsynch.c:385 #3 0xb74cb139 in _pt_root (arg=0x908e1f0) at ../../../../pr/src/pthreads/ptthread.c:168 #4 0x4fb08cd3 in start_thread () from /lib/libpthread.so.0 #5 0x4fa11a2e in clone () from /lib/libc.so.6 Thread 1 (Thread 0xb746f6c0 (LWP 26707)): #0 0xb772d424 in __kernel_vsyscall () #1 0x4fb0c85c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0 #2 0xb74c3c58 in PR_WaitCondVar (cvar=0x9049508, timeout=4294967295) at ../../../../pr/src/pthreads/ptsynch.c:385 #3 0xb74cc68c in PR_Cleanup () at ../../../../pr/src/pthreads/ptthread.c:992 ---Type <return> to continue, or q <return> to quit--- #4 0x0805122e in main (argc=16, argv=0xbfaff284) at selfserv.c:2401

Kai Engert [:KaiE:]

Reporter

Comment 2

•

13 years ago

I'm confused. Right now, with latest trunk, no patches, I run into this bug each time the test suite attempts to kill selfserv for the first time. Either something on my Fedora 16 system has changed, or something in NSPR/NSS has changed recently.

Kai Engert [:KaiE:]

Reporter

Comment 3

•

13 years ago

Ok, this regression was introduced by bug 758837. After backing out http://kuix.de/nss-snag/HEAD/nsprpub/3860.patch I no longer run into this deadlock. But why does this deadlock occurr on my local Fedora 16 machine (32bit - 4 cores), but not on any of the NSS trunk tinderbox machines? I used the following command for test execution: NSS_TESTS="ssl" NSS_SSL_RUN="stress" NSS_CYCLES="pkix" NSS_SSL_TESTS="bypass_normal" HOST=localhost DOMSUF=localdomain ./all.sh

Depends on: 758837

Kai Engert [:KaiE:]

Reporter

Comment 4

•

13 years ago

> NSS_TESTS="ssl" NSS_SSL_RUN="stress" NSS_CYCLES="pkix" > NSS_SSL_TESTS="bypass_normal" HOST=localhost DOMSUF=localdomain ./all.sh It's not necessary to use this special command to start the test. I just reproduced the deadlock using the standard command to run all the tests (no environment variables), and run into it on first attempt to kill selfserv.

Kai Engert [:KaiE:]

Reporter

Comment 5

•

13 years ago

Properties of my test system: - Fedora 16 - 8 GB RAM - 32 bit - Kernel 3.3.7 - PAE flavor kernel (supporting more than 3 GB RAM on 32 bit systems)

Kai Engert [:KaiE:]

Reporter

Comment 6

•

13 years ago

Name : glibc Version : 2.14.90 Release : 24.fc16.7 Architecture: i686

Wan-Teh Chang

Comment 7

•

13 years ago

Thank you for tracking this down. This is caused by the lack of header file dependencies in the NSPR build system. Please try doing a clean build (make clean; make all) in NSPR. I just checked in a dummy white space change to mozilla/nsprpub/config/prdepend.h to force a clean build on the tinderboxes. The problem should be fixed in the next build cycle. Checking in prdepend.h; /cvsroot/mozilla/nsprpub/config/prdepend.h,v <-- prdepend.h new revision: 1.17; previous revision: 1.16 done

Kai Engert [:KaiE:]

Reporter

Comment 8

•

13 years ago

Thanks for pointing out the need to clobber NSPR. As a consequence I updated my local NSS build script to use "make nss_clean_all" (formerly I simply ran "make clean"...)

Status: NEW → RESOLVED

Closed: 13 years ago

No longer depends on: 758837

Resolution: --- → WORKSFORME

Bugzilla

Occasional selfserv lockup

Categories

(NSS :: Tools, defect)

Tracking

(Not tracked)

People

(Reporter: KaiE, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8