I have been running the QA in a loop on my home machine and saw many crashes in tstclnt. This happens pretty much every day. Here is a list of threads : 0 > 6012 Main Thread Main Thread _PR_MD_CLEANUP_BEFORE_EXIT Normal 0 stack : ntdll.dll!00000000773c5eda() [Frames below may be incorrect and/or missing, no symbols loaded for ntdll.dll] kernel32.dll!000000007727ab1a() ws2_32.dll!000007fefe059121() ws2_32.dll!000007fefe058cbe() > libnspr4.dll!_PR_MD_CLEANUP_BEFORE_EXIT() Line 141 C libnspr4.dll!_PR_CleanupBeforeExit() Line 353 C libnspr4.dll!PR_Cleanup() Line 462 C tstclnt.exe!main(int argc=8, char * * argv=0x00000000027d96c0) Line 694 C tstclnt.exe!__tmainCRTStartup() Line 586 + 0x19 bytes C kernel32.dll!000000007727495d() ntdll.dll!00000000773a8791() 0 6020 Worker Thread _threadstartex PR_Assert Normal 0 ntdll.dll!00000000773c4ea0() [Frames below may be incorrect and/or missing, no symbols loaded for ntdll.dll] libnspr4.dll!PR_Assert(const char * s=0x0000000000bc23e0, const char * file=0x0000000000bc2508, int ln=236) Line 550 C libnspr4.dll!PR_Lock(PRLock * lock=0x0000000000000000) Line 244 C libnspr4.dll!_PR_NativeRunThread(void * arg=0x00000000027da280) Line 432 C libnspr4.dll!pr_root(void * arg=0x00000000027da280) Line 199 C > msvcr90.dll!_callthreadstartex() Line 348 + 0xd bytes C msvcr90.dll!_threadstartex(void * ptd=0x0000000000000000) Line 326 + 0x5 bytes C kernel32.dll!000000007727495d() ntdll.dll!00000000773a8791() 0 1792 Worker Thread _threadstartex _PR_MD_WAIT Highest 0 stack : [Frames below may be incorrect and/or missing, no symbols loaded for ntdll.dll] kernel32.dll!000000007727ce00() > libnspr4.dll!_PR_MD_WAIT(PRThread * thread=0x00000000027dbd20, unsigned int ticks=4294967295) Line 540 + 0x16 bytes C libnspr4.dll!_PR_WaitCondVar(PRThread * thread=0x00000000027dbd20, PRCondVar * cvar=0x00000000027d83e0, PRLock * lock=0x00000000027dad90, unsigned int timeout=4294967295) Line 261 + 0xe bytes C libnspr4.dll!PR_WaitCondVar(PRCondVar * cvar=0x00000000027d83e0, unsigned int timeout=4294967295) Line 552 C libnspr4.dll!ContinuationThread(void * arg=0x0000000000000000) Line 4309 + 0x11 bytes C libnspr4.dll!_PR_NativeRunThread(void * arg=0x00000000027dbd20) Line 445 C libnspr4.dll!pr_root(void * arg=0x00000000027dbd20) Line 199 C msvcr90.dll!_callthreadstartex() Line 348 + 0xd bytes C msvcr90.dll!_threadstartex(void * ptd=0x0000000000000000) Line 326 + 0x5 bytes C kernel32.dll!000000007727495d() ntdll.dll!00000000773a8791() 0 3980 Worker Thread Win64 Thread 00000000773c5faa Normal 0 stack : > ntdll.dll!00000000773c5faa() [Frames below may be incorrect and/or missing, no symbols loaded for ntdll.dll] ntdll.dll!000000007739b508() ntdll.dll!0000000077448781() kernel32.dll!000000007727495d() ntdll.dll!00000000773a8791()
Julien, It appears to me that your tstclnt process was a multi-threaded process. One thread called PR_Cleanup while another thread was still active (had not been joined, and had not terminated). I think that's the immediate cause of the crash. But tstclnt is a single-threaded program. (!?!?!) Now, I seem to recall that the WinNT flavor of NSPR runs a separate thread to handle the "io completion ports" (async IO) used only by WinNT NSPR. Am I right that you are building and testing with the WinNT flavor of NSPR?
Nelson, Yes, I was testing the WINNT flavor of NSPR. Actually, tstclnt is multi-threaded on Windows and OS/2 . See http://mxr.mozilla.org/security/source/security/nss/cmd/tstclnt/tstclnt.c#898 . cvs blame shows that you added the PR_CreateThread call as part of bug 206031 .
I think the bug is that there is no PR_JoinThread call for that thread. One needs to be added, before the PR_Cleanup call.
Created attachment 369801 [details] [diff] [review] add required PR_JoinThread call
Comment on attachment 369801 [details] [diff] [review] add required PR_JoinThread call That won't work. It's an unjoinable thread. You can make it a jointable thread, but you also need to do something to make it terminate (return out of its thread_main function). I suspect there's a race going on here. If the other thread ends first, there's no problem, but otherwise there is a problem in the WinNT flavor of NSPR. Hmm. I wonder how the behavior of NSPR is defined for the case where threads remain at PR_Cleanup. IOW, I wonder if the crash is a bug in NSPR.
Let's ask Wan-Teh.
Good catch, Nelson. Yes, it's a race - PR_Cleanup has many known problems. Can we simply fix this by making the thread joinable ? thread_main looks like it is finished - it doesn't show in any of the stacks I reported in comment 0.
I think even if PR_Cleanup was perfect, there is still a bug here in tstclnt, because there is no code on WIN95/WINNT/OS/2 that ensure that thread_main has completed before PR_Cleanup is called. And thread_main has a loop that calls NSPR functions. So, this is definitely a bug in tstclnt, and not in the WINNT version of NSPR. Most likely no one else ran into this race before because they weren't running tstclnt on a quad-core AMD box, which I am. The question is now, how do we ensure that thread_main completes ? It uses PR_Send with PR_INTERVAL_NO_TIMEOUT . This is probably a job for PR_Interrupt.
Even PR_Interrupt won't work. From https://developer.mozilla.org/en/PR_Interrupt : File I/O is considered instantaneous, so file I/O functions cannot be interrupted. Unfortunately the standard input, output, and error streams are treated as files by NSPR, so a PR_Read call on PR_STDIN cannot be interrupted even though it may block indefinitely. Unfortunately, that's precisely the case of one of the two calls in the loop in thread_main.
I think just making the thread unjoinable is fine. Even in interactive mode, one can interrupt thread_main with control-Z to end stdin.
I meant joinable, of course.
Unfortunately, making the thread joinable didn't work. I ended up with many hung tstclnt in all.sh . I don't think this is because of the blocking read on PR_STDIN, though, since tstclnt is always fed a file which is NULL terminated in the QA scripts. So, this hang may be because of the other call in the thread_main loop, which is a PR_Send . Maybe we can use PR_Interrupt to stop that call. I will try that approach. Otherwise, commenting out PR_Cleanup on Windows & OS/2 is probably the simplest solution.
After I applied a patch to use PR_Interrupt, I got a deadlock in the C runtime. Unfortunately, this deadlock happened in an optimized build, so there is not much information available. There was only one thread in this deadlocked process :-( It was deadlocked for over a half hour. ntdll.dll!00000000777d6d9a() [Frames below may be incorrect and/or missing, no symbols loaded for ntdll.dll] ntdll.dll!0000000077792610() ntdll.dll!00000000777ae31e() ntdll.dll!00000000777b3052() ntdll.dll!00000000777b2f80() msvcr90.dll!doexit(int code=0, int quick=0, int retcaller=0) Line 644 + 0x11 bytes C > tstclnt.exe!__tmainCRTStartup() Line 599 C kernel32.dll!000000007768495d() ntdll.dll!00000000777b8791()
not that it matters much, but https://developer.mozilla.org/en/docs/How_to_get_a_stacktrace_with_WinDbg .symfix+ c:\symbols !analyze -v -hang and also !locks and .hh !locks and such.
Thanks, timeless. But given the libraries where that last deadlock happened, I think there is not much I can do about it. I have been testing another patch for a couple days and haven't run into another tstclnt problem yet. I will submit it later tonight when I get home if nothing else is found.
Unfortunately I lost that patch. I will have to rewrite it.
Is this an NSPR bug? It's marked as an NSS bug.
Nelson, No, it's not an NSPR bug, it's a tstclnt bug. On Windows and OS/2, tstclnt spawns a thread, and that thread keeps making NSPR calls in a loop, and is never terminated. tstclnt then calls PR_Cleanup . Regardless of how broken PR_Cleanup is, this is tstclnt bug. tstclnt needs to make sure that the background thread stops making any NSPR calls before invoking PR_Cleanup.