Open Bug 485659 Opened 15 years ago Updated 2 years ago

tstclnt crashes during PR_Cleanup on Windows - race condition

Categories

(NSS :: Tools, defect, P2)

x86_64
Windows Vista

Tracking

(Not tracked)

People

(Reporter: julien.pierre, Unassigned)

References

Details

Attachments

(1 file)

I have been running the QA in a loop on my home machine and saw many crashes in tstclnt. This happens pretty much every day.

Here is a list of threads :

0	>	6012	Main Thread	Main Thread	_PR_MD_CLEANUP_BEFORE_EXIT	Normal	0

stack :

 	ntdll.dll!00000000773c5eda() 	
 	[Frames below may be incorrect and/or missing, no symbols loaded for ntdll.dll]	
 	kernel32.dll!000000007727ab1a() 	
 	ws2_32.dll!000007fefe059121() 	
 	ws2_32.dll!000007fefe058cbe() 	
>	libnspr4.dll!_PR_MD_CLEANUP_BEFORE_EXIT()  Line 141	C
 	libnspr4.dll!_PR_CleanupBeforeExit()  Line 353	C
 	libnspr4.dll!PR_Cleanup()  Line 462	C
 	tstclnt.exe!main(int argc=8, char * * argv=0x00000000027d96c0)  Line 694	C
 	tstclnt.exe!__tmainCRTStartup()  Line 586 + 0x19 bytes	C
 	kernel32.dll!000000007727495d() 	
 	ntdll.dll!00000000773a8791() 	

0	 	6020	Worker Thread	_threadstartex	PR_Assert	Normal	0

 	ntdll.dll!00000000773c4ea0() 	
 	[Frames below may be incorrect and/or missing, no symbols loaded for ntdll.dll]	
 	libnspr4.dll!PR_Assert(const char * s=0x0000000000bc23e0, const char * file=0x0000000000bc2508, int ln=236)  Line 550	C
 	libnspr4.dll!PR_Lock(PRLock * lock=0x0000000000000000)  Line 244	C
 	libnspr4.dll!_PR_NativeRunThread(void * arg=0x00000000027da280)  Line 432	C
 	libnspr4.dll!pr_root(void * arg=0x00000000027da280)  Line 199	C
>	msvcr90.dll!_callthreadstartex()  Line 348 + 0xd bytes	C
 	msvcr90.dll!_threadstartex(void * ptd=0x0000000000000000)  Line 326 + 0x5 bytes	C
 	kernel32.dll!000000007727495d() 	
 	ntdll.dll!00000000773a8791() 	


0	 	1792	Worker Thread	_threadstartex	_PR_MD_WAIT	Highest	0

stack :

 	[Frames below may be incorrect and/or missing, no symbols loaded for ntdll.dll]	
 	kernel32.dll!000000007727ce00() 	
>	libnspr4.dll!_PR_MD_WAIT(PRThread * thread=0x00000000027dbd20, unsigned int ticks=4294967295)  Line 540 + 0x16 bytes	C
 	libnspr4.dll!_PR_WaitCondVar(PRThread * thread=0x00000000027dbd20, PRCondVar * cvar=0x00000000027d83e0, PRLock * lock=0x00000000027dad90, unsigned int timeout=4294967295)  Line 261 + 0xe bytes	C
 	libnspr4.dll!PR_WaitCondVar(PRCondVar * cvar=0x00000000027d83e0, unsigned int timeout=4294967295)  Line 552	C
 	libnspr4.dll!ContinuationThread(void * arg=0x0000000000000000)  Line 4309 + 0x11 bytes	C
 	libnspr4.dll!_PR_NativeRunThread(void * arg=0x00000000027dbd20)  Line 445	C
 	libnspr4.dll!pr_root(void * arg=0x00000000027dbd20)  Line 199	C
 	msvcr90.dll!_callthreadstartex()  Line 348 + 0xd bytes	C
 	msvcr90.dll!_threadstartex(void * ptd=0x0000000000000000)  Line 326 + 0x5 bytes	C
 	kernel32.dll!000000007727495d() 	
 	ntdll.dll!00000000773a8791() 	


0	 	3980	Worker Thread	Win64 Thread	00000000773c5faa	Normal	0

stack :

>	ntdll.dll!00000000773c5faa() 	
 	[Frames below may be incorrect and/or missing, no symbols loaded for ntdll.dll]	
 	ntdll.dll!000000007739b508() 	
 	ntdll.dll!0000000077448781() 	
 	kernel32.dll!000000007727495d() 	
 	ntdll.dll!00000000773a8791()
Julien, It appears to me that your tstclnt process was a multi-threaded 
process.  One thread called PR_Cleanup while another thread was still 
active (had not been joined, and had not terminated).  I think that's the
immediate cause of the crash.

But tstclnt is a single-threaded program.  (!?!?!)  

Now, I seem to recall that the WinNT flavor of NSPR runs a separate thread
to handle the "io completion ports" (async IO) used only by WinNT NSPR.

Am I right that you are building and testing with the WinNT flavor of NSPR?
Nelson,

Yes, I was testing the WINNT flavor of NSPR.

Actually, tstclnt is multi-threaded on Windows and OS/2 . See http://mxr.mozilla.org/security/source/security/nss/cmd/tstclnt/tstclnt.c#898 .
cvs blame shows that you added the PR_CreateThread call as part of bug 206031 .
I think the bug is that there is no PR_JoinThread call for that thread. One needs to be added, before the PR_Cleanup call.
Assignee: nobody → julien.pierre.boogz
Attachment #369801 - Flags: review?(nelson)
Version: 3.11.10 → trunk
Comment on attachment 369801 [details] [diff] [review]
add required PR_JoinThread call

That won't work.  It's an unjoinable thread.  You can make it a jointable thread, but you also need to do something to make it terminate (return out of its thread_main function). I suspect there's a race going on here.  If the other thread ends first, there's no problem, but otherwise there is a problem in the WinNT flavor of NSPR.  

Hmm. I wonder how the behavior of NSPR is defined for the case where threads remain at PR_Cleanup.  IOW, I wonder if the crash is a bug in NSPR.
Attachment #369801 - Flags: review?(nelson) → review-
Let's ask Wan-Teh.
Good catch, Nelson. Yes, it's a race - PR_Cleanup has many known problems. 
Can we simply fix this by making the thread joinable ? thread_main looks like it is finished - it doesn't show in any of the stacks I reported in comment 0.
I think even if PR_Cleanup was perfect, there is still a bug here in tstclnt, because there is no code on WIN95/WINNT/OS/2 that ensure that thread_main has completed before PR_Cleanup is called. And thread_main has a loop that calls NSPR functions. So, this is definitely a bug in tstclnt, and not in the WINNT version of NSPR. Most likely no one else ran into this race before because they weren't running tstclnt on a quad-core AMD box, which I am.

The question is now, how do we ensure that thread_main completes ?
It uses PR_Send with PR_INTERVAL_NO_TIMEOUT . This is probably a job for PR_Interrupt.
Even PR_Interrupt won't work.

From https://developer.mozilla.org/en/PR_Interrupt :

File I/O is considered instantaneous, so file I/O functions cannot be interrupted. Unfortunately the standard input, output, and error streams are treated as files by NSPR, so a PR_Read call on PR_STDIN cannot be interrupted even though it may block indefinitely.

Unfortunately, that's precisely the case of one of the two calls in the loop in thread_main.
I think just making the thread unjoinable is fine. Even in interactive mode, one can interrupt thread_main with control-Z to end stdin.
I meant joinable, of course.
Unfortunately, making the thread joinable didn't work. I ended up with many hung tstclnt in all.sh . I don't think this is because of the blocking read on PR_STDIN, though, since tstclnt is always fed a file which is NULL terminated in the QA scripts. So, this hang may be because of the other call in the thread_main loop, which is a PR_Send . Maybe we can use PR_Interrupt to stop that call. I will try that approach. Otherwise, commenting out PR_Cleanup on Windows & OS/2 is probably the simplest solution.
After I applied a patch to use PR_Interrupt, I got a deadlock in the C runtime. Unfortunately, this deadlock happened in an optimized build, so there is not much information available.

There was only one thread in this deadlocked process :-( It was deadlocked for over a half hour.

 	ntdll.dll!00000000777d6d9a() 	
 	[Frames below may be incorrect and/or missing, no symbols loaded for ntdll.dll]	
 	ntdll.dll!0000000077792610() 	
 	ntdll.dll!00000000777ae31e() 	
 	ntdll.dll!00000000777b3052() 	
 	ntdll.dll!00000000777b2f80() 	
 	msvcr90.dll!doexit(int code=0, int quick=0, int retcaller=0)  Line 644 + 0x11 bytes	C
>	tstclnt.exe!__tmainCRTStartup()  Line 599	C
 	kernel32.dll!000000007768495d() 	
 	ntdll.dll!00000000777b8791()
not that it matters much, but 

https://developer.mozilla.org/en/docs/How_to_get_a_stacktrace_with_WinDbg

.symfix+ c:\symbols
!analyze -v -hang

and also !locks and .hh !locks and such.
Thanks, timeless. But given the libraries where that last deadlock happened, I think there is not much I can do about it.

I have been testing another patch for a couple days and haven't run into another tstclnt problem yet. I will submit it later tonight when I get home if nothing else is found.
Summary: tstclnt crashes during PR_Cleanup → tstclnt crashes during PR_Cleanup on Windows - race condition
Priority: -- → P2
Unfortunately I lost that patch. I will have to rewrite it.
Is this an NSPR bug?  
It's marked as an NSS bug.
Nelson,

No, it's not an NSPR bug, it's a tstclnt bug.

On Windows and OS/2, tstclnt spawns a thread, and that thread keeps making NSPR calls in a loop, and is never terminated. tstclnt then calls PR_Cleanup .

Regardless of how broken PR_Cleanup is, this is tstclnt bug. tstclnt needs to make sure that the background thread stops making any NSPR calls before invoking PR_Cleanup.
Assignee: julien.pierre.boogz → nobody
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: