161998 - stresstest failures on Windows NT

Reporter

Description

•

23 years ago

The Sun released version of NSS 3.4.2 (default build) uses NSPR 4.1.2 and shows no problems, the exact same NSS 3.4.2 build with NSPR 4.2 fails the stress tests, (NT DEBUG build running on NT only, Win2K is fine) ssl.sh: SSL Stress Test =============================== ssl.sh: Stress SSL2 RC4 128 with MD5 ---- selfserv -D -p 8443 -d ../server -n nss-nt.red.iplanet.com \ -w nss -i ../tests_pid.179 & selfserv started at Tue Aug 6 05:10:05 PDT 2002 tstclnt -p 8443 -h nss-nt -q -d ../client < W:/nss/nsstip/builds/20020806.1/blowfish_NT4.0_Win95/mozilla/security/nss/tests/ssl/sslreq.txt \ strsclnt -q -p 8443 -d ../client -w nss -c 1000 -C A \ nss-nt.red.iplanet.com strsclnt started at Tue Aug 6 05:10:11 PDT 2002 strsclnt: 1 server certificates tested. strsclnt: -- SSL: Server Certificate Validated. strsclnt: 0 cache hits; 0 cache misses, 0 cache not reusable [2] + Done(139) ? 199 Segmentation violation strsclnt strsclnt completed at Tue Aug 6 05:10:19 PDT 2002 ssl.sh: Stress SSL3 RC4 128 with MD5 ---- selfserv -D -p 8443 -d ../server -n nss-nt.red.iplanet.com \ -w nss -i ../tests_pid.179 & selfserv started at Tue Aug 6 05:10:25 PDT 2002 tstclnt -p 8443 -h nss-nt -q -d ../client < W:/nss/nsstip/builds/20020806.1/blowfish_NT4.0_Win95/mozilla/security/nss/tests/ssl/sslreq.txt \ strsclnt -q -p 8443 -d ../client -w nss -c 1000 -C c \ nss-nt.red.iplanet.com strsclnt started at Tue Aug 6 05:10:31 PDT 2002 strsclnt: 999 cache hits; 1 cache misses, 0 cache not reusable strsclnt: -- SSL: Server Certificate Validated. strsclnt: 0 cache hits; 1 cache misses, 0 cache not reusable [3] + Done(139) ? 176 Segmentation violation strsclnt strsclnt completed at Tue Aug 6 05:10:44 PDT 2002 ssl.sh: Stress TLS RC4 128 with MD5 ---- selfserv -D -p 8443 -d ../server -n nss-nt.red.iplanet.com \ -w nss -i ../tests_pid.179 & selfserv started at Tue Aug 6 05:10:49 PDT 2002 tstclnt -p 8443 -h nss-nt -q -d ../client < W:/nss/nsstip/builds/20020806.1/blowfish_NT4.0_Win95/mozilla/security/nss/tests/ssl/sslreq.txt \ strsclnt -q -p 8443 -d ../client -w nss -c 1000 -C c \ nss-nt.red.iplanet.com strsclnt started at Tue Aug 6 05:10:56 PDT 2002 strsclnt: 999 cache hits; 1 cache misses, 0 cache not reusable strsclnt: -- SSL: Server Certificate Validated. strsclnt: 0 cache hits; 1 cache misses, 0 cache not reusable [4] + Done(139) ? 175 Segmentation violation strsclnt strsclnt completed at Tue Aug 6 05:11:09 PDT 2002 ssl.sh: SSL Cipher Coverage Extended test =============================== selfserv -D -p 8443 -d ../ext_server -n nss-nt.red.iplanet.com \ -w nss -c ABCDEFabcdefghijklmnvy -i ../tests_pid.179 & selfserv started at Tue Aug 6 05:11:16 PDT 2002 tstclnt -p 8443 -h nss-nt -q -d ../ext_client < W:/nss/nsstip/builds/20020806.1/blowfish_NT4.0_Win95/mozilla/security/nss/tests/ssl/sslreq.txt \ ssl.sh: skipping SSL2 ciphers are independent of whether TLS is enabled or not. We for Extended test ftp://ftp.mozilla.org/pub/security/nss/daily_qa/20020806.1/nss-nt.1/output.log http://cindercone.red.iplanet.com/share/builds/mccrel3/nss/nss342/builds/20020808.2/booboo_Solaris8/mozilla/tests_results/security/RELIANCE.1/output.log ==================== Problems also show up in NT QA of the NSS tip, (using NSPR 4.2 by default). Same problems were seen on machines nss-nt, reliance and on a netscape machine. Problems were seen at least since August 6th on daily QA of nss tip builds. Tinderboxes are fine, they are not using anything from a networkdrive though Wan-Teh said he tested this version of NSPR against NSS 3.4 Also, I noticed the failures are intermittant, occasionally there will be no failures, or like in http://cindercone.red.iplanet.com/share/builds/mccrel3/nss/nss342/builds/20020808.2/booboo_Solaris8/mozilla/tests_results/security/RELIANCE.1/results.html only one test fails the results below have all stresstests failing http://cindercone.red.iplanet.com/share/builds/mccrel3/nss/nss342/builds/20020808.2/booboo_Solaris8/mozilla/tests_results/security/nss-nt.1/results.html I would wait for Wan-Teh to look at it, but I think it is fairly urgent, because JSS uses this NSS/NSPR combination, and several other products are planning on using it as well.

Ian McGreer

Comment 1

•

23 years ago

I'm trying to reproduce this on my NT box, no luck yet. Here's something I notice: > strsclnt -q -p 8443 -d ../client -w nss -c 1000 -C c \ > nss-nt.red.iplanet.com > strsclnt started at Tue Aug 6 05:10:31 PDT 2002 > strsclnt: 999 cache hits; 1 cache misses, 0 cache not reusable > strsclnt: -- SSL: Server Certificate Validated. > strsclnt: 0 cache hits; 1 cache misses, 0 cache not reusable > [3] + Done(139) ? > 176 Segmentation violation strsclnt The fact that strsclnt reports "999 cache hits; 1 cache misses, ..." indicates that the program has completed all of the requested connections. After that line is printed, the only remaining NSS call is NSS_Shutdown(). I examined diffs of 3.4.2 and 3.4. The only shutdown related change is in softoken/pkcs11.c, where a function nscFreeAllSlots was added. This function frees all the entries of the slot hashtable, as well as the table itself. This seems to be the most likely candidate for a segv. The hashtable is treated as static, but uses PL_HashTableLookup() to find the entries to destroy. However, only one thread should be calling NSS_Shutdown, so I wouldn't suspect a race condition there. I'll attach the diff for others to review. One option would be to remove the call to nscFreeAllSlots in nsc_CommonFinalize, and see if the failures stop occuring.

Ian McGreer

Comment 2

•

23 years ago

Attached patch diffs of 3.4.2 and 3.4 in softoken finalize code — Details — Splinter Review

Ian McGreer

Comment 3

•

23 years ago

I didn't read the diff closely enough. That function wasn't new, it was just moved. I still think it may be worthwhile to see if the shutdown failure can be isolated to that location.

Sonja Mirtitsch

Reporter

Comment 4

•

23 years ago

one thing that I have noticed is, that the stresstests are fine unless run on the network. I remember we had a nrspr problem a long time ago that only showed up on Windows on a network drive. I'll look it up after my meeting.

Sonja Mirtitsch

Reporter

Comment 5

•

23 years ago

I looked up the nspr bug # 70765, I don't think there is a connection, all the ntio.c changes from Larry's 4.1.1 patch are in 4.2, and the problem then showed up on Win2K only, and was fine on NT > The fact that strsclnt reports "999 cache hits; 1 cache misses, ..." indicates > that the program has completed all of the requested connections. I verified that the segmentation violation occured after the strsclnt reports "999 cache hits..." in the other cases as well.

Sonja Mirtitsch

Reporter

Comment 6

•

23 years ago

It is not reproducible from the commandline so far. Maybe not connected, but I noticed more frequent hangs in the stresstest on the tinderboxes, today on Reliance ssl.sh: SSL Stress Test =============================== ssl.sh: Stress SSL2 RC4 128 with MD5 ---- selfserv -D -p 8444 -d ../server -n RELIANCE.red.iplanet.com \ -w nss -i ../tests_pid.258 & selfserv started at Thu Aug 15 01:54:58 MST 2002 tstclnt -p 8444 -h RELIANCE -q -d ../client < E:/tbx/nss/builds/tbx/reliance_NT/mozilla/security/nss/tests/ssl/sslreq.txt \ strsclnt -q -p 8444 -d ../client -w nss -c 1000 -C A \ RELIANCE.red.iplanet.com strsclnt started at Thu Aug 15 01:54:59 MST 2002 [2] + Done(137) ? 153 Killed strsclnt ============================== yesterday on HP 65 charm

Sonja Mirtitsch

Reporter

Comment 7

•

23 years ago

Attached file dijkstra (OSK) tinderbox crashed in stresstest — Details

Ian McGreer

Comment 8

•

23 years ago

I seem to remember that some time ago, when dijkstra was first updated to OSFv5.1, we started seeing stress failures that were hard to reproduce. I can't find the bug, though. At any rate, I doubt that crash is related. First of all, it was the server that crashed, not the client. Secondly, the log seems to indicate resource starvation (the number of client threads keeps being reset to a lower value). That could be because of machine problems, or it could be because selfserv went bonkers. Either way, I think it should be a separate bug, to avoid confusion.

Sonja Mirtitsch

Reporter

Comment 9

•

23 years ago

yes, you are right, the OSF failure looks totally different. Since it is an officially unsupported platform I won't file a bug for now. The other thing: I ran 60 WinNT DBG stresstests today on 2 different NT machines, and only got 2 failures (full nssqa script) - I saw the failure more frequently before. I also reran tests on the 0813 build, less failures in later runs. I don't know what to make of this, and it seems weird to me, but since I noticed it, I figured I'd put it in the bug. I'll forward the QA reports to Ian.

Wan-Teh Chang

Assignee

Comment 10

•

23 years ago

It would be useful to open a bug about the OSF1 crash and attach the stack trace in the core file. Thanks.

Ian McGreer

Comment 11

•

23 years ago

Attached patch log shutdown activity — Details — Splinter Review

This patch logs most of the critical parts of the shutdown process. Sonja, can you try this patch with the QA and see if you can reproduce a failure with it? Unfortunately, there's a chance that adding the fprintf's will change the state enough to mask the crash, and with it only occurring every 30 QA runs, it may not be easy...

Wan-Teh Chang

Assignee

Comment 12

•

23 years ago

I can reproduce the crash on one of our network drives here. Unfortunately the network drive is very slow, so running all.sh takes a long time. I found that the strsclnt test finishes without a crash if I remove the PR_Cleanup call. This suggests that NSS_Shutdown is not the problem. The crash is very sensitive to the addition of printf statements. Does anyone know why Windows doesn't pop up a dialog, with a button to debug the crash, when strsclnt gets a segmentation violation?

Ian McGreer

Comment 13

•

23 years ago

Most likely because we don't build with USE_DEBUG_RTL? Otherwise, the QA scripts would halt anytime a memory fault occurred. I think Sonja tried a build with USE_DEBUG_RTL, and couldn't reproduce the crash.

Wan-Teh Chang

Assignee

Comment 14

•

23 years ago

It appears that all.sh or a script it invokes sets the error mode to 3 (SEM_FAILCRITICALERRORS | SEM_NOGPFAULTERRORBOX), which is inherited by the test programs. The default error mode is 0. (See http://msdn.microsoft.com/library/default.asp?url=/library/en-us/debug/errors_38px.asp for more info on the error modes.) I'll add a SetErrorMode(0) call to the beginning of the strsclnt test to restore the error mode to 0. Then it will pop up a dialog when it gets a segmentation violation.

Wan-Teh Chang

Assignee

Comment 15

•

23 years ago

I have determined that this is an NSPR bug introduced in NSPR 4.2 (the fix for bug 30746) and this bug only affects the WINNT version during PR_Cleanup. The stack trace is: NTDLL! 77f6830b() MSVCRT! 7800b503() MSVCRT! 7800a5a5() _PR_Getfd() line 121 + 37 bytes PR_AllocFileDesc(int 228, const PRIOMethods * 0x00373388 _pr_fileMethods) line 129 + 5 bytes _PR_InitIO() line 56 + 18 bytes _PR_InitStuff() line 236 _PR_ImplicitInitialization() line 262 PR_IntervalNow() line 74 _PR_ClockInterrupt() line 344 + 5 bytes _PR_CPU_Idle(void * 0x00931100) line 351 _PR_UserRunThread() line 523 + 13 bytes _pr_fiber_mainline(void * 0x00000000) line 396 + 19 bytes KERNEL32! 77f04ede() KERNEL32! 77f05429() In NSPR 4.2, the "CPU" thread is now a separate native thread. The CPU thread waits on NSPR's I/O completion port with a 5-second timeout when it is idle. What happens is that the primordial thread calls PR_Cleanup while the CPU thread wakes up from the wait and proceeds to call PR_IntervalNow, which detects that NSPR is not initialized (because PR_Cleanup has been called) and starts to initialize NSPR. It appears that the PR_Cleanup call and the _PR_ImplicitInitialization call are occurring simultaneously and collide in NSPR's fd cache. The fix is for PR_Cleanup to properly shut down the CPU threads.

Assignee: ian.mcgreer → wtc

Component: Tools → NSPR

Product: NSS → NSPR

QA Contact: bishakhabanerjee → wtc

Target Milestone: --- → 4.2.2

Version: 3.4.2 → 4.2

Wan-Teh Chang

Assignee

Comment 16

•

23 years ago

Earilier, I said: > It appears that the PR_Cleanup call and the > _PR_ImplicitInitialization call are occurring > simultaneously and collide in NSPR's fd cache. Actually, the stack trace above can only happen if PR_Cleanup has completed. That is, PR_Cleanup and _PR_ImplicitInitialization cannot be occurring at the same time. I also saw this assertion failure once: Assertion failure: 0 == _pr_fd_cache.count, at ../../../../pr/src/io/prfdcach.c:118

Status: NEW → ASSIGNED

Priority: -- → P1

Wan-Teh Chang

Assignee

Comment 17

•

23 years ago

Another stack trace, which I occasionally see with my test program (which I will attach later), is: NTDLL! 77f7f570() PR_Lock(PRLock * 0x00000000) line 233 + 35 bytes _PR_UserRunThread() line 517 + 12 bytes _pr_fiber_mainline(void * 0x00000000) line 396 + 19 bytes KERNEL32! 77e802ed() KERNEL32! 77e997df() This is the TimerManager thread. In _PR_UserRunThread it crashes here because PR_Cleanup has destroyed _pr_activeLock and set it to NULL: /* * Add to list of active threads */ if (!(thread->flags & _PR_IDLE_THREAD)) { => PR_Lock(_pr_activeLock); PR_APPEND_LINK(&thread->active, &_PR_ACTIVE_LOCAL_THREADQ()); _pr_local_threads++; PR_Unlock(_pr_activeLock); } (*thread->startFunc)(thread->arg); where thread->startFunc is TimerManager (defined in prmwait.c).

Wan-Teh Chang

Assignee

Comment 18

•

23 years ago

Attached file A test program that reproduces the crash in comment #15 easily — Details

Wan-Teh Chang

Assignee

Comment 19

•

23 years ago

Attached patch Proposed patch — Details — Splinter Review

This patch implements the solution I proposed at the end of comment #15: PR_Cleanup shuts down the CPU threads. I tried to implement a simpler solution: the CPU threads exit if they return from _PR_MD_PAUSE_CPU and find that _pr_initialized is false. This solution is not thread safe, and I found that no matter how I patch it, I can still break it. Comments on this patch. 1. It is only implemented for WINNT and it's easy to verify that it does not affect the non-WINNT versions of NSPR. 2. I need a boolean flag to tell the CPU threads that it is time to exit. I found an unused variable named _pr_cpus_exit and thought it was probably intended for this purpose. So I simply use it. 3. A known risk of this patch is that PR_Cleanup will hang if there is a thread created with both the PR_SYSTEM_THREAD and PR_LOCAL_THREAD flags that does not call any blocking NSPR functions when PR_Cleanup is trying to stop and join with the CPU threads. Such a thread will tie up a CPU thread and prevent the CPU thread from entering its idle loop and noticing the signal to exit. If this turns out to be a problem, I can modify _PR_CleanupCPUs so that it does not try to stop the CPU threads if there are any "local" "system" threads running.

Wan-Teh Chang

Assignee

Comment 20

•

23 years ago

The patch has been checked in and will be in NSPR 4.2.2. It works (verified with NSPR 4.2.2 RC1).

Status: ASSIGNED → RESOLVED

Closed: 23 years ago

Resolution: --- → FIXED

diffs of 3.4.2 and 3.4 in softoken finalize code 23 years ago Ian McGreer 3.28 KB, patch		Details \| Diff \| Splinter Review
dijkstra (OSK) tinderbox crashed in stresstest 23 years ago Sonja Mirtitsch 150.24 KB, text/plain		Details
log shutdown activity 23 years ago Ian McGreer 4.47 KB, patch		Details \| Diff \| Splinter Review
A test program that reproduces the crash in comment #15 easily 23 years ago Wan-Teh Chang 379 bytes, text/plain		Details
Proposed patch 23 years ago Wan-Teh Chang 4.69 KB, patch		Details \| Diff \| Splinter Review