Closed Bug 161998 Opened 22 years ago Closed 22 years ago

stresstest failures on Windows NT

Categories

(NSPR :: NSPR, defect, P1)

x86
Windows NT
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: sonja.mirtitsch, Assigned: wtc)

Details

Attachments

(5 files)

The Sun released version of NSS 3.4.2 (default build) uses NSPR 4.1.2 and shows
no problems, the exact same NSS 3.4.2 build with NSPR 4.2 fails the stress
tests, (NT DEBUG build running on NT only, Win2K is fine)

ssl.sh: SSL Stress Test  ===============================
ssl.sh: Stress SSL2 RC4 128 with MD5 ----
selfserv -D -p 8443 -d ../server -n nss-nt.red.iplanet.com \
         -w nss   -i ../tests_pid.179  &
selfserv started at Tue Aug  6 05:10:05 PDT 2002
tstclnt -p 8443 -h nss-nt -q 
        -d ../client <
W:/nss/nsstip/builds/20020806.1/blowfish_NT4.0_Win95/mozilla/security/nss/tests/ssl/sslreq.txt
\
strsclnt -q -p 8443 -d ../client -w nss -c 1000 -C A \
          nss-nt.red.iplanet.com
strsclnt started at Tue Aug  6 05:10:11 PDT 2002
strsclnt: 1 server certificates tested.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: 0 cache hits; 0 cache misses, 0 cache not reusable
[2] + Done(139) ?
  199	Segmentation violation	strsclnt
strsclnt completed at Tue Aug  6 05:10:19 PDT 2002
ssl.sh: Stress SSL3 RC4 128 with MD5 ----
selfserv -D -p 8443 -d ../server -n nss-nt.red.iplanet.com \
         -w nss   -i ../tests_pid.179  &
selfserv started at Tue Aug  6 05:10:25 PDT 2002
tstclnt -p 8443 -h nss-nt -q 
        -d ../client <
W:/nss/nsstip/builds/20020806.1/blowfish_NT4.0_Win95/mozilla/security/nss/tests/ssl/sslreq.txt
\
strsclnt -q -p 8443 -d ../client -w nss -c 1000 -C c \
          nss-nt.red.iplanet.com
strsclnt started at Tue Aug  6 05:10:31 PDT 2002
strsclnt: 999 cache hits; 1 cache misses, 0 cache not reusable
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: 0 cache hits; 1 cache misses, 0 cache not reusable
[3] + Done(139) ?
  176	Segmentation violation	strsclnt
strsclnt completed at Tue Aug  6 05:10:44 PDT 2002
ssl.sh: Stress TLS  RC4 128 with MD5 ----
selfserv -D -p 8443 -d ../server -n nss-nt.red.iplanet.com \
         -w nss   -i ../tests_pid.179  &
selfserv started at Tue Aug  6 05:10:49 PDT 2002
tstclnt -p 8443 -h nss-nt -q 
        -d ../client <
W:/nss/nsstip/builds/20020806.1/blowfish_NT4.0_Win95/mozilla/security/nss/tests/ssl/sslreq.txt
\
strsclnt -q -p 8443 -d ../client -w nss -c 1000 -C c \
          nss-nt.red.iplanet.com
strsclnt started at Tue Aug  6 05:10:56 PDT 2002
strsclnt: 999 cache hits; 1 cache misses, 0 cache not reusable
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: 0 cache hits; 1 cache misses, 0 cache not reusable
[4] + Done(139) ?
  175	Segmentation violation	strsclnt
strsclnt completed at Tue Aug  6 05:11:09 PDT 2002
ssl.sh: SSL Cipher Coverage Extended test ===============================
selfserv -D -p 8443 -d ../ext_server -n nss-nt.red.iplanet.com \
         -w nss -c ABCDEFabcdefghijklmnvy -i ../tests_pid.179  &
selfserv started at Tue Aug  6 05:11:16 PDT 2002
tstclnt -p 8443 -h nss-nt -q 
        -d ../ext_client <
W:/nss/nsstip/builds/20020806.1/blowfish_NT4.0_Win95/mozilla/security/nss/tests/ssl/sslreq.txt
\
ssl.sh: skipping  SSL2 ciphers are independent of whether TLS is enabled or not.
We for Extended test

ftp://ftp.mozilla.org/pub/security/nss/daily_qa/20020806.1/nss-nt.1/output.log
http://cindercone.red.iplanet.com/share/builds/mccrel3/nss/nss342/builds/20020808.2/booboo_Solaris8/mozilla/tests_results/security/RELIANCE.1/output.log
====================

Problems also show up in NT QA of the NSS tip, (using NSPR 4.2 by default). 

Same problems were seen on machines nss-nt, reliance and on a netscape machine. 

Problems were seen at least since August 6th on daily QA of nss tip builds.
Tinderboxes are fine, they are  not using anything from a networkdrive though

Wan-Teh said he tested this version of NSPR against NSS 3.4

Also, I noticed the failures are intermittant, occasionally there will be no
failures, or like in
http://cindercone.red.iplanet.com/share/builds/mccrel3/nss/nss342/builds/20020808.2/booboo_Solaris8/mozilla/tests_results/security/RELIANCE.1/results.html
only one test fails
the results below have all stresstests failing
http://cindercone.red.iplanet.com/share/builds/mccrel3/nss/nss342/builds/20020808.2/booboo_Solaris8/mozilla/tests_results/security/nss-nt.1/results.html

I would wait for Wan-Teh to look at it, but I think it is fairly urgent, because
JSS uses this NSS/NSPR combination, and several other products are planning on
using it as well.
I'm trying to reproduce this on my NT box, no luck yet.

Here's something I notice:

> strsclnt -q -p 8443 -d ../client -w nss -c 1000 -C c \
>           nss-nt.red.iplanet.com
> strsclnt started at Tue Aug  6 05:10:31 PDT 2002
> strsclnt: 999 cache hits; 1 cache misses, 0 cache not reusable
> strsclnt: -- SSL: Server Certificate Validated.
> strsclnt: 0 cache hits; 1 cache misses, 0 cache not reusable
> [3] + Done(139) ?
>   176	Segmentation violation	strsclnt

The fact that strsclnt reports "999 cache hits; 1 cache misses, ..." indicates
that the program has completed all of the requested connections.  After that
line is printed, the only remaining NSS call is NSS_Shutdown().

I examined diffs of 3.4.2 and 3.4.  The only shutdown related change is in
softoken/pkcs11.c, where a function nscFreeAllSlots was added.  This function
frees all the entries of the slot hashtable, as well as the table itself.  This
seems to be the most likely candidate for a segv.

The hashtable is treated as static, but uses PL_HashTableLookup() to find the
entries to destroy.  However, only one thread should be calling NSS_Shutdown, so
I wouldn't suspect a race condition there.  I'll attach the diff for others to
review.

One option would be to remove the call to nscFreeAllSlots in nsc_CommonFinalize,
and see if the failures stop occuring.
I didn't read the diff closely enough.  That function wasn't new, it was just
moved.  I still think it may be worthwhile to see if the shutdown failure can be
isolated to that location.
one thing that I have noticed is, that the stresstests are fine unless run on
the network. I remember we had a nrspr problem a long time ago that only showed
up on Windows on a network drive. I'll look it up after my meeting.
I looked up the nspr bug # 70765, I don't think there is a connection, all the
ntio.c changes from Larry's 4.1.1 patch are in 4.2, and the problem then showed
up on Win2K only, and was fine on NT

> The fact that strsclnt reports "999 cache hits; 1 cache misses, ..." indicates
> that the program has completed all of the requested connections. 

I verified that the segmentation violation occured after the strsclnt reports
"999 cache hits..." in the other cases as well.

It is not reproducible from the commandline so far. Maybe not connected, but I
noticed more frequent hangs in the stresstest on the tinderboxes, today on Reliance
ssl.sh: SSL Stress Test  ===============================
ssl.sh: Stress SSL2 RC4 128 with MD5 ----
selfserv -D -p 8444 -d ../server -n RELIANCE.red.iplanet.com \
         -w nss   -i ../tests_pid.258  &
selfserv started at Thu Aug 15 01:54:58 MST 2002
tstclnt -p 8444 -h RELIANCE -q 
        -d ../client <
E:/tbx/nss/builds/tbx/reliance_NT/mozilla/security/nss/tests/ssl/sslreq.txt \
strsclnt -q -p 8444 -d ../client -w nss -c 1000 -C A \
          RELIANCE.red.iplanet.com
strsclnt started at Thu Aug 15 01:54:59 MST 2002
[2] + Done(137) ?
  153	Killed	strsclnt
==============================
yesterday on HP 65 charm
I seem to remember that some time ago, when dijkstra was first updated to
OSFv5.1, we started seeing stress failures that were hard to reproduce.  I can't
find the bug, though.

At any rate, I doubt that crash is related.  First of all, it was the server
that crashed, not the client.  Secondly, the log seems to indicate resource
starvation (the number of client threads keeps being reset to a lower value). 
That could be because of machine problems, or it could be because selfserv went
bonkers.  Either way, I think it should be a separate bug, to avoid confusion.
yes, you are right, the OSF failure looks totally different. Since it is an
officially unsupported platform I won't file a bug for now.

The other thing: I ran 60 WinNT DBG stresstests today on 2 different NT
machines, and only got 2 failures (full nssqa script) - I saw the failure more
frequently before. I also reran tests on the 0813 build, less failures in later
runs. I don't know what to make of this, and it seems weird to me, but since I
noticed it, I figured I'd put it in the bug. I'll forward the QA reports to Ian.
It would be useful to open a bug about the OSF1 crash
and attach the stack trace in the core file.  Thanks.
This patch logs most of the critical parts of the shutdown process.

Sonja, can you try this patch with the QA and see if you can reproduce a
failure with it?  Unfortunately, there's a chance that adding the fprintf's
will change the state enough to mask the crash, and with it only occurring
every 30 QA runs, it may not be easy...
I can reproduce the crash on one of our network drives
here.  Unfortunately the network drive is very slow, so
running all.sh takes a long time.

I found that the strsclnt test finishes without a crash
if I remove the PR_Cleanup call.  This suggests that
NSS_Shutdown is not the problem.  The crash is very
sensitive to the addition of printf statements.

Does anyone know why Windows doesn't pop up a dialog,
with a button to debug the crash, when strsclnt gets a
segmentation violation?
Most likely because we don't build with USE_DEBUG_RTL?  Otherwise, the QA
scripts would halt anytime a memory fault occurred.

I think Sonja tried a build with USE_DEBUG_RTL, and couldn't reproduce the crash.
It appears that all.sh or a script it invokes sets the error
mode to 3 (SEM_FAILCRITICALERRORS | SEM_NOGPFAULTERRORBOX),
which is inherited by the test programs.  The default error
mode is 0.  (See
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/debug/errors_38px.asp
for more info on the error modes.)

I'll add a SetErrorMode(0) call to the beginning of the
strsclnt test to restore the error mode to 0.  Then it will
pop up a dialog when it gets a segmentation violation.
I have determined that this is an NSPR bug introduced
in NSPR 4.2 (the fix for bug 30746) and this bug only
affects the WINNT version during PR_Cleanup.

The stack trace is:

NTDLL! 77f6830b()
MSVCRT! 7800b503()
MSVCRT! 7800a5a5()
_PR_Getfd() line 121 + 37 bytes
PR_AllocFileDesc(int 228, const PRIOMethods * 0x00373388 _pr_fileMethods) line 
129 + 5 bytes
_PR_InitIO() line 56 + 18 bytes
_PR_InitStuff() line 236
_PR_ImplicitInitialization() line 262
PR_IntervalNow() line 74
_PR_ClockInterrupt() line 344 + 5 bytes
_PR_CPU_Idle(void * 0x00931100) line 351
_PR_UserRunThread() line 523 + 13 bytes
_pr_fiber_mainline(void * 0x00000000) line 396 + 19 bytes
KERNEL32! 77f04ede()
KERNEL32! 77f05429()

In NSPR 4.2, the "CPU" thread is now a separate native
thread.  The CPU thread waits on NSPR's I/O completion
port with a 5-second timeout when it is idle.  What
happens is that the primordial thread calls PR_Cleanup
while the CPU thread wakes up from the wait and proceeds
to call PR_IntervalNow, which detects that NSPR is not
initialized (because PR_Cleanup has been called) and
starts to initialize NSPR.  It appears that the PR_Cleanup
call and the _PR_ImplicitInitialization call are occurring
simultaneously and collide in NSPR's fd cache.

The fix is for PR_Cleanup to properly shut down the CPU
threads.
Assignee: ian.mcgreer → wtc
Component: Tools → NSPR
Product: NSS → NSPR
QA Contact: bishakhabanerjee → wtc
Target Milestone: --- → 4.2.2
Version: 3.4.2 → 4.2
Earilier, I said:
> It appears that the PR_Cleanup call and the
> _PR_ImplicitInitialization call are occurring
> simultaneously and collide in NSPR's fd cache.

Actually, the stack trace above can only happen
if PR_Cleanup has completed.  That is, PR_Cleanup
and _PR_ImplicitInitialization cannot be
occurring at the same time.

I also saw this assertion failure once:
Assertion failure: 0 == _pr_fd_cache.count, at 
../../../../pr/src/io/prfdcach.c:118
Status: NEW → ASSIGNED
Priority: -- → P1
Another stack trace, which I occasionally see
with my test program (which I will attach later),
is:

NTDLL! 77f7f570()
PR_Lock(PRLock * 0x00000000) line 233 + 35 bytes
_PR_UserRunThread() line 517 + 12 bytes
_pr_fiber_mainline(void * 0x00000000) line 396 + 19 bytes
KERNEL32! 77e802ed()
KERNEL32! 77e997df()

This is the TimerManager thread.  In _PR_UserRunThread
it crashes here because PR_Cleanup has destroyed
_pr_activeLock and set it to NULL:

    /*
     * Add to list of active threads
     */
    if (!(thread->flags & _PR_IDLE_THREAD)) {
 =>     PR_Lock(_pr_activeLock);
        PR_APPEND_LINK(&thread->active, &_PR_ACTIVE_LOCAL_THREADQ());
        _pr_local_threads++;
        PR_Unlock(_pr_activeLock);
    }

        (*thread->startFunc)(thread->arg);

where thread->startFunc is TimerManager (defined in prmwait.c).
Attached patch Proposed patchSplinter Review
This patch implements the solution I proposed at
the end of comment #15: PR_Cleanup shuts down the
CPU threads.

I tried to implement a simpler solution: the CPU
threads exit if they return from _PR_MD_PAUSE_CPU
and find that _pr_initialized is false.  This
solution is not thread safe, and I found that no
matter how I patch it, I can still break it.

Comments on this patch.

1. It is only implemented for WINNT and it's easy
to verify that it does not affect the non-WINNT
versions of NSPR.

2. I need a boolean flag to tell the CPU threads
that it is time to exit.  I found an unused
variable named _pr_cpus_exit and thought it was
probably intended for this purpose.  So I simply
use it.

3. A known risk of this patch is that PR_Cleanup
will hang if there is a thread created with both
the PR_SYSTEM_THREAD and PR_LOCAL_THREAD flags that
does not call any blocking NSPR functions when
PR_Cleanup is trying to stop and join with the CPU
threads.  Such a thread will tie up a CPU thread
and prevent the CPU thread from entering its idle
loop and noticing the signal to exit.

If this turns out to be a problem, I can modify
_PR_CleanupCPUs so that it does not try to stop
the CPU threads if there are any "local" "system"
threads running.
The patch has been checked in and will be in
NSPR 4.2.2.  It works (verified with NSPR 4.2.2
RC1).
Status: ASSIGNED → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: