Closed Bug 772226 Opened 12 years ago Closed 12 years ago

Assertion failure: lock != NULL, at prulock.c

Categories

(NSPR :: NSPR, defect)

4.9.2
x86_64
Windows XP
defect
Not set
major

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: KaiE, Assigned: wtc)

Details

Attachments

(1 file)

Since around 2012/07/05 09:46:02 we have a new failure on Windows XP machine buildnss03.

Most failures are preceeded by:

Assertion failure: lock != NULL, at e:/mozilla/security/tinderlight/data/buildnss03_trunk_64_DBG/mozilla/nsprpub/pr/src/threads/combined/prulock.c:198

Example logfile
http://tinderbox.mozilla.org/showlog.cgi?log=NSS/1341848422.1341855358.22814.gz&fulltext=1
That assertion failure means some NSS (or NSPR) code is passing a NULL
'lock' argument to PR_Lock:

http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/nsprpub/pr/src/threads/combined/prulock.c&rev=3.15&mark=189,198#186

We should track this down.

Kai, could you get the call stack of the assertion failure?  This requires
logging into the VM and running the modutil or certutil command manually.
You may need to run the command in a debugger.
I'm having trouble to reproduce this error when running the commands manually.

I saw a different error "Failed to add module ... unknown pkcs#11 error".

Because it appeared that my command looks fine, I started to experiment. One of the experiments made the command work. I copied the dll to a different, shorter path. That makes the command work.

Is it possible that we have a limit for the path? A path of 125 characters fails. A libfile path of 88 characters works...

Currently the tests are being run from path /tinderbox/mozilla/security/tinderlight.

I consider to move that directory to a shorter path, let the tinderbox test script run and see what happens...
Moved to /tinderbox/tinderlight ... let's wait for the next cycles of buildnss03
Kai: thank you for trying to track this down.  I suspect the PR_Lock(null)
error occurred during NSS_Shutdown.  Perhaps NSS destroyed some lock and
then later tried to acquire the lock.  Just a wild guess.
I'm now able to reproduce the assertion when running modutil in the Windows debugger.

While I'm still confused, because I cannot reproduce the behaviour seen during the test run, I was able to reproduce the assertion using a different approach. I get it when running modutil without any argument at all, in other words, if modutil simply prints the usage output and exits. But there appears to be a race. In most scenarios the assertion warning is printed, but sometimes it's not.

Wan-Teh, yes, it's true that comcone called PR_Lock with null.

I can see that modutil has two active threads. (I don't know why modutil would start a second thread when print simply the usage information.)

The assertions happens on the secondary thread. It appears to be some kind of automatic thread. Thread name is _threadstartex. stack is:
msvcr90.dll  - threadstartex
             - callthreadstartex
libnspr4.dll - pr_root
             - _PR_nativeRunThread - line 391
                     (just after comment "add to list of active threads")
             - PR_Lock(0)
             - PR_Assert
stack of main thread:

modutil.exe  - main
libnspr4.dll - PR_Cleanup line 429
             - _PR_CleanupBeforeExit line 313
             - _PR_MD_CLEANUP_BEFORE_EXIT line 109
I see that _pr_activeLock gets destroyed and set to null in _PR_CleanupThreads.

I restarted and set a breakpoint in _PR_CleanupThreads.

That breakpoint gets hit first, called from PR_Cleanup line 422.
At this time, the main thread is only thread.

I set a breakpoint for each line in PR_Cleanup after 422.
Each time the debugger stopped until line 429, there was just one single thread.

The additional thread gets created inside _PR_CleanupBeforeExit,
at the time it calls WSACleanup.
The startFunc of the secondary thread is:
  ContinuationThread
It would be good if these details can help you understand what's wrong.
Please let me know if you need further information, I have stopped the continous building/testing on buildnss03 and will wait for your feedback.
Kai: thank you for the info.  I can understand the problem you
described.  However, it is not clear if it is the same PR_Lock(0)
call observed in normal test run on that tinderbox.

I noticed that this is the "WINNT" build configuration, which is
why I lowered the priority and severity of this bug.

The "ContinuationThread" is an internal thread created by NSPR
for UDP support.  Since NSS doesn't have any tests that use UDP
yet, we can comment out the ContinuationThread as an experiment.

If the assertion regularly fails on the tinderbox buildnss03,
please apply this patch to the NSS source tree locally.  If
the assertion failure is gone, this will prove that the assertion
failure is related to the ContinuationThread.

Thanks.
Comment on attachment 645085 [details] [diff] [review]
Patch for debugging: comment out ContinuationThread

Kai: I have trouble testing this patch at work.
I will test this patch at home tonight.  Please do
not test this patch until I have tested it.
Wan-Teh, I guess you didn't find time since you wrote comment 11 (2 months ago).

(The affected Windows buildnss03 machine has been deactivated since that time.)

How should we proceed?
(In reply to Wan-Teh Chang from comment #10)
> 
> I noticed that this is the "WINNT" build configuration, which is
> why I lowered the priority and severity of this bug.

I don't understand. Why is the WINNT configuration a low priority?

The WINNT build configuration appears to be the default one!
We're using it on all the currently active Windows NSS build machines.
changing component to NSPR
Assignee: nobody → wtc
Component: Test → NSPR
Product: NSS → NSPR
Version: 3.14 → 4.9.2
Whiteboard: [waiting for wtc, comment 11]
I would also like to clarify that all Windows XP testing on buildnss03 was sitting idle for 3 months already, because of comment 11.

I just reenabled this build machine, but as expected it still shows this bug.

Wan-Teh, I'm not sure why you had asked me to wait for you to test this locally.

If you want, just let me know, and I can test this patch on buildnss03 (as you had proposed in comment 10, before you asked me to wait for you).
Wan-Teh clarified that Mozilla is using OS_TARGET=WIN95

I will have buildnss03 do one more cycle with WINNT and the debug patch applied, to see what we get.

Afterwards, I'll change buildnss03 to use WIN95.
It seems like the patch didn't help?

As you can see in this build log
http://tinderbox.mozilla.org/showlog.cgi?log=NSS/1351272886.1351284204.11257.gz&fulltext=1

the tree contains a patch for ntio.c but we still get the assertions.


Anyway. It's time to switch buildnss03 to WIN95 now.
We don't get this failure on buildnss03

Since Wan-Teh clarified the WINNT configuration is outdated, I don't have an interest to track this bug any more.

I suggest WONTFIX.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WONTFIX
Whiteboard: [waiting for wtc, comment 11]
Hi Kai,Wan-Teh,

We are also getting this same error message on Windows 2003 and 2008 (32 bits). But the error is thrown only with debug bits. Same problem does not occur with optimized build.
Do you have any update on this issue? Is there any workaround/solution to this problem?
We are building NSPR 4.9.5 with NSS 3.14.3
> We are also getting this same error message on Windows 2003 and 2008 (32
> bits). But the error is thrown only with debug bits. Same problem does not
> occur with optimized build.
> Do you have any update on this issue? Is there any workaround/solution to
> this problem?

This problem will never occur in any optimized build because assertions are disabled in optimized builds.

Are you using OS_TARGET=WIN95 or OS_TARGET=WINNT?

Please attack a stacktrace to this bug.
We are using OS_TARGET=WINNT.

I will get back to you with stack trace of the failure.
Please use OS_TARGET=WIN95.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: