Closed
Bug 164512
Opened 23 years ago
Closed 22 years ago
CERT_FindCertIssuer in >3 threads stalls under OS/2 SMP - VACPP bug
Categories
(NSS :: Libraries, defect, P4)
Tracking
(Not tracked)
RESOLVED
WONTFIX
Future
People
(Reporter: julien.pierre, Assigned: mkaply)
References
Details
Attachments
(2 files, 2 obsolete files)
While running a multi-threaded NSS CRL cache stress test on my dual Athlon OS/2
SMP machine, I encountered what seemed to be a deadlock in this function. The
process was stalled, and nothing was running on either processor anymore. I
attached to the process to find out what was going on.
My test was running with 11 threads - the one main thread waiting for 10 test
threads in a loop during certificate verification.
3 threads were stuck in SemRequest486, while 7 others were in _SemRequest of the
C library. Somehow, all 10 threads were stuck - nothing got to run anymore, so I
attached to the process to find out what was going on.
Here is the stack of a thread in SemRequest486 :
Function | Part
------------------------------------------+-----------------
SemRequest486 | OS2VACPP.OBJ
nssSession_EnterMonitor | DEVSLOT.OBJ
find_objects | DEVTOKEN.OBJ
find_objects_by_template | DEVTOKEN.OBJ
nssToken_FindCertificatesBySubject | DEVTOKEN.OBJ
nssTrustDomain_FindCertificatesBySubject | TRUSTDOMAIN.OBJ
find_cert_issuer | CERTIFICATE.OBJ
nssCertificate_BuildChain | CERTIFICATE.OBJ
NSSCertificate_BuildChain | CERTIFICATE.OBJ
CERT_FindCertIssuer | CERTVFY.OBJ
cert_VerifyCertChain | CERTVFY.OBJ
CERT_VerifyCertificate | CERTVFY.OBJ
VerifyCert | CERTUTIL.OBJ
_PR_NativeRunThread | PRUTHR.OBJ
open__7filebufFPCciT2 | cpprmi36.dll:2
0x1FFECE33 | DOSCALL1.DLL:4
Another one in the heap lock :
Function | Part
-------------------------------------+-----------------
0x1FFDE361 | DOSCALL1.DLL:3
_SemRequest | cpprmi36.dll:2
free | cpprmi36.dll:2
PR_Free | PRMEM.OBJ
PR_DestroyLock | PRULOCK.OBJ
PORT_FreeArena | SECPORT.OBJ
nss3certificate_getIssuerIdentifier | PKI3HACK.OBJ
find_cert_issuer | CERTIFICATE.OBJ
nssCertificate_BuildChain | CERTIFICATE.OBJ
NSSCertificate_BuildChain | CERTIFICATE.OBJ
CERT_FindCertIssuer | CERTVFY.OBJ
cert_VerifyCertChain | CERTVFY.OBJ
CERT_VerifyCertificate | CERTVFY.OBJ
VerifyCert | CERTUTIL.OBJ
_PR_NativeRunThread | PRUTHR.OBJ
open__7filebufFPCciT2 | cpprmi36.dll:2
0x1FFECE33 | DOSCALL1.DLL:4
Eventually - after about 20s of inactivity, the process resumed execution, and
it eventually completed.
Reporter | ||
Comment 1•23 years ago
|
||
FYI, I confirmed that this is a problem on SMP. Here are the results below when
running with SMP first, and then without (executable marked as non-SMP - all
threads run on the same CPU). As you can see, the cached operations take 4.32
seconds with 2 processors, and 0.0005 seconds with a single processor.
Since I currently have a global lock around the verification operation, I expect
the time per operation to be roughly the same regardless of the number of CPUs,
maybe slightly higher due to scheduling on SMP, but not 78x higher as it is
currently.
[e:\dev\nss\36\mozilla\dist\os22.45_icc_dbg.obj\bin]revoked.cmd
Time for initial uncached verification : 0: 6.926000
Time for 200 cached verifications : 0: 4.326000
Average time per cached verification : 0: 0. 21000
E:\DEV\NSS\36\MOZILLA\DIST\OS22.45_ICC_DBG.OBJ\BIN\CERTUTIL.EXE: certificate is
invalid: Peer's Certificate has been revoked.
[e:\dev\nss\36\mozilla\dist\os22.45_icc_dbg.obj\bin]execmode certutil.exe
ExecMode v1.0 - Single processor execution mode utility
ACTIVATED OPTIONS:
SINGLE-PROCESSOR files option.
FILES MODIFIED:
certutil.exe (modified)
1 file analyzed, 1 file MODIFIED.
[e:\dev\nss\36\mozilla\dist\os22.45_icc_dbg.obj\bin]revoked.cmd
Time for initial uncached verification : 0: 6.988000
Time for 200 cached verifications : 0: 0. 55000
Average time per cached verification : 0: 0. 0
E:\DEV\NSS\36\MOZILLA\DIST\OS22.45_ICC_DBG.OBJ\BIN\CERTUTIL.EXE: certificate is
invalid: Peer's Certificate has been revoked.
[e:\dev\nss\36\mozilla\dist\os22.45_icc_dbg.obj\bin]
Reporter | ||
Comment 2•23 years ago
|
||
FYI, I found a line in os2vacpp.asm that says:
; lock ; Uncomment for SMP
So I uncommented it and rebuilt NSPR.
Unfortunately, it made things even worse.
[e:\dev\nss\36\mozilla\dist\os22.45_icc_dbg.obj\bin]revoked.cmd
Time for initial uncached verification : 0: 6.981000
Time for 200 cached verifications : 0: 5.371000
Average time per cached verification : 0: 0. 26000
Just to clarify my test protocol, the 200 verifications are happening in 10
threads, each doing 20 verifications.
Reporter | ||
Comment 3•23 years ago
|
||
Changing description.
Summary: SemRequest486 is not safe for SMP (OS/2) → SemRequest486 is extremely slow on OS/2 SMP
Reporter | ||
Comment 4•23 years ago
|
||
I found that there was a USE_RAMSEM macro in NSPR to use the assembly code. By
undefining it - and doing some cleanup in the ASM file to allow linking - I was
able to generate a version of NSPR without the ramset, that reverts to using the
OS/2 mutex semaphore calls. Much to my surprise, that still did not resolve the
stalling problem when running in multiprocessor mode.
Note that the test below is different than yesterday.
Differences include :
1) times are measured more precisely, down to the microsecond
2) I am using a reader/writer lock to protect my cert verification, instead of a
global lock. This allows the code to scale close to linearly with CPUs. I have
verified that it scales 95% on a 2-way Ultrasparc machine, and doesn't stall.
However, the above differences are only between yesterday's test and today's -
all 4 results below are exactly the same test - with only 2 variables :
1) the type of semaphore used - RAM sem vs OS/2 mutex sem
2) whether the process is allowed to run on more than one CPU, as configured by
the "execmode" program. The OS/2 kernel automatically dispatches threads to
multiple CPUs. The machine is a dual Athlon MP 1500+.
--- Results with RAM sems, multi-processor mode :
Time for 5000 cached verifications : 1: 7.451067
Threads : 10. Iterations : 500.
Average time per cached verification : 0: 0. 13477
--- Results with RAM sems, single-processor mode :
Time for 5000 cached verifications : 0: 1.236268
Threads : 10. Iterations : 500.
Average time per cached verification : 0: 0. 241
--- Results without OS/2 Mutex sems, multi-processor mode :
Time for initial uncached verification : 0: 6.844070
Time for 5000 cached verifications : 2: 8.772234
Threads : 10. Iterations : 500.
Average time per cached verification : 0: 0. 25747
--- Results with OS/2 mutex sems, single-processor mode :
Time for initial uncached verification : 0: 6.850681
Time for 5000 cached verifications : 0: 3.726761
Threads : 10. Iterations : 500.
Average time per cached verification : 0: 0. 738
As you can see, both types of semaphore give very bad performance when the
multi-processor mode is enabled. Actually, the entire process seemed stalled in
both cases. In one of the tests, I had to use the machine and start some other
processes to finally trigger the kernel to reschedule the threads to the running
state, after two minutes of waiting and both CPUs being idle. There is something
seriously wrong with the locking here. I'm afraid that it doesn't only affect
NSPR, though. I will try to write a simple non-NSPR multithreaded program that
increments a counter protected by a mutex, and see how it does.
I also conducted another test, of a good certificate, where each verification
actually takes longer (about 50,000 microseconds of one CPU's worth of time, as
opposed to about 250 microseconds in the previous test) . In that case, there is
no stalling and everything scales linearly !!! Somehow the process stalling only
seems to happen when the lock/unlock operations are very frequent.
--- Results with OS/2 mutex sems, single-processor mode :
Time for initial uncached verification : 0: 6.964880
Time for 100 cached verifications : 0:10.762283
Threads : 10. Iterations : 10.
Average time per cached verification : 0: 0.107614
--- Results with OS/2 mutex sems, multi-processor mode :
Time for initial uncached verification : 0: 6.984082
Time for 100 cached verifications : 0: 5.679035
Threads : 10. Iterations : 10.
Average time per cached verification : 0: 0. 56777
Reporter | ||
Comment 5•23 years ago
|
||
I wrote a separate test, but couldn't reproduce the stalling thread problem. I
could reproduce 3x worse performance on SMP, but both CPUs were peaked at all
times during the test. I have attached the tests. I would be interested in the
results on other SMP machines. Here are the results on my dual athlon:
[e:\dev\projets\sems]semtest.cmd
IBM* C and C++ Compilers for OS/2*, AIX* and for Windows NT**, Version 3.6
(C) Copyright IBM Corp. 1991, 1997 All Rights Reserved.
* registered trademarks of IBM Corp., ** registered trademark of Microsoft Corp.
single-processor test
Threads : 10. Iterations : 500.
Clock ticks elapsed : 3272226.
Clock ticks per operation : 654.445200.
multi-processor test (if on an SMP machine)
Threads : 10. Iterations : 500.
Clock ticks elapsed : 9911315.
Clock ticks per operation : 1982.263000.
Reporter | ||
Comment 6•23 years ago
|
||
Reporter | ||
Comment 7•23 years ago
|
||
Reporter | ||
Comment 8•23 years ago
|
||
I have tried to simplify my original test case, but I haven't been successful in
recreating the same stalling process problem outside of my certificate
verification test. I believe the problem may not be in NSPR, but rather to NSS.
Reporter | ||
Comment 9•23 years ago
|
||
Changing product.
Component: NSPR → Libraries
Product: NSPR → NSS
Summary: SemRequest486 is extremely slow on OS/2 SMP → certificate verification is extremely slow on OS/2 SMP
Target Milestone: --- → 3.6
Version: 4.2 → 3.6
Reporter | ||
Comment 10•23 years ago
|
||
Mike, do you have access to any OS/2 SMP machine ? I can produce a test program
for you.
I have tried on a multi-processor Sparc, a multi-processor Mac, and the test
runs fine on all of them, with no stalling.
But on OS/2, it stalls, sometimes for minutes at a time. The only fix is to
revert the process to single-processor mode with execmode.
Attaching with the debugger you gave me shows the threads waiting for locks
either on SemRequest486 or in the C runtime's heap lock.
I tried to replace the former with DosRequestMutexSem calls, and that didn't fix
the problems. I determined that the DosRequestMutexSem semaphores work properly
on my SMP system - the stress test produced no stalling.
Therefore, the only direction I can look at is the other lock - the one in the C
runtime, on free(). Perhaps there is a problem with the SMP safety of
_SemRequest from cpprmi36.dll, which is possibly another RAMSEM. Although I see
on my stack that it goes to DOSCALL1.DLL, which means the OS/2 kernel .
This is a very strange one.
Reporter | ||
Comment 11•23 years ago
|
||
OK, I made some more progress here.
- First, I have determined that the stalling only happens if the number of
threads is greater than 2. With only 2 threads, there is no stalling. With 3
threads, it happens.
I think this may be symptomatic of a race condition in the cert cache. My
machine is very fast - it's a dual athlon MP 1500+. So perhaps it simply doesn't
happen with slower systems.
- Second, I have narrowed down the problem a little more. Until now, my worker
threads in the test were doing a full certificate verification, a process which
has many steps. One of those steps is to find the certificate's issuer. I
removed the call to CERT_VerifyCert from my worker thread code, and replaced it
with a call to CERT_FindCertIssuer . The problem still occurred !
- Just to make certain, I made a third test, where I cached the certificate
issuer at the beginning of the program, and did a CRL check of the certificate
CRL in the worker threads (SEC_CheckCRL), which calls the new CRL cache code I
was trying to test. The stalling problem did not occur in that test.
Therefore, I believe the problem is with the code that does the issuer lookup.
Somehow, it ends up stalling. What's strange is that there is no deadlock - I
can see the iterations incrementing in all 3 test threads, but they increment
extremely slowly, and the CPU monitor shows both CPUs nearly idle. When I reduce
the number of threads to 2, everything reverts to normal and there is good CPU
usage back and forth on both CPUs. The two CPUs are never fully peaked at the
same time, but that's expected due to locking.
Summary: certificate verification is extremely slow on OS/2 SMP → certificate issuer lookup stalls process on OS/2 SMP
Comment 12•23 years ago
|
||
1. OS/2 DosRequestMutexSem works properly on all levels of OS/2, FWIW. Also,
doscall1 is NOT the same as the kernel. There is a bunch of stuff going on there
in some functions -- ESPECIALLY the 16bit ramsem functions under SMP.
2. Much as I personally like AMD, we have NEVER tested Athlon MP systems. The thing
that I would guess MOST likely to need tweaking is our performance monitoring
and CPUID. This could account for yur problems. I suggest you test the same
software on a P3 or P4 SMP system. (As you already know, OS/2 does not support
hyperthreading at this time. If and when I implement it, I'll post something on
comp.os.os2.bugs and maybe an ecs newsgroup).
3. I haven't poked through your source code, but it looks to me like you may have
a problem in your (I assue) home-rolled semaphore code.
Reporter | ||
Comment 13•23 years ago
|
||
Scott,
Thanks for your response. I suspected there were some odd things going on with
the semaphores on SMP.
The test code that is attached isn't really demonstrating the problem to the
extent I wanted. A more complicated program based on NSS is necessary to show
the full staling behavior. What the test program does is lookup the issuer of a
certificate in multiple threads. This generates the stalling with a threadcount
of 3 or greatre.
I will produce a binary for you if you are willing to run it. I don't have
access to an SMP Intel machine with OS/2 installed at the moment, only my Athlon
SMP at home. My new Intel Xeon box has OS/2 SMP installed, but it currently only
has a single CPU. Hyperthreading unfortunately didn't fool OS/2 into thinking
there were two CPUs. I do have a second CPU on order for this machine however,
but I have no ETA for it.
I can also provide a source and build tree since NSS is open-source.
If you already have an OS/2 Mozilla build environment, it will simplify things.
Ultimately I'd need to have some sort of profiling tool to diagnose this problem
better. There used to be profiling support in VACPP 3.0, but it was removed from
VACPP 3.6 unfortunately, so now I have no good way to see exactly where the time
is being spent.
Comment 14•23 years ago
|
||
I am willing to run a *simple* testcase on my dual p3 xeon. I don't have the
time until November to do a lot more than that.
Reporter | ||
Updated•23 years ago
|
Attachment #96862 -
Attachment is obsolete: true
Reporter | ||
Updated•23 years ago
|
Attachment #96863 -
Attachment is obsolete: true
Reporter | ||
Comment 15•23 years ago
|
||
I will give you a binary to run. It will be very simple to execute - you will be
able to set the threadcount to different values. Just try with 1 and 3 threads.
See if it stalls with three threads. If it does, disable the second CPU and see
if that fixes it.
This will tell us if the problem exists on your Intel-based SMP system or if it
is specific to the Athlon MP machine I'm using at home, which is useful information.
Unfortunately at this time, I have not been able to simplify the program, so you
won't be able to gather more information without debugging deep into the code.
I'm somewhat in the dark as to where exactly the stalling happens in the
program,, except that I'm confident that we aren't calling DosSleep(), so it
must be a system problem as all CPUs become idle for an extended period (minutes).
I have tested the code on a variety of architectures (Power PC SMP with Mac OS X
, Sparc SMP with Solaris 8, Intel hyperthreaded Xeon with Win2K) and none of
them shows this problem. This is why I think the problem lies in one of the OS/2
system calls under SMP. I originally suspected semaphores but I'm not sure,
since my previous program which strictly tested semaphore was unable to
reproduce stalling. Only the more complex NSS certificate checking program,
which uses the NSPR library, reproduces the problem. The problem could also lie
in NSPR, which is a portability API layer on top of the OS functions. Mostly it
doesn't implement its own functions and just relies on the OS, but there are
exceptions and SemRequest486 was one of those exceptions. This was a RAMSEM
function implemented in assembly. I tried replacing that code with a call to
OS/2 mutexes, but that didn't fix the stalling.
Reporter | ||
Comment 16•23 years ago
|
||
This test program runs a specified number of iterations of CERT_FindCertIssuer
in parallel in a specified number of threads.
On my OS/2 SMP Athlon, the program stalls with a threadcount of 3 or greater.
Disabling a CPU fixes the problem.
My test case is to run it with 10000 iterations in 3 threads.
With both CPUs enabled, it took about 2.5 minutes to excute.
With only one CPU enabled, it completed in about 10 seconds.
I provided Scott with a binary of this program to try on his Intel based
system.
I have run the test program on various other SMP architectures and don't see
the problem on any of them.
Reporter | ||
Comment 17•23 years ago
|
||
When I attach to the process with JITDBG when the problem occurs, it always
shows the threads me one of the stacks mentioned in comment #1 . I only have
control over the implementation of SemRequest486, which I already changed to use
the OS/2 native mutexes without effect. This leaves the other stacks suspicious
- it points to a _SemRequest call in the C runtime library called by free() .
Without the source to the VACPP C runtime library, it is difficult to see if
there is something wrong with that code.
I'm going to try to recompile NSS and the test program with EMX/GCC, for which
the C runtime library is available. This might give another clue as to what the
problem is. It might not even happen with the other compiler's runtime, if as I
suspect it has to do with the the types of semaphores used by the VACPP runtime.
Reporter | ||
Updated•23 years ago
|
Summary: certificate issuer lookup stalls process on OS/2 SMP → CERT_FindCertIssuer in >3 threads stalls under OS/2 SMP
Comment 18•23 years ago
|
||
The last update sounds sensible (recompiling with emx as an experiment). In any
case, I will not have time to run the testcase until next Tuesday, at the
earliest.
Reporter | ||
Comment 19•23 years ago
|
||
I'm having a hell of a time trying to build NSPR & NSS with EMX. For now I'm
only trying with the old EMX 0.9d compiler, which is based on gcc2.8.1. I'm
cc'ing Henry Sobotka who is the expert on this.
I had to make several changes to get to the point I'm at right now :
1) add some #defines to mozilla/nsprpub/pr/include/md/_os2.h for some missing
socket macros
2) change coreconf to use "cp" instead of copy, since gbash doesn't understand
the copy command
3) disable RAMSEMs by undefining USE_RAMSEM . The C code wouldn't link against
the assembly . This forces the use of OS/2 mutexes
I will attach the patches. Henry, do you always have to do all that when you
build ? Or is something still wrong in my build environment ? Also, what exact
version of the compiler do you use ?
I believe I'm close to building NSS successfully - I have built everything
successfully, except SMIME3.DLL. Somehow, emxexp is called multiple times
without arguments and the DEF file gets generated incorrectly. This is odd, as
the other DLLs build successfully with EMX.
Reporter | ||
Comment 20•23 years ago
|
||
Comment 21•23 years ago
|
||
I haven't built in quite a while due to time constraints of all kinds. I do
recall having to add socket macro defines; don't recall ever hitting the cp-copy
problem; and remember having to undefine USE_RAMSEM (due to Intel asm IIRC).
The last binaries I released (last February) were built with the gcc 3.0.x.
Before that I normally used pgcc (2.95.x).
The smime3.dll build failure appears to be due to the .def file missing from the
linkage command.
Comment 22•23 years ago
|
||
Assigned the bug to Julien.
Assignee: wtc → jpierre
QA Contact: wtc → bishakhabanerjee
Target Milestone: 3.6 → Future
Reporter | ||
Comment 23•23 years ago
|
||
I now have access to an Intel SMP box, dual 2.2 GHz Xeon. I booted it to OS/2
and ran my stalling program on it. It happened just the same as on the AMD-based
SMP system I'm running at home.
This was with the binary compiled with IBM VACPP. I have not made any progress
towards compiling the code with EMX GCC.
Comment 24•23 years ago
|
||
So the good news is that it's not your hardware. The bad news is that it's your
software. Unfortunately, I, for one, am booked solid with work for at least four
weeks.
Reporter | ||
Comment 25•23 years ago
|
||
Scott,
I understand that you are booked with work. Same here, and this bug hasn't been
given a high priority. I just thought I'd test it since I added that second CPU
to my system today.
While we definitely know it's the software that's the problem, we don't
necessarily whose software it is that's at fault yet. It could be something
wrong in :
- Mozilla's NSPR code
- Mozilla's NSS code
- something in the IBM compiler runtime
- something the IBM OS/2 kernel
My suspicion is that it's in one of the later two, because the code runs fine on
so many different other architectures, and I have already reviewed at the little
OS/2-specific code that we have involved here in NSPR.
The easiest of the variables to check is the compiler.
The NSPR code is the next easiest to review (again).
The NSS code comes next - I have not done much review in that are as far as this
test is concerned.
Obivously the hardest to debug if we get there is the kernel. It might not be
for another 4 weeks or more, the way we are progressing on this bug.
Assignee | ||
Comment 26•23 years ago
|
||
What about using VACPP 3.0 instead of 3.65?
Reporter | ||
Comment 27•23 years ago
|
||
That might be easier - I do have both 3.0 and 3.6.5 installed on my home system.
However, I know that we have many 64-bit types now that 3.0 won't compile. I
don't think the macros were setup to handle the two versions of VACPP. It may
not be hard to tweak though.
Reporter | ||
Comment 28•23 years ago
|
||
Henry,
Part of the problem that I see is that when my DLLs are linked, the DEF file
isn't passed anywhere on the command line. The linking is done via gcc -Zomf .
Shouldn't it be done via LINK386 ?
It seems that the coreconf rules are all wrong. If you have done a Mozilla build
in the recent past (say, since september), I'd appreciate any patches you had to
make to mozilla/security/nss .
Comment 29•23 years ago
|
||
Pierre,
The missing DEF file is precisely why you're not getting a DLL. I notice that in
my coreconf/rules.mk I've added $@.def to the end of line 361. That'll make it
work for OS/2 but will likely screw up other platforms. What's needed there is
either a block for EMX with $@.def or, as in mozilla/config/rules.mk, use of a
DEF_FILE variable that's defined for EMX (and VACPP) but left empty for other
platforms.
gcc -Zomf automatically calls link386; without -Zomf it calls ld for linkage.
Reporter | ||
Comment 30•23 years ago
|
||
Henry,
I found another fix in OS2.mk for passing in the DEF file. I got DLLs. Very
simple programs seem to run. However, NSS_Initialize fails. I haven't been able
to find out why.
I used the VACPP debugger to debug the code - I was happily surprised that I
could do that and see the source without having to use gdb.
I could trace through it, but if I just "run", the program hangs and I cannot
interrupt it to find out where it hangs. Even the IBM Just-in-time debugger
which can attach to the process cannot do it in this case.
I would say there are plenty of things wrong with the EMX build at this point -
I saw tons of warnings, such as "symbol xxx undeclared, assuming int xxx(int)" .
That's what we get for using a weakly-typed language such as C ...
Henry, I wonder when is the last time you could build and actually run the
Mozilla code with EMX. This seems like almost as much effort as porting to
another platform.
Comment 31•23 years ago
|
||
Pierre,
If you're using the VAC++ 3.65 debugger, try running the old 3.0 one if you have
it. Otherwise, there's always good old printf debugging to hone in on the point
of failure.
The compiler warnings you're seeing should in theory appear on every other
platform with the same version of gcc.
I haven't built since last February but my business normally slows down around
this time of year until the spring, so I'm hoping to take a run at it in the
coming weeks.
Reporter | ||
Comment 32•23 years ago
|
||
It looks like the inline assembly implementation of atomic functions is hosed
for EMX/GCC. PR_AtomicSet doesn't return from where it was called. This gets
called the first time there is a PR_ArenaAlloc in the NSS initialization.
There is another implementation of these atomic functions in os2vacpp.asm, but
it is written for the _Optlink calling convention, which isn't supported by EMX.
Reporter | ||
Comment 33•23 years ago
|
||
I was able to go a little further by compiling NSPR with EMX without
HAVE_ATOMICS . This uses a lock to implement atomics, and is much slower.
However, I'm now seeing crashes with DBM.
[d:\nss\36emx\mozilla\dist\os22.45_gcc_dbg.obj\bin]pk12util -d . -i user123456.p
12
Assertion failed: 0, file ../../../dbm/src/hash.c, line 130
Abnormal program termination
This happens even when just import a small p12 file, not my large CRL.
*sigh*
This gave me an idea though - I'm going to disable the atomics for my VACPP
build too and see if this has any effect on my test case. There are some
instances of using atomics in the softoken so it might.
Reporter | ||
Comment 34•23 years ago
|
||
I repeated the test by disabling atomics in the VACPP build, and it made no
difference - the program still stalled on the SMP box.
However, I got another idea - I didn't need to create my cert/key database under
the EMX build for the test. So just for the hell of it, I took the one that I
had already created with the VAC build. And lo and behold, the EMX build of my
stalling program was able to read it just fine.
And it worked as expected on the dual-CPU machine, peaking both CPUs !!!
This means that this is definitely a compiler problem with VACPP 3.6.5 .
Reporter | ||
Updated•23 years ago
|
Summary: CERT_FindCertIssuer in >3 threads stalls under OS/2 SMP → CERT_FindCertIssuer in >3 threads stalls under OS/2 SMP - VACPP bug
Reporter | ||
Comment 35•23 years ago
|
||
Mike Kaply was here today in Mountain View and we looked at the problem together
for much of this afternoon. This was indeed a problem in the compiler's runtime
library. It was using a DosSleep(1) to force a thread context switch. This is
fine on a uniprocessor machine, but very bad on a multiprocessor one. After
uncompressing it using LXLITE, I was able to make a binary patch to the IBM C
runtime DLL (CPPRMI36.DLL) to change it to do a DosSleep(0), which is the right
thing to do. With this patched DLL, the problem disappeared and the test program
no longer stalled with both processors enabled.
Reporter | ||
Comment 36•23 years ago
|
||
This will be fixed automatically when the Mozilla OS/2 build moves to the gcc
compiler. Adding dependency.
Comment 37•22 years ago
|
||
Out of curiousity I looked at this bug. I'm pretty sure I've seen simliar
problems with other software using VAC. (And since usually using SMP machines I
usually see these things while the guy writing the software don't.)
I just had a look the _SemHandleCollision path. I'm not sure if they did
wrongly using SysSleep(1). I think there is a case with higher priority thread
(real time vs. normal for instance) where SysSleep(0) won't do it but
SysSleep(1) works, at least for UNI systems. I did a similar thing (DosSleep(0))
in an active wait loop recently, it works on SMP but not on UNI when thread
priorities are mixed.
Apparantly, this code simply isn't really SMP safe. It's also using an xchg
instruction without lock prefix to claim the ownership of the sem. I'm not
convinced that this is safe on SMP systems, I beleive it might actually cause
several threads to gain the semaphore ownership.
The right thing (TM) would have been to replace this code, and probably use
differnet code paths for SMP and UNI systems, letting the SMP ones use the
ingenious DCE_POSTONE flag (as they are at the right level for usage of that
flag) and hence avoid any chances for races.
But, of course the better solution is to switch to GCC. :-)
Comment 38•22 years ago
|
||
XCHG is always LOCKed, whether or not you use the LOCK prefix. Also, note that
using the OS/2 32bit semaphores is a *lot* slower on a Pentium 4. All in all,
it's worth using a fast sem package -- you just need one that works....
Comment 39•22 years ago
|
||
You got me there :-)
Reporter | ||
Comment 40•22 years ago
|
||
This bug should be resolved by the gcc build of Mozilla / NSPR / NSS. Once I get
my standalone build of NSS running with gcc, I'll verify it and close it.
Reporter | ||
Comment 42•22 years ago
|
||
Re-assigning to Mike Kaply. Please resolve this bug when regular Mozilla builds
are migrated to gcc.
Assignee | ||
Comment 44•22 years ago
|
||
We have moved to GCC.
Status: NEW → RESOLVED
Closed: 22 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•