Closed Bug 328262 Opened 19 years ago Closed 18 years ago

Incorrect stress test cache statistics cause QA test failures

Categories

(NSS :: Libraries, defect, P1)

3.11
defect

Tracking

(Not tracked)

RESOLVED FIXED
3.11.1

People

(Reporter: nelson, Assigned: nelson)

References

Details

Attachments

(2 files)

A week ago, Christophe reported that he was beginning to see numerous 
stress test failures, stress testing ECC on various platforms.  
The output log for a typical failure looked like this:

ssl.sh: Stress TLS ECDH-ECDSA AES 128 CBC with SHA (no reuse) ----
[snip]
strsclnt -q -p 8444 -d ../client -B -s -w nss -c 10 -C :C004 -N \
          mace.red.iplanet.com
strsclnt started at Wed Feb 15 03:13:08 PST 2006
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: 0 cache hits; 7 cache misses, 0 cache not reusable
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: 0 cache hits; 9 cache misses, 0 cache not reusable
strsclnt completed at Wed Feb 15 03:13:11 PST 2006
ssl.sh: Stress TLS ECDH-ECDSA AES 128 CBC with SHA (no reuse) produced a returncode of 1, expected is 0.  FAILED

strsclnt was retuning a non-zero exit code because the sum of the 3 cache
statistics did not equal the number of connections attempted.  
The command line calls for doing 10 tests.  
There are 10 lines of output that read: SSL: Server Certificate Validated.
indicating that 10 full handshakes were done, but the sum of the cache
statistics is only 9, not 10.   This is causing nightly QA tests to report
failure, and so needs to be treated as P1.  

This problem blocks bug 324887, and blocks NSS 3.11.1.

I suspect that the explanation for the statistical error is that the cache
counters are not updated atomically.  Here are the lines of code from
ssl3con.c that update those counters:

3302:       ++ssl3stats.sch_sid_cache_not_ok;
3310:   ++ssl3stats.sch_sid_cache_hits;
3321:   ++ssl3stats.sch_sid_cache_misses;
4293:   ++ssl3stats.hsh_sid_cache_hits;
4312:   ++ssl3stats.hsh_sid_cache_not_ok;
4314:   ++ssl3stats.hsh_sid_cache_misses;
5154:       ++ssl3stats.hch_sid_cache_not_ok;
5327:   ++ssl3stats.hch_sid_cache_hits;
5389:   ++ssl3stats.hch_sid_cache_not_ok;
5394:    ++ssl3stats.hch_sid_cache_misses;
5557:    ++ssl3stats.hch_sid_cache_misses;

To test this hypothesis, I propose to change the above 11 lines of code 
to call PR_AtomicIncrement.  But there is a gotcha.  

PR_AtomicIncrement operates on PR_Int32 values only.  The counters named
above are part of a public structure, and are of type "long".  So, simply
changing each line above to call PR_AtomicIncrement will produce build
errors on 64-bit platforms where sizeof(long) != sizeof(PRInt32).

All the binary compatible solutions to this problem that occur to me require conditional compilation using a pre-processor symbol that unambiguously means 
"sizeof(long) > sizeof(PRInt32)".  
This seems to be a recurring problem for NSS.
In order to fix this bug, I will need a "feature test macro" 
(preprocessor symbol) that means "sizeof(long) > sizeof(PRInt32)".
Seems like this issue has come up many times in the last year, yet 
I do not remember the answer, if there is one.  
Anyone remember what it is?
Priority: -- → P1
This patch creates a new function, SSL_AtomicIncrementLong(long *x)
which effectively does an atomic increment on the long, whether long is
32 or 64 bits.  It uses PR_AtomicIncrement to do the hard parts.  

For the 64-bit long case, you may argue that the function is NOT atomic.
However, for programs that do lots of such increments in multiple threads
simultaneously, and which never look at the results until the end (after
all the increments are done), it can be shown that the results of this 
function are equivalent to true 64-bit atomic operators, and that's good
enough for strsclnt and selfserv.

Sorry if this gives anyone heartburn.  :)

Christophe, 
Please test this by patching a build source tree with this change.  
Trunk or branch will be OK.  Thanks.
Attachment #213432 - Flags: review?(julien.pierre.bugs)
Comment on attachment 213432 [details] [diff] [review]
patch v1 (checked in)

Nelson, the feature test macro you asked about in comment
1 can be done with NSPR's PR_BYTES_PER_LONG.

  #include "any NSPR header"

  #if PR_BYTES_PER_LONG == 4

means

  "sizeof(long) == sizeof(PRInt32)"

So you can use this compile-time test in SSL_AtomicIncrementLong.

The issue that has come up many times last year is not this
issue.  That issue is the feature test macro for "native"
(as opposed to compiler emulated) support of PRInt64 in a
32-bit build, for example, when you compile with the
-xarch=v8plus flag of the Sun compiler or the +DA2.0 flag
of the HP compiler.

Here are my review comments of your patch.

It's better to capitalize the type names tooLongStr and tooLong,
which is NSS's naming convention.  Perhaps you can name them
SSLInt64 or SSLLongLong.

I find the use of || a little hard to understand.

Your patch is a good solution.
Attachment #213432 - Flags: review+
(In reply to comment #2)
> Christophe, 
> Please test this by patching a build source tree with this change.  
> Trunk or branch will be OK.  Thanks.
> 
I have tested your patch. There is no regression and the failure in the stress test was not reproduced (tested on Solaris 9 SPARC and Solaris 10 SPARC).
I like this better, if it works.
Attachment #213432 - Flags: review?(julien.pierre.bugs) → review+
Comment on attachment 213432 [details] [diff] [review]
patch v1 (checked in)

Checking in ssl3con.c; new revision: 1.80; previous revision: 1.79
Checking in ssl3con.c; new revision: 1.76.2.4; previous revision: 1.76.2.3

I have checked in this first patch on trunk and branch to meet an 
immediate deadline.  However, I will also proceed to test the second
patch, and if it passes tests and reviews, I will back out the first 
patch and put in the second one.
Attachment #213432 - Attachment description: patch v1 → patch v1 (checked in)
*** Bug 292181 has been marked as a duplicate of this bug. ***
I believe that this bug is fixed.
Yes, I believe it is fixed, also. 
I had a slightly more elegant patch that I wanted to try as an alterantive,
but time does not permit.
Status: NEW → RESOLVED
Closed: 18 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: