Closed Bug 338798 Opened 15 years ago Closed 15 years ago

selfserv reports error -12272 SSL_ERROR_BAD_MAC_ALERT in QA stress tests

Categories

(NSS :: Libraries, defect, P1)

3.11.1
HP
HP-UX
defect

Tracking

(Not tracked)

RESOLVED FIXED
3.11.2

People

(Reporter: slavomir.katuscak+mozilla, Assigned: alvolkov.bgs)

References

Details

Attachments

(5 files, 1 obsolete file)

ssl.sh: Stress TLS  RC4 128 with MD5 ----
selfserv -D -p 8444 -d ../ext_server -n charm.red.iplanet.com -B -s \
         -e charm.red.iplanet.com-ec -w nss   -i ../tests_pid.22364  &
selfserv started at Sun May 21 08:28:10 PDT 2006
tstclnt -p 8444 -h charm.red.iplanet.com  -q \
        -d ../ext_client < /share/builds/mccrel3/security/securitytip/builds/20060521.1/wozzeck_Solaris8/mozilla/security/nss/tests/ssl/sslreq.dat
strsclnt -q -p 8444 -d ../ext_client  -w nss -c 1000 -C c \
          charm.red.iplanet.com
strsclnt started at Sun May 21 08:28:11 PDT 2006
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: 0 cache hits; 1 cache misses, 0 cache not reusable
selfserv: HDX PR_Read returned error -12272:
SSL peer reports incorrect Message Authentication Code.
selfserv: HDX PR_Read returned error -12272:
SSL peer reports incorrect Message Authentication Code.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: PR_Send returned error -12273:
SSL received a record with an incorrect Message Authentication Code.
strsclnt: PR_Send returned error -12273:
SSL received a record with an incorrect Message Authentication Code.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: 442 cache hits; 7 cache misses, 0 cache not reusable
strsclnt completed at Sun May 21 08:28:13 PDT 2006
ssl.sh: Stress TLS  RC4 128 with MD5 produced a returncode of 1, expected is 0.  FAILED

Appeared during nightly testing securitytip 20060521.1 on HP-UXB.11.11_64_OPT.OBJ architecture.
ssl.sh: Stress TLS  ECDHE-RSA   AES 128 CBC with SHA ----
selfserv -D -p 8444 -d ../server -n charm.red.iplanet.com  \
         -e charm.red.iplanet.com-ec -w nss -c :C013 -i ../tests_pid.27075  &
selfserv started at Mon May 22 05:18:35 PDT 2006
tstclnt -p 8444 -h charm.red.iplanet.com -B -s -q \
        -d ../client < /share/builds/mccrel3/security/securitytip/builds/20060522.1/wozzeck_Solaris8/mozilla/security/nss/tests/ssl/sslreq.dat
strsclnt -q -p 8444 -d ../client -B -s -w nss -2 -c 1000 -C :C013 \
          charm.red.iplanet.com
strsclnt started at Mon May 22 05:18:36 PDT 2006
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: 0 cache hits; 1 cache misses, 0 cache not reusable
selfserv: HDX PR_Read returned error -12272:
SSL peer reports incorrect Message Authentication Code.
selfserv: HDX PR_Read returned error -12272:
SSL peer reports incorrect Message Authentication Code.
strsclnt: PR_Send returned error -12273:
SSL received a record with an incorrect Message Authentication Code.
strsclnt: PR_Send returned error -12273:
SSL received a record with an incorrect Message Authentication Code.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: 520 cache hits; 2 cache misses, 0 cache not reusable
strsclnt completed at Mon May 22 05:18:39 PDT 2006
ssl.sh: Stress TLS  ECDHE-RSA   AES 128 CBC with SHA produced a returncode of 1, expected is 0.  FAILED

Appeared during nightly testing securitytip 20060522.1 on
HP-UXB.11.11_64_OPT.OBJ architecture.
Summary: HDX PR_Read error -12272 in Stress Test TLS RC4 128 with MD5 → HDX PR_Read error -12272
In these logs, selfserv complains that it has received an alert from strsclnt, 
SSL_ERROR_BAD_MAC_ALERT, complaining that strsclnt has received a record 
with a bad MAC (from selfserv).

strsclnt should also have logged this error in its own output.
Where is strsclnt's log of this problem?
Is there a defect in strsclnt or in the test script, that causes the errors
reported by strsclnt not to appear in these logs?
Summary: HDX PR_Read error -12272 → selfserv reports error -12272 SSL_ERROR_BAD_MAC_ALERT in QA stress tests
First is started selfserv as backgroung process and then is started strsclnt. Outputs from both scripts (including stderr) are forwarded to the same log file.
This problem is being seen consistently in the nightly QA on the TRUNK,
and is not being seen in the nightly QA on the NSS 3_11 BRANCH.

So, I think the trunk and branch have diverged in some SIGNIFICANT way,
and we need to fix the trunk.  

Are the trunk and branch builds using different versions of NSPR? 
Priority: -- → P1
Target Milestone: --- → 3.12
Version: 3.11.1 → 3.12
Nelson,

The trunk builds are using the NSPR tip, and the branch builds are using NSPR_4_6_2_RTM . This info can be found in our nightly build emails, on the line "CVS tag(s)" . But I really doubt that NSPR is the problem. Not many changes are going on in NSPR.

If the problem is systematically reproducible, we could try putting the libsoftokn3 and libfreebl* binary libraries from the branch into the tip tree, and see if the problem goes away. That would be my first guess. If that doesn't do it, we should be able to copy the branch libssl3 and libnss3 libraries into the tip tree as well.
According to Slavomir, this problem is also being seen intermittently on 
HPUX 64 bit builds in 3.11.1. 

Alexei, please look into why the problem is being reported by the server
and not (apparently) by the client.  Any of the "xxxx_ALERT error codes
means that the process has received an SSL error record from its remote
peer.  Both sides should have logged the error.  If the server reports
a BAD_MAC_ALERT error, then the client should have reported a BAD_MAC 
error.  According to Slavomir, we're seeing the alerts from selfserv but
not the BAD_MAVs from strsclnt.  Need to get to the bottom of that.  
Could be a bug in strsclnt, not outputting error messages, or could be a
problem with the QA test scripts, not causing the stderr output from
the client to be redirected to the log file.  
Target Milestone: 3.12 → 3.11.2
Version: 3.12 → 3.11.1
Assignee: nobody → alexei.volkov.bugs
Nelson,

If you look at the output reported in the bug, first we see message from server reporting an error BAD_MAC_ALERT

selfserv: HDX PR_Read returned error -12272:
SSL peer reports incorrect Message Authentication Code.
selfserv: HDX PR_Read returned error -12272:
SSL peer reports incorrect Message Authentication Code.
... some output ...

and then we see client reporting BAD_MAC_READ. 

strsclnt: PR_Send returned error -12273:
SSL received a record with an incorrect Message Authentication Code.
strsclnt: PR_Send returned error -12273:
SSL received a record with an incorrect Message Authentication Code.

I don't see any diviations in server/client behaviour from what you've mentioned.

The question why server gets BAD_MAC_ALERT on HP is still needed to be answered.
I guess I should correct my last posting: server receives BAD_MAC_ALERT after client encountered message with bad mac. So the question is why client gets incorrect Message Authentication Code.
I am running the ssl.sh script attached to this bug
on our HP-UX machine.  The build is HP-UXB.11.11_64_OPT.OBJ
based on NSS_3_11_BRANCH (with the special ecl-curve.h from
NSS_3_11_1_RTM), with NSS_ENABLE_ECC=1 and NSS_ECC_MORE_THAN_SUITE_B=1.

The tests are still running (at the
"SSL Stress Test - Client Bypass - with ECC" stage).
So far results.html is all green.
My ssl.sh run finished with an all-green results.html file.
I then ran all.sh six times.  All the results.html files
are all green, and none of the output.log files contain
"-1227" (the error codes -12272 and -12273).

Here is our system and compiler info:
$ uname -a
HP-UX typhoon B.11.11 U 9000/785 2015658734 unlimited-user license
$ ls -l `which cc`
lr-xr-xr-t   1 root       sys             17 Jun 11  2004 /bin/cc@ -> /opt/ansic
/bin/cc
$ ls -l /opt/ansic/bin/cc
-r-xr-xr-x   2 bin        bin         565248 Apr  4  2000 /opt/ansic/bin/cc*
$ what /opt/ansic/bin/cc
/opt/ansic/bin/cc:
         LINT A.11.01.20 CXREF A.11.01.20
        HP92453-01 A.11.01.20 HP C Compiler
         $ CUPI80_IC7  Jan  7 1999 11:20:34 $

I remember you (Sun) are using a newer version of the
compiler.

The 'top' command shows that the system has two CPUs.

Alexei, can you reproduce this problem with your NSS_3_11_1_RTM
build?
At Sun, we have:

$ what /opt/ansic/bin/cc
/opt/ansic/bin/cc:
        $Revision: 92453-07 linker linker crt0.o B.11.16.01 030316 $
        LINT B.11.11.08 CXREF B.11.11.08
        HP92453-01 B.11.11.08 HP C Compiler
         $ PATCH/11.00:PHCO_27774  Oct  3 2002 09:45:59 $ 

Our machines also have 2 CPUs (L1000 servers).
I sent Alexei our HP-UXB.11.11_64_OPT.OBJ shared libraries
and test programs.  This will tell us whether the version
of the compiler matters.
I found an interesting information on the web:

If you have installed linker patch PHSS_26559 on HP-UX 11.0 or PHSS_26560 on HP-UX
11.11, you must upgrade to HP WDB 3.x.

If you try to debug 64-bit shared programs with those linker patches, HP WDB versions
earlier than 3.x will fail, generating the following message:
Version of shared library
shared_library_name not understood;
there may be new features in the library that gdb does not support

I have updated HP WDB to version 5.4 and the error message disappeared. Alexei is currently testing this newer version of the debugger.

Note: HP WDB is gdb.
Testing binaries from Wan-teh for more when four hours. Tests are still running without any failures.
Bug 311279 is the last time the older HP C compiler
used at Red Hat generated correct code but the
newer HP C compiler used at Sun doesn't.  As I mentioned
in bug 311279 comment 12, I reported the problem to HP.
If I remember correctly, HP's reply was that the code in
question was not proper C.  However, the newer HP C
compiler didn't report a compilation error; it just
generated bad code.

I don't know where to look.  I guess you could examine
the compiler warnings...
I did number of testings today.

Wan-teh build (3.11 branch) consistently passes on our HP boxes. Wan-teh also ran our bits, and he sees some failures(I'm not quite sure what tests failed) . Christophe applied some patched to the compiler and OS today, but this didn't help to resolve the problem. It is still a mystery why our bits are failing and Wan-teh's are not.

At the same time, I've narrowed search path down a little bit. The client reports bad record MAC primary reason of which is incorrect text padding of the record. I don't have clear understanding yet why this happening.

Alexei,

Were you able to determine which shared library causes the problem, by mixing bits from the broken and working builds ? We need to make a decision about this bug today. This is the last P1 blocking the release of 3.11.2 .
Other useful information to know would be :

1) is this a client or server problem, or both ?
You can use a remote machine on a different platform with a known good build to test for this. Then run selfserv on HP and strsclnt on the remote; and strsclnt on HP and selfserv on the remote.

2) does this error happen only in multithread mode ?
You can force strsclnt to single thread with -t 1 .
For selfserv you need to change MIN_THREADS to 1 and recompile, then use -t 1 .
Just use -t 1 on both client/server sides and see if you can ever get the problem to show. If it's really the compiler putting out bad code, then I would expect that it would show even in single-thread some of the time. On the other hand, we only got the error reported with strsclnt, but not tstclnt. But that could just be probabilities.
This bug is reproducible in m-threaded environment only. This problem is on the server side. Use of -t 1 on the client side does not produce any problem.

libsoftoken.sl is the source of the failure.

I could not reproduce it on 3.10 nor from 3.11 from NSS_3_11_RTM built without ecc. But with NSS_3_11_BRANCH and ecc enabled it reproducible from the beginning
of May. 
 

The bug was reproduced on tonight during nightly testing on HP-UXB.11.11_64_DBG.OBJ build.

ssl.sh: Stress TLS  ECDHE-RSA   AES 128 CBC with SHA ----
selfserv -D -p 8444 -d ../server -n charm.red.iplanet.com -B -s \
         -e charm.red.iplanet.com-ec -w nss -c :C013 -i ../tests_pid.8176  &
selfserv started at Mon Jun 19 07:09:36 PDT 2006
tstclnt -p 8444 -h charm.red.iplanet.com  -q \
        -d ../client < /share/builds/mccrel3/security/securitytip/builds/2006061
9.1/hploan1_HPUX11i/mozilla/security/nss/tests/ssl/sslreq.dat
strsclnt -q -p 8444 -d ../client  -w nss -2 -c 1000 -C :C013 \
          charm.red.iplanet.com
strsclnt started at Mon Jun 19 07:09:36 PDT 2006
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: 0 cache hits; 1 cache misses, 0 cache not reusable
selfserv: HDX PR_Read returned error -12272:
SSL peer reports incorrect Message Authentication Code.
strsclnt: PR_Send returned error -12273:
SSL received a record with an incorrect Message Authentication Code.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
strsclnt: -- SSL: Server Certificate Validated.
Alexei, please search for the string "= {" in
lib/softoken/pkcs11c.c and look for SECItem local
variables initialized with the addresses of other
local variables like this:

            SECItem crsr   = { siBuffer, crsrdata, sizeof crsrdata };
            SECItem master = { siBuffer, key_block, SSL3_MASTER_SECRET_LENGTH};

The version of the HP C compiler you are using is
known to have problem with such initializers (bug
311279).  Also, in rev. 1.70 of lib/softoken/pkcs11c.c
Nelson fixed a similar problem (bug 274512).  In
fact, bug 274512, in NSS 3.11.1, might be the bug
fix that introduced this regression.

Replace such code with code like this:

            SECItem crsr   = { siBuffer, NULL, 0 };
            SECItem master = { siBuffer, NULL, 0};

            crsr.data = crsrdata;
            crsr.len = sizeof crsrdata;
            master.data = key_block;
            master.len = SSL3_MASTER_SECRET_LENGTH;

If this is indeed the culprit, we need to write a
regular expression to search for all such initializers
in our source tree.
Attached patch Proposed patchSplinter Review
When I reported bug 311279 to HP, HP's compiler team
said that such code is illegal and not C89.  A structure
member can only be initialized by a constant expression
in C89.  It is legal in C99.  I confirmed this with the
following sentences in the C Programming Language, 2nd Ed.,
A8.7, p. 219:

  All the expressions in the initializer for a static object
  or array must be constant expressions as described in A7.19.
  The expression in the initializer for an auto or register
  object or array must likewise be constant expressions if the
  initializer is a brace-enclosed list.
Attachment #226251 - Flags: superreview?(nelson)
Attachment #226251 - Flags: review?
Attachment #226251 - Flags: review? → review?(alexei.volkov.bugs)
Comment on attachment 226251 [details] [diff] [review]
Proposed patch

Patch looks fine. Tests, that were failing within first 5min, are running continuously for 1h+. I think this patch solves the problem.
r=alexei.volkov
Attachment #226251 - Flags: review?(alexei.volkov.bugs) → review+
Some more info.

The root cause of this bug is that the HP C compiler
silently generates incorrect code for such code that
is illegal in C89.

We can catch such code by compiling with GCC (on Linux)
with the -pedantic option and searching for the warning
"initializer element is not computable at load time".
(It's too much work to make NSS compile with -pedantic-errors.)

This web page on the HP C compiler
(http://docs.hp.com/en/5990-8153/ch10s05.html) says
that HP C compiler version B.11.11.08 supports
* C99 Mode with -AC99 Option.
* Non-Constant Initializers for Aggregates: Now it
  is possible to specify non-constant expressions as
  initializers in the initialization lists of
  aggregates, such as arrays and structures.
(See also HP C compiler, version B.11.11.08 release
notes at http://docs.hp.com/en/5187-4181/5187-4181.pdf.)

Christophe said in comment 12 that this is the version
of the HP C compiler you are using.  So you should try
the -AC99 option.  Since we are still using HP C
compiler version A.11.01.20, don't check in the -AC99
option until we upgrade our compiler.

This article talks about the new initializer related
features in C99:
http://www.ddj.com/dept/cpp/184401377.
Here are the things that trouble me :

a) If the problem is related to the HP compiler code generation, why does it show up intermittently and only in multithread mode, as if it was a race condition ?

b) If the construct is illegal C89, why does the HP compiler not produce a compilation error ? I can't see how silently generating bad code is ever going to be the right thing for anybody.

Do we know what code the compiler is generating ? Is it perhaps simply skipping the initialization ? That might be the answer to a) .
Comment on attachment 226251 [details] [diff] [review]
Proposed patch

The patch looks fine. Please checkin to NSS_3_11_BRANCH .
Attachment #226251 - Flags: review+
I checked in the proposed patch on the NSS trunk (3.12)
and the NSS_3_11_BRANCH (3.11.2).

Checking in pkcs11c.c;
/cvsroot/mozilla/security/nss/lib/softoken/pkcs11c.c,v  <--  pkcs11c.c
new revision: 1.84; previous revision: 1.83
done

Checking in pkcs11c.c;
/cvsroot/mozilla/security/nss/lib/softoken/pkcs11c.c,v  <--  pkcs11c.c
new revision: 1.68.2.14; previous revision: 1.68.2.13
done
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Attached patch Additional patchSplinter Review
I compiled NSS_3_11_BRANCH on Linux with the GCC -pedantic option.
It reported three other instances of this problem.

ssl3con.c: In function `ssl3_HandleRSAClientKeyExchange':
ssl3con.c:6412: warning: initializer element is not computable at load time

testcrmf.c: In function `CreateCertRequest':
testcrmf.c:374: warning: initializer element is not computable at load time
testcrmf.c:375: warning: initializer element is not computable at load time

I recommend we fix these in NSS 3.11.2.  Please check in this
patch after it has received the necessary reviews.
Attachment #226442 - Flags: superreview?(julien.pierre.bugs)
Attachment #226442 - Flags: review?(alexei.volkov.bugs)
Comment on attachment 226442 [details] [diff] [review]
Additional patch

r=alexei.volkov
Attachment #226442 - Flags: review?(alexei.volkov.bugs) → review+
I tried to make NSS compile with the GCC -pedantic-errors
option and quickly gave up.  This patch is the changes I
made before I gave up.  It fixes the problems of casting an
object pointer to a function pointer (loader.c and dbinit.c) and
compiling an empty source file (ecdecode.c, with ECC disabled).
Attachment #226448 - Flags: review?(alexei.volkov.bugs)
Attachment #226251 - Flags: superreview?(nelson)
Comment on attachment 226448 [details] [diff] [review]
Incomplete patch to fix -pedantic warnings (for NSS trunk only)

r=alexei.volkov
Attachment #226448 - Flags: review?(alexei.volkov.bugs) → review+
Comment on attachment 226448 [details] [diff] [review]
Incomplete patch to fix -pedantic warnings (for NSS trunk only)

My only concern with this patch is that I believe the ifdef
should be in config.mk, rather than in manifest.mn.
This is mostly relevant to "DEFINES +="
Otherwise, r=nelson

>Index: mozilla/security/nss/lib/softoken/manifest.mn

>@@ -88,6 +87,7 @@
> 	$(NULL)
> 
> ifdef NSS_ENABLE_ECC
>+CSRCS += ecdecode.c
> DEFINES += -DNSS_ENABLE_ECC
> endif
>
Attachment #226448 - Flags: review+
I have a script that can pool out revisions on the source file based on date, build and run with test against a particular revision of nss libraries.
I've run this scripts last Monday. Result that it produced shows that 3.11 branch
was failing stress test from 25 of April, after back porting the fix for bug 274512.

Curios, why did we start seeing failures during our nightlys only at the end of last month? 
Attached file scripts
Comment on attachment 226442 [details] [diff] [review]
Additional patch

The patch is fine, but we really have to find a way to get the stupid HP compiler to either complain or generate proper code; otherwise we'll just have to drop the platform.
Attachment #226442 - Flags: superreview?(julien.pierre.bugs) → superreview+
Comment on attachment 226442 [details] [diff] [review]
Additional patch

I checked in this patch ("additional patch") on
the NSS trunk (3.12).

If you still allow checkins in 3.11.2, I recommend
taking this patch.
Nelson, you're right, we should not use ifdef's in
manifest.mn.  I omitted the change to compile ecdecode.c
conditionally; it's not worth the trouble.  I checked
in this patch on the NSS trunk (3.12).

Checking in freebl/loader.c;
/cvsroot/mozilla/security/nss/lib/freebl/loader.c,v  <--  loader.c
new revision: 1.29; previous revision: 1.28
done
Checking in softoken/dbinit.c;
/cvsroot/mozilla/security/nss/lib/softoken/dbinit.c,v  <--  dbinit.c
new revision: 1.29; previous revision: 1.28
done
Attachment #226448 - Attachment is obsolete: true
Wan-Teh,

We aren't taking checkins for 3.11.2 unless we find a stopper bug that would cause us to respin.
The "additional patch" contains one fix for crmftest which isn't critical. It also fixes some code in libssl which I believe is exercised by our QA and does not fail. I know it doesn't make much sense that the HP compiler doesn't misbehave all the time for this case, but it would appear that it didn't here. Thus, I think we can delay integrating this additional patch until 3.11.3 .
Duplicate of this bug: 367761
You need to log in before you can comment on or make changes to this bug.