Closed Bug 348359 Opened 18 years ago Closed 18 years ago

handshake failure using TLS/DHE-DSS/AES128-CBC/SHA suite

Categories

(NSS :: Libraries, defect, P1)

3.11
x86
Windows 2000
defect

Tracking

(Not tracked)

RESOLVED FIXED
3.11.3

People

(Reporter: tngan, Assigned: nelson)

References

Details

(Keywords: regression)

Attachments

(9 files, 4 obsolete files)

7.05 KB, text/plain
Details
14.42 KB, text/plain
Details
18.32 KB, text/plain
Details
73.36 KB, text/plain
Details
1.68 KB, patch
rrelyea
: superreview+
Details | Diff | Splinter Review
2.89 KB, patch
nelson
: review-
rrelyea
: superreview+
Details | Diff | Splinter Review
8.16 KB, patch
julien.pierre
: review+
rrelyea
: review+
Details | Diff | Splinter Review
18.44 KB, patch
julien.pierre
: review-
rrelyea
: review+
Details | Diff | Splinter Review
8.72 KB, text/plain
Details
User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.8.1b1) Gecko/20060807 BonEcho/2.0b1
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.8.1b1) Gecko/20060807 BonEcho/2.0b1

Jeremy Viegas from MSFT raised that Firefox 2 Beta 1 doesn't work with DSA server auth against the MS test site https://131.107.193.14:452
(http://131.107.193.14/ is the base page)

They claim Firefox 1.5.0.6 works fine, but something is broken in Firefox 2.0b1


Their ssltap shows client starting a second handshake while the first handshake still in progress  (see attached file)

Nelson's response:
Once I base64-decoded the attachment, and saw that it was a plain text
ssltap output, I saw the problem.  The client is starting a second
handshake, on the connection, while the first handshake is still in
progress.  This is a client error.  I do not believe this is an NSS
error.  The client probably called SSL_ResetHandshake.

Reproducible: Always

Steps to Reproduce:
1. Use Firefox 2.0 Beta to access http://131.107.193.14:452

Actual Results:  
Fails (with 2nd handshake started while the first handshake is still in process?)

Expected Results:  
Should succeed (single handshake)
Attached file ssltap output
@TO, are you able to reproduce this bug yourself? I am not.

(If only Jeremy is able to reproduce this bug, I propose he gets a bugzilla acount and cc's himself to this bug.)

When I try to connect to https://131.107.193.14:452 it occassionally "stalls" with no data being transfered at all, however, as soon as data gets transfered, I get a correct connection.

In the provided logfile I noticed that Jeremy connected to a different server, IP address 157.59.28.203 at port 8080, which I can not reach.

I also noticed that he connected to that site using a hostname "sslinterop", because during the handshake the TLS server_name extension was sent.

I conclude Jeremy connected to some other server internal at MSFT.

In order to achieve a similar test environment, I set up a host name entry to the public SSL interop server, so the TLS server_name extension got send in my test, too. But still, it works fine for me.
> Once I base64-decoded the attachment, and saw that it was a plain text
> ssltap output, I saw the problem.  The client is starting a second
> handshake, on the connection, while the first handshake is still in
> progress.  This is a client error.  I do not believe this is an NSS
> error.  The client probably called SSL_ResetHandshake.

I have no idea how this could have happened.

There is only one scenario where the client calls SSL_ResetHandshake, which is on sockets that begin with a plaintext phase. This is not true for the https protocol in general. It might get called for sockets that run over a proxy. However, I tested with a proxy myself, and did not get that behaviour when connecting to the test site.


Kai, I don't get the stalls but I cannot get a similar ssltap output as the one the MS folks sent us.

Attached is the ssltap I get consistently, which doesn't show multiple handshake connection from the client.

(I've sent email to ask Jeremy and Jiggy to register and add themselves to cc list of this bug.)
I didn't make it clear, but FF 2.0 beta on Linux works when accessing https://131.107.193.14:452/.  Only Windows build has problem.

(I'll attach the ssltap output for Linux FF for comparison)
Marking as a patch to workaround the size limit (compressed but still over 300k)
Something is wrong, but I am not yet convinced it is at the PSM level.

Nelson, as opposed to that earlier ssltap output, in my tests, the second handshake does not appear to begin as early as you suspected.

I get:
Error -5961: TCP connection reset by peer.: error on server-side socket.
Connection 1 Complete [Mon Aug 14 19:54:07 2006]
Connection #2 [Mon Aug 14 19:54:08 2006]
...
Error -5961: TCP connection reset by peer.: error on server-side socket.
Connection 2 Complete [Mon Aug 14 19:54:08 2006]

First conn is Hello v3, second conn is Hello v2 (TLS intolerant retry).

Firefox UI reports "could not establish encrypted connection because server cert has an invalid signature".

I did this test 10 about ten times. To my surprise, in one attempt it worked correctly.

Doesn't this behaviour look more like a handshake problem at the SSL lib layer, or the server layer?

PSM does not try to interfere in the middle of the handshake, IINM. 
Status: UNCONFIRMED → NEW
Ever confirmed: true
Nelson, if you want me to have a deeper look, would you like to suggest where I could add tracing in libssl?
I see no obvious NSS TLS errors in the first two above SSLTAP outputs.

Numerous comments about this bug:

1. These ssltap outputs look incomplete.  They lack information about the
end of each connection.  I suspect they contain the output to stdout but 
not the output to stderr.  I have filed a bug against ssltap, requesting
that all its output go to a single FD.  Until that bug is fixed, users of 
ssltap will need to ensure that they capture the output on both stdout and 
stderr.

2. If these ssltap outputs show different problems, they should be attached
to separate bugs.  Each bugzilla bug should show/report just one problem.

3. Please don't use the patch flag to exceed mozilla's attachment size limit.
I'm sure that bugzilla administrators will view that as abuse.
ssltap output for a single handshake will rarely exceed 20KB.  Any file 
of ssltap output that exceeds (say) 50KB must include a lot of stuff that 
is not relevant to the handshake.  Any such ssltap logs should be trimmed 
down to the essential information about the handshake.  

4. Both of the first two ssltap outputs show the connection being terminated
after the client sent its final part of the handshake to the server and is
waiting for the server to reply with its final part of the handshake. 
Kai's comment 8 above suggests that the server has dropped its end of the 
connection in both cases.  The ssltap output shows that the server did not 
send an alert record to tell us why.  That's a server fault.
Perhaps server log entries can be of some help here.  

Kai, the discussion about invalid server certs sounds like a completely
different problem.  

I see no reason to look any further into NSS or PSM based on the above two
ssltap outputs.
According to Jeremy, here's what he observed, including what's on the server side:

"I took new ssltap logs (didn’t receive yours, we had mail problems that day) –attached: dss_aes128 [ This is the first ssltap output attached to this bug ]
From the logs it appears that the last message sent is the client change_cipher_spec message.

I looked at the logs on the server:
The encrypted client change_cipher_spec message decryption fails at the server side.  This is only for the ECC enabled Firefox 2.0b1. Firefox 1.5.0.6 encrypts this message correctly. Pls take a look at the 1.5.0.6 code base, something may have changed in 2.0b1 that broke interop.

If there was a change put in to better adhere to the RFC, pls let us know."
extracted handshake portion from the ssltap output for the Linux FF, which connects successfully to the test site.
Attachment #233585 - Attachment is obsolete: true
>2. If these ssltap outputs show different problems, they should be attached
>to separate bugs.  Each bugzilla bug should show/report just one problem.

I agree with Nelson, let's focus on the first handshake and why it does not succeed. 

I looked at the server logs and the server fails the connection because it receives an invalid padding length for the message. The server does not even verify the MACit fails before it can.

001f00de  c3 54 62 a8 59 28 47 67-53 47 50 5e 3c 24 5d 31  .Tb.Y(GgSGP^<$]1
001f00ee  60 2d 7c de a8 07 e5 c3-53 3b 1b 15 30 96 36 cd  `-|.....S;..0.6.
001f00fe  f5 45 f6 3a 84 ba 57 1b-b3 45 96 03 7d c3 78 *c2*  .E.:..W..E..}.x.
001f010e  32 4e 22 c4 a0 aa 96 44-04 83 82 64 fb bd c9 cb  2N"....D...d....
001f011e  85 42 79 bf 00 32 00 00-04 00 00 00 00 0b 00 03  .By..2..........
001f012e  e7 00 03 e4 00 03 e1 30-82 03 dd 30 82 03 9c a0  .......0...0....
001f013e  03 02 01 02 02 0a 1c af-dd 2f 00 00 00 00 00 13  ........./......
001f014e  30 09 06 07 2a 86 48 ce-38 04 03 30 15 31 13 30  0...*.H.8..0.1.0

The byte in *C2* is expected to be block cipher padding length, ala pkcs#5 v 2.0
Could you check to see if padding length for the message is sent correctly?

It was not clear from the messages whether you have a consistent repro so here is one below:

Server: Longhorn
Client: Windows, Firefox 2.0b1

Repro:
- Install Firefox for Windows from http://www.mozilla.org/projects/bonecho/all-beta.html 
- Disable all DSS ciphersuites except security.ssl3.dhe_dss_aes_128_sha
- Try to connect to https://131.107.193.14:452/

Expected: Connection should succeed with dhe_dss_aes_128_sha negotiated.
Observed: Connection fails beause server resets connection.
Is there any way I can enable logging on mozilla to get the Master Secret, Key block and other security parameters calculated on the client side. The server logs are extremely bulky and may not contain all the information we need.

We might to work out a way to get the logs \ what is happening on both sides for one connection to diagnose better if simple analysis does not work.
Attached file Server Log
The server log containing the master secret and keys for one such connection.
Attachment #234740 - Attachment mime type: application/octet-stream → text/plain
Changed the bug "summary" to identify the problem we're now chasing.
Summary: ssltap shows client starting a second handshake while the first handshake still in progress → handshake failure using TLS/DHE-DSS/AES128-CBC/SHA suite
Depends on: 349966
I've added a patch to bug 349966, and a comment that explains how to use
the SSLTRACE feature.  It requires that the application not close stdout.
I don't know if FireFox closes stdout or not.  It may differ by platform
(e.g. perhaps it does on Windows, but not on Linux, or vice versa). 

Jaremy and/or Kai, Please test out those instructions (without the patch) 
to see if you get any trace output at all.  If you do, then the next step 
is to try with the patch.
I just realized that a large part of that patch applies only to the RSA case.
The patch will work for DHE-DSS but may not give all the desired info.
If not, I will work on it some more.
Kai,

In comment 0 above, mention is made of various FireFox builds, suggesting 
that Jeremy thinks there has been a regression in FireFox.  I have no way 
to correlate the build identifiers he used with versions (e.g. static tags)
of NSS.  SO I have no way to examine NSS for potential regressions.  Perhaps 
you can identify the version information (tags) of NSS used with the builds
he identified.  That would  help.

In bug 349966 I have produced a patch that will increase the amount of trace
information available from NSS debug libraries when used in programs that do
NOT close stdout.  Does FireFox close stdout on Windows?  
If not, could you build a set of NSS DEBUG shared libs with that patch 
applied for Jeremy's testing (and/or your own) ?
I ran several tests with tests clients using the NSS tip . They all succeeded .
Note that -c t specifies the SSL3 DHE DSS WITH AES 128 CBC SHA cipher suite .
I tried without -2 and -3 also to use a V2 client helo message, and got success.
In my Mozilla 1.7 browser, everything works as well.

tstclnt -c t -h 131.107.193.14 -p 452 -o -2 -3
Bad server certificate: -8179, Peer's Certificate issuer is not recognized.
subject DN: CN=DSSSrvCert
issuer  DN: CN=DSSCERTSRV
0 cache hits; 1 cache misses, 0 cache not reusable
GET /test
&#65533;&#65533;&#65533;<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD><TITLE>404 - File or directory not found.</TITLE>
<META http-equiv=Content-Type content="text/html; charset=utf-8">
<STYLE>
I was using 64-bit Sparc Solaris in the previous test, and my tstclnt was always succeeding.

But I found that there is an intermittent failure on x86 Solaris . It is better to use strsclnt to see these intermittent failures. Use the following command :

./strsclnt -c 100 -N -C t -o -p 452 131.107.193.14

On x86 Solaris I get the following result :

[jp96085@monstre]/export/home/nss/tip/mozilla/dist/SunOS5.10_i86pc_DBG.OBJ/bin 137 % ./strsclnt -c 100 -N -C t -o -p 452 131.107.193.14
strsclnt: PR_Send returned error -5961:
TCP connection reset by peer.
strsclnt: 0 cache hits; 8 cache misses, 0 cache not reusable
strsclnt: PR_Send returned error -5961:
TCP connection reset by peer.
strsclnt: PR_Send returned error -5961:
TCP connection reset by peer.
strsclnt: PR_Send returned error -5961:
TCP connection reset by peer.
strsclnt: 0 cache hits; 8 cache misses, 0 cache not reusable
strsclnt: NoReuse - 8 server certificates tested.

On 64-bit Sparc Solaris :

[jp96085@variation]/h/monstre/export/home/nss/tip/mozilla/dist/SunOS5.9_64_DBG.OBJ/bin 398 % ./strsclnt -c 100 -N -C t -o -p 452 131.107.193.14
strsclnt: 0 cache hits; 5 cache misses, 0 cache not reusable
strsclnt: 0 cache hits; 100 cache misses, 0 cache not reusable
strsclnt: NoReuse - 100 server certificates tested.

So, this is a platform-specific problem, apparently affecting x86.
I tried with Christophe's nightly tip bits . I also get the intermittent failure on Solaris 64-bit AMD64 . I don't get any failure on 32-bit Sparc. It might have to do with endianness.
I confirmed this is a regression in NSS (as opposed to PSM).
Firefox 1.5.0.4 originally worked, but after I dropped the
current NSS into Firefox 1.5.0.4, it also fails with the error
code -8182 (SEC_ERROR_BAD_SIGNATURE).
Nelson, in a minute I will send to you, by personal mail, two log files.

They were both produced using the same snapshot of Mozilla source code, with your patch from bug 349966 applied.

Unfortunately, on Windows, the Mozilla software suppresses any output to the console. But in a debug build we get what you want.

One log file was produced with a Firefox debug build on Windows, which failed to connect.

The other one was produced with a SeaMonkey debug build on Linux x86_64, which was able to connect successfully.

(You might see some additional connection attempts from internal product software update checks.)

Shouldn't you be able to see the broken behaviour yourself, if you use a NSS windows debug build and its command line clients and try an SSL connection to 131.107.193.14:452 ?
This is what I got in the MSVC 6.0 debugger.  This is Firefox 1.5.0.4
with the current NSS_3_11_BRANCH debug build.  I disabled TLS, so SSL 3.0
was used.

The DSA signature verification failed because it has the wrong length
(either 47 or 46 bytes).  The p, g, and y values all have the correct
length (128 bytes).

PORT_SetError(int -8182) line 189 + 9 bytes
PK11_Verify(SECKEYPublicKeyStr * 0x025f1b60, SECItemStr * 0x0157fc48, SECItemStr * 0x0157fb98, void * 0x02161ad0) line 730 + 18 bytes
ssl3_VerifySignedHashes(SSL3Hashes * 0x0157fc14, CERTCertificateStr * 0x022ab168, SECItemStr * 0x0157fc48, int 0, void * 0x02161ad0) line 923 + 21 bytes
ssl3_HandleServerKeyExchange(sslSocketStr * 0x025f79a0, unsigned char * 0x02511137, unsigned int 0) line 4841 + 34 bytes
ssl3_HandleHandshakeMessage(sslSocketStr * 0x025f79a0, unsigned char * 0x02510f81, unsigned int 438) line 7672 + 17 bytes
ssl3_HandleHandshake(sslSocketStr * 0x025f79a0, sslBufferStr * 0x025f7c00) line 7780 + 25 bytes
ssl3_HandleRecord(sslSocketStr * 0x025f79a0, SSL3Ciphertext * 0x0157fd5c, sslBufferStr * 0x025f7c00) line 8043 + 13 bytes
ssl3_GatherCompleteHandshake(sslSocketStr * 0x025f79a0, int 0) line 206 + 23 bytes
ssl_GatherRecord1stHandshake(sslSocketStr * 0x025f79a0) line 1258 + 11 bytes
ssl_Do1stHandshake(sslSocketStr * 0x025f79a0) line 149 + 13 bytes
ssl_SecureSend(sslSocketStr * 0x025f79a0, const unsigned char * 0x020c9ee0, int 472, int 0) line 1090 + 9 bytes
ssl_SecureWrite(sslSocketStr * 0x025f79a0, const unsigned char * 0x020c9ee0, int 472) line 1128 + 19 bytes
ssl_Write(PRFileDesc * 0x022b8490, const void * 0x020c9ee0, int 472) line 1413 + 21 bytes
FIREFOX! 00829181()
PR_Write(PRFileDesc * 0x02545710, const void * 0x020c9ee0, int 472) line 146 + 20 bytes
FIREFOX! 004761ff()
FIREFOX! 00491d9d()
FIREFOX! 00493967()
FIREFOX! 00491e45()
FIREFOX! 004923b6()
FIREFOX! 004777a2()
XPCOM_CORE! 6033b7c4()
pr_root(void * 0x012a4770) line 112 + 13 bytes
MSVCRT! 77c3a3b0()
KERNEL32! 7c80b6
Kai, comparing logs from two separate client runs will not be instructive
because of the random components that differ from run to run.
Comparing the client and server logs/traces from the same handshake 
with each other will reveal where the problem occurs.

There are at least two separate error behaviors being seen here.

1) the original one: client has no problem verifying the signature on the 
server key exchange, and completes the handshake, but the server is unhappy
with it and drops the connection.  I can reproduce this with NSS tstclnt
builds on Windows.

2) The client is unhappy with the DSA signature on the server key exchange
message and reports error -8182 (SEC_ERROR_BAD_SIGNATURE) locally, and
does not complete the handshake.  The browser puts up a dialog message 
that reports, incorrectly, that this is a bad signature on a certificate.
In fact it is not a signature on a certificate at all, and is misleading. :(

This second behavior is seen in browser builds, not in NSS builds (AFAIK).
That is, I see this second behavior in SeaMonkey and others see it in FF,
but I never see it with tstclnt.  With tstclnt, I see only the first 
behavior, which is the one reported by Jeremy Viegas.  I think this bug
must remain focused on Jeremy's observation (the first behavior).
I think we need a separate bug report for the second behavior.  

I still have no way of knowing what changes were committed to NSS between
the first build (the one that is believed to work OK) and the second build
(which has the problem).  Only people who built those builds can supply 
that information.  
Firefox 1.5.0.6 works, which is the most recent product release from stable MOZILLA_1_8_0_BRANCH.

The latest nightly builds of Firefox 2 beta are failing, which are produced from MOZILLA_1_8_BRANCH.

I recommend to do a
  cvs rdiff -u -r MOZILLA_1_8_0_BRANCH -r MOZILLA_1_8_BRANCH mozilla/security/nss
to view the diffs.

This tells us that NSS 3.10.2 works
while NSS 3.11.3 Beta fails.
Here is a brief analysis of the trace files Kai sent me:

trace-ssl-linux:  a single handshake occurred, 
   TLS, cipher suite 0x0032: TLS_DHE_DSS_WITH_AES_128_CBC_SHA
        no error occurred.  succeeded.

trace-ssl-windows:  3 separate handshakes occurred.
   1) TLS cipher suite 0x0035 TLS_RSA_WITH_AES_256_CBC_SHA 
        succeeded.  
   2) TLS cipher suite 0x0032 TLS_DHE_DSS_WITH_AES_128_CBC_SHA 
        failed as described in behavior 1 above.  client detected no error, 
        attempted to complete handshake. Server dropped connection.
   3) SSL 3.0 cipher suite 0x0013 SSL_DHE_DSS_WITH_3DES_EDE_CBC_SHA
        failed as described in behavior 2 above.  client detected invalid 
        signature on the server key exchange message.  client sent alert,
        and dropped connection.

I had previously reproduced behavior 1 with this command:
    tstclnt -h 131.107.193.14 -p 452 -d db -2Bfovv -c tw < stdin.txt
I attempted to reproduce behavior 2 with this command:
    tstclnt -h 131.107.193.14 -p 452 -d db -2TBfovv -c q < stdin.txt
but the test server is apparently down now.

As I previously noted, diagnosing behavior 1 will require comparing client
logs with server logs.  I have arranged to do that with Jeremy Friday morning.

I have not yet begun to diagnose behavior 2.  A separate bug is needed for
that.  I won't know if we'll need to compare client and server logs until
I have studied the log I have.  

I wonder if there is perhaps some confusion over which cipher suites are 
actually being configured, and which are being experienced in the testing.
These might be browser UI issues.  
I'm thinking softoken (Bob) or freebl (Bob or me).
I guess the SSL 3.0 DSA signature verification problem that
I reported in comment 24 is that Mozilla received an ASN.1
encoded DSA signature.  I don't have my ASN.1 book with me,
but the 47 or 46 byte length seems reasonable for an ASN.1
encoded DSA signature.
The regression is caused by the weave patch (bug 298630).
I determined this by doing a binary search along the NSS trunk.

The bug can be worked around by defining -DMP_CHAR_STORE_SLOW -DMP_IS_LITTLE_ENDIAN for Windows.  The MP_CHAR_STORE_SLOW
changes the implementation of two functions: mpi_to_weave and
weave_to_mpi, so only the default implementation of those two
functions needs to be reviewed.

Right now only Linux x86 and Linux x86_64 use the MP_CHAR_STORE_SLOW
implementation of those two functions.

Kai, you are off the hook for this bug.
Assignee: kengert → nobody
Component: Security: PSM → Libraries
Product: Core → NSS
QA Contact: libraries
Target Milestone: --- → 3.11.5
Version: Trunk → 3.11
P1, a stopper issue for some platforms.
Priority: -- → P1
Nominating for 3.11.3 Beta.
Target Milestone: 3.11.5 → 3.11.3
Attached patch Patch 1Splinter Review
It seems that we forgot to multiply by sizeof(mp_digit).
Look at now 'end' is defined and observe that 'end' and
'zero' are used in a similar way (to control the number
of iterations).

I will create another patch, which is larger but makes
the similarity more obvious.
Attachment #235499 - Flags: superreview?(rrelyea)
Attachment #235499 - Flags: review?(nelson)
Attached patch Patch 2 (do not check in) (obsolete) — Splinter Review
After I created this patch, I realized it may be illegal C
code, because the pointer 'pb' may point beyond the end of
the MP_DIGITS(&a[i]).  It works because after 'pb' goes
beyond the end of that array, we only use 'pb' as a counter
(to count how many 0's need to be stored).

So just use this patch as a help for the review of patch 1.
This patch more clearly shows the similarity of the two
loops and their iteration counts dependency on 'useda' and
'b_size'.
We should have a QA test for this cipher suite. Unfortunately, NSS doesn't have its own implementation of the server-side. However, Alexei has been working on SSL interoperability tests, including with Apache. The work needs to be completed. When it is, it should detect this type of issue.
Attached patch Patch 2 v2Splinter Review
I found that my patch 2 made mpi_to_weave more similar
to weave_to_mpi.  So I decided to go further and make
these two functions very similar.

This patch is equivalent to patch 1.  So if you can
verify the equivalence, it may be better to use patch 1
for the NSS_3_11_BRANCH (if we want to facilitate code
reviews by people outside our team) and this patch for
the trunk.
Attachment #235501 - Attachment is obsolete: true
Attachment #235508 - Flags: superreview?(rrelyea)
Attachment #235508 - Flags: review?(nelson)
Comment on attachment 235499 [details] [diff] [review]
Patch 1

r+ = relyea

To answer why this wasn't caught in other testing...

If the buffer b goes into is already zeros, not putting enough zeros will not trigger the bug. Also if useda == b_size (the likely case for most RSA ops) the bug wouldn't be triggered.

Both patches are correct. The other patch has better variable naming and looks nicer, but I don't think I like the use of marching pb passed the end of 'a'. It's safe because we don't dereference it, (we are using the pointer as nothing more than an index), but it might lead to future issues if someone tries to dereference pb.
Attachment #235499 - Flags: superreview?(rrelyea) → superreview+
Comment on attachment 235508 [details] [diff] [review]
Patch 2 v2

r+ see my previous comments about pb pointer.
Attachment #235508 - Flags: superreview?(rrelyea) → superreview+
Attached patch Alternative patch (obsolete) — Splinter Review
This alternative patch 
makes variable names consistent (b is not sometimes source and sometimes 
destination), 
it reduces the number of computations in the loops, 
it doesn't use source counters to find the end of the destination
It doesn't change the input arguments (base addresses of arrays)

I think this is a whole lot more understandable, and easier to prove correct.
Anyone agree?
Attachment #235514 - Flags: superreview?(rrelyea)
Attachment #235514 - Flags: review?(wtchang)
Comment on attachment 235514 [details] [diff] [review]
Alternative patch

I still want to rewrite a large part of the block comment, to better explain what this function does.  But for now, let's get the code right.
Attachment #235514 - Flags: review?(julien.pierre.bugs)
Comment on attachment 235514 [details] [diff] [review]
Alternative patch

It's better to follow the style of the original mpi code
(short, lowercase variable names such as pa, pb, enda, endb).
You should also change the related function weave_to_mpi the
same way.
Attachment #235514 - Flags: review?(wtchang)
Comment on attachment 235499 [details] [diff] [review]
Patch 1

The short one and two letter variable names, which are used
inconsistently (b is sometimes source, sometimes destination)
are a big part of the problem here.  They make this code
unnecessarily difficult to understand.

I'm giving this patch r- because it doesn't correct that problem..
This code has been reviewed MULTIPLE times in the past.  It was
wrong every time.  The author and the reviewers could not find
the errors.  It's time to end that cycle by using clear variable
names, used consistently, and a clear algorithm.

Each of these problems has caused a big political problem for NSS.  
The cost of another failure of this code greatly exceeds the value
of preserving the existing variable names.  I'm not willing to 
take that risk..
Attachment #235499 - Flags: review?(nelson) → review-
Comment on attachment 235508 [details] [diff] [review]
Patch 2 v2

My remarks in comment 42 apply to this patch, also.
Attachment #235508 - Flags: review?(nelson) → review-
(In reply to comment #41)
> (From update of attachment 235514 [details] [diff] [review] [edit])
> It's better to follow the style of the original mpi code
> (short, lowercase variable names such as pa, pb, enda, endb).

mpi uses letters like a and b to refer to mp_ints, items of the same type.
This code uses a and b to point to completely different types,
and uses "b" and "pb" to point to entirely different things, 
b being a destination pointer, and pb being a source pointer.  

> You should also change the related function weave_to_mpi the
> same way.

OK, I am working on a bigger patch that will replace both mpi_to_weave and
weave_to_mpi to make them consistent with each other.  I'm also rewriting
the introductory comment block.

This patch completely replaces the existing byte-at-a-time weaving 
(and de-weaving) implementation.  

My next patch will address the word-at-a-time implementation.
Attachment #235514 - Attachment is obsolete: true
Attachment #235611 - Flags: review?(julien.pierre.bugs)
Attachment #235514 - Flags: superreview?(rrelyea)
Attachment #235514 - Flags: review?(julien.pierre.bugs)
Comment on attachment 235611 [details] [diff] [review]
Alternative byte-at-a-time patch, v2 (checked in)

Julien and Bob, please review.
Attachment #235611 - Flags: review?(rrelyea)
This patch applies the same treatment to all weave and unweave code,
both byte-at-a-time and word-at-a-time.  I was able to eliminate 
two variables and several calculations from the word-at-a-time code,
which is otherwise mostly unchanged except for variable names.
Attachment #235619 - Flags: review?(julien.pierre.bugs)
Attachment #235619 - Flags: superreview?(wtchang)
Attachment #235619 - Flags: review?(rrelyea)
Comment on attachment 235611 [details] [diff] [review]
Alternative byte-at-a-time patch, v2 (checked in)

Reviewers can choose among alternatives.
Attachment #235611 - Flags: superreview?(wtchang)
Comment on attachment 235619 [details] [diff] [review]
alternative patch v3: replace all weave and unweave code

BTW, this patch does fix this bug for Win32.  I tested with:

tstclnt.exe -h 131.107.193.14 -p 452 -d db -2Bfovv -c tw < stdin.txt
I am not sure the reason why the problem doesn't occur on Sparc has anything to do with endianness anymore. On Sparc, we have assembly optimizations using floating point code. The weave code is not compiled in our libfreebl*fpu*.so libraries, which are selected on the Ultrasparc III machine I was using. So, any bug in this code cannot affect it. It may affect the Sparc libfreebl integer libraries, which would be selected on the Niagara chip, as well as on older v8 pre-Ultrasparc CPUs (32-bit libs). But I can't test this for sure because Microsoft took down their public test server. It would be very useful to have another test server, especially to check the various patches which need to be reviewed. Alexei, do you still have IIS and Apache servers setup for testing, and do they have this cipher suite turned on ?
The public test server I cited in comment 49 is still up and running.
Comment on attachment 235619 [details] [diff] [review]
alternative patch v3: replace all weave and unweave code

I agree with Nelson that we should have better variable names and comments in this code.

That said, doing this in the same patch as a bug fix makes it particularly difficult to review. It would be much easier to review as 2 patches - one which just changes the variable names without any logic change, and another one to fix the bug. It will take me more time to review the chagnes to the first mpi_to_weave for this reason.

I like the changes in both weave_to_mpi. r+ for those parts.

Your changes in the second version of mpi_to_weave are equivalent to the original. However, I the type switch leads to clarification of the comment, but complication of the code itself in the MPI_WEAVE_ONE_STEP macros. This appears twice :

-    *weaved = acc; weaved += count;
+    *(mp_weave_word *)weaved = acc; weaved += nBignums;

I don't think this is an approvement. I would prefer sticking with variable name changes and not doing a type switch that requires a cast. IMO, it's OK to have an extra variable for the cast array - just call it something like weaved_words instead of weaved, as it was originally.

And for count, you can keep the computation but assign it to a separate variable. such as:

-  count = count/sizeof(mp_weave_word);
+  mp_size nWords = nBignums/sizeof(mp_weave_word);
Comment on attachment 235611 [details] [diff] [review]
Alternative byte-at-a-time patch, v2 (checked in)

r=wtc.  The NSS_3_11_BRANCH needs a small patch like my patch 1 to
facilitate code reviews by people outside the NSS team.

I don't oppose using good variable names, but the naming convention
in this file should be familiar to Unix programmers.  It's not
necessary to rename all the function parameters and local variables
(only the loop index 'i' was not renamed by your patch).  To me,
your patch simply changed Bob's coding style to Nelson's coding
style, neither of which is my favorite coding style.  But now this
file contains two coding styles.  This is why someone said the
"entropy" of a code base increases over time.

It's also hard to prove the correctness of moving the second innner
loop's termination condition outside the outer for loop.  It depends
on the fact that nBignums >= WEAVE_WORD_SIZE.
Attachment #235611 - Flags: superreview?(wtchang) → superreview+
This log file contains the DH domain parameters (prime modulus and base)
and the client and server public keys of six connections:
NSS client -> bugzilla.mozilla.org
NSS client -> Microsoft test server
NSS client shuts down and restarts.
NSS client -> bugzilla.mozilla.org
NSS client -> Microsoft test server
NSS client shuts down and restarts.
NSS client -> bugzilla.mozilla.org
NSS client -> Microsoft test server

Some observations.
1. The prime modulus from Microsoft test server looks unusual.
It has 8 0xff bytes at the beginning and 8 0xff bytes at the
end.
2. Both servers use a base of 2.  Microsoft sends the base 2
as a 128-byte value, but after conversion to an mp_int (g),
it only has one mp_digit.
3. Both servers use the same DH domain parameters.
4. bugzilla.mozilla.org uses a new DH public key every time,
whereas Microsoft test server uses the same DH public key.
Comment on attachment 235832 [details]
DH domain parameters (prime modulus, base) and public keys from bugzilla.mozilla.org and the Microsoft test server

For all the tests we have run, the value of 'zero' in
mpi_to_weave is 0.  The value of 'zero' is also 0 when
we connect to https://bugzilla.mozilla.org, which I
guess is Apache with mod_ssl.  So it is unlikely that
we could have discovered this bug by our own testing
or by testing against Apache/mod_ssl.

When we connect to the Microsoft test server, the value
of 'zero' is nonzero as this patch shows.  (Note that
my "patch 1" has been applied, so the definition of 'zero'
is (b_zero-useda)*sizeof(mp_digit).
Comment on attachment 235611 [details] [diff] [review]
Alternative byte-at-a-time patch, v2 (checked in)

Nelson,

This looks good.

I think we should also ensure in the caller of mpi_to_weave that the output array gets pre-filled with non-zero values (a magic number such as
0xbadc0ffee0ddf00d) in the debug build. This may help detect problems of this type in the future.
Attachment #235611 - Flags: review?(julien.pierre.bugs) → review+
Comment on attachment 235611 [details] [diff] [review]
Alternative byte-at-a-time patch, v2 (checked in)

Committed this patch on trunk and 3_11 branch.
Correctly zero-fill columns in weaved array.  r=julien,wtchang. Bug 348359.
Checking in mpi/mpmontg.c; new revision: 1.20; previous revision: 1.19
Checking in mpi/mpmontg.c; new revision: 1.17.2.3; previous revision: 1.17.2.2
Comment on attachment 235611 [details] [diff] [review]
Alternative byte-at-a-time patch, v2 (checked in)

r+ = rrelyea

Use of the dest array as the counter eliminates the problem I had with wan-teh's cleaner patch.

Converting the style to more nelson-like is appropriate for this file.

I would like to see the return of the table. I don't find the changed comments any more clear, and a picture help.
Attachment #235611 - Flags: review?(rrelyea) → review+
Comment on attachment 235619 [details] [diff] [review]
alternative patch v3: replace all weave and unweave code

r+ for the trunk.

This is mostly a variable name change. The changes make the functions more consistent witht the prevailing style in mpmontg.c
Attachment #235619 - Flags: review?(rrelyea) → review+
Attached patch Workaround for NSS_3_11_BRANCH (obsolete) — Splinter Review
As I already mentioned in comment 53, the NSS_3_11_BRANCH needs
a simple patch so that we can easily convince ourselves and people
unfamiliar with the NSS source code that the change won't invalidate
all the testing we have done on the NSS_3_11_BRANCH.

I have determined that 'zero' has the value 0 for all the tests
we have run.  This implies that my "patch 1" only modifies code
paths that were not taken in the tests, and therefore does not
invalidate the test results.

But I think this patch makes a more convincing case to people
unfamiliar with the NSS source code that the change is safe.

The code reviewrs need to:
1. verify that this patch does work around the patch, and
2. verify that this patch can convince someone unfamiliar
with the code that the patch is safe, in the sense that it
can only eliminate uninitialized memory reads (UMR).

This patch should have low overhead.  First, its cost is
lower than the cost of weaving.  If the cost of weaving
is negligible, so is the cost of this patch.  Second, for
regular Firefox use (visits to popular web sites such as
bugzilla.mozilla.org, www.etrade.com, www.wellsfargo.com,
www.amazon.com, www.ebay.com, www.yahoo.com, gmail.google.com),
the size of 'powersArray' is:
- most common: 258 bytes
- occasionally: 514 bytes
- rarely: 2064 bytes

In our all.sh, the size of 2080 bytes is more common, but
the size of 258 bytes is still most common.

By the way, the DHE_RSA cipher suites are commonly used because
Firefox and Apache both support them and Firefox put
TLS_DHE_RSA_WITH_AES_256_CBC_SHA at the top of the cipher suite
list, only second to the ECC cipher suites.  So our DHE code is
being exercised a lot in practice.
Attachment #235943 - Flags: superreview?(rrelyea)
Attachment #235943 - Flags: review?(nelson)
Comment on attachment 235943 [details] [diff] [review]
Workaround for NSS_3_11_BRANCH

>+#ifndef MP_CHAR_STORE_SLOW
>+  /*
>+   * mpi_to_weave miscalculates the number of bytes it needs to zero,
>+   * so we initialize the buffer to zero bytes to work around the bug.
>+   * (Bugzilla bug 348359).
>+   */
>+  powersArray = (unsigned char *)calloc(num_powers*(nLen*sizeof(mp_digit)+1),1);
>+#else

Wan-Teh, I don't object in principle to zero'ing the array up front, rather 
than as we're filling in the columns of the array, but I do object to a big 
red banner (such as the above comment) saying "UNFIXED BUG HERE".

Let me suggest that you do this calloc (without the comment), and remove the 
loop in mpi_to_weave that does insufficient zeroing.  You can claim that 
zero'ing up front is more efficient and that the loop being removed is 
unnecessary and duplicative.  I think any competent code reviewer should be 
able to see that factoring out the array zeroing, and putting it up front is 
equivalent to doing it column by column.
Comment on attachment 235619 [details] [diff] [review]
alternative patch v3: replace all weave and unweave code

Although I don't have time to review this patch, I agree that
this patch should be checked in so that the two versions of
mpi_to_weave and weave_to_mpi have the same function parameter
names.
Comment on attachment 235943 [details] [diff] [review]
Workaround for NSS_3_11_BRANCH

If I also remove that for loop from mpi_to_weave, the code
reviewer will need to trace the code to see powersArray is
passed as the 'b' (new name 'weaved') argument to mpi_to_weave
and each byte in powersArray is only written once.  This
is hard.

I want a patch that someone can easily say, "this patch is
safe unless the program incorrectly depends on uninitialized
variable values."
Comment on attachment 235619 [details] [diff] [review]
alternative patch v3: replace all weave and unweave code

I reviewed this patch in comment 52 . The only reason for r- is that I don't like the changes in the second (word) version of mpi_to_weave. I'm fine with all 3 other functions.
Attachment #235619 - Flags: review?(julien.pierre.bugs) → review-
This P1 cannot be left assigned to nobody. Reassigning to Nelson, since he came up with the fix which was reviewed and checked in. I think this bug should be marked resolved fixed.
Assignee: nobody → nelson
Comment on attachment 235499 [details] [diff] [review]
Patch 1

Nelson, please review this patch again for NSS_3_11_BRANCH.
This small patch is more appropriate for NSS_3_11_BRANCH.
Attachment #235499 - Flags: review- → review?
I verified that NSS on Mac OS X PowerPC also has this bug.
So this bug has nothing to do with endianness.
Comment on attachment 235611 [details] [diff] [review]
Alternative byte-at-a-time patch, v2 (checked in)

Merely recording the fact (again) that this patch was committed on August 28.  See comment 57
Attachment #235611 - Attachment description: Alternative byte-at-a-time patch, v2 → Alternative byte-at-a-time patch, v2 (checked in)
I think this bug is now resolved/fixed.
Wan-Teh, do you concur?
This bug was fixed in NSS 3.11.3.  Unfortunately it left the function
parameter names different in the two implementations of mpi_to_weave
and weave_to_mpi as shown below.

The implementation for the case MP_CHAR_STORE_SLOW is not defined:

mp_err mpi_to_weave(const mp_int  *bignums,
                    unsigned char *weaved,
                    mp_size nDigits,  /* in each mp_int of input */
                    mp_size nBignums) /* in the entire source array */

mp_err weave_to_mpi(mp_int *a,                /* output, result */
                    const unsigned char *pSrc, /* input, byte matrix */
                    mp_size nDigits,          /* per mp_int output */
                    mp_size nBignums)         /* bignums in weaved matrix */

The implementation for the case MP_CHAR_STORE_SLOW is defined:

mp_err mpi_to_weave(const mp_int *a, unsigned char *b,
                                        mp_size b_size, mp_size count)

mp_err weave_to_mpi(mp_int *a, const unsigned char *b,
                                        mp_size b_size, mp_size count)
Status: NEW → RESOLVED
Closed: 18 years ago
Resolution: --- → FIXED
Attachment #235943 - Attachment is obsolete: true
Attachment #235943 - Flags: superreview?(rrelyea)
Attachment #235943 - Flags: review?(nelson)
Attachment #235499 - Flags: review?
Keywords: regression
Comment on attachment 235619 [details] [diff] [review]
alternative patch v3: replace all weave and unweave code

cancelling my SR request, since this bug is now FIXED.
Attachment #235619 - Flags: superreview?(wtchang)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: