Closed Bug 88303 Opened 23 years ago Closed 23 years ago

infinite loop while "Sending Message..."

Categories

(Core Graveyard :: Security: UI, defect, P1)

1.0 Branch
x86
Windows 2000
defect

Tracking

(Not tracked)

VERIFIED FIXED
psm2.1

People

(Reporter: mitch, Assigned: inactive-mailbox)

References

Details

(Whiteboard: [ckritzer] [MacOS_X])

Attachments

(1 file)

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:0.9.1+)
Gecko/20010628 Netscape6/6.1b1
BuildID:    2001062804

Using SSL=always, I sometimes get caught in an infinite loop waiting for the
message to send.  GUI indicates it is Sending Message..., the animation is
working, but the mail never gets sent.  If I click the "offline" icon in the
mail thread window, I can change my SMTP prefs to use SSL=never, return on an
online state, and successfully send the message.  

Reproducible: Sometimes
Steps to Reproduce:
1. select SSL=always
2. select "use username/password"
3. send mail

Actual Results:  sometimes mail never sends

Expected Results:  mail sends

Happens both when I select "select cert manually" and when I select "select cert
automatically". 

I usually get a dialog indicating a mail server timeout at some point AFTER I go
through steps described in the "description" above.  I will capture the dialog
contents the next time it happens and attach to this bug.
I've seen this, but not in the past week.  This might be a timing problem.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Keywords: nsenterprise
Priority: -- → P2
Target Milestone: --- → 2.0
Version: 1.01 → 2.0
Target Milestone: 2.0 → 2.1
Except for the report against 0.9.1, this sounds a lot like bug 74387 where the 
SSL poll method was not returning readable when SSL had buffered write data and 
the underlying socket was writable.  That bug was triggered by sending messages 
larger than a certain size.

Ran additional tests. Seems to be independent of PlainText messages vs HTML
messages vs PlainText+HTML messages. Seems to be independent of "forwarded" vs
"reply" vs "new message". The key trigger appears to be message size, the
threshold being somewhere between 11k and 22k (size estimates being derived from
an 11k message sent successfully as HTML and failing as HTML+PlainText).
Is this against a Mozilla that was compiled in an unusual way, say with 
MOZ_NSS_AUTOCONF?
It's a standard "commercial" build downloaded from sweetlou.  I get a new copy
every day.  I don't have information regarding how those builds are compiled.  Bob?
I've been seeing this more often recently.
Are you sending bigger messages?  There seems to be a threshold dependent upon
message size that triggers this.  
Mass assigning QA to ckritzer.
QA Contact: junruh → ckritzer
*** Bug 92588 has been marked as a duplicate of this bug. ***
Still need information on how these bits were built.  It is likely the fix made 
to the official NSS static tag were not applied to the NSS autoconf branch. 
Applying the fix to the NSS autoconf branch would be the responsibility of 
whoever is maintaining the NSS autoconf branch, not the PSM team.
Neither the Mozilla nor the Netscape commercial builds should be using NSS built
under autoconf. 

The NSS autoconf branch should be dead.

This bug appears to be a dupe of 84731.  I'll let you experts make that
determination and mark if "resoved, dupe" if you concur.

*** This bug has been marked as a duplicate of 84731 ***
Status: NEW → RESOLVED
Closed: 23 years ago
Resolution: --- → DUPLICATE
Marking VERIFIED DUPLICATE.
Status: RESOLVED → VERIFIED
this isn't a dup of that bug. We fixed that bug it was something different. This
is a bug in the SSL transport layer. see the stack traces in 92588. This bug was
filed long after 84731 was fixed.
Status: VERIFIED → REOPENED
Resolution: DUPLICATE → ---
Ok, so let's try to get to the bottom of this.  
Assignee: ssaux → kai.engert
Status: REOPENED → NEW
Priority: P2 → P1
I spent a full day trying to reproduce this in a controlled environment, without
any luck. I saw it once, when I didn't have a debugger attached.

The only idea I have for now: I often see that the progress window stops
somewhere between 70% and 90%, and stays there for a few moments. Using traffic
sniffers, I found out that at this point our application is ready sending all
the data. The final answer from the other side is awaited.

Is is possible that sometimes the mail server on the other side needs more time
to put the new message into its local storage, run e-mail virus scans, etc?

I'm desperate for more ideas.
In a related bug (can't track it down among chain of dups!), I found that
including this image: 
http://www.jivamedia.com/SolarEclipse2001/Mid-Inner-Composite.jpg
in a mail message would always result in the apparent send message infinite
loop. That problem went away when I changed my pref to "not use OSCP for
certificate validation." Maybe this will help
Thanks for your comment. I'm continuing my research, and I have some good news
regarding reproducing it. On Windows 2000, I can!

I will invest more time, as this really is a major bug.
Status: NEW → ASSIGNED
Here are more results out of my countless hours trying to analyze and debug this.

I'm able to reproduce this in Windows.

When the application seems to block at around 94% of sending the e-mail, there
are usually 3-4 file descriptors open which are checked by the socket transport
thread.

The SSL descriptor has over 1000 bytes waiting to be sent, but it never receives
a writeable event.

One of the sockets receives a writeable event, but that doesn't help, because it
has nothing in its incoming pipe, and its mSelectFlags does not contain the
PR_POLL_WRITE bit.

If I use a direct connection to judge.mcom.com:25, I am able to reproduce this
nearly all the time.

As a next test, I used the ssltap tool from the NSS tools. I configured this to
sit in the middle between Mozilla sending SMTP/TLS traffic and
judge.mcom.com:25. With this configuration, it *always* works!

I don't know enough yet to guess where the error could be. Ideas welcome. CC'ing
mcgreer.
While the SSL layer seems to have data writeable, but doesn't receive the
writeable event, it is receiving a readable event all the time, trying to fetch
data from the socket, only to see that nothing is available. That's why the CPU
usage goes up to 100%.
adding darin, mscott to cc list as they may know more about why the writeable
events never happen.
if i understand the way smtp uses the socket transport, it writes data
synchronously, but reads data asynchronously.  this means that while the
socket is configured to be non-blocking, the Write calls made by mailnews
do not return until all of the data has been written (to the underlying io
layer, in this case PSM).  after "successfully" writing the data to the
underlying io layer, the socket transport does not PR_Poll with PR_POLL_WRITE
set... it only PR_Poll's with PR_POLL_READ set so it can "asynchronously" read
from the socket.  is PSM expecting PR_Poll to be called with PR_POLL_WRITE?
would it help if we called PR_Flush from within nsSocketBOS::Write ??
Darin, thanks for your comment, I will try what you suggested.

Finally, I was able to produce some good trace logs, maybe this helps you with
getting more insight:

0: SSL3[95584368] In ssl3_SendRecord, calling ssl_DefSend
0: SSL3[95584368] In ssl3_SendRecord, calling SaveWriteData (2nd) with: 2069
0: SSL[95584368]: saving 2069 bytes of data (2069 total saved so far)
0: SSL3[95584368] In ssl3_SendRecord, calling SaveWriteData (2nd) with: 2069
0: SSL3[95584368] Leaving SendApplicationData, sent: 2048
0: SSL3[95584368]: gather state 1 (need 5 more)
....
0: SSL3[95584368]: gather state 1 (need 5 more)
0: SSL[95584368]: sending 2069 bytes of saved data
0: SSL[95584368]: SecureSend: sending 1025 bytes
0: SSL3[95584368] Entering SendApplicationData, to send: 1025
0: SSL3[95584368] SendApplicationData, in loop, sent: 0, to go: 1025
0: SSL3[95584368] SendRecord type: application_data (23) bytes=1025
0: SSL[95584368]: Send record (plain text) [Len: 1025]
   0d 0a 4f 54 34 4e 4c 30 4e 79 5a 57 46 30 61 57   ..OT4NL0NyZWF0aW
....
   56 6d 44 54 55 77 4d 7a 63 4e 4a 53 56 46 54 30   VmDTUwMzcNJSVFT0
   59 4e 0d 0a 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d   YN..------------
   2d 2d 32 35 38 34 32 35 35 38 45 34 34 32 42 44   --25842558E442BD
   46 38 37 34 33 36 42 30 36 33 2d 2d 0d 0a 0d 0a   F87436B063--....
   0d 0a 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d   ..--------------
   30 38 30 36 30 33 30 32 30 38 30 38 30 36 30 38   0806030208080608
   30 32 30 35 30 31 30 38 2d 2d 0d 0a 0d 0a 2e 0d   02050108--......
   0a                                                .
0: SSL3[95584368] In ssl3_SendRecord, bytes: 1025
0: SSL3[95584368] In ssl3_SendRecord, Getting SpecReadLock
0: SSL3[95584368] In ssl3_SendRecord, comuting RecordMAC
0: SSL3[95584368] In ssl3_SendRecord, calling encode
0: SSL3[95584368] In ssl3_SendRecord, Release SpecReadLock
0: SSL[95584368]: send (encrypted) record data: [Len: 1046]
   17 03 00 04 11 3f 28 f4 d6 5a 05 57 5d dc c7 ff   .....?(..Z.W]...
....
   e4 55 88 50 8c f0 7c 61 a6 f1 75 41 c8 38 27 35   .U.P..|a..uA.8'5
   59 7c b6 70 14 66                                 Y|.p.f
0: SSL3[95584368] In ssl3_SendRecord, calling ssl_DefSend
0: SSL3[95584368] In ssl3_SendRecord, calling SaveWriteData (2nd) with: 1046
0: SSL[95584368]: saving 1046 bytes of data (1046 total saved so far)
0: SSL3[95584368] In ssl3_SendRecord, calling SaveWriteData (2nd) with: 1046
0: SSL3[95584368] Leaving SendApplicationData, sent: 1025
0: SSL3[95584368]: gather state 1 (need 5 more)
0: SSL3[95584368]: gather state 1 (need 5 more)
....
last message continuously without any other events happening.
As I said, a filedescriptor is writeable.
Darin, I assume you meant PR_Sync. I added this to the suggested function.
However, it ends up in ssl_FSync, which has an assert(0), because it is not
implemented.

I will now research whether adding PR_POLL_WRITE to polling helps.
I just read Darin's comments about the SMTP protocol's read / write behavior.
Darin, SMTP actually uses async reads & writes now. People were getting mad that
we were blocking the UI when sending large messages =), so I re-wrote it a while
a go to use async write. Not sure if this helps or hinders the analysis of the
ssl logs.
mscott: thanks for the info.  it definitely invalidates my hunch.
Some more traces:

ssl3_SendRecord is called, it tries to send out immediately as much data as
possible using ssl_DefSend. If some data couldn't be sent, ssl_SaveWriteData is
called, storing the pending data in the sslSocket's sslBuffer (pendingBuf).

The worker function that later should be called again, to flush the buffer, is
again ssl_DefSend.

The function that could trigger flushing the pending buffer is ssl_SecureSend.
This function is required to be triggered by the sockettransport calling the
OnWrite method.

The SocketTransport currently doesn't check for the outgoing socket to become
writeable, so it never triggers OnWrite. But even if it did (I set the flags
manually in the debugger), this wouldn't solve the problem immediately. 

The OnWrite method checks for some pipe to have data available. I haven't
understood this completely, but my assumption is, the socket transport assumes
that the layers communicate using a pipe with each other. I'm thankful for any
explanations you could give me on this.

But as the ssl code doesn't seem to put data into a pipe, it just put it into an
internal buffer.

I saw something else: The sslSocket has a flag called "fdx", which is commented
as "simultaneous R/W threads". This flag is set to true.

When the socket transport finds out that a socket is writeable, it triggers
(indirectly) ssl_SecureRecv. Inside this function is the code that might be able
to help us. There is a call to ssl_SendSavedWriteData, which is executed when
then pendingBuf contains data. However, this method is only called when fdx is
false.
Where is the problem? Should the ssl code fill a pipe in case it can't write
immediately? Or is it an error that fdx is set to true?
Ok, setting a breakpoint in ssl_SecureRecv, and setting ss->fdx to zero, breaks
the endless loop and the SMTP operation continues.

I don't understand what the correct fix to this problem would be, so I'm
thankful for any explanation you might have for me. Thanks.
kai, i don't know the SSL code at all, but i can perhaps explain how OnWrite
works.  an async write request has an associated stream provider.  the socket
transport poll's w/ PR_POLL_WRITE, when poll returns it checks with the stream
provider for data.  usually, the client's stream provider does not run entirely
on the socket transport thread.  instead an event is posted to the client's
thread (see nsStreamProviderProxy.cpp).  when the client handles the event, it
writes some data into a pipe which the stream provider owns.  the stream
provider then notifies the socket transport to resume writing data.  the socket
transport then attempts to write data (from the pipe) to the socket.  this
continues until the pipe is empty or the client otherwise indicates end of file.
The following sequence of events is supposed to happen:

SMTP is in the write loop, selecting for PR_POLL_WRITE and writing the message 
data to the SSL layer.  When all of the data have been written, it then tries to 
read the response, selecting for PR_POLL_READ and reading the response until it 
has a complete response line.

When the message is large enough, it almost always happens that on the last 
write of message data, the SSL code is able to encrypt and wrap that data into 
an SSL record, but the lower OS TCP layer does not accept the encrypted data (it 
returns a partial write and then returns WOULDBLOCK.)  The SSL layer, however, 
returns to the SMTP code that it has accepted the written data, so the SMTP code 
then goes into its read-the-response loop, only selecting for PR_POLL_READ and 
only calling the read method of the SSL code.

To work around this, the SSL code is supposed to do two things.  One, whenever 
it is asked to read data, it is supposed to try to flush any buffered write 
data.  Two, when asked to select for PR_POLL_READ and there is buffered write 
data, it will select the underlying socket for both read and write, so if the 
underlying socket becomes writable, the calling event loop will wake up and call 
the SSL's read method to flush the buffered write data.

The last time Mozilla had this symptom, the fix was to add that second behavior. 
 Now it seems from the previous comments that there was some change which now 
prevents SSL from flushing the buffered write data on a read.  I suggest getting 
an SSL person to investigate this "fdx" member and figure out why it is being 
set.  I believe simultaneous read/write is a newly added SSL feature.

With the new knowledge from your explanations, I did some additional testing. If
I remove testing for the fdx flag, i.e. apply the patch listed below, which
causes the pending buffer to be tested whenever the ssl layer tries to read,
sending does always work for me! Sometimes I've seen an alert (assertion), that
nsSmtpUrl is not thread safe, but it still worked.

But I agree with John that advice from a ssl expert would be very helpful, i.e.
is my patch the right solution? Or should the problem be solved differently,
because checking for the fdx flag is important?
Index: sslsecur.c
===================================================================
RCS file: /cvsroot/mozilla/security/nss/lib/ssl/sslsecur.c,v
retrieving revision 1.11.2.1
diff -u -r1.11.2.1 sslsecur.c
--- sslsecur.c	2001/07/03 18:11:17	1.11.2.1
+++ sslsecur.c	2001/08/19 14:24:02
@@ -992,7 +992,7 @@
     	return PR_FAILURE;
     }
 
-    if (!ssl_SocketIsBlocking(ss) && !ss->fdx) {
+    if (!ssl_SocketIsBlocking(ss)) {
 	ssl_GetXmitBufLock(ss);
 	if (ss->pendingBuf.len != 0) {
 	    rv = ssl_SendSavedWriteData(ss, &ss->pendingBuf, ssl_DefSend);
Adding keywords
Keywords: patch, review
To aid with testing, I created a distribution using todays source and this one
line change.

If you see this bug with your current version of Mozilla, you could help
downloading the distribution and test whether its now fixed for now.

We are particulary interested, if you see any other SSL related problem that you
didn't see before.

Linux and Windows builds are now available at
http://www.kuix.de/mozilla/build-20010820-24847-88303.php
This bug is a symptom of the NSS bug 80092.  We should
figure out a good fix for the NSS bug, but PSM can
avoid the issue altogether with Kai's new patch.

The definition of the SSL_ENABLE_FDX option is:
    SSL_ENABLE_FDX tells the SSL library whether
    the application will have two threads, one
    reading and one writing, or just one thread
    doing reads and writes alternately.
(http://www.mozilla.org/projects/security/pki/nss/ref/ssl/sslfnc.html#1086543)

PSM only has one thread doing reads and writes
alternately on an SSL socket, therefore it should
not enable the SSL_ENABLE_FDX option.

r=wtc.
rs=darin
Patch checked in.
Status: ASSIGNED → RESOLVED
Closed: 23 years ago23 years ago
Resolution: --- → FIXED
I've been sending email with SSL enabled and have not seen this problem.

Marking VERIFIED FIXED on:
- MacOS91 2001-08-21-04-trunk (commercial)
- LinRH62 2001-08-21-06-trunk (commercial)
- Win98SE 2001-08-21-11-trunk (commercial)

VERIFICATION PENDING on:
- MacOS_X 2001-08-21-05-trunk (commercial)

Adding note to self to re-verify on MacOS_X
Status: RESOLVED → VERIFIED
Whiteboard: [ckritzer] [MacOS_X]
Product: PSM → Core
Version: psm2.0 → 1.0 Branch
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: