88303 - infinite loop while "Sending Message..."

Reporter

Description

•

23 years ago

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:0.9.1+)
Gecko/20010628 Netscape6/6.1b1
BuildID:    2001062804

Using SSL=always, I sometimes get caught in an infinite loop waiting for the
message to send.  GUI indicates it is Sending Message..., the animation is
working, but the mail never gets sent.  If I click the "offline" icon in the
mail thread window, I can change my SMTP prefs to use SSL=never, return on an
online state, and successfully send the message.  

Reproducible: Sometimes
Steps to Reproduce:
1. select SSL=always
2. select "use username/password"
3. send mail

Actual Results:  sometimes mail never sends

Expected Results:  mail sends

Happens both when I select "select cert manually" and when I select "select cert
automatically". 

I usually get a dialog indicating a mail server timeout at some point AFTER I go
through steps described in the "description" above.  I will capture the dialog
contents the next time it happens and attach to this bug.

Bob Lord

Comment 1

•

23 years ago

I've seen this, but not in the past week.  This might be a timing problem.

Status: UNCONFIRMED → NEW

Ever confirmed: true

Keywords: nsenterprise

Priority: -- → P2

Target Milestone: --- → 2.0

Version: 1.01 → 2.0

Bob Lord

Updated

•

23 years ago

Target Milestone: 2.0 → 2.1

John G. Myers

Comment 2

•

23 years ago

Except for the report against 0.9.1, this sounds a lot like bug 74387 where the 
SSL poll method was not returning readable when SSL had buffered write data and 
the underlying socket was writable.  That bug was triggered by sending messages 
larger than a certain size.

Mitch Green

Reporter

Comment 3

•

23 years ago

Ran additional tests. Seems to be independent of PlainText messages vs HTML
messages vs PlainText+HTML messages. Seems to be independent of "forwarded" vs
"reply" vs "new message". The key trigger appears to be message size, the
threshold being somewhere between 11k and 22k (size estimates being derived from
an 11k message sent successfully as HTML and failing as HTML+PlainText).

John G. Myers

Comment 4

•

23 years ago

Is this against a Mozilla that was compiled in an unusual way, say with 
MOZ_NSS_AUTOCONF?

Mitch Green

Reporter

Comment 5

•

23 years ago

It's a standard "commercial" build downloaded from sweetlou.  I get a new copy
every day.  I don't have information regarding how those builds are compiled.  Bob?

Bob Lord

Comment 6

•

23 years ago

I've been seeing this more often recently.

Mitch Green

Reporter

Comment 7

•

23 years ago

Are you sending bigger messages?  There seems to be a threshold dependent upon
message size that triggers this.

John Unruh

Comment 8

•

23 years ago

Mass assigning QA to ckritzer.

QA Contact: junruh → ckritzer

Scott MacGregor

Comment 9

•

23 years ago

*** Bug 92588 has been marked as a duplicate of this bug. ***

John G. Myers

Comment 10

•

23 years ago

Still need information on how these bits were built.  It is likely the fix made 
to the official NSS static tag were not applied to the NSS autoconf branch. 
Applying the fix to the NSS autoconf branch would be the responsibility of 
whoever is maintaining the NSS autoconf branch, not the PSM team.

Bob Lord

Comment 11

•

23 years ago

Neither the Mozilla nor the Netscape commercial builds should be using NSS built
under autoconf. 

The NSS autoconf branch should be dead.

Mitch Green

Reporter

Comment 12

•

23 years ago

This bug appears to be a dupe of 84731.  I'll let you experts make that
determination and mark if "resoved, dupe" if you concur.

Stephane Saux

Comment 13

•

23 years ago


*** This bug has been marked as a duplicate of 84731 ***

Status: NEW → RESOLVED

Closed: 23 years ago

Resolution: --- → DUPLICATE

ckritzer (gone)

Comment 14

•

23 years ago

Marking VERIFIED DUPLICATE.

Status: RESOLVED → VERIFIED

Scott MacGregor

Comment 15

•

23 years ago

this isn't a dup of that bug. We fixed that bug it was something different. This
is a bug in the SSL transport layer. see the stack traces in 92588. This bug was
filed long after 84731 was fixed.

Status: VERIFIED → REOPENED

Resolution: DUPLICATE → ---

Stephane Saux

Comment 16

•

23 years ago

Ok, so let's try to get to the bottom of this.

Assignee: ssaux → kai.engert

Status: REOPENED → NEW

Priority: P2 → P1

Jussi-Pekka Mantere

Updated

•

23 years ago

Keywords: nsenterprise → nsenterprise+

(Don't use, old account) - Kai Engert

Assignee

Comment 17

•

23 years ago

I spent a full day trying to reproduce this in a controlled environment, without
any luck. I saw it once, when I didn't have a debugger attached.

The only idea I have for now: I often see that the progress window stops
somewhere between 70% and 90%, and stays there for a few moments. Using traffic
sniffers, I found out that at this point our application is ready sending all
the data. The final answer from the other side is awaited.

Is is possible that sometimes the mail server on the other side needs more time
to put the new message into its local storage, run e-mail virus scans, etc?

I'm desperate for more ideas.

Charles Manske

Comment 18

•

23 years ago

In a related bug (can't track it down among chain of dups!), I found that
including this image: 
http://www.jivamedia.com/SolarEclipse2001/Mid-Inner-Composite.jpg
in a mail message would always result in the apparent send message infinite
loop. That problem went away when I changed my pref to "not use OSCP for
certificate validation." Maybe this will help

(Don't use, old account) - Kai Engert

Assignee

Comment 19

•

23 years ago

Thanks for your comment. I'm continuing my research, and I have some good news
regarding reproducing it. On Windows 2000, I can!

I will invest more time, as this really is a major bug.

Status: NEW → ASSIGNED

(Don't use, old account) - Kai Engert

Assignee

Comment 20

•

23 years ago

Here are more results out of my countless hours trying to analyze and debug this.

I'm able to reproduce this in Windows.

When the application seems to block at around 94% of sending the e-mail, there
are usually 3-4 file descriptors open which are checked by the socket transport
thread.

The SSL descriptor has over 1000 bytes waiting to be sent, but it never receives
a writeable event.

One of the sockets receives a writeable event, but that doesn't help, because it
has nothing in its incoming pipe, and its mSelectFlags does not contain the
PR_POLL_WRITE bit.

If I use a direct connection to judge.mcom.com:25, I am able to reproduce this
nearly all the time.

As a next test, I used the ssltap tool from the NSS tools. I configured this to
sit in the middle between Mozilla sending SMTP/TLS traffic and
judge.mcom.com:25. With this configuration, it *always* works!

I don't know enough yet to guess where the error could be. Ideas welcome. CC'ing
mcgreer.

(Don't use, old account) - Kai Engert

Assignee

Comment 21

•

23 years ago

While the SSL layer seems to have data writeable, but doesn't receive the
writeable event, it is receiving a readable event all the time, trying to fetch
data from the socket, only to see that nothing is available. That's why the CPU
usage goes up to 100%.

Javier Delgadillo

Comment 22

•

23 years ago

adding darin, mscott to cc list as they may know more about why the writeable
events never happen.

Darin Fisher

Comment 23

•

23 years ago

if i understand the way smtp uses the socket transport, it writes data
synchronously, but reads data asynchronously.  this means that while the
socket is configured to be non-blocking, the Write calls made by mailnews
do not return until all of the data has been written (to the underlying io
layer, in this case PSM).  after "successfully" writing the data to the
underlying io layer, the socket transport does not PR_Poll with PR_POLL_WRITE
set... it only PR_Poll's with PR_POLL_READ set so it can "asynchronously" read
from the socket.  is PSM expecting PR_Poll to be called with PR_POLL_WRITE?

Darin Fisher

Comment 24

•

23 years ago

would it help if we called PR_Flush from within nsSocketBOS::Write ??

(Don't use, old account) - Kai Engert

Assignee

Comment 25

•

23 years ago

Darin, thanks for your comment, I will try what you suggested.

Finally, I was able to produce some good trace logs, maybe this helps you with
getting more insight:

0: SSL3[95584368] In ssl3_SendRecord, calling ssl_DefSend
0: SSL3[95584368] In ssl3_SendRecord, calling SaveWriteData (2nd) with: 2069
0: SSL[95584368]: saving 2069 bytes of data (2069 total saved so far)
0: SSL3[95584368] In ssl3_SendRecord, calling SaveWriteData (2nd) with: 2069
0: SSL3[95584368] Leaving SendApplicationData, sent: 2048
0: SSL3[95584368]: gather state 1 (need 5 more)
....
0: SSL3[95584368]: gather state 1 (need 5 more)
0: SSL[95584368]: sending 2069 bytes of saved data
0: SSL[95584368]: SecureSend: sending 1025 bytes
0: SSL3[95584368] Entering SendApplicationData, to send: 1025
0: SSL3[95584368] SendApplicationData, in loop, sent: 0, to go: 1025
0: SSL3[95584368] SendRecord type: application_data (23) bytes=1025
0: SSL[95584368]: Send record (plain text) [Len: 1025]
   0d 0a 4f 54 34 4e 4c 30 4e 79 5a 57 46 30 61 57   ..OT4NL0NyZWF0aW
....
   56 6d 44 54 55 77 4d 7a 63 4e 4a 53 56 46 54 30   VmDTUwMzcNJSVFT0
   59 4e 0d 0a 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d   YN..------------
   2d 2d 32 35 38 34 32 35 35 38 45 34 34 32 42 44   --25842558E442BD
   46 38 37 34 33 36 42 30 36 33 2d 2d 0d 0a 0d 0a   F87436B063--....
   0d 0a 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d   ..--------------
   30 38 30 36 30 33 30 32 30 38 30 38 30 36 30 38   0806030208080608
   30 32 30 35 30 31 30 38 2d 2d 0d 0a 0d 0a 2e 0d   02050108--......
   0a                                                .
0: SSL3[95584368] In ssl3_SendRecord, bytes: 1025
0: SSL3[95584368] In ssl3_SendRecord, Getting SpecReadLock
0: SSL3[95584368] In ssl3_SendRecord, comuting RecordMAC
0: SSL3[95584368] In ssl3_SendRecord, calling encode
0: SSL3[95584368] In ssl3_SendRecord, Release SpecReadLock
0: SSL[95584368]: send (encrypted) record data: [Len: 1046]
   17 03 00 04 11 3f 28 f4 d6 5a 05 57 5d dc c7 ff   .....?(..Z.W]...
....
   e4 55 88 50 8c f0 7c 61 a6 f1 75 41 c8 38 27 35   .U.P..|a..uA.8'5
   59 7c b6 70 14 66                                 Y|.p.f
0: SSL3[95584368] In ssl3_SendRecord, calling ssl_DefSend
0: SSL3[95584368] In ssl3_SendRecord, calling SaveWriteData (2nd) with: 1046
0: SSL[95584368]: saving 1046 bytes of data (1046 total saved so far)
0: SSL3[95584368] In ssl3_SendRecord, calling SaveWriteData (2nd) with: 1046
0: SSL3[95584368] Leaving SendApplicationData, sent: 1025
0: SSL3[95584368]: gather state 1 (need 5 more)
0: SSL3[95584368]: gather state 1 (need 5 more)
....
last message continuously without any other events happening.
As I said, a filedescriptor is writeable.

(Don't use, old account) - Kai Engert

Assignee

Comment 26

•

23 years ago

Darin, I assume you meant PR_Sync. I added this to the suggested function.
However, it ends up in ssl_FSync, which has an assert(0), because it is not
implemented.

I will now research whether adding PR_POLL_WRITE to polling helps.

Scott MacGregor

Comment 27

•

23 years ago

I just read Darin's comments about the SMTP protocol's read / write behavior.
Darin, SMTP actually uses async reads & writes now. People were getting mad that
we were blocking the UI when sending large messages =), so I re-wrote it a while
a go to use async write. Not sure if this helps or hinders the analysis of the
ssl logs.

Darin Fisher

Comment 28

•

23 years ago

mscott: thanks for the info.  it definitely invalidates my hunch.

(Don't use, old account) - Kai Engert

Assignee

Comment 29

•

23 years ago

Some more traces:

ssl3_SendRecord is called, it tries to send out immediately as much data as
possible using ssl_DefSend. If some data couldn't be sent, ssl_SaveWriteData is
called, storing the pending data in the sslSocket's sslBuffer (pendingBuf).

The worker function that later should be called again, to flush the buffer, is
again ssl_DefSend.

The function that could trigger flushing the pending buffer is ssl_SecureSend.
This function is required to be triggered by the sockettransport calling the
OnWrite method.

The SocketTransport currently doesn't check for the outgoing socket to become
writeable, so it never triggers OnWrite. But even if it did (I set the flags
manually in the debugger), this wouldn't solve the problem immediately. 

The OnWrite method checks for some pipe to have data available. I haven't
understood this completely, but my assumption is, the socket transport assumes
that the layers communicate using a pipe with each other. I'm thankful for any
explanations you could give me on this.

But as the ssl code doesn't seem to put data into a pipe, it just put it into an
internal buffer.

I saw something else: The sslSocket has a flag called "fdx", which is commented
as "simultaneous R/W threads". This flag is set to true.

When the socket transport finds out that a socket is writeable, it triggers
(indirectly) ssl_SecureRecv. Inside this function is the code that might be able
to help us. There is a call to ssl_SendSavedWriteData, which is executed when
then pendingBuf contains data. However, this method is only called when fdx is
false.
Where is the problem? Should the ssl code fill a pipe in case it can't write
immediately? Or is it an error that fdx is set to true?

(Don't use, old account) - Kai Engert

Assignee

Comment 30

•

23 years ago

Ok, setting a breakpoint in ssl_SecureRecv, and setting ss->fdx to zero, breaks
the endless loop and the SMTP operation continues.

I don't understand what the correct fix to this problem would be, so I'm
thankful for any explanation you might have for me. Thanks.

Darin Fisher

Comment 31

•

23 years ago

kai, i don't know the SSL code at all, but i can perhaps explain how OnWrite
works.  an async write request has an associated stream provider.  the socket
transport poll's w/ PR_POLL_WRITE, when poll returns it checks with the stream
provider for data.  usually, the client's stream provider does not run entirely
on the socket transport thread.  instead an event is posted to the client's
thread (see nsStreamProviderProxy.cpp).  when the client handles the event, it
writes some data into a pipe which the stream provider owns.  the stream
provider then notifies the socket transport to resume writing data.  the socket
transport then attempts to write data (from the pipe) to the socket.  this
continues until the pipe is empty or the client otherwise indicates end of file.

John G. Myers

Comment 32

•

23 years ago

The following sequence of events is supposed to happen:

SMTP is in the write loop, selecting for PR_POLL_WRITE and writing the message 
data to the SSL layer.  When all of the data have been written, it then tries to 
read the response, selecting for PR_POLL_READ and reading the response until it 
has a complete response line.

When the message is large enough, it almost always happens that on the last 
write of message data, the SSL code is able to encrypt and wrap that data into 
an SSL record, but the lower OS TCP layer does not accept the encrypted data (it 
returns a partial write and then returns WOULDBLOCK.)  The SSL layer, however, 
returns to the SMTP code that it has accepted the written data, so the SMTP code 
then goes into its read-the-response loop, only selecting for PR_POLL_READ and 
only calling the read method of the SSL code.

To work around this, the SSL code is supposed to do two things.  One, whenever 
it is asked to read data, it is supposed to try to flush any buffered write 
data.  Two, when asked to select for PR_POLL_READ and there is buffered write 
data, it will select the underlying socket for both read and write, so if the 
underlying socket becomes writable, the calling event loop will wake up and call 
the SSL's read method to flush the buffered write data.

The last time Mozilla had this symptom, the fix was to add that second behavior. 
 Now it seems from the previous comments that there was some change which now 
prevents SSL from flushing the buffered write data on a read.  I suggest getting 
an SSL person to investigate this "fdx" member and figure out why it is being 
set.  I believe simultaneous read/write is a newly added SSL feature.

(Don't use, old account) - Kai Engert

Assignee

Comment 33

•

23 years ago

With the new knowledge from your explanations, I did some additional testing. If
I remove testing for the fdx flag, i.e. apply the patch listed below, which
causes the pending buffer to be tested whenever the ssl layer tries to read,
sending does always work for me! Sometimes I've seen an alert (assertion), that
nsSmtpUrl is not thread safe, but it still worked.

But I agree with John that advice from a ssl expert would be very helpful, i.e.
is my patch the right solution? Or should the problem be solved differently,
because checking for the fdx flag is important?
Index: sslsecur.c
===================================================================
RCS file: /cvsroot/mozilla/security/nss/lib/ssl/sslsecur.c,v
retrieving revision 1.11.2.1
diff -u -r1.11.2.1 sslsecur.c
--- sslsecur.c	2001/07/03 18:11:17	1.11.2.1
+++ sslsecur.c	2001/08/19 14:24:02
@@ -992,7 +992,7 @@
     	return PR_FAILURE;
     }
 
-    if (!ssl_SocketIsBlocking(ss) && !ss->fdx) {
+    if (!ssl_SocketIsBlocking(ss)) {
 	ssl_GetXmitBufLock(ss);
 	if (ss->pendingBuf.len != 0) {
 	    rv = ssl_SendSavedWriteData(ss, &ss->pendingBuf, ssl_DefSend);

Keyser Sose

Comment 34

•

23 years ago

Adding keywords

Keywords: patch, review

(Don't use, old account) - Kai Engert

Assignee

Comment 35

•

23 years ago

To aid with testing, I created a distribution using todays source and this one
line change.

If you see this bug with your current version of Mozilla, you could help
downloading the distribution and test whether its now fixed for now.

We are particulary interested, if you see any other SSL related problem that you
didn't see before.

Linux and Windows builds are now available at
http://www.kuix.de/mozilla/build-20010820-24847-88303.php

(Don't use, old account) - Kai Engert

Assignee

Comment 36

•

23 years ago

Attached patch New patch — Details — Splinter Review

Wan-Teh Chang

Comment 37

•

23 years ago

This bug is a symptom of the NSS bug 80092.  We should
figure out a good fix for the NSS bug, but PSM can
avoid the issue altogether with Kai's new patch.

The definition of the SSL_ENABLE_FDX option is:
    SSL_ENABLE_FDX tells the SSL library whether
    the application will have two threads, one
    reading and one writing, or just one thread
    doing reads and writes alternately.
(http://www.mozilla.org/projects/security/pki/nss/ref/ssl/sslfnc.html#1086543)

PSM only has one thread doing reads and writes
alternately on an SSL socket, therefore it should
not enable the SSL_ENABLE_FDX option.

r=wtc.

Darin Fisher

Comment 38

•

23 years ago

rs=darin

(Don't use, old account) - Kai Engert

Assignee

Comment 39

•

23 years ago

Patch checked in.

Status: ASSIGNED → RESOLVED

Closed: 23 years ago → 23 years ago

Resolution: --- → FIXED

ckritzer (gone)

Comment 40

•

23 years ago

I've been sending email with SSL enabled and have not seen this problem.

Marking VERIFIED FIXED on:
- MacOS91 2001-08-21-04-trunk (commercial)
- LinRH62 2001-08-21-06-trunk (commercial)
- Win98SE 2001-08-21-11-trunk (commercial)

VERIFICATION PENDING on:
- MacOS_X 2001-08-21-05-trunk (commercial)

Adding note to self to re-verify on MacOS_X

Status: RESOLVED → VERIFIED

Whiteboard: [ckritzer] [MacOS_X]

timeless

Updated

•

19 years ago

Product: PSM → Core

timeless

Updated

•

16 years ago

Version: psm2.0 → 1.0 Branch

Nobody; OK to take it and work on it

Updated

•

8 years ago

Product: Core → Core Graveyard