Closed
Bug 88303
Opened 23 years ago
Closed 23 years ago
infinite loop while "Sending Message..."
Categories
(Core Graveyard :: Security: UI, defect, P1)
Tracking
(Not tracked)
VERIFIED
FIXED
psm2.1
People
(Reporter: mitch, Assigned: inactive-mailbox)
References
Details
(Whiteboard: [ckritzer] [MacOS_X])
Attachments
(1 file)
808 bytes,
patch
|
Details | Diff | Splinter Review |
From Bugzilla Helper: User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:0.9.1+) Gecko/20010628 Netscape6/6.1b1 BuildID: 2001062804 Using SSL=always, I sometimes get caught in an infinite loop waiting for the message to send. GUI indicates it is Sending Message..., the animation is working, but the mail never gets sent. If I click the "offline" icon in the mail thread window, I can change my SMTP prefs to use SSL=never, return on an online state, and successfully send the message. Reproducible: Sometimes Steps to Reproduce: 1. select SSL=always 2. select "use username/password" 3. send mail Actual Results: sometimes mail never sends Expected Results: mail sends Happens both when I select "select cert manually" and when I select "select cert automatically". I usually get a dialog indicating a mail server timeout at some point AFTER I go through steps described in the "description" above. I will capture the dialog contents the next time it happens and attach to this bug.
I've seen this, but not in the past week. This might be a timing problem.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Keywords: nsenterprise
Priority: -- → P2
Target Milestone: --- → 2.0
Version: 1.01 → 2.0
Comment 2•23 years ago
|
||
Except for the report against 0.9.1, this sounds a lot like bug 74387 where the SSL poll method was not returning readable when SSL had buffered write data and the underlying socket was writable. That bug was triggered by sending messages larger than a certain size.
Reporter | ||
Comment 3•23 years ago
|
||
Ran additional tests. Seems to be independent of PlainText messages vs HTML messages vs PlainText+HTML messages. Seems to be independent of "forwarded" vs "reply" vs "new message". The key trigger appears to be message size, the threshold being somewhere between 11k and 22k (size estimates being derived from an 11k message sent successfully as HTML and failing as HTML+PlainText).
Comment 4•23 years ago
|
||
Is this against a Mozilla that was compiled in an unusual way, say with MOZ_NSS_AUTOCONF?
Reporter | ||
Comment 5•23 years ago
|
||
It's a standard "commercial" build downloaded from sweetlou. I get a new copy every day. I don't have information regarding how those builds are compiled. Bob?
Reporter | ||
Comment 7•23 years ago
|
||
Are you sending bigger messages? There seems to be a threshold dependent upon message size that triggers this.
Comment 10•23 years ago
|
||
Still need information on how these bits were built. It is likely the fix made to the official NSS static tag were not applied to the NSS autoconf branch. Applying the fix to the NSS autoconf branch would be the responsibility of whoever is maintaining the NSS autoconf branch, not the PSM team.
Comment 11•23 years ago
|
||
Neither the Mozilla nor the Netscape commercial builds should be using NSS built under autoconf. The NSS autoconf branch should be dead.
Reporter | ||
Comment 12•23 years ago
|
||
This bug appears to be a dupe of 84731. I'll let you experts make that determination and mark if "resoved, dupe" if you concur.
Comment 13•23 years ago
|
||
*** This bug has been marked as a duplicate of 84731 ***
Status: NEW → RESOLVED
Closed: 23 years ago
Resolution: --- → DUPLICATE
Comment 15•23 years ago
|
||
this isn't a dup of that bug. We fixed that bug it was something different. This is a bug in the SSL transport layer. see the stack traces in 92588. This bug was filed long after 84731 was fixed.
Status: VERIFIED → REOPENED
Resolution: DUPLICATE → ---
Comment 16•23 years ago
|
||
Ok, so let's try to get to the bottom of this.
Assignee: ssaux → kai.engert
Status: REOPENED → NEW
Priority: P2 → P1
Updated•23 years ago
|
Keywords: nsenterprise → nsenterprise+
Assignee | ||
Comment 17•23 years ago
|
||
I spent a full day trying to reproduce this in a controlled environment, without any luck. I saw it once, when I didn't have a debugger attached. The only idea I have for now: I often see that the progress window stops somewhere between 70% and 90%, and stays there for a few moments. Using traffic sniffers, I found out that at this point our application is ready sending all the data. The final answer from the other side is awaited. Is is possible that sometimes the mail server on the other side needs more time to put the new message into its local storage, run e-mail virus scans, etc? I'm desperate for more ideas.
Comment 18•23 years ago
|
||
In a related bug (can't track it down among chain of dups!), I found that including this image: http://www.jivamedia.com/SolarEclipse2001/Mid-Inner-Composite.jpg in a mail message would always result in the apparent send message infinite loop. That problem went away when I changed my pref to "not use OSCP for certificate validation." Maybe this will help
Assignee | ||
Comment 19•23 years ago
|
||
Thanks for your comment. I'm continuing my research, and I have some good news regarding reproducing it. On Windows 2000, I can! I will invest more time, as this really is a major bug.
Status: NEW → ASSIGNED
Assignee | ||
Comment 20•23 years ago
|
||
Here are more results out of my countless hours trying to analyze and debug this. I'm able to reproduce this in Windows. When the application seems to block at around 94% of sending the e-mail, there are usually 3-4 file descriptors open which are checked by the socket transport thread. The SSL descriptor has over 1000 bytes waiting to be sent, but it never receives a writeable event. One of the sockets receives a writeable event, but that doesn't help, because it has nothing in its incoming pipe, and its mSelectFlags does not contain the PR_POLL_WRITE bit. If I use a direct connection to judge.mcom.com:25, I am able to reproduce this nearly all the time. As a next test, I used the ssltap tool from the NSS tools. I configured this to sit in the middle between Mozilla sending SMTP/TLS traffic and judge.mcom.com:25. With this configuration, it *always* works! I don't know enough yet to guess where the error could be. Ideas welcome. CC'ing mcgreer.
Assignee | ||
Comment 21•23 years ago
|
||
While the SSL layer seems to have data writeable, but doesn't receive the writeable event, it is receiving a readable event all the time, trying to fetch data from the socket, only to see that nothing is available. That's why the CPU usage goes up to 100%.
Comment 22•23 years ago
|
||
adding darin, mscott to cc list as they may know more about why the writeable events never happen.
Comment 23•23 years ago
|
||
if i understand the way smtp uses the socket transport, it writes data synchronously, but reads data asynchronously. this means that while the socket is configured to be non-blocking, the Write calls made by mailnews do not return until all of the data has been written (to the underlying io layer, in this case PSM). after "successfully" writing the data to the underlying io layer, the socket transport does not PR_Poll with PR_POLL_WRITE set... it only PR_Poll's with PR_POLL_READ set so it can "asynchronously" read from the socket. is PSM expecting PR_Poll to be called with PR_POLL_WRITE?
Comment 24•23 years ago
|
||
would it help if we called PR_Flush from within nsSocketBOS::Write ??
Assignee | ||
Comment 25•23 years ago
|
||
Darin, thanks for your comment, I will try what you suggested. Finally, I was able to produce some good trace logs, maybe this helps you with getting more insight: 0: SSL3[95584368] In ssl3_SendRecord, calling ssl_DefSend 0: SSL3[95584368] In ssl3_SendRecord, calling SaveWriteData (2nd) with: 2069 0: SSL[95584368]: saving 2069 bytes of data (2069 total saved so far) 0: SSL3[95584368] In ssl3_SendRecord, calling SaveWriteData (2nd) with: 2069 0: SSL3[95584368] Leaving SendApplicationData, sent: 2048 0: SSL3[95584368]: gather state 1 (need 5 more) .... 0: SSL3[95584368]: gather state 1 (need 5 more) 0: SSL[95584368]: sending 2069 bytes of saved data 0: SSL[95584368]: SecureSend: sending 1025 bytes 0: SSL3[95584368] Entering SendApplicationData, to send: 1025 0: SSL3[95584368] SendApplicationData, in loop, sent: 0, to go: 1025 0: SSL3[95584368] SendRecord type: application_data (23) bytes=1025 0: SSL[95584368]: Send record (plain text) [Len: 1025] 0d 0a 4f 54 34 4e 4c 30 4e 79 5a 57 46 30 61 57 ..OT4NL0NyZWF0aW .... 56 6d 44 54 55 77 4d 7a 63 4e 4a 53 56 46 54 30 VmDTUwMzcNJSVFT0 59 4e 0d 0a 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d YN..------------ 2d 2d 32 35 38 34 32 35 35 38 45 34 34 32 42 44 --25842558E442BD 46 38 37 34 33 36 42 30 36 33 2d 2d 0d 0a 0d 0a F87436B063--.... 0d 0a 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d ..-------------- 30 38 30 36 30 33 30 32 30 38 30 38 30 36 30 38 0806030208080608 30 32 30 35 30 31 30 38 2d 2d 0d 0a 0d 0a 2e 0d 02050108--...... 0a . 0: SSL3[95584368] In ssl3_SendRecord, bytes: 1025 0: SSL3[95584368] In ssl3_SendRecord, Getting SpecReadLock 0: SSL3[95584368] In ssl3_SendRecord, comuting RecordMAC 0: SSL3[95584368] In ssl3_SendRecord, calling encode 0: SSL3[95584368] In ssl3_SendRecord, Release SpecReadLock 0: SSL[95584368]: send (encrypted) record data: [Len: 1046] 17 03 00 04 11 3f 28 f4 d6 5a 05 57 5d dc c7 ff .....?(..Z.W]... .... e4 55 88 50 8c f0 7c 61 a6 f1 75 41 c8 38 27 35 .U.P..|a..uA.8'5 59 7c b6 70 14 66 Y|.p.f 0: SSL3[95584368] In ssl3_SendRecord, calling ssl_DefSend 0: SSL3[95584368] In ssl3_SendRecord, calling SaveWriteData (2nd) with: 1046 0: SSL[95584368]: saving 1046 bytes of data (1046 total saved so far) 0: SSL3[95584368] In ssl3_SendRecord, calling SaveWriteData (2nd) with: 1046 0: SSL3[95584368] Leaving SendApplicationData, sent: 1025 0: SSL3[95584368]: gather state 1 (need 5 more) 0: SSL3[95584368]: gather state 1 (need 5 more) .... last message continuously without any other events happening. As I said, a filedescriptor is writeable.
Assignee | ||
Comment 26•23 years ago
|
||
Darin, I assume you meant PR_Sync. I added this to the suggested function. However, it ends up in ssl_FSync, which has an assert(0), because it is not implemented. I will now research whether adding PR_POLL_WRITE to polling helps.
Comment 27•23 years ago
|
||
I just read Darin's comments about the SMTP protocol's read / write behavior. Darin, SMTP actually uses async reads & writes now. People were getting mad that we were blocking the UI when sending large messages =), so I re-wrote it a while a go to use async write. Not sure if this helps or hinders the analysis of the ssl logs.
Comment 28•23 years ago
|
||
mscott: thanks for the info. it definitely invalidates my hunch.
Assignee | ||
Comment 29•23 years ago
|
||
Some more traces: ssl3_SendRecord is called, it tries to send out immediately as much data as possible using ssl_DefSend. If some data couldn't be sent, ssl_SaveWriteData is called, storing the pending data in the sslSocket's sslBuffer (pendingBuf). The worker function that later should be called again, to flush the buffer, is again ssl_DefSend. The function that could trigger flushing the pending buffer is ssl_SecureSend. This function is required to be triggered by the sockettransport calling the OnWrite method. The SocketTransport currently doesn't check for the outgoing socket to become writeable, so it never triggers OnWrite. But even if it did (I set the flags manually in the debugger), this wouldn't solve the problem immediately. The OnWrite method checks for some pipe to have data available. I haven't understood this completely, but my assumption is, the socket transport assumes that the layers communicate using a pipe with each other. I'm thankful for any explanations you could give me on this. But as the ssl code doesn't seem to put data into a pipe, it just put it into an internal buffer. I saw something else: The sslSocket has a flag called "fdx", which is commented as "simultaneous R/W threads". This flag is set to true. When the socket transport finds out that a socket is writeable, it triggers (indirectly) ssl_SecureRecv. Inside this function is the code that might be able to help us. There is a call to ssl_SendSavedWriteData, which is executed when then pendingBuf contains data. However, this method is only called when fdx is false. Where is the problem? Should the ssl code fill a pipe in case it can't write immediately? Or is it an error that fdx is set to true?
Assignee | ||
Comment 30•23 years ago
|
||
Ok, setting a breakpoint in ssl_SecureRecv, and setting ss->fdx to zero, breaks the endless loop and the SMTP operation continues. I don't understand what the correct fix to this problem would be, so I'm thankful for any explanation you might have for me. Thanks.
Comment 31•23 years ago
|
||
kai, i don't know the SSL code at all, but i can perhaps explain how OnWrite works. an async write request has an associated stream provider. the socket transport poll's w/ PR_POLL_WRITE, when poll returns it checks with the stream provider for data. usually, the client's stream provider does not run entirely on the socket transport thread. instead an event is posted to the client's thread (see nsStreamProviderProxy.cpp). when the client handles the event, it writes some data into a pipe which the stream provider owns. the stream provider then notifies the socket transport to resume writing data. the socket transport then attempts to write data (from the pipe) to the socket. this continues until the pipe is empty or the client otherwise indicates end of file.
Comment 32•23 years ago
|
||
The following sequence of events is supposed to happen: SMTP is in the write loop, selecting for PR_POLL_WRITE and writing the message data to the SSL layer. When all of the data have been written, it then tries to read the response, selecting for PR_POLL_READ and reading the response until it has a complete response line. When the message is large enough, it almost always happens that on the last write of message data, the SSL code is able to encrypt and wrap that data into an SSL record, but the lower OS TCP layer does not accept the encrypted data (it returns a partial write and then returns WOULDBLOCK.) The SSL layer, however, returns to the SMTP code that it has accepted the written data, so the SMTP code then goes into its read-the-response loop, only selecting for PR_POLL_READ and only calling the read method of the SSL code. To work around this, the SSL code is supposed to do two things. One, whenever it is asked to read data, it is supposed to try to flush any buffered write data. Two, when asked to select for PR_POLL_READ and there is buffered write data, it will select the underlying socket for both read and write, so if the underlying socket becomes writable, the calling event loop will wake up and call the SSL's read method to flush the buffered write data. The last time Mozilla had this symptom, the fix was to add that second behavior. Now it seems from the previous comments that there was some change which now prevents SSL from flushing the buffered write data on a read. I suggest getting an SSL person to investigate this "fdx" member and figure out why it is being set. I believe simultaneous read/write is a newly added SSL feature.
Assignee | ||
Comment 33•23 years ago
|
||
With the new knowledge from your explanations, I did some additional testing. If I remove testing for the fdx flag, i.e. apply the patch listed below, which causes the pending buffer to be tested whenever the ssl layer tries to read, sending does always work for me! Sometimes I've seen an alert (assertion), that nsSmtpUrl is not thread safe, but it still worked. But I agree with John that advice from a ssl expert would be very helpful, i.e. is my patch the right solution? Or should the problem be solved differently, because checking for the fdx flag is important? Index: sslsecur.c =================================================================== RCS file: /cvsroot/mozilla/security/nss/lib/ssl/sslsecur.c,v retrieving revision 1.11.2.1 diff -u -r1.11.2.1 sslsecur.c --- sslsecur.c 2001/07/03 18:11:17 1.11.2.1 +++ sslsecur.c 2001/08/19 14:24:02 @@ -992,7 +992,7 @@ return PR_FAILURE; } - if (!ssl_SocketIsBlocking(ss) && !ss->fdx) { + if (!ssl_SocketIsBlocking(ss)) { ssl_GetXmitBufLock(ss); if (ss->pendingBuf.len != 0) { rv = ssl_SendSavedWriteData(ss, &ss->pendingBuf, ssl_DefSend);
Assignee | ||
Comment 35•23 years ago
|
||
To aid with testing, I created a distribution using todays source and this one line change. If you see this bug with your current version of Mozilla, you could help downloading the distribution and test whether its now fixed for now. We are particulary interested, if you see any other SSL related problem that you didn't see before. Linux and Windows builds are now available at http://www.kuix.de/mozilla/build-20010820-24847-88303.php
Assignee | ||
Comment 36•23 years ago
|
||
Comment 37•23 years ago
|
||
This bug is a symptom of the NSS bug 80092. We should figure out a good fix for the NSS bug, but PSM can avoid the issue altogether with Kai's new patch. The definition of the SSL_ENABLE_FDX option is: SSL_ENABLE_FDX tells the SSL library whether the application will have two threads, one reading and one writing, or just one thread doing reads and writes alternately. (http://www.mozilla.org/projects/security/pki/nss/ref/ssl/sslfnc.html#1086543) PSM only has one thread doing reads and writes alternately on an SSL socket, therefore it should not enable the SSL_ENABLE_FDX option. r=wtc.
Comment 38•23 years ago
|
||
rs=darin
Assignee | ||
Comment 39•23 years ago
|
||
Patch checked in.
Status: ASSIGNED → RESOLVED
Closed: 23 years ago → 23 years ago
Resolution: --- → FIXED
Comment 40•23 years ago
|
||
I've been sending email with SSL enabled and have not seen this problem. Marking VERIFIED FIXED on: - MacOS91 2001-08-21-04-trunk (commercial) - LinRH62 2001-08-21-06-trunk (commercial) - Win98SE 2001-08-21-11-trunk (commercial) VERIFICATION PENDING on: - MacOS_X 2001-08-21-05-trunk (commercial) Adding note to self to re-verify on MacOS_X
Status: RESOLVED → VERIFIED
Whiteboard: [ckritzer] [MacOS_X]
Updated•8 years ago
|
Product: Core → Core Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•