Closed
Bug 163605
Opened 22 years ago
Closed 22 years ago
Use of blocking I/O for SSL in PSM stalls network activity
Categories
(Core Graveyard :: Security: UI, defect, P2)
Tracking
(Not tracked)
VERIFIED
FIXED
psm2.4
People
(Reporter: anlan, Assigned: KaiE)
References
()
Details
Attachments
(3 files, 7 obsolete files)
8.56 KB,
text/plain
|
Details | |
25.13 KB,
text/plain
|
Details | |
15.56 KB,
patch
|
KaiE
:
review+
KaiE
:
superreview+
asa
:
approval+
|
Details | Diff | Splinter Review |
User-Agent: Mozilla/5.0 (X11; U; SunOS sun4u; en-US; rv:1.0.0) Gecko/20020606 Build Identifier: Mozilla/5.0 (X11; U; SunOS sun4u; en-US; rv:1.0.0) Gecko/20020606 When the URL to this zope-server is used with https, it tries to load but nothing happens. After that, mozilla will not load any page at all, open imap-email or anything. It is possible to close browser windows, trying to close mail-news causes all remaining mozilla windows to hang. Reproducible: Always Steps to Reproduce: 1.https://www.ida.liu.se:9080/groups/cugs/doclib/CUGSLogo_blue.jpg 2.Abort, since it won't load. 3.Try to open anything else, try to close mail-news and the whole thing might freeze. Actual Results: It seems like networking got stuck somewhere, affecting all mozilla components. Had to kill it after it froze. Expected Results: Probably failed to load the image, and left the rest of the application in a normal shape. Loading the image with plain http:// works fine. Https to other pages at the site works, but not this combination of https and the port-number url. Unfortunately, I don't have any newer build than this 1.0 to test with here. Will try to test later with nightlys from 1.0.1 and 1.1.
Comment 2•22 years ago
|
||
confirming with win2k build 20020812.. Mozilla hangs if you try to switch to offline, File URLs are still working.
Assignee: new-network-bugs → darin
Severity: normal → critical
Status: UNCONFIRMED → NEW
Component: Networking → Networking: HTTP
Ever confirmed: true
QA Contact: benc → tever
Assignee | ||
Comment 3•22 years ago
|
||
Let me try to reproduce. I will test whether this is a regression caused by my fix to bug 87902. A previous attempt to fix 87902 resulted in a similar regression.
Assignee | ||
Comment 4•22 years ago
|
||
Confirming all described behaviour on Linux. I tested to back out 87902. This did not help, still showing same symptoms. So at least, this is not caused by 87902. I'll try to have a look what's happening. I wonder if the suspicion mentioned in bug 153769 is correct and is related to this problem.
Assignee | ||
Comment 5•22 years ago
|
||
Error code is 804b0002 for all URL load attempts after the initial failure. => NS_BINDING_ABORTED
Comment 6•22 years ago
|
||
*** Bug 167421 has been marked as a duplicate of this bug. ***
Assignee | ||
Comment 7•22 years ago
|
||
Assignee | ||
Comment 8•22 years ago
|
||
Nelson, this looks like an endlessloop in SSL, always reproducable. Could you please have a look? Please see the attached stack. Thanks.
Target Milestone: --- → mozilla1.2beta
Comment 9•22 years ago
|
||
It appears that the socket on which SSL is operating is a blocking socket. ssl was invoked via the function ssl_Write, which always sets an infinite timeout, so ssl is behaving as intended here. I think the problem is that this socket is blocking.
Comment 10•22 years ago
|
||
Re: Nelson's comment that SSL may be behaving as intended, I note that Netscape does not behave this way. Stopping Netscape on an hung SSL connection does not leave it hung up in an infinite loop.
Comment 11•22 years ago
|
||
interesting, so mailnews does blocking socket i/o on a background thread ... do we need to avoid that? can NSS be configured to have a non-infinite timeout for blocking writes?
Comment 12•22 years ago
|
||
the imap code writes data to a channel opened as follows: rv = m_channel->OpenOutputStream(0, PRUint32(-1), 0, getter_AddRefs(m_outputStream)); does that mean that we do blocking writes when we call Write() on the channel?
Assignee | ||
Comment 13•22 years ago
|
||
The issue is that PSM is currently doing a blocking initial write for any SSL connection, and that is indeed the culprit. I discussed with Nelson, and we concluded this is a bad thing (TM). The solution is to change PSM to never to a blocking call.
Comment 14•22 years ago
|
||
oh right... my mistake. even though necko hands out a blocking nsIOutputStream from OpenOutputStream, the actual socket is still configured for non-blocking mode, and we call PR_Poll to make nsIOutputStream::Write blocking. -> PSM
Assignee: darin → ssaux
Component: Networking: HTTP → Client Library
Product: Browser → PSM
QA Contact: tever → junruh
Target Milestone: mozilla1.2beta → ---
Version: other → unspecified
Comment 15•22 years ago
|
||
There is no infinite loop here. An operation with an infinite timeout has been done on a blocking socket. The progress of that thread is blocked until the operation completes.
Comment 16•22 years ago
|
||
*** Bug 170087 has been marked as a duplicate of this bug. ***
Comment 17•22 years ago
|
||
*** Bug 170428 has been marked as a duplicate of this bug. ***
Comment 18•22 years ago
|
||
So kai how hard is this to do?
Assignee: ssaux → kaie
Priority: -- → P2
Target Milestone: --- → 2.4
Assignee | ||
Comment 19•22 years ago
|
||
Stephane: I'm not sure how much work it is. It might be easy to do, by adding a state variable (state = still waiting for connect), and make PSM I/O functions pay attention to that flag in its read/write functions, and continue with the connect logic as soon as Necko calls us again, when the socket becomes readable/writeable.
Assignee | ||
Comment 20•22 years ago
|
||
I suspected this bug could be the cause for bug 167764. But they seem to be unrelated. But we now have a patch for this one, and it seems to work on Linux and Mac OS X (other platforms not yet tested).
Assignee | ||
Comment 21•22 years ago
|
||
Assignee | ||
Comment 22•22 years ago
|
||
Assignee | ||
Comment 23•22 years ago
|
||
Javi can you please review? The changes are: - no longer use blocking mode on initial write - at the time we attempt the first write, if we see PR_WOULD_BLOCK_ERROR and no bytes were written, do nothing (the socket transport will eventually let us try again) - execute the old fashion logic, if first write can not write bytes, and we see a result code different from PR_WOULD_BLOCK_ERROR
Assignee | ||
Updated•22 years ago
|
Attachment #102127 -
Attachment is obsolete: true
Assignee | ||
Comment 24•22 years ago
|
||
Comment on attachment 102128 [details] [diff] [review] Patch v1 - ignoring whitespace, larger context - please review this one I take my review request back. More testing showed the issue is more complicated. This patch breaks the TLS intolerant detection (sigh).
Attachment #102128 -
Attachment is obsolete: true
Assignee | ||
Comment 25•22 years ago
|
||
We have the following new problem: The SSL error handling within PSM is decoupled from the error handling within Mozilla's general network component (socket transport). By always using non blocking I/O in PSM, we loose the ability to detect sudden disconnects on initial write within PSM. The error will be seen by the socket transport when doing the next select. In my particular test case, PR_POLL_ERR was seen, and socket transport closed the socket. I see a workaround for now, but for an ideal solution I think I need help from both necko and NSS/SSL. PSM can remember, whether it already tried to trigger the initial write, but NSS returned PR_WOULD_BLOCK_ERROR, and therefore the reaction from the remote site is not yet known. When the socket transport closes the socket and calls PSM's I/O layer close function, we can detect the remembered state, and make the decision for TLS intolerance at that time. But the problem is: At this time, we do not have the failure reason code available (only inside necko, but not passed to us). We do not know whether a network problem occurred, and whether TLS can still be used, or whether the remote site did disconnect and we therefore should retry without TLS. We have the choice to always decide the site is TLS intolerant, but this might result in wrong TLS intolerance decisions, which means we don't use TLS in some cases, although we should. For now, I don't have a better idea how I could solve this problem on my own, without changing socket transport or NSS. I suggest I'll attach a patch to always retry the connection on failure, after having initiated the first write, and before the first write completed. However, I think we should try to minimize the TLS intolerance decisions. I see two approaches for future work: - make socket transport pass socket close reason information down to lower NSPR fd layers and/or - transfer the decision, whether a site should be treated as TLS intolerant, to a new NSS function (possibly passing available socket error codes). Maybe NSS could make a resaonable decision, based on state of the initial handshake?
Assignee | ||
Comment 26•22 years ago
|
||
Assignee | ||
Comment 27•22 years ago
|
||
Assignee | ||
Comment 28•22 years ago
|
||
Javi, can you please review?
Assignee | ||
Updated•22 years ago
|
Attachment #102139 -
Attachment is obsolete: true
Assignee | ||
Comment 29•22 years ago
|
||
Comment on attachment 102140 [details] [diff] [review] Patch v2 - diff -u10 -w for easier reviewing I found more arguments why this patch needs more work. The patch from bug 162752 (not yet checked in) is not compatible with this new logic. In general, SSL errors requiring an automatic retry because of TLS intolerance can now happen at multiple locations. Here is how I understand it: As long as the underlying SSL handshake is not yet done, NSS will return WOULD_BLOCK, even though it were able to transfer bytes as part of the handshake. It can take multiple calls to read/write before the handshake completes and finally application data can be transfered. Because necko will see that neither reading nor writing works because of WOULD_BLOCK, it calls both functions alternating. In either of both functions could the handshake complete and NSS return the SSL error to us.
Attachment #102140 -
Attachment is obsolete: true
Assignee | ||
Comment 30•22 years ago
|
||
Darin, can you confirm the following is expected, and that we can rely on that behaviour When necko/http calls a lower layer write function, it will retry, if the write function returns -1 with PR_CONNECT_RESET_ERROR. When necko/http calls a lower layer read function, this same error will not cause a retry. But I saw that returning 0 causes it to retry. Because of the additional SSL logic going on behind the scenes, while necko waits for the real application data to get transfered, and because we eventually decide that an automatic retry makes sense, PSM wants to trigger that retry, triggering using return values and error codes of read/write.
Assignee | ||
Comment 31•22 years ago
|
||
This latest patch seem to work quite well for this bug, it also fixes bug 162752, and bug 167764 seems fixed, too (although wtc suspects the problem in bug 167764 is only being masked with this fix). Reviews and comments for my previous questions welcome.
Comment 32•22 years ago
|
||
kai: actually, it is the HTTP layer that detects a premature EOF. it has configured the socket transport to poll on read and write. when poll wakes up (either for read or write), the socket transport checks for PR_POLL_EXCEPT and PR_POLL_ERR. if either of these is true, then it cancels the socket transport read and write requests with a status of NS_ERROR_NET_RESET. that error code will be trapped by the HTTP layer, and if the HTTP layer has not received any bytes for that transaction, it will retry the transaction. the HTTP layer will also retry the transaction if the first socket read returns NS_OK and zero bytes read (indicating EOF). finally, if the socket transport's calls to PR_Read or PR_Write fail with either PR_CONNECT_ABORTED_ERROR or PR_CONNECT_RESET_ERROR, then socket transport read and write requests will be canceled with a status of NS_ERROR_NET_RESET. and just like in the PR_POLL_EXCEPT/PR_POLL_ERR case, the HTTP layer will see NS_ERROR_NET_RESET. the HTTP layer will retry the transaction if it receives this error before receiving any data on the socket.
Comment 33•22 years ago
|
||
Comment on attachment 102169 [details] [diff] [review] Patch v3 r=javi nit-pick: This if statement has the potential to grow quite large become nearly impossible to read as we addd more error codes that should be retried. Perhaps a method/function that takes the necessary parameters to return true/false to determine if we wanna retry. + if ( + (!withInitialCleartext && PR_CONNECT_RESET_ERROR == err) + || + (withInitialCleartext && PR_END_OF_FILE_ERROR == err) + || + (SSL_ERROR_BAD_MAC_ALERT == err) + || + (SSL_ERROR_BAD_MAC_READ == err) + ) {
Attachment #102169 -
Flags: review+
Assignee | ||
Comment 34•22 years ago
|
||
Darin, thanks for the explanation. I see another oddity: In my test case, the failure happens reproducable at the time necko calls our read function. Regardless whether I return PR_CONNECT_RESET_ERROR or PR_CONNECT_ABORT_ERROR or the simple EOF, necko will indeed automatically retry the connection. However, there is difference. If the read function returns EOF (0 bytes read), when necko re-opens the connection, it will again call our read and write functions. However, if I return PR_CONNECT_RESET/ABORT_ERROR, necko will repeated call our open/close functions, but will never call our read/write funtions! Is that a bug? At least it makes it necessary to return EOF in read for this patch to work.
Assignee | ||
Comment 35•22 years ago
|
||
This new patch moves the tls intolerance decision to a separate function, as Javi suggested.
Comment 36•22 years ago
|
||
kai: yeah, that sounds like a necko bug then :(
Comment 37•22 years ago
|
||
In comment 29, Kai observes that > Because necko will see that neither reading nor writing works because of > WOULD_BLOCK, it calls both functions alternating. This certainly seems like a bug. Perhaps it was a hack to work around some other flaw. an http client should send the entire http request, including any post data, and then read the response. Any chance this alternating read/write behavior can be fixed?
Comment 38•22 years ago
|
||
can someone provide a socket transport log demonstrating the alternating calls to PR_Read and PR_Write. i'm surprised that we would do that. by design we do initate both a write and a read at the same time. this is done to pick up an EOF on the read request. i'll be the first to admit that our architecture is far from clean, but it should be valid to poll for both read and write at the same time.
Assignee | ||
Comment 39•22 years ago
|
||
> can someone provide a socket transport log demonstrating the alternating calls > to PR_Read and PR_Write. I've done the following: - I applied the patch from this bug - I tried to connect to https://www.e-girot.net/ewh/visafaktura - because the site is TLS intolerant, the SSL handshake will fail, and no http content will be transfered on the first connection attempt - in addition to the patch, I added some printf statements to the SSL I/O layer, the printf is the first statement in the function. Look for lines starting with => in the logfile The interesting part is up to SSL layer close. You can see the read/write functions are being called repeatedly. The second connect phase is the actual loading phase, that happens with TLS disabled, and is able to execute the http request.
Assignee | ||
Comment 40•22 years ago
|
||
I'm making one more small change to the patch, that fixes also bug 167227. (I notice the disconnect with TLS intolerance can also cause PR_END_OF_FILE_ERROR, I had previously seen this only with proxies.)
Attachment #102312 -
Attachment is obsolete: true
Assignee | ||
Comment 41•22 years ago
|
||
Comment on attachment 102447 [details] [diff] [review] Patch v5 carrying forward r=javi
Attachment #102447 -
Flags: review+
Comment 42•22 years ago
|
||
As discussed in http://bugzilla.mozilla.org/show_bug.cgi?id=106865#c32 (comments 32 and later) PR_END_OF_FILE_ERROR reports an condition that may be a "truncation attack" on SSL/TLS. It has a different meaning than EOF (e.g. read returning zero bytes). If no data has been sent, and PR_END_OF_FILE_ERROR is received, then that could be a sign of TLS intolerance. If the http request has already been sent, and PR_END_OF_FILE_ERROR is received, the client should NOT automatically retry. In fact, the client should probably not retry at all, but instead an error dialog should be displayed. So, the proper interpretation and handling of PR_END_OF_FILE_ERROR depends on knowledge of whether or not any data has been succesfully read from or written to the socket.
Assignee | ||
Comment 43•22 years ago
|
||
> If the http request has already been sent, and PR_END_OF_FILE_ERROR is > received, the client should NOT automatically retry. Nelson, thanks for the reminder. I can promise that we will never assume TLS intolerance, if we had already been able to write content data. The crypto code will know whether it has even been able to write any bytes for this socket. Only if it has not yet been able to write any byte will it try to detect TLS intolerance. This new patch just relaxes a different constraint I had introduced earlier. Without this new patch, when seeing PR_END_OF_FILE_ERROR, the code would only assume TLS intolerance, if the communication is going over a proxy. I introduced this restriction, because during my earlier testing with other TLS intolerant sites, I only saw PR_END_OF_FILE_ERROR when talking over a proxy. When not talking over a proxy, the error I saw was PR_CONNECT_RESET_ERROR. But I learn from bug 167227, that experiencing PR_END_OF_FILE_ERROR can also happen when not talking over a proxy. Still, this new detection is only used when no content data bytes have yet been transfered. As soon as the SSL level write function returns something other but -1, we'll never again try to detect TLS intolerance for this open socket. I assume you wanted to reconfirm this and the patch should be fine as is.
Assignee | ||
Comment 44•22 years ago
|
||
Darin, could you please sr ? Nelson, please veto if I didn't understand you correctly.
Assignee | ||
Comment 45•22 years ago
|
||
Darin, one question to your comment 38. You said: > by design we do initate both a write and a read at the same time. Does this mean, you call read and write simultaneously on separate threads? I guess you are not, but you meant, you are just polling for both, but will call read and write one after the other?
Comment 46•22 years ago
|
||
Re: comment 43 This explanation sounds right to me. Thanks.
Comment 47•22 years ago
|
||
Comment on attachment 102447 [details] [diff] [review] Patch v5 >Index: mozilla/security/manager/ssl/src/nsNSSIOLayer.cpp >+void nsNSSSocketInfo::SetHandshakeInProgress(PRBool aIsIn) >+{ >+ mHandshakeInProgress = aIsIn; >+} >+ >+PRBool nsNSSSocketInfo::GetHandshakeInProgress() >+{ >+ return mHandshakeInProgress; >+} nit: these should probably be implemented inline in the header file instead. >+// Call this function to report a site that is possibly TLS intolerant. >+// This function will return true, if the given socket is currently using TLS. >+static PRBool >+rememberPossibleTLSProblemSite(PRFileDesc* fd, nsNSSSocketInfo *socketInfo) nit: most methods in this class begin with an uppercase letter. is there any reason why this should differ? >+{ >+ PRBool currentlyUsesTLS = PR_FALSE; >+ >+ SSL_OptionGet(fd->lower, SSL_ENABLE_TLS, ¤tlyUsesTLS); >+ if (currentlyUsesTLS) { >+ // Add this site to the list of TLS intolerant sites. >+ char buf[1024]; >+ PRInt32 port; >+ nsXPIDLCString host; >+ socketInfo->GetPort(&port); >+ socketInfo->GetHostName(getter_Copies(host)); >+ HASH_STRING_KEY(buf,1024,host.get(),port); nit: use sizeof(buf) instead of hardcoding 1024. i wonder if it wouldn't be better to use: nsCAutoString buf; buf = host + NS_LITERAL_CSTRING(":") + nsPrintfCString("%d", port); but, i see that you are just copying existing code, so up to you which way you want to go. >+ // We don't really wanna associate a value. If it's >+ // in the table, that means it's TLS intolerant and >+ // we don't really need to know anything else. >+ gTLSIntolerantSites->Put(&key, nsnull); you might want to use nsCStringHashSet for this. up to you. >+ case PR_CONNECT_RESET_ERROR: >+ if (!withInitialCleartext) >+ return PR_TRUE; >+ else >+ break; nit: "else" keyword is meaningless here. >+ if (-1 == bytesTransfered && handleHandshakeResultNow) { nit: maybe (0 > bytesTransfered) just to be safe. fix these nits and sr=darin
Attachment #102447 -
Flags: superreview+
Comment 48•22 years ago
|
||
re: comment #45, we poll on one thread, and then either read or write on that thread. sorry for not being clear about that!
Assignee | ||
Comment 49•22 years ago
|
||
> we poll on one thread, and then either read or write on that thread.
ah good, thanks for confirming.
Assignee | ||
Comment 50•22 years ago
|
||
> these should probably be implemented inline in the header file instead. done > most methods in this class begin with an uppercase letter. is there any > reason why this should differ? Actually, that's a function outside of any class, but static to the source file. There are a couple of such functions already, and they all start with a lowercase letter. Not changed. > i wonder if it wouldn't be better to use: > buf = host + NS_LITERAL_CSTRING(":") + nsPrintfCString("%d", port); convinced. That's a nice class. changed. > you might want to use nsCStringHashSet for this. up to you. This required more changes all over the file... But changed. > nit: "else" keyword is meaningless here. Changed. > nit: maybe (0 > bytesTransfered) just to be safe. changed.
Assignee | ||
Comment 51•22 years ago
|
||
Attachment #102447 -
Attachment is obsolete: true
Assignee | ||
Comment 52•22 years ago
|
||
Comment on attachment 102826 [details] [diff] [review] Patch v6 carrying forward
Attachment #102826 -
Flags: superreview+
Attachment #102826 -
Flags: review+
Assignee | ||
Updated•22 years ago
|
Summary: Failed SSL disables all networking → Use of blocking I/O for SSL in PSM stalls network activity
Assignee | ||
Comment 53•22 years ago
|
||
The reporter of bug 153769 said this patch also fixes bug 153769.
Blocks: 153769
Comment 54•22 years ago
|
||
Comment on attachment 102826 [details] [diff] [review] Patch v6 a=asa for checkin to 1.2beta (on behalf of drivers)
Attachment #102826 -
Flags: approval+
Assignee | ||
Comment 55•22 years ago
|
||
Although we have approval to check in for 1.2 beta, I will wait until tomorrow with checking it in. I have uploaded test builds for Linux, Win32 and Mac Carbon to: http://www.kuix.de/mozilla/163605/ I think it is sufficient to test the various SSL protocols, like SMTP/SSL, IMAP/SSL, HTTPS. If you want to help, please check that no SSL functionality is regressing by this patch. (All builds are based on the trunk from 2002/10/15, so if you find other problems in this build, please also check whether the problem is found in the builds at ftp://ftp.mozilla.org/pub/mozilla/nightly/2002-10-15-08-trunk/ .)
Reporter | ||
Comment 56•22 years ago
|
||
Since I reported this bug, I felt obliged to do some testing. I tried the linux build with IMAP/SSL without problems. I ran https against different servers with Apache, Stronghold, Netscape-Enterprise, Roxen, iPlanet or IIS5 without any glitches, so it seems fine!
Comment 57•22 years ago
|
||
Using kaie's test builds, I have found no regressions, and numerous bug fixes. I suggest the patch be checked into the trunk ASAP.
Assignee | ||
Comment 58•22 years ago
|
||
fixed on trunk
Status: ASSIGNED → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
Comment 60•22 years ago
|
||
*** Bug 180221 has been marked as a duplicate of this bug. ***
Comment 61•22 years ago
|
||
*** Bug 172241 has been marked as a duplicate of this bug. ***
Comment 62•22 years ago
|
||
*** Bug 181292 has been marked as a duplicate of this bug. ***
Updated•8 years ago
|
Product: Core → Core Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•