Closed Bug 163605 Opened 22 years ago Closed 22 years ago

Use of blocking I/O for SSL in PSM stalls network activity

Categories

(Core Graveyard :: Security: UI, defect, P2)

Other Branch
defect

Tracking

(Not tracked)

VERIFIED FIXED
psm2.4

People

(Reporter: anlan, Assigned: KaiE)

References

()

Details

Attachments

(3 files, 7 obsolete files)

User-Agent:       Mozilla/5.0 (X11; U; SunOS sun4u; en-US; rv:1.0.0) Gecko/20020606
Build Identifier: Mozilla/5.0 (X11; U; SunOS sun4u; en-US; rv:1.0.0) Gecko/20020606

When the URL to this zope-server is used with https, it tries to load but
nothing happens. After that, mozilla will not load any page at all, open
imap-email or anything. It is possible to close browser windows, trying to close
mail-news causes all remaining mozilla windows to hang.

Reproducible: Always

Steps to Reproduce:
1.https://www.ida.liu.se:9080/groups/cugs/doclib/CUGSLogo_blue.jpg
2.Abort, since it won't load.
3.Try to open anything else, try to close mail-news and the whole thing might
freeze.

Actual Results:  
It seems like networking got stuck somewhere, affecting all mozilla components.
Had to kill it after it froze.

Expected Results:  
Probably failed to load the image, and left the rest of the application in a
normal shape.

Loading the image with plain http:// works fine. Https to other pages at the
site works, but not this combination of https and the port-number url.

Unfortunately, I don't have any newer build than this 1.0 to test with here.
Will try to test later with nightlys from 1.0.1 and 1.1.
confirming using build 2002081904 on Win2k (trunk).
Hardware: Sun → All
confirming with win2k build 20020812..

Mozilla hangs if you try to switch to offline, File URLs are still working.
Assignee: new-network-bugs → darin
Severity: normal → critical
Status: UNCONFIRMED → NEW
Component: Networking → Networking: HTTP
Ever confirmed: true
QA Contact: benc → tever
Let me try to reproduce.
I will test whether this is a regression caused by my fix to bug 87902.
A previous attempt to fix 87902 resulted in a similar regression.
Confirming all described behaviour on Linux.

I tested to back out 87902. This did not help, still showing same symptoms. So
at least, this is not caused by 87902.

I'll try to have a look what's happening.

I wonder if the suspicion mentioned in bug 153769 is correct and is related to
this problem.
Error code is 804b0002 for all URL load attempts after the initial failure.

=> NS_BINDING_ABORTED
Keywords: nsbeta1
*** Bug 167421 has been marked as a duplicate of this bug. ***
Attached file SSL Endlessloop stack
Nelson, this looks like an endlessloop in SSL, always reproducable.
Could you please have a look? Please see the attached stack. Thanks.
Target Milestone: --- → mozilla1.2beta
It appears that the socket on which SSL is operating is a blocking socket.
ssl was invoked via the function ssl_Write, which always sets an infinite
timeout, so ssl is behaving as intended here.

I think the problem is that this socket is blocking. 
Re: Nelson's comment that SSL may be behaving as intended, I note that Netscape
does not behave this way.  Stopping Netscape on an hung SSL connection does not
leave it hung up in an infinite loop.
interesting, so mailnews does blocking socket i/o on a background thread ... do
we need to avoid that?  can NSS be configured to have a non-infinite timeout for
blocking writes?
the imap code writes data to a channel opened as follows:

            rv = m_channel->OpenOutputStream(0, PRUint32(-1), 0,
getter_AddRefs(m_outputStream));

does that mean that we do blocking writes when we call Write() on the channel?
The issue is that PSM is currently doing a blocking initial write for any SSL
connection, and that is indeed the culprit.

I discussed with Nelson, and we concluded this is a bad thing (TM).

The solution is to change PSM to never to a blocking call.
oh right... my mistake.  even though necko hands out a blocking nsIOutputStream
from OpenOutputStream, the actual socket is still configured for non-blocking
mode, and we call PR_Poll to make nsIOutputStream::Write blocking.

-> PSM
Assignee: darin → ssaux
Component: Networking: HTTP → Client Library
Product: Browser → PSM
QA Contact: tever → junruh
Target Milestone: mozilla1.2beta → ---
Version: other → unspecified
There is no infinite loop here.  
An operation with an infinite timeout has been done on a blocking socket.
The progress of that thread is blocked until the operation completes.
*** Bug 170087 has been marked as a duplicate of this bug. ***
*** Bug 170428 has been marked as a duplicate of this bug. ***
So kai how hard is this to do?
Assignee: ssaux → kaie
Priority: -- → P2
Target Milestone: --- → 2.4
Stephane: I'm not sure how much work it is. It might be easy to do, by adding a
state variable (state = still waiting for connect), and make PSM I/O functions
pay attention to that flag in its read/write functions, and continue with the
connect logic as soon as Necko calls us again, when the socket becomes
readable/writeable.
I suspected this bug could be the cause for bug 167764.
But they seem to be unrelated.

But we now have a patch for this one, and it seems to work on Linux and Mac OS X
(other platforms not yet tested).
Attached patch Patch v1 (obsolete) — Splinter Review
Javi can you please review? The changes are:

- no longer use blocking mode on initial write

- at the time we attempt the first write, if we see PR_WOULD_BLOCK_ERROR and no
bytes were written, do nothing (the socket transport will eventually let us try
again)

- execute the old fashion logic, if first write can not write bytes, and we see
a  result code different from PR_WOULD_BLOCK_ERROR
Attachment #102127 - Attachment is obsolete: true
Comment on attachment 102128 [details] [diff] [review]
Patch v1 - ignoring whitespace, larger context - please review this one


I take my review request back.

More testing showed the issue is more complicated. This patch breaks the TLS
intolerant detection (sigh).
Attachment #102128 - Attachment is obsolete: true
We have the following new problem:

The SSL error handling within PSM is decoupled from the error handling within
Mozilla's general network component (socket transport).

By always using non blocking I/O in PSM, we loose the ability to detect sudden
disconnects on initial write within PSM. The error will be seen by the socket
transport when doing the next select. In my particular test case, PR_POLL_ERR
was seen, and socket transport closed the socket.


I see a workaround for now, but for an ideal solution I think I need help from
both necko and NSS/SSL.


PSM can remember, whether it already tried to trigger the initial write, but NSS
returned PR_WOULD_BLOCK_ERROR, and therefore the reaction from the remote site
is not yet known.
When the socket transport closes the socket and calls PSM's I/O layer close
function, we can detect the remembered state, and make the decision for TLS
intolerance at that time.

But the problem is: At this time, we do not have the failure reason code
available (only inside necko, but not passed to us). We do not know whether a
network problem occurred, and whether TLS can still be used, or whether the
remote site did disconnect and we therefore should retry without TLS.


We have the choice to always decide the site is TLS intolerant, but this might
result in wrong TLS intolerance decisions, which means we don't use TLS in some
cases, although we should.


For now, I don't have a better idea how I could solve this problem on my own,
without changing socket transport or NSS. I suggest I'll attach a patch to
always retry the connection on failure, after having initiated the first write,
and before the first write completed.


However, I think we should try to minimize the TLS intolerance decisions.
I see two approaches for future work:
- make socket transport pass socket close reason information down to lower NSPR
fd layers
and/or
- transfer the decision, whether a site should be treated as TLS intolerant, to
a new NSS function (possibly passing available socket error codes). Maybe NSS
could make a resaonable decision, based on state of the initial handshake?
Blocks: 167764
Attached patch Patch v2 (obsolete) — Splinter Review
Javi, can you please review?
Status: NEW → ASSIGNED
Keywords: nsbeta1nsbeta1+
Attachment #102139 - Attachment is obsolete: true
Comment on attachment 102140 [details] [diff] [review]
Patch v2  - diff -u10 -w for easier reviewing


I found more arguments why this patch needs more work.

The patch from bug 162752 (not yet checked in) is not compatible with this new
logic. In general, SSL errors requiring an automatic retry because of TLS
intolerance can now happen at multiple locations.

Here is how I understand it:

As long as the underlying SSL handshake is not yet done, NSS will return
WOULD_BLOCK, even though it were able to transfer bytes as part of the
handshake. It can take multiple calls to read/write before the handshake
completes and finally application data can be transfered.

Because necko will see that neither reading nor writing works because of
WOULD_BLOCK, it calls both functions alternating.

In either of both functions could the handshake complete and NSS return the SSL
error to us.
Attachment #102140 - Attachment is obsolete: true
Darin, can you confirm the following is expected, and that we can rely on that
behaviour

When necko/http calls a lower layer write function, it will retry, if the write
function returns -1 with PR_CONNECT_RESET_ERROR.

When necko/http calls a lower layer read function, this same error will not
cause a retry. But I saw that returning 0 causes it to retry.


Because of the additional SSL logic going on behind the scenes, while necko
waits for the real application data to get transfered, and because we eventually
decide that an automatic retry makes sense, PSM wants to trigger that retry,
triggering using return values and error codes of read/write.
Attached patch Patch v3 (obsolete) — Splinter Review
This latest patch seem to work quite well for this bug, it also fixes bug
162752, and bug 167764 seems fixed, too (although wtc suspects the problem in
bug 167764 is only being masked with this fix).

Reviews and comments for my previous questions welcome.
Blocks: 162752
kai:

actually, it is the HTTP layer that detects a premature EOF.  it has configured
the socket transport to poll on read and write.  when poll wakes up (either for
read or write), the socket transport checks for PR_POLL_EXCEPT and PR_POLL_ERR.
 if either of these is true, then it cancels the socket transport read and write
requests with a status of NS_ERROR_NET_RESET.  that error code will be trapped
by the HTTP layer, and if the HTTP layer has not received any bytes for that
transaction, it will retry the transaction.

the HTTP layer will also retry the transaction if the first socket read returns
NS_OK and zero bytes read (indicating EOF).

finally, if the socket transport's calls to PR_Read or PR_Write fail with either
PR_CONNECT_ABORTED_ERROR or PR_CONNECT_RESET_ERROR, then socket transport read
and write requests will be canceled with a status of NS_ERROR_NET_RESET.  and
just like in the PR_POLL_EXCEPT/PR_POLL_ERR case, the HTTP layer will see
NS_ERROR_NET_RESET.  the HTTP layer will retry the transaction if it receives
this error before receiving any data on the socket.
Comment on attachment 102169 [details] [diff] [review]
Patch v3

r=javi

nit-pick:
This if statement has the potential to grow quite large become nearly
impossible to read as we addd more error codes that should be retried.	Perhaps
a method/function that takes the necessary parameters to return true/false to
determine if we wanna retry.
+      if (
+	   (!withInitialCleartext && PR_CONNECT_RESET_ERROR == err)
+	  ||
+	   (withInitialCleartext && PR_END_OF_FILE_ERROR == err)
+	  ||
+	   (SSL_ERROR_BAD_MAC_ALERT == err)
+	  ||
+	   (SSL_ERROR_BAD_MAC_READ == err)
+      ) {
Attachment #102169 - Flags: review+
Darin, thanks for the explanation. I see another oddity:

In my test case, the failure happens reproducable at the time necko calls our
read function.

Regardless whether I return PR_CONNECT_RESET_ERROR or PR_CONNECT_ABORT_ERROR or
the simple EOF, necko will indeed automatically retry the connection.

However, there is difference.

If the read function returns EOF (0 bytes read), when necko re-opens the
connection, it will again call our read and write functions.

However, if I return PR_CONNECT_RESET/ABORT_ERROR, necko will repeated call our
open/close functions, but will never call our read/write funtions!

Is that a bug?
At least it makes it necessary to return EOF in read for this patch to work.

Attached patch Patch v4 (obsolete) — Splinter Review
This new patch moves the tls intolerance decision to a separate function, as
Javi suggested.
kai: yeah, that sounds like a necko bug then :(
In comment 29, Kai observes that 

> Because necko will see that neither reading nor writing works because of
> WOULD_BLOCK, it calls both functions alternating.

This certainly seems like a bug.  Perhaps it was a hack to work around some
other flaw.  an http client should send the entire http request, including
any post data, and then read the response.  Any chance this alternating
read/write behavior can be fixed?
can someone provide a socket transport log demonstrating the alternating calls
to PR_Read and PR_Write.  i'm surprised that we would do that.  by design we do
initate both a write and a read at the same time.  this is done to pick up an
EOF on the read request.  i'll be the first to admit that our architecture is
far from clean, but it should be valid to poll for both read and write at the
same time.
Attached file Logfile for Darin
> can someone provide a socket transport log demonstrating the alternating
calls
> to PR_Read and PR_Write.

I've done the following:
- I applied the patch from this bug
- I tried to connect to https://www.e-girot.net/ewh/visafaktura
- because the site is TLS intolerant, the SSL handshake will fail, and no http
content will be transfered on the first connection attempt
- in addition to the patch, I added some printf statements to the SSL I/O
layer, the printf is the first statement in the function. Look for lines
starting with => in the logfile

The interesting part is up to SSL layer close. You can see the read/write
functions are being called repeatedly.

The second connect phase is the actual loading phase, that happens with TLS
disabled, and is able to execute the http request.
Blocks: 167227
Attached patch Patch v5 (obsolete) — Splinter Review
I'm making one more small change to the patch, that fixes also bug 167227. (I
notice the disconnect with TLS intolerance can also cause PR_END_OF_FILE_ERROR,
I had previously seen this only with proxies.)
Attachment #102312 - Attachment is obsolete: true
Comment on attachment 102447 [details] [diff] [review]
Patch v5

carrying forward r=javi
Attachment #102447 - Flags: review+
As discussed in http://bugzilla.mozilla.org/show_bug.cgi?id=106865#c32 
(comments 32 and later) PR_END_OF_FILE_ERROR reports an condition that
may be a "truncation attack" on SSL/TLS.  It has a different meaning
than EOF (e.g. read returning zero bytes).  

If no data has been sent, and PR_END_OF_FILE_ERROR is received, then
that could be a sign of TLS intolerance.

If the http request has already been sent, and PR_END_OF_FILE_ERROR is 
received, the client should NOT automatically retry.  In fact, the 
client should probably not retry at all, but instead an error dialog
should be displayed.  

So, the proper interpretation and handling of PR_END_OF_FILE_ERROR 
depends on knowledge of whether or not any data has been succesfully 
read from or written to the socket.
> If the http request has already been sent, and PR_END_OF_FILE_ERROR is 
> received, the client should NOT automatically retry.

Nelson, thanks for the reminder.

I can promise that we will never assume TLS intolerance, if we had already been
able to write content data.

The crypto code will know whether it has even been able to write any bytes for
this socket. Only if it has not yet been able to write any byte will it try to
detect TLS intolerance.


This new patch just relaxes a different constraint I had introduced earlier.
Without this new patch, when seeing PR_END_OF_FILE_ERROR, the code would only
assume TLS intolerance, if the communication is going over a proxy. I introduced
this restriction, because during my earlier testing with other TLS intolerant
sites, I only saw PR_END_OF_FILE_ERROR when talking over a proxy. When not
talking over a proxy, the error I saw was PR_CONNECT_RESET_ERROR.

But I learn from bug 167227, that experiencing PR_END_OF_FILE_ERROR can also
happen when not talking over a proxy.

Still, this new detection is only used when no content data bytes have yet been
transfered. As soon as the SSL level write function returns something other but
-1, we'll never again try to detect TLS intolerance for this open socket.


I assume you wanted to reconfirm this and the patch should be fine as is.
Darin, could you please sr ?

Nelson, please veto if I didn't understand you correctly.
Blocks: 166931
Darin, one question to your comment 38. You said:
> by design we do initate both a write and a read at the same time.

Does this mean, you call read and write simultaneously on separate threads?
I guess you are not, but you meant, you are just polling for both, but will call
read and write one after the other?
Re: comment 43
This explanation sounds right to me.  Thanks.
Comment on attachment 102447 [details] [diff] [review]
Patch v5

>Index: mozilla/security/manager/ssl/src/nsNSSIOLayer.cpp

>+void nsNSSSocketInfo::SetHandshakeInProgress(PRBool aIsIn)
>+{
>+  mHandshakeInProgress = aIsIn;
>+}
>+
>+PRBool nsNSSSocketInfo::GetHandshakeInProgress()
>+{
>+  return mHandshakeInProgress;
>+}

nit: these should probably be implemented inline in the header file instead.


>+// Call this function to report a site that is possibly TLS intolerant.
>+// This function will return true, if the given socket is currently using TLS.
>+static PRBool
>+rememberPossibleTLSProblemSite(PRFileDesc* fd, nsNSSSocketInfo *socketInfo)

nit: most methods in this class begin with an uppercase letter.  is there any
reason why this should differ?


>+{
>+  PRBool currentlyUsesTLS = PR_FALSE;
>+
>+  SSL_OptionGet(fd->lower, SSL_ENABLE_TLS, &currentlyUsesTLS);
>+  if (currentlyUsesTLS) {
>+    // Add this site to the list of TLS intolerant sites.
>+    char buf[1024];
>+    PRInt32 port;
>+    nsXPIDLCString host;
>+    socketInfo->GetPort(&port);
>+    socketInfo->GetHostName(getter_Copies(host));
>+    HASH_STRING_KEY(buf,1024,host.get(),port);

nit: use sizeof(buf) instead of hardcoding 1024.

i wonder if it wouldn't be better to use:

  nsCAutoString buf;
  buf = host + NS_LITERAL_CSTRING(":") + nsPrintfCString("%d", port);

but, i see that you are just copying existing code, so up to you
which way you want to go.

>+    // We don't really wanna associate a value.  If it's
>+    // in the table, that means it's TLS intolerant and
>+    // we don't really need to know anything else.
>+    gTLSIntolerantSites->Put(&key, nsnull);

you might want to use nsCStringHashSet for this.  up to you.


>+    case PR_CONNECT_RESET_ERROR:
>+      if (!withInitialCleartext)
>+        return PR_TRUE;
>+      else
>+        break;

nit: "else" keyword is meaningless here.


>+  if (-1 == bytesTransfered && handleHandshakeResultNow) {

nit: maybe (0 > bytesTransfered) just to be safe.


fix these nits and sr=darin
Attachment #102447 - Flags: superreview+
re: comment #45, we poll on one thread, and then either read or write on that 
thread.  sorry for not being clear about that!
> we poll on one thread, and then either read or write on that thread.

ah good, thanks for confirming.
> these should probably be implemented inline in the header file instead.

done


> most methods in this class begin with an uppercase letter.  is there any
> reason why this should differ?

Actually, that's a function outside of any class, but static to the source file.
There are a couple of such functions already, and they all start with a
lowercase letter. Not changed.


> i wonder if it wouldn't be better to use:
>  buf = host + NS_LITERAL_CSTRING(":") + nsPrintfCString("%d", port);

convinced. That's a nice class. changed.


> you might want to use nsCStringHashSet for this.  up to you.

This required more changes all over the file...
But changed.


> nit: "else" keyword is meaningless here.

Changed.


> nit: maybe (0 > bytesTransfered) just to be safe.

changed.
Attached patch Patch v6Splinter Review
Attachment #102447 - Attachment is obsolete: true
Comment on attachment 102826 [details] [diff] [review]
Patch v6

carrying forward
Attachment #102826 - Flags: superreview+
Attachment #102826 - Flags: review+
Summary: Failed SSL disables all networking → Use of blocking I/O for SSL in PSM stalls network activity
The reporter of bug 153769 said this patch also fixes bug 153769.
Blocks: 153769
Comment on attachment 102826 [details] [diff] [review]
Patch v6

a=asa for checkin to 1.2beta (on behalf of drivers)
Attachment #102826 - Flags: approval+
Although we have approval to check in for 1.2 beta, I will wait until tomorrow
with checking it in.

I have uploaded test builds for Linux, Win32 and Mac Carbon to:
  http://www.kuix.de/mozilla/163605/

I think it is sufficient to test the various SSL protocols, like SMTP/SSL,
IMAP/SSL, HTTPS.

If you want to help, please check that no SSL functionality is regressing by
this patch.

(All builds are based on the trunk from 2002/10/15, so if you find other problems
in this build, please also check whether the problem is found in the builds at
ftp://ftp.mozilla.org/pub/mozilla/nightly/2002-10-15-08-trunk/ .)
Since I reported this bug, I felt obliged to do some testing. 

I tried the linux build with IMAP/SSL without problems. I ran https against
different servers with Apache, Stronghold, Netscape-Enterprise, Roxen, iPlanet
or IIS5 without any glitches, so it seems fine!
Blocks: 172791
Using kaie's test builds, I have found no regressions, and numerous bug fixes. I 
suggest the patch be checked into the trunk ASAP.
fixed on trunk
Status: ASSIGNED → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
Verified.
Status: RESOLVED → VERIFIED
*** Bug 180221 has been marked as a duplicate of this bug. ***
*** Bug 172241 has been marked as a duplicate of this bug. ***
*** Bug 181292 has been marked as a duplicate of this bug. ***
Product: PSM → Core
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: