Closed Bug 473197 Opened 16 years ago Closed 15 years ago

Invalid Data Transfer Interrupted error message when connecting to TLS-only site which requires a client certificate

Categories

(Core :: Security: PSM, defect)

defect
Not set
major

Tracking

()

RESOLVED FIXED

People

(Reporter: tapio.niemi, Assigned: mayhemer)

References

()

Details

Attachments

(4 files, 1 obsolete file)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2a1pre) Gecko/20090112 Minefield/3.2a1pre
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2a1pre) Gecko/20090112 Minefield/3.2a1pre

When connecting to a site that requires a SSL client certificate to be presented over a relatively slow connection, "Data Transfer Interrupted - The connection to [site] was interrupted while the page was loading" - error message occurs, and Firefox will not connect to that site again until it is completely restarted.

It is easiest to reproduce this bug using a slow, high-latency connection such as GPRS or Flash-OFDM, but we have reproduced this everywhere we have tried by very quickly and many times clicking the reload button, it just takes some effort on faster connections.

We have tested this on different server side software (Apache and IIS) and operating systems (Windows XP, Mac OS X, Linux). The problems does not arise when using Safari on XP or OS X, nor with Internet Explorer. We have also monitored network activity and processor load on both ends of the connection, so this is not a real congestion situation. So we are certain that this is a bug in Firefox.

For your convenience, we have set up a test site requiring client authentication on https://www.wompat.fi. Simply install both the test CA certificate and the test client certificate, which I have attached, then follow reproduction instructions using abovementioned site. 

Reproducible: Always

Steps to Reproduce:
1. Access a site requiring client certificate authentication over a slow connection or reloading the page multiple times in a quick sequence.
Actual Results:  
"Data Transfer Interrupted" error message appears. Firefox won't connect to that site again until it is completely restarted.

Expected Results:  
The page should load normally.
Assignee: nobody → kaie
Component: General → Security: PSM
Product: Firefox → Core
QA Contact: general → psm
I have done some more testing, and I believe that whether SSLv3 or TLSv1 is used affects the problem. Our production servers and our wompat.fi test server are configured to accept only TLS connections, and when I tested with the server configured to accept only SSLv3 connections, I could not reproduce the bug, then switching back to TLS easily reproduced the bug. This of course may be just coincidental, but I think this is worth mentioning.
Some more testing done, and what I said in last comment is now confirmed. Only TLS is affected by this bug, not SSL. Looking at server logs, I can see that subsequent requests are done in SSL after TLS fails. (Some kind of automatic fallback?) Without looking at the logs, this bug will go completely unnoticed on a SSL/TLS server, whereas it will show aforementioned behaviour on TLS-only servers. So I'm turning on SSL support server-side as a temporary workaround.
Summary: Invalid Data Transfer Interrupted error message when connecting to site which requires client SSL certificate → Invalid Data Transfer Interrupted error message when connecting to TLS-only site which requires a client certificate
Adding Nelson to this TLS-only bug (works with SSLv3).

Changing to confirmed based on the amount of testing reported.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Tapio, I believe I understand what's going on.

In the past we have encountered very badly behaving SSL servers.
There are some which simply "stall" when we try to connect in TLS mode.

Because of that we have introduced a timeout for completing a successful TLS handshake. If that timeout is hit, we conclude the server is incapable of speaking TLS and fall back to the older SSL protocol. This fallback is remembered for the remainder of a user's session.

The very slow connection could trigger the timeout.
Right now I don't have a good idea how to solve this for both failure scenarios.
I am using a proxy allowing me to set a bandwidth throttling and emulate a modem. I setup a roundtrip delay to 2500 ms and I think I can reproduce the problem with the test page. Going to investigate further what we could do with this.

Kai, could you refer the bug where the timeout were introduced please?
Assignee: kaie → honzab.moz
Status: NEW → ASSIGNED
Someone needs to capture all of this with ssltap -sxlp.  See 
http://www.mozilla.org/projects/security/pki/nss/tools/ssltap.html
Attached file logs
Logs from ssltap and wireshark:

first-run-pcap and first-run-ssltap are logs when FF was started and the site was loaded the first time (no ssl session cached). I was successfully able to load the page.

refreshes-pcap and refreshes-ssltap are logs of 3 quick refreshes of the page. I got Connection Interrupted in these cases, regularly. After Client Hello with Session ID Length=0 server closes with FIN. Looks like wrong server configuration?

The environment:
FF setup to use a proxy on localhost:8888 that has a throttle enabled, and connects ssltap on localhost:8889 (through the proxy) configured to connect to www.wompat.fi:443. wireshark just captures all for port 443.
Honza, Welcome back!

I studied your log files this afternoon.  All the issues here are related to
Firefox's handling of "TLS intolerant" servers.  Here is a URL for a bugzilla
search that will reveal related bugs.  

https://bugzilla.mozilla.org/buglist.cgi?query_format=advanced&short_desc_type=allwordssubstr&short_desc=intolerant&resolution=---&emailassigned_to1=1&emailreporter1=1&emailcc1=1&emaillongdesc1=1&emailtype1=exact&email1=nelson%40bolyard.me&chfieldfrom=2004-01-01&chfieldto=Now&chfield=%5BBug+creation%5D

The "refreshes" logs reveal that this server is an "SSL intolerant" server.
It is the exact opposite of TLS intolerant servers.  It ONLY accepts TLS 
client hellos, and not client hellos in sslv2 compatible format.  Presently
when PSM decides that a server is "TLS intolerant", it stops attempting to
use TLS with that server for the remainder of the process lifetime.  That 
is why, once this problem begins, it remains until the browser is restarted.
The first time that PSM concludes that an attempt to do an TLS handshake has
failed, it reverts to "TLS intolerant mode", which will never work with an
SSL-intolerant server like this one.  

There are several relevant issues here.
1) Once PSM has successfully negotiated TLS with a server, for the remainder 
of that process lifetime, it should not attempt to revert to "TLS Intolerant" 
mode for that server.  It should remember that the server is TLS capable,
and leave well enough alone.  This is bug 412834.

2) Clicking the browser's stop button before a TLS connection has successfully completed negotiating a handshake is reported to cause PSM to mark the server
whose handshake had not completed as being TLS intolerant.  That is bug 498311.
I believe that is relevant to this issue.  

3) PSM's decision to mark a server as TLS Intolerant is a one-way street.  
It has been suggested that PSM ought to try TLS, and if that fails, try SSL,
but if that fails, go back to trying TLS, and keep toggling back and forth.
That is bug 412833.  That would also help with this server.

The "first run" logs show an interesting situation that I suspect is caused
by the testing apparatus.  While this problem is interesting, I'm not at all
certain that this is the same problem as originally experienced and reported.
It may be another problem with similar superficial symptoms.  It may be, or
it may not be.  Soon, I will attach a simplified (and much shorter) version 
of the first run pcap log, from which the behavior may be more obvious.
Version: unspecified → Trunk
Attached file Simplified log file (obsolete) —
This bug was reported against a January 12 build of Firefox trunk (shiretoko), 
IINM.  Could it be that this build does not have the fix for PSM bug 365898 ?

In the log files, we see occasional seemingly-random delays, ranging from a
fraction of a second to 21 seconds.  Despite these delays, the first connection
apparently succeeds, but takes about 30 seconds to do so.  This is very close
the PSM's TLS intolerant server timeout value (see bug 365898).  The second
connection attempt shown the handshake completing, the client sending an https request, and the server sending back a rather large reply, exceeding 128K bytes.

The client system offers a TCP "window" (effectively a buffer into which 
received data may be stored) of 64KB-1 bytes to the server.  The server sends 
this data to the client quickly.  The client OS acknowledges the receipt of this data quickly, but the data is evidently not taken from the OS's buffers for some time.  Consequently, we see the size of the window shrink down from 
64KB-1 to 0 (frames 95-165) as the buffers fill.  Then after about 2 seconds,  the data gets taken from the OS buffers, reopening the window (frames 166-169), 
and the server quickly sends another 64KB-1 bytes, causing the client OS's buffers to fill again and the window to go to zero again (frames 169-240).  
But this time, after the window closes (the buffers are full) the data is not
taken from the OS's buffers for over 5 seconds (frames 241-249), during which
time the server repeatedly asks "are you still there?" by sending keep-alives.

Finally, after 5 seconds of playing keep-alive, the server gives up and aborts
the connection to the browser.  It is at this point that the browser 
*correctly* reports that the transfer has been interrupted.  It *HAS* been 
interrupted by the server. 

Now, I would say that a 5 second keep-alive time limit for the server is WAY too short.  If this is really what was going on in the reporter's original 
experience with this problem, then I would say this bug is invalid.  The 
browser is not incorrect in reporting that the connection was interrupted. 

But I think it is possible that the behavior seen in this pcap log is quite
different from what was actually experienced by the user without the "throttle" proxy in between.  So, I would suggest that the reporter attempt to use the
pcap capture tool to capture the events he naturally experiences.  Perhaps
Honza can provide some guidance about how to do that.  

In any case, I see no evidence of any misbehavior of any TLS code on either 
the client or server side.  This is a TCP connection issue, not a TLS issue.
If the reporter only sees this with TLS (that is, with https), then I suggest
it may be that the https server has a much shorter keep alive timeout than 
the http (not TLS) server uses.  

In summary, I see a server with a short TCP keep-alive timeout time, and a 
client that is quick to declare TLS intolerance, even after having successfully negotiated TLS with the server.  But I do not see any false reports of transfer interruption.  The transfer interruption reports are correct.  

If it can be shown that the server's keep-alive timeout is causing this 
problem, even in the absence of a "throttle proxy", then IMO, this bug is 
invalid.  I would encourage the reporter to also join the CC list on the other
bugs cited above.
Oh, one more question for Honza.  What version of ssltap did you use? 
Numerous old versions of ssltap have a bug that causes it to incorrectly 
account for the sizes of large messages.  I wonder if you used one of those.
(In reply to comment #12)
> Oh, one more question for Honza.  What version of ssltap did you use? 
> Numerous old versions of ssltap have a bug that causes it to incorrectly 
> account for the sizes of large messages.  I wonder if you used one of those.

Version: $Revision: 1.13 $ ($Date: 2009/03/13 02:24:07 $) $Author: nelson%bolyar
d.com $

According to cvs log it should be the most recent one.
(In reply to comment #11)
> This bug was reported against a January 12 build of Firefox trunk (shiretoko), 
> IINM.  Could it be that this build does not have the fix for PSM bug 365898 ?
> 

If you mean this patch https://bugzilla.mozilla.org/attachment.cgi?id=257720&action=diff than it is in Shiretoko.

To explain in detail my test environment:
I have a proxy that throttles and has set round-trip latency to 2500ms, I am not aware of the throttle implementation. Firefox is configured to use this proxy for all conn types. I run ssltap independently configured to connect www.wompat.fi:443. I connect from Firefox to ssltap (having a cert exception). Then the connection chain looks like this:

Firefox --> Throttling proxy:8888 --> ssltap:8889 -(Wireshark)-> www.wompat.fi:443

(Wireshark) displays the position where I capture. So, IMO it should behave the way that the proxy delays read from its input buffers (the same way as ISPs throttles your download speed). It's probably not exactly what we should normally see when capturing directly.

To capture, I'm using Wireshark, with initial capture filter "tcp port 443" on the active adapter. I can give some more directions about how to use it. I have no experience how to capture on e.g. GPRS modems.
Thanks for everybody finally taking this bug seriously! We ourselves didn't believe this to be true regardless of repeated reports from some of our users until personally visiting one of them.

I agree that the title is misleading, now that we know better what's going on. Feel free to change it to something more descriptive.

I belive that what Kai said in #6 is what's happening, and what nelson said in #10 section 1) is the solution. When I initiated the workaround of allowing SSLv2, this problem vanished from production.

I can provide Wireshark/Ethereal logs produced by using the original, and not-so-scientific reproduction method of keep-on-clicking-reload. And it produces messy, hard to follow logs that shows new connections starting before old ones have completely closed. But drop me a message if you really want them. It's difficult to obtain non-artificial captures, as this rarely happens on a clean success-failure sequence. More like load-click link-don't bother to wait until load is complete-click some other link-repeat a few times-bang! Now you think of TCP problems - sound a lot like so, but then, why did enabling SSLv2 worked so well as a workaround? And if the problem were in operating system, wouldn't this problem surface on other browsers also?

One idea that I got: could this TLS-intolerance timeout be user configurable? That might help, too.

Not really a big problem if this happens once in a while, people with poor connections are used to that there are quirks. They are also used to clicking reload as a swiss army knife for any connection problem. But now Firefox won't respond to a reload in an expected way. And that's what got our helpdesk phones ringing; people see that the rest of the internet is working, so they call helpdesk and tell that our server is not responding. That's the real life scenario affecting business.
I'm not interested in any results that are the result of clicking the stop 
button or the reload button before the previous page load has finished, 
because all such tests do is confirm bug 498311.  

In summary, all the issues shown by existing attempts to reproduce this bug 
are either:
a) already known bugs (see comment 10), or 
b) for one issue, the fault of the server (process or OS).  
   (5 seconds is WAY too short for a keep alive timeout.)

The reported issue (transfer interrupted) is a server bug, not a browser bug.
We could change the title of this bug, but then it would become a duplicate
of one of the bugs cited in comment 10. So I think the best thing to do 
with THIS bug is to mark it invalid, to encourage the server's keep-alive 
timeout timer to be increased to a reasonable value.  But I'll leave the 
final resolution of this bug up to Honza.
Tapio: can you verify that there is a timeout value set on your server close to 5 seconds to confirm the Nelson's conclusion? If you find this to be true, can you reset the value to it's TCP standard?

I did another test. I setup the proxy to pass only 2 kbps up and down and delay for 6000 seconds. This means almost dead connectivity and I would expect to reproduce the problem very well. But I discovered something else. We end up with TLS t/o after 32 secs (not sure why not after 25 as defined). We do 9 other retries on level of http connection. Then we finish with connection RESET (not interrupt).

So I suspect this is a dup of bug 498311, because I don't believe that people would not retry to reload the page again with F5 or the reload button. I'd also say this is con of prolonging the timeout from 8 to 25 seconds.
6000 MILLIseconds :)
A connection reset *is* an interrupt.
The original report is about "interrupted". Problem when pressing the stop/reload button is "interrupted". What I observe with the last test is "reset" message. Those two errors are distinct in Firefox.

http://mxr.mozilla.org/mozilla-central/source/docshell/base/nsDocShell.cpp#3553
http://mxr.mozilla.org/mozilla-central/source/docshell/base/nsDocShell.cpp#3647

and 

http://mxr.mozilla.org/mozilla-central/source/netwerk/base/src/nsSocketTransport2.cpp#179

Interrupt means we got EOF from the server (FIN/ACK packet). This would indicate the server got from client handshake that the server was not willing to continue with.
I've checked my server config. Server is a standard Apache 2.2.11, which indeed has a default KeepAliveTimeout of 5 seconds. I've changed it to 60 s and tested again, with no effect. However, if I understand correctly Nelson is talking about TCP keep-alive and not application layer keep-alive anyway.

Those values are by default quite high, for example Apache TimeOut directive is at 300 s, which is the default. (But again, I dont think that it has any effect on TCP keep-alive). Same goes for my server OS (Linux): 
# cat /proc/sys/net/ipv4/tcp_keepalive_time
7200
# cat /proc/sys/net/ipv4/tcp_keepalive_intvl
75
# cat /proc/sys/net/ipv4/tcp_keepalive_probes
9

It might of course be possible that these are hardcoded into Apache, if it is possible for an application to control these in the first place. I'm no expert on these matters, just trying to provide requested information.

I see that there are a lot of related bugs and I do not try to deny that this could be a duplicate, say of #412833 or #412834, or invalid, but that wasn't clear to me before getting more information from here.
Yes, I am referring to TCP keep-alives, not to http keep-alives.
The wireshark output (see the simplified log above) shows (lines 241-249)
that the server system sent 4 TCP keep-alives (not 9), 1-3 seconds apart 
(not 75), and that each of them was answered (acknowledged) immediately,
and it gave up after ~5 seconds, not 7200.  

An application can alter some of the TCP parameters, such as the maximum 
number of *unacknowledged* keep-alives sent before giving up.  But unless 
your kernel's TCP implementation is broken, that should not be the reason 
for this, because all the keep-alives were quickly acknowledged.  

I wonder if TCP retry counts (e.g. tcp_retries1, tcp_retries2 are the issue.
I think that TCP keep-alives should not count as retries, but maybe this 
kernel is counting them that way.
(In reply to comment #11)
> But this time, after the window closes (the buffers are full) the data is not
> taken from the OS's buffers for over 5 seconds (frames 241-249), during which
> time the server repeatedly asks "are you still there?" by sending keep-alives.
> 
> Finally, after 5 seconds of playing keep-alive, the server gives up and aborts
> the connection to the browser.  It is at this point that the browser 
> *correctly* reports that the transfer has been interrupted.  It *HAS* been 
> interrupted by the server. 

What I see is interruption from side of the client and not of the server in the log. Frame 249 is RST sent by FF.

I'd say the interruption error is caused by users that stop load and reload the page again. This is IMO dup of bug 498311.
Honza, thanks for noticing that the RST came from the client OS.
Somehow, the whole S->C and C<-S think got completely (but consistently) 
messed up.  It is corrected in this version.  

My analysis is the same as before, except for the sending of the RST.
Offhand, at the moment, I can't think of anything an application can do to 
a connected socket that will cause the OS to send a RST without first sending
a FIN, and I didn't see the client sending any FIN.  Did I miss that? 
Or did it not happen?  That seems bogus.  But it's an OS issue, not an app 
issue.
Attachment #385703 - Attachment is obsolete: true
Nelson, remember please that what I capture is communication between the proxy and the server, *NOT* between firefox and the proxy.
Yes, I know, but regardless of what the application is, an OS should not 
send a RST on a connection under these circumstances.  It should send a FIN.
But, in any case, the sending of the RST is not a Firefox bug.
This is not matter of OS stack, it is matter of the app. 

You can call shutdown() on the socket and then close() after you get read error (indicates RST receive or timeout/hw error) or read zero bytes on the socket (indicates FIN receive). This will lead to send just FIN from our side and is considered as graceful shutdown. RST should not be send when close() is called unless we got not FIN from the other side yet. 

Or, you can call directly close() on the socket that forces the TCP session to close by sending RST, from that moment read and write is impossible on the socket.
Tapio, can you try one of builds for your platform at this URL to check the problem is gone? Thanks.

https://build.mozilla.org/tryserver-builds/honzab.moz@firemni.cz-bug473197/
Tapio: ping.
Sorry for delay Honza, I've been on summer vacation (actually still am). Just tested again on a Flash-OFDM connection with poor radio weather, first with my old Minefield build, reproduced, then with your build - no problems. Great thanks if this really is fixed now :-)
In bug 412834 and bug 412833.
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: