473197 - Invalid Data Transfer Interrupted error message when connecting to TLS-only site which requires a client certificate

Reporter

Description

•

16 years ago

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2a1pre) Gecko/20090112 Minefield/3.2a1pre Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2a1pre) Gecko/20090112 Minefield/3.2a1pre When connecting to a site that requires a SSL client certificate to be presented over a relatively slow connection, "Data Transfer Interrupted - The connection to [site] was interrupted while the page was loading" - error message occurs, and Firefox will not connect to that site again until it is completely restarted. It is easiest to reproduce this bug using a slow, high-latency connection such as GPRS or Flash-OFDM, but we have reproduced this everywhere we have tried by very quickly and many times clicking the reload button, it just takes some effort on faster connections. We have tested this on different server side software (Apache and IIS) and operating systems (Windows XP, Mac OS X, Linux). The problems does not arise when using Safari on XP or OS X, nor with Internet Explorer. We have also monitored network activity and processor load on both ends of the connection, so this is not a real congestion situation. So we are certain that this is a bug in Firefox. For your convenience, we have set up a test site requiring client authentication on https://www.wompat.fi. Simply install both the test CA certificate and the test client certificate, which I have attached, then follow reproduction instructions using abovementioned site. Reproducible: Always Steps to Reproduce: 1. Access a site requiring client certificate authentication over a slow connection or reloading the page multiple times in a quick sequence. Actual Results: "Data Transfer Interrupted" error message appears. Firefox won't connect to that site again until it is completely restarted. Expected Results: The page should load normally.

Tapio Niemi

Reporter

Comment 1

•

16 years ago

Attached file CA certificate for authenticating our test site. — Details

Tapio Niemi

Reporter

Comment 2

•

16 years ago

Attached file Test client certificate that is accepted by our test site. — Details

Matthias Versen [:Matti]

Updated

•

16 years ago

Assignee: nobody → kaie

Component: General → Security: PSM

Product: Firefox → Core

QA Contact: general → psm

Tapio Niemi

Reporter

Comment 3

•

16 years ago

I have done some more testing, and I believe that whether SSLv3 or TLSv1 is used affects the problem. Our production servers and our wompat.fi test server are configured to accept only TLS connections, and when I tested with the server configured to accept only SSLv3 connections, I could not reproduce the bug, then switching back to TLS easily reproduced the bug. This of course may be just coincidental, but I think this is worth mentioning.

Tapio Niemi

Reporter

Comment 4

•

16 years ago

Some more testing done, and what I said in last comment is now confirmed. Only TLS is affected by this bug, not SSL. Looking at server logs, I can see that subsequent requests are done in SSL after TLS fails. (Some kind of automatic fallback?) Without looking at the logs, this bug will go completely unnoticed on a SSL/TLS server, whereas it will show aforementioned behaviour on TLS-only servers. So I'm turning on SSL support server-side as a temporary workaround.

Tapio Niemi

Reporter

Updated

•

16 years ago

Summary: Invalid Data Transfer Interrupted error message when connecting to site which requires client SSL certificate → Invalid Data Transfer Interrupted error message when connecting to TLS-only site which requires a client certificate

Kai Engert [:KaiE:]

Comment 5

•

16 years ago

Adding Nelson to this TLS-only bug (works with SSLv3). Changing to confirmed based on the amount of testing reported.

Status: UNCONFIRMED → NEW

Ever confirmed: true

Kai Engert [:KaiE:]

Comment 6

•

16 years ago

Tapio, I believe I understand what's going on. In the past we have encountered very badly behaving SSL servers. There are some which simply "stall" when we try to connect in TLS mode. Because of that we have introduced a timeout for completing a successful TLS handshake. If that timeout is hit, we conclude the server is incapable of speaking TLS and fall back to the older SSL protocol. This fallback is remembered for the remainder of a user's session. The very slow connection could trigger the timeout. Right now I don't have a good idea how to solve this for both failure scenarios.

Honza Bambas (:mayhemer)

Assignee

Comment 7

•

16 years ago

I am using a proxy allowing me to set a bandwidth throttling and emulate a modem. I setup a roundtrip delay to 2500 ms and I think I can reproduce the problem with the test page. Going to investigate further what we could do with this. Kai, could you refer the bug where the timeout were introduced please?

Assignee: kaie → honzab.moz

Status: NEW → ASSIGNED

Nelson Bolyard (seldom reads bugmail)

Comment 8

•

16 years ago

Someone needs to capture all of this with ssltap -sxlp. See http://www.mozilla.org/projects/security/pki/nss/tools/ssltap.html

Honza Bambas (:mayhemer)

Assignee

Comment 9

•

16 years ago

Attached file logs — Details

Logs from ssltap and wireshark: first-run-pcap and first-run-ssltap are logs when FF was started and the site was loaded the first time (no ssl session cached). I was successfully able to load the page. refreshes-pcap and refreshes-ssltap are logs of 3 quick refreshes of the page. I got Connection Interrupted in these cases, regularly. After Client Hello with Session ID Length=0 server closes with FIN. Looks like wrong server configuration? The environment: FF setup to use a proxy on localhost:8888 that has a throttle enabled, and connects ssltap on localhost:8889 (through the proxy) configured to connect to www.wompat.fi:443. wireshark just captures all for port 443.

Nelson Bolyard (seldom reads bugmail)

Comment 10

•

16 years ago

Honza, Welcome back! I studied your log files this afternoon. All the issues here are related to Firefox's handling of "TLS intolerant" servers. Here is a URL for a bugzilla search that will reveal related bugs. https://bugzilla.mozilla.org/buglist.cgi?query_format=advanced&short_desc_type=allwordssubstr&short_desc=intolerant&resolution=---&emailassigned_to1=1&emailreporter1=1&emailcc1=1&emaillongdesc1=1&emailtype1=exact&email1=nelson%40bolyard.me&chfieldfrom=2004-01-01&chfieldto=Now&chfield=%5BBug+creation%5D The "refreshes" logs reveal that this server is an "SSL intolerant" server. It is the exact opposite of TLS intolerant servers. It ONLY accepts TLS client hellos, and not client hellos in sslv2 compatible format. Presently when PSM decides that a server is "TLS intolerant", it stops attempting to use TLS with that server for the remainder of the process lifetime. That is why, once this problem begins, it remains until the browser is restarted. The first time that PSM concludes that an attempt to do an TLS handshake has failed, it reverts to "TLS intolerant mode", which will never work with an SSL-intolerant server like this one. There are several relevant issues here. 1) Once PSM has successfully negotiated TLS with a server, for the remainder of that process lifetime, it should not attempt to revert to "TLS Intolerant" mode for that server. It should remember that the server is TLS capable, and leave well enough alone. This is bug 412834. 2) Clicking the browser's stop button before a TLS connection has successfully completed negotiating a handshake is reported to cause PSM to mark the server whose handshake had not completed as being TLS intolerant. That is bug 498311. I believe that is relevant to this issue. 3) PSM's decision to mark a server as TLS Intolerant is a one-way street. It has been suggested that PSM ought to try TLS, and if that fails, try SSL, but if that fails, go back to trying TLS, and keep toggling back and forth. That is bug 412833. That would also help with this server. The "first run" logs show an interesting situation that I suspect is caused by the testing apparatus. While this problem is interesting, I'm not at all certain that this is the same problem as originally experienced and reported. It may be another problem with similar superficial symptoms. It may be, or it may not be. Soon, I will attach a simplified (and much shorter) version of the first run pcap log, from which the behavior may be more obvious.

Nelson Bolyard (seldom reads bugmail)

Updated

•

16 years ago

Version: unspecified → Trunk

Nelson Bolyard (seldom reads bugmail)

Comment 11

•

16 years ago

Attached file Simplified log file (obsolete) — Details

This bug was reported against a January 12 build of Firefox trunk (shiretoko), IINM. Could it be that this build does not have the fix for PSM bug 365898 ? In the log files, we see occasional seemingly-random delays, ranging from a fraction of a second to 21 seconds. Despite these delays, the first connection apparently succeeds, but takes about 30 seconds to do so. This is very close the PSM's TLS intolerant server timeout value (see bug 365898). The second connection attempt shown the handshake completing, the client sending an https request, and the server sending back a rather large reply, exceeding 128K bytes. The client system offers a TCP "window" (effectively a buffer into which received data may be stored) of 64KB-1 bytes to the server. The server sends this data to the client quickly. The client OS acknowledges the receipt of this data quickly, but the data is evidently not taken from the OS's buffers for some time. Consequently, we see the size of the window shrink down from 64KB-1 to 0 (frames 95-165) as the buffers fill. Then after about 2 seconds, the data gets taken from the OS buffers, reopening the window (frames 166-169), and the server quickly sends another 64KB-1 bytes, causing the client OS's buffers to fill again and the window to go to zero again (frames 169-240). But this time, after the window closes (the buffers are full) the data is not taken from the OS's buffers for over 5 seconds (frames 241-249), during which time the server repeatedly asks "are you still there?" by sending keep-alives. Finally, after 5 seconds of playing keep-alive, the server gives up and aborts the connection to the browser. It is at this point that the browser *correctly* reports that the transfer has been interrupted. It *HAS* been interrupted by the server. Now, I would say that a 5 second keep-alive time limit for the server is WAY too short. If this is really what was going on in the reporter's original experience with this problem, then I would say this bug is invalid. The browser is not incorrect in reporting that the connection was interrupted. But I think it is possible that the behavior seen in this pcap log is quite different from what was actually experienced by the user without the "throttle" proxy in between. So, I would suggest that the reporter attempt to use the pcap capture tool to capture the events he naturally experiences. Perhaps Honza can provide some guidance about how to do that. In any case, I see no evidence of any misbehavior of any TLS code on either the client or server side. This is a TCP connection issue, not a TLS issue. If the reporter only sees this with TLS (that is, with https), then I suggest it may be that the https server has a much shorter keep alive timeout than the http (not TLS) server uses. In summary, I see a server with a short TCP keep-alive timeout time, and a client that is quick to declare TLS intolerance, even after having successfully negotiated TLS with the server. But I do not see any false reports of transfer interruption. The transfer interruption reports are correct. If it can be shown that the server's keep-alive timeout is causing this problem, even in the absence of a "throttle proxy", then IMO, this bug is invalid. I would encourage the reporter to also join the CC list on the other bugs cited above.

Nelson Bolyard (seldom reads bugmail)

Comment 12

•

16 years ago

Oh, one more question for Honza. What version of ssltap did you use? Numerous old versions of ssltap have a bug that causes it to incorrectly account for the sizes of large messages. I wonder if you used one of those.

Honza Bambas (:mayhemer)

Assignee

Comment 13

•

16 years ago

(In reply to comment #12) > Oh, one more question for Honza. What version of ssltap did you use? > Numerous old versions of ssltap have a bug that causes it to incorrectly > account for the sizes of large messages. I wonder if you used one of those. Version: $Revision: 1.13 $ ($Date: 2009/03/13 02:24:07 $) $Author: nelson%bolyar d.com $ According to cvs log it should be the most recent one.

Honza Bambas (:mayhemer)

Assignee

Comment 14

•

16 years ago

(In reply to comment #11) > This bug was reported against a January 12 build of Firefox trunk (shiretoko), > IINM. Could it be that this build does not have the fix for PSM bug 365898 ? > If you mean this patch https://bugzilla.mozilla.org/attachment.cgi?id=257720&action=diff than it is in Shiretoko. To explain in detail my test environment: I have a proxy that throttles and has set round-trip latency to 2500ms, I am not aware of the throttle implementation. Firefox is configured to use this proxy for all conn types. I run ssltap independently configured to connect www.wompat.fi:443. I connect from Firefox to ssltap (having a cert exception). Then the connection chain looks like this: Firefox --> Throttling proxy:8888 --> ssltap:8889 -(Wireshark)-> www.wompat.fi:443 (Wireshark) displays the position where I capture. So, IMO it should behave the way that the proxy delays read from its input buffers (the same way as ISPs throttles your download speed). It's probably not exactly what we should normally see when capturing directly. To capture, I'm using Wireshark, with initial capture filter "tcp port 443" on the active adapter. I can give some more directions about how to use it. I have no experience how to capture on e.g. GPRS modems.

Tapio Niemi

Reporter

Comment 15

•

16 years ago

Thanks for everybody finally taking this bug seriously! We ourselves didn't believe this to be true regardless of repeated reports from some of our users until personally visiting one of them. I agree that the title is misleading, now that we know better what's going on. Feel free to change it to something more descriptive. I belive that what Kai said in #6 is what's happening, and what nelson said in #10 section 1) is the solution. When I initiated the workaround of allowing SSLv2, this problem vanished from production. I can provide Wireshark/Ethereal logs produced by using the original, and not-so-scientific reproduction method of keep-on-clicking-reload. And it produces messy, hard to follow logs that shows new connections starting before old ones have completely closed. But drop me a message if you really want them. It's difficult to obtain non-artificial captures, as this rarely happens on a clean success-failure sequence. More like load-click link-don't bother to wait until load is complete-click some other link-repeat a few times-bang! Now you think of TCP problems - sound a lot like so, but then, why did enabling SSLv2 worked so well as a workaround? And if the problem were in operating system, wouldn't this problem surface on other browsers also? One idea that I got: could this TLS-intolerance timeout be user configurable? That might help, too. Not really a big problem if this happens once in a while, people with poor connections are used to that there are quirks. They are also used to clicking reload as a swiss army knife for any connection problem. But now Firefox won't respond to a reload in an expected way. And that's what got our helpdesk phones ringing; people see that the rest of the internet is working, so they call helpdesk and tell that our server is not responding. That's the real life scenario affecting business.

Nelson Bolyard (seldom reads bugmail)

Comment 16

•

16 years ago

I'm not interested in any results that are the result of clicking the stop button or the reload button before the previous page load has finished, because all such tests do is confirm bug 498311. In summary, all the issues shown by existing attempts to reproduce this bug are either: a) already known bugs (see comment 10), or b) for one issue, the fault of the server (process or OS). (5 seconds is WAY too short for a keep alive timeout.) The reported issue (transfer interrupted) is a server bug, not a browser bug. We could change the title of this bug, but then it would become a duplicate of one of the bugs cited in comment 10. So I think the best thing to do with THIS bug is to mark it invalid, to encourage the server's keep-alive timeout timer to be increased to a reasonable value. But I'll leave the final resolution of this bug up to Honza.

Honza Bambas (:mayhemer)

Assignee

Comment 17

•

16 years ago

Tapio: can you verify that there is a timeout value set on your server close to 5 seconds to confirm the Nelson's conclusion? If you find this to be true, can you reset the value to it's TCP standard? I did another test. I setup the proxy to pass only 2 kbps up and down and delay for 6000 seconds. This means almost dead connectivity and I would expect to reproduce the problem very well. But I discovered something else. We end up with TLS t/o after 32 secs (not sure why not after 25 as defined). We do 9 other retries on level of http connection. Then we finish with connection RESET (not interrupt). So I suspect this is a dup of bug 498311, because I don't believe that people would not retry to reload the page again with F5 or the reload button. I'd also say this is con of prolonging the timeout from 8 to 25 seconds.

Honza Bambas (:mayhemer)

Assignee

Comment 18

•

16 years ago

6000 MILLIseconds :)

Nelson Bolyard (seldom reads bugmail)

Comment 19

•

16 years ago

A connection reset *is* an interrupt.

Honza Bambas (:mayhemer)

Assignee

Comment 20

•

16 years ago

The original report is about "interrupted". Problem when pressing the stop/reload button is "interrupted". What I observe with the last test is "reset" message. Those two errors are distinct in Firefox. http://mxr.mozilla.org/mozilla-central/source/docshell/base/nsDocShell.cpp#3553 http://mxr.mozilla.org/mozilla-central/source/docshell/base/nsDocShell.cpp#3647 and http://mxr.mozilla.org/mozilla-central/source/netwerk/base/src/nsSocketTransport2.cpp#179 Interrupt means we got EOF from the server (FIN/ACK packet). This would indicate the server got from client handshake that the server was not willing to continue with.

Tapio Niemi

Reporter

Comment 21

•

16 years ago

I've checked my server config. Server is a standard Apache 2.2.11, which indeed has a default KeepAliveTimeout of 5 seconds. I've changed it to 60 s and tested again, with no effect. However, if I understand correctly Nelson is talking about TCP keep-alive and not application layer keep-alive anyway. Those values are by default quite high, for example Apache TimeOut directive is at 300 s, which is the default. (But again, I dont think that it has any effect on TCP keep-alive). Same goes for my server OS (Linux): # cat /proc/sys/net/ipv4/tcp_keepalive_time 7200 # cat /proc/sys/net/ipv4/tcp_keepalive_intvl 75 # cat /proc/sys/net/ipv4/tcp_keepalive_probes 9 It might of course be possible that these are hardcoded into Apache, if it is possible for an application to control these in the first place. I'm no expert on these matters, just trying to provide requested information. I see that there are a lot of related bugs and I do not try to deny that this could be a duplicate, say of #412833 or #412834, or invalid, but that wasn't clear to me before getting more information from here.

Nelson Bolyard (seldom reads bugmail)

Comment 22

•

16 years ago

Yes, I am referring to TCP keep-alives, not to http keep-alives. The wireshark output (see the simplified log above) shows (lines 241-249) that the server system sent 4 TCP keep-alives (not 9), 1-3 seconds apart (not 75), and that each of them was answered (acknowledged) immediately, and it gave up after ~5 seconds, not 7200. An application can alter some of the TCP parameters, such as the maximum number of *unacknowledged* keep-alives sent before giving up. But unless your kernel's TCP implementation is broken, that should not be the reason for this, because all the keep-alives were quickly acknowledged. I wonder if TCP retry counts (e.g. tcp_retries1, tcp_retries2 are the issue. I think that TCP keep-alives should not count as retries, but maybe this kernel is counting them that way.

Honza Bambas (:mayhemer)

Assignee

Comment 23

•

16 years ago

(In reply to comment #11) > But this time, after the window closes (the buffers are full) the data is not > taken from the OS's buffers for over 5 seconds (frames 241-249), during which > time the server repeatedly asks "are you still there?" by sending keep-alives. > > Finally, after 5 seconds of playing keep-alive, the server gives up and aborts > the connection to the browser. It is at this point that the browser > *correctly* reports that the transfer has been interrupted. It *HAS* been > interrupted by the server. What I see is interruption from side of the client and not of the server in the log. Frame 249 is RST sent by FF. I'd say the interruption error is caused by users that stop load and reload the page again. This is IMO dup of bug 498311.

Nelson Bolyard (seldom reads bugmail)

Comment 24

•

16 years ago

Attached file Corrected Simplified log file — Details

Honza, thanks for noticing that the RST came from the client OS. Somehow, the whole S->C and C<-S think got completely (but consistently) messed up. It is corrected in this version. My analysis is the same as before, except for the sending of the RST. Offhand, at the moment, I can't think of anything an application can do to a connected socket that will cause the OS to send a RST without first sending a FIN, and I didn't see the client sending any FIN. Did I miss that? Or did it not happen? That seems bogus. But it's an OS issue, not an app issue.

Attachment #385703 - Attachment is obsolete: true

Honza Bambas (:mayhemer)

Assignee

Comment 25

•

16 years ago

Nelson, remember please that what I capture is communication between the proxy and the server, *NOT* between firefox and the proxy.

Nelson Bolyard (seldom reads bugmail)

Comment 26

•

16 years ago

Yes, I know, but regardless of what the application is, an OS should not send a RST on a connection under these circumstances. It should send a FIN. But, in any case, the sending of the RST is not a Firefox bug.

Honza Bambas (:mayhemer)

Assignee

Comment 27

•

16 years ago

This is not matter of OS stack, it is matter of the app. You can call shutdown() on the socket and then close() after you get read error (indicates RST receive or timeout/hw error) or read zero bytes on the socket (indicates FIN receive). This will lead to send just FIN from our side and is considered as graceful shutdown. RST should not be send when close() is called unless we got not FIN from the other side yet. Or, you can call directly close() on the socket that forces the TCP session to close by sending RST, from that moment read and write is impossible on the socket.

Honza Bambas (:mayhemer)

Assignee

Comment 28

•

16 years ago

Tapio, can you try one of builds for your platform at this URL to check the problem is gone? Thanks. https://build.mozilla.org/tryserver-builds/honzab.moz@firemni.cz-bug473197/

Honza Bambas (:mayhemer)

Assignee

Comment 29

•

16 years ago

Tapio: ping.

Tapio Niemi

Reporter

Comment 30

•

16 years ago

Sorry for delay Honza, I've been on summer vacation (actually still am). Just tested again on a Flash-OFDM connection with poor radio weather, first with my old Minefield build, reproduced, then with your build - no problems. Great thanks if this really is fixed now :-)

Honza Bambas (:mayhemer)

Assignee

Comment 31

•

16 years ago

In bug 412834 and bug 412833.

Status: ASSIGNED → RESOLVED

Closed: 16 years ago

Resolution: --- → FIXED

CA certificate for authenticating our test site. 16 years ago Tapio Niemi 1.18 KB, application/octet-stream		Details
Test client certificate that is accepted by our test site. 16 years ago Tapio Niemi 1.68 KB, application/octet-stream		Details
logs 16 years ago Honza Bambas (:mayhemer) 44.34 KB, application/zip		Details
Simplified log file 16 years ago Nelson Bolyard (seldom reads bugmail) 22.00 KB, text/plain		Details
Corrected Simplified log file 16 years ago Nelson Bolyard (seldom reads bugmail) 19.39 KB, text/plain		Details