491541 - Long delay with HTTP Post to IIS7-hosted websites

Reporter

Description

•

16 years ago

User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Trident/4.0; SLCC1; .NET CLR 2.0.50727; .NET CLR 3.5.21022; .NET CLR 3.5.30729; .NET CLR 3.0.30618) Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10 (.NET CLR 3.5.30729) I believe there is a bug in the way FireFox handles connections to IIS7 web servers. As outlined below I (and others) have experienced severe delays in page responses to IIS7-hosted websites. Viewing the same pages in other web browsers does not cause the delay, and changing the website host to IIS6 prevents the problem with FireFox. It appears to be an IIS7 (or perhaps Windows Server 2008) connection/communication problem with FireFox. Reproducible: Sometimes Steps to Reproduce: 1. Pull up the main www.winebid.com url 2. Click on the Search button to perform a blank search 3. Wait 60 seconds, then click the "Next" link to go to the next page of results. Actual Results: 30 seconds to several minutes delay in page response. Expected Results: Quick response... a couple seconds max. Please see the following posts for more information: http://forums.iis.net/t/1155755.aspx https://support.mozilla.com/tiki-view_forum_thread.php?locale=en-US&comments_parentId=282453&forumId=1 Note, it does not seem to occur if you wait for an extended period of time (i.e. several minutes) between initial post and secondary post. It also does not occur if you click through the site continuously. It only presents itself if you wait for approximately 60 secs and then attempt the secondary post.

Ian

Reporter

Updated

•

16 years ago

Version: unspecified → 3.0 Branch

Ria Klaassen (not reading all bugmail)

Comment 1

•

16 years ago

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.11pre) Gecko/2009043005 GranParadiso/3.0.11pre I couldn't reproduce the problem.

Ria Klaassen (not reading all bugmail)

Comment 2

•

16 years ago

Okay, yes I see it now.

Status: UNCONFIRMED → NEW

Component: General → Networking: HTTP

Ever confirmed: true

Product: Firefox → Core

QA Contact: general → networking.http

Version: 3.0 Branch → 1.9.0 Branch

Ian

Reporter

Comment 3

•

16 years ago

FYI... I have removed all server-side code from the loop. I setup a test page that has no server-side code in in and I still get the problem. My guess is that it is related to the size of the HTTP POST variables, specifically the VIEWSTATE variable which is unusually large (although not by ASP.NET standards). See http://www.winebid.com/test.aspx I will test further to eliminate client-side code as well...

Ian

Reporter

Comment 4

•

16 years ago

Also, my coworker could not reproduce this with version 3.0.0 of FireFox, so this may only be a recently-introduced issue. I have yet to verify his claims but have reproduced the problem with all versions from 3.0.6 to 3.0.10 on Windows.

Ria Klaassen (not reading all bugmail)

Updated

•

16 years ago

Version: 1.9.0 Branch → Trunk

Ria Klaassen (not reading all bugmail)

Comment 5

•

16 years ago

This is not a (recent) regression; I see the problem in all Firefox versions I tried (I went back until Firefox 1.0.7).

Ria Klaassen (not reading all bugmail)

Comment 6

•

16 years ago

No such problem in Opera until now.

Ria Klaassen (not reading all bugmail)

Updated

•

16 years ago

Flags: blocking1.9.2?

Flags: blocking1.9.1?

Michal Novotny [:michal]

Assignee

Comment 7

•

16 years ago

I've done some test with the server and it is definitely a server bug. When the connection is idle for more than ca 130 seconds then the connection becomes completely dead. There is no response from the server to any packet. And the problem is that Firefox keeps idle connection for 300 seconds by default. So when you try to reuse connection that was idle for more than 130 seconds Firefox sends request but it never receive any response. The connection will timeout after some time (19 minutes in my case on linux). Opera closes idle connections after 120 seconds so that's why the bug can't be reproduced with it. I've tried also Konqueror but I don't know the logic of its timeouts. It sometimes closes idle connection after 60 seconds and sometimes "never" (I've given up waiting for FIN packet after 10 minutes...). Anyway when it don't close the connection the bug can be reproduces too. We can set mIdleTimeout in nsHttpConnection to 120 or less when we detect header "Server: Microsoft-IIS/7.0" but I'm in doubt that this is IIS bug. It seems to me rather like bug in Windows TCP/IP stack.

Ian

Reporter

Comment 8

•

16 years ago

If the server is switched to IIS6 the problem goes away, so it seems to be something that is server-specific. It may be a problem with the Server 2008 TCP/IP stack as Server 2008 & IIS7 are inseparable. Either way, if there is a change that can be made to eliminate the issue from FireFox that would be worthwhile, since it is the only browser that appears to be affected by this problem.

Jason Duell

Comment 9

•

16 years ago

I can't think of any reason not to drop the timeout to 120 seconds. Seems likely there's a reason why Opera set it just below IIS 130 second roll-over-and-play-dead behavior. Can anyone else think of a reason why it'd be bad to lower the limit? CC-ing the usual suspects.

Boris Zbarsky [:bzbarsky]

Comment 10

•

16 years ago

Well, lower limit just means more connection setup costs, generally speaking. The question is how much of an effect that would be. We're talking about cases when a new load to the same site starts within 5 mins of a previous load but not within 2 mins, right? Seems like it might be rare and the cost of such non-reuse would be low, since we're talking one connection setup every two minutes. To be sure, we could add some instrumentation (say prlog) to see how common such connection reuse actually is and then do some logs while browsing around, using gmail and whatnot, etc.

Patrick McManus [:mcmanus]

Comment 11

•

16 years ago

when the server is not responding, does it send a tcp ack in response to the FF request? (I know it doesn't send any data bytes) and if so, does it ack the request or just reack the old sequence #? this ought to tell you if the issue is at the OS or application level. fwiw, I've worked on a number of projects which all had the client side timeout set to 120. Seems to be a common value.. amortizing the setup costs over a number that big is no big deal.

Michal Novotny [:michal]

Assignee

Comment 12

•

16 years ago

(In reply to comment #11) > when the server is not responding, does it send a tcp ack in response to the FF > request? (I know it doesn't send any data bytes) I was talking about packet layer. I.e. it doesn't send any ACK, FIN, ... Simply no packet.

Ian

Reporter

Comment 13

•

16 years ago

IIS's default HTTP-keepalive timeout value is 120 seconds. After that amount of inactivity I think the server closes the connection. When that happens is the server supposed to send some sort of notice to the client that the connection is no longer usable? I.E. how do server-side and client-side timeouts work together? How does FF know when the server is closing a connection early (before FF would consider it unusable)?

Patrick McManus [:mcmanus]

Comment 14

•

16 years ago

server should send a TCP fin when timing out, which FF would read and then remove the idle connection from its pool. Comment 12 indicates that isn't happening. This sounds more like an interaction with a FW or a content-switch of some kind than a straight IIS version level thing.. maybe the different versions have different timeouts that tickle the FW differently.. maybe I'm wrong.

Ian

Reporter

Comment 15

•

16 years ago

From this thread (which sounds very much like the issue we are dealing with) a poster mentions that IIS does not terminate connections with FIN, but rather sends an RST instead: http://www.eggheadcafe.com/conversation.aspx?messageid=34041293&threadid=33101781 The linked paper mentions IIS6 as the tested environment, so perhaps MS changed this behavior in IIS7?

Johnny Stenback (:jst)

Comment 16

•

16 years ago

Not a recent regression, won't hold 3.5 for this.

Flags: blocking1.9.1? → blocking1.9.1-

Ian

Reporter

Comment 17

•

16 years ago

The IIS support people seem to think this is a problem between Windows Server and the network card (http://forums.iis.net/t/1155755.aspx ). Doesn't seem to be sending the RST to close the connection. Either way I think the best solution from a FF perspective would be to reduce the keepalive-timeout from 5 minutes to at most 2. There is no real advantage gained in having a 5 minute window as no current server software supports (by default) such a large timeout. IIS defaults to a 2-minute timeout and Apache defaults to only 15 seconds. Since the potential benefit is so small, and the potential side affects so large (the issue, when encountered, can 'hang' FF for minutes at a time) it make sense to change this setting. Aside from that I think this issue can be closed from a FF perspective. Thanks for the help!

Jason Duell

Comment 18

•

16 years ago

Yes, someone give me a patch to drop the keepalive timeout to 120 seconds (should be a one-liner), so I can review it and we can move this along.

Boris Zbarsky [:bzbarsky]

Comment 19

•

16 years ago

Or you can patch and I can review.

Patrick McManus [:mcmanus]

Comment 20

•

16 years ago

should be 115 imo. no need to taunt the race condition gods by making it the same on each end.

Ian

Reporter

Comment 21

•

16 years ago

For what it's worth, MSIE defaults to 60 secs.

Michal Novotny [:michal]

Assignee

Comment 22

•

16 years ago

Attached patch patch (obsolete) — Details — Splinter Review

Attachment #376965 - Flags: review?(jduell.mcbugs)

Jason Duell

Comment 23

•

16 years ago

Um, wait a sec. Reading back over the actual bug report, it seems that Ian was seeing this behavior after waiting "approximately 60 seconds", so we may be barking up the wrong tree here, as that's less than the 130 second cutoff IIS allegedly uses. But we can find out easily enough. Ian (or Ria): could you go into your about:config and set "network.http.keep-alive.timeout" to 120 seconds, and see if it changes anything? If it doesn't fix things, try something smaller (say 30 seconds). If the smaller time works but the larger one doesn't, and you've got time to do a quick binary search to see where the cutoff time for the error is (at least roughly) that would be great.

Ian

Reporter

Comment 24

•

16 years ago

To clarify, the reason it was happening after 60 seconds was due to a load balancer sitting in between IIS and FireFox that itself had a 60-sec timeout. Once we took that out of the loop we started experiencing the issue after 120 seconds, as mentioned above.

Ian

Reporter

Comment 25

•

16 years ago

Note that this is also another argument for having a shorter client-side timeout, as other timeouts can also come into play aside from just the web-server settings.

Ria Klaassen (not reading all bugmail)

Comment 26

•

16 years ago

(In reply to comment #23) > But we can find out easily enough. Ian (or Ria): could you go into your > about:config and set "network.http.keep-alive.timeout" to 120 seconds, and see > if it changes anything? Yes, this seemed to help. I tested this twice (waited 2,5 minutes) and the site loaded instantly.

Ria Klaassen (not reading all bugmail)

Comment 27

•

16 years ago

http://www.winebid.com/buy_wine/search_results.aspx?SearchString=&PBCT=3 also works with the default value now, so this is a bit confusing :)

Honza Bambas (:mayhemer)

Comment 28

•

16 years ago

(In reply to comment #27) > http://www.winebid.com/buy_wine/search_results.aspx?SearchString=&PBCT=3 also > works with the default value now, so this is a bit confusing :) Did you restart FF? Did you drop the http cache? ... BTW: I've seen some servers sending Connection: keep-alive and Keep-alive: timeout=<some number of seconds>. I wasn't able to find an exact spec for this header but do we consider it? Does IIS7 send it?

Honza Bambas (:mayhemer)

Comment 29

•

16 years ago

Maybe this thread would help: http://www.hpl.hp.com/personal/ange/archives/archives-95/http-wg-archive/1661.html

Ian

Reporter

Comment 30

•

16 years ago

I have implemented the workaround of setting IIS to 300 seconds on the server-side, so you should not see the issue any more on our site.

Jason Duell

Comment 31

•

16 years ago

Comment on attachment 376965 [details] [diff] [review] patch Patch approved with 115 second timeout value. Sounds like 120 might work, but might as well play it safe.

Attachment #376965 - Flags: superreview?(bzbarsky)

Attachment #376965 - Flags: review?(jduell.mcbugs)

Attachment #376965 - Flags: review+

Jason Duell

Comment 32

•

16 years ago

JST: this is a one-line fix. Is it worth making a blocker (assuming that's the path to getting it into 1.9.0 ASAP)?

Flags: blocking1.9.1- → blocking1.9.1?

Boris Zbarsky [:bzbarsky]

Updated

•

16 years ago

Attachment #376965 - Flags: superreview?(bzbarsky) → superreview+

Boris Zbarsky [:bzbarsky]

Comment 33

•

16 years ago

Comment on attachment 376965 [details] [diff] [review] patch Please add a nice comment explaining why it's such a non-round number.

Boris Zbarsky [:bzbarsky]

Comment 34

•

16 years ago

In particular why it's not 120.

Patrick McManus [:mcmanus]

Comment 35

•

16 years ago

the fast track here makes me a little queasy. It sounds like the bug is "IIS7 persistent connection timeout handling is broken" and I'm a little skeptical the real issue is that broad.. that would be a major bug if its that broad. is there a KB article on it if so? there is nothing wrong with the smaller timeout, but maybe we should be sure we've got the issue fully understood before declaring victory?

Jason Duell

Comment 36

•

16 years ago

> there is nothing wrong with the smaller timeout True. The current timeout value appears to be much larger than any servers out there will wait for, and it's higher than IE or Opera. So I think dropping down to anywhere between IIS/Opera's value (120 secs) or IE's (60) is fine. Patrick liked the idea of making the number slightly smaller than IIS's, hence 115. > but maybe we should be sure we've got the issue fully understood > before declaring victory? Well, yes, that would be nice. It's not really clear what's going on here. Comment #15 indicates that IIS may be using a RST rather than a FIN, but that would still shut down the socket on our end: that doesn't seem to be happening. Actually, the IIS server doesn't seem to respond at all (assuming comment #12 is true). I'd be fine with us poking around at this some more, but given that any timeout value between 60-115 appears to fix this bug, without any significant cost (and that other browsers use similar values, and that no servers honor a keepalive as long as our current 300 secs) I don't see why we shouldn't just check in the patch. Thoughts?

Flags: blocking1.9.1?

Jason Duell

Comment 37

•

16 years ago

Here's some potentially relevant stuff from the IIS bug forum: http://forums.iis.net/t/1155755.aspx?PageIndex=4 Here is Microsoft's response to the ticket I opened. "Firefox is waiting for more than the standard 2 minutes before trying to re-use the connection. Firefox never sends "FIN" command(FIN- Finish is used during a graceful session close to show that the sender has no more data to send) to the server, so it cannot re-open the connection. IIS times the request out as expected, due to the default 2 minute ConnectionTimeout setting of HTTP.sys. The IIS server, however should not be waiting for 9 seconds to send a reset. So we doubt that there could be some issues with the NIC or NIC drivers which initiates this waiting. So, the part of the problem here is Firefox trying to reuse an old connection. The other problem seems to be with TCP on the server not issuing a timely RST(RST- Reset is an instantaneous abort in both directions (abnormal session disconnection)). Recommendations: - Let’s disable TCP chimney and/or update NIC drivers on server. - Lets run the following command to disable the TCPChimney, - Netsh int ip set chimney DISABLED Unfortunately that command didn't work for me... I kept getting "command not found". So they had me add the following registry entries: ..."

Jason Duell

Comment 38

•

16 years ago

> Please add a nice comment explaining why it's such a non-round number. "Set slightly under IIS's keepalive timeout (120 secs) to avoid potential race."

Michal Novotny [:michal]

Assignee

Comment 39

•

16 years ago

Attached patch patch v2 - added comment (obsolete) — Details — Splinter Review

Michal Novotny [:michal]

Assignee

Comment 40

•

16 years ago

(In reply to comment #38) > > Please add a nice comment explaining why it's such a non-round number. > > "Set slightly under IIS's keepalive timeout (120 secs) to avoid potential > race." I've just attached patch with more verbose comment. Feel free to change it as you wish.

Jason Duell

Updated

•

16 years ago

Attachment #377033 - Flags: superreview?(bzbarsky)

Attachment #377033 - Flags: review+

Boris Zbarsky [:bzbarsky]

Comment 41

•

16 years ago

Comment on attachment 377033 [details] [diff] [review] patch v2 - added comment "don't close the connection" and looks good.

Attachment #377033 - Flags: superreview?(bzbarsky) → superreview+

Michal Novotny [:michal]

Assignee

Comment 42

•

16 years ago

Attached patch patch v3 — Details — Splinter Review

Attachment #376965 - Attachment is obsolete: true

Attachment #377033 - Attachment is obsolete: true

Michal Novotny [:michal]

Assignee

Updated

•

16 years ago

Keywords: checkin-needed

Patrick McManus [:mcmanus]

Comment 43

•

16 years ago

(In reply to comment #37) > > Here is Microsoft's response to the ticket I opened. Wow. That's a lotta questionable material stuffed into one response from MS. I think you should keep trying them until they identify a specific bug number on their side. > > "Firefox is waiting for more than the standard 2 minutes before trying to > re-use the connection. There is no standard 2 minutes. That's crap. Indeed, the standard says explicitly that there is no particular value - from 8.1.4 of rfc 2616 - "The use of persistent connections places no requirements on the length (or existence) of this time-out for either the client or the server." Firefox is doing nothing wrong. > > Firefox never sends "FIN" command(FIN- Finish is used during a graceful session > close to show that the sender has no more data to send) to the server, so it > cannot re-open the connection. The fallacy of this is obvious. The presumption is that firefox has no more data to send - but that's not true! We have another request, which we're sending down a perfectly valid idle persistent connection (that the server has not closed). > > IIS times the request out as expected, due to the default 2 minute > ConnectionTimeout setting of HTTP.sys. no. If IIS timed out the connection (not request, there is no active request) it would generate a FIN towards Firefox. (well, perhaps IIS did time it out, but windows or the TOE hardware is broken - that's all the "server" as far as firefox can be concerned) A RST, btw, is not an acceptable replacement for a fin. (not that there is even a rst at the 2 minute mark) RSTs are not subject to reliable delivery and they would abort the reliable retransmission of any portion of the prior transaction's response that had not yet reached the client. And it could also seriously call into question the length of a message body that was being determined by conneciton close rather than a length header or chunking frame if it were used in those situations. Again - RFC 2616 " When a client or server wishes to time-out it SHOULD issue a graceful close on the transport connection." The wording graceful is specifically about fin vs rst. (in the spec world it is an attempt to keep the http spec open to transports other than tcp..) > So, the part of the problem here is Firefox trying to reuse an old connection. That's a feature, not a problem :) It saves the work of connection establishment for everyone. If the server would rather firefox did not do this they only need to close that old connection, which they do not do. > Recommendations: > > - Let’s disable TCP chimney and/or update NIC drivers on server. > - Lets run the following command to disable the TCPChimney, > - Netsh int ip set chimney DISABLED > Ah - now the meat of the matter. It sounds like a bug with chimney. Chimney is the MS layer that connects TCP TOE functionality on ethernet cards with the OS. I have worked with TOE hardware - it can be really confusing and the hardware can (and does) have bugs too. The hw bugs are generally fixable in firmware. Looks like a big fat hardware dependent OS bug, not really an IIS-7 one. Indeed, if I try an infinite pconn to www.iis.net I can reproduce the issue as described. But if I try such a beast to www.microsoft.com, which identifies itself as running IIS/7.5, the server does a graceful close after 4 minutes and 17 seconds. Completely in spec. In a different trial run, I checked to make sure the connection was usable after 3 minutes of idle time. It was. 09:34:50.217126 IP 192.168.16.214.54408 > 65.55.21.250.80: S 547311562:547311562(0) win 5840 <mss 1460,sackOK,timestamp 306027870 0,nop,wscale 7> 09:34:50.322780 IP 65.55.21.250.80 > 192.168.16.214.54408: S 498729028:498729028(0) ack 547311563 win 8190 <mss 1460> 09:34:50.322806 IP 192.168.16.214.54408 > 65.55.21.250.80: . ack 1 win 5840 09:34:51.617992 IP 192.168.16.214.54408 > 65.55.21.250.80: P 1:18(17) ack 1 win 5840 09:34:51.723172 IP 65.55.21.250.80 > 192.168.16.214.54408: . ack 18 win 8190 09:34:51.723200 IP 192.168.16.214.54408 > 65.55.21.250.80: P 18:45(27) ack 1 win 5840 09:34:51.828137 IP 65.55.21.250.80 > 192.168.16.214.54408: . ack 45 win 62867 09:34:51.828693 IP 65.55.21.250.80 > 192.168.16.214.54408: P 1:391(390) ack 45 win 62867 09:34:51.828703 IP 192.168.16.214.54408 > 65.55.21.250.80: . ack 391 win 6432 09:39:08.276051 IP 65.55.21.250.80 > 192.168.16.214.54408: F 391:391(0) ack 45 win 8190 09:39:08.276142 IP 192.168.16.214.54408 > 65.55.21.250.80: F 45:45(0) ack 392 win 6432 09:39:08.385123 IP 65.55.21.250.80 > 192.168.16.214.54408: . ack 46 win 8190 I did a different run and got the same graceful close after 3 minutes and 23 seconds of idle time. Also perfectly legitimate - there is no rule that says the timeout period has to be based on a constant. This makes some sense. Closing the connections is just meant to preserve resources - if you don't have any resource pressure why close it at all? I wrote a tool once that only closed idle connections when their fixed size pool was full (and then it did it fifo style). But in some ways this gets to the sillyness of us changing the timeout in firefox. The 120 number is configurable, there are often proxy steps introduced as part of load balancing, and so on.. So 115 might not mean you can sleep well. But please do note that the session from microsoft.com is closed with an F, not an R. Any talk that RSTs are a normal way to close the session (or that IIS normally does that) is just someone running their mouth/keyboard - RSTs are by definition error conditions, and the closing of a persistent connection is not an error. ..... I just needed to understand the issue a little better - please don't take this as an argument against changing the timeout value. It will help in some cases, and the harm is relatively minor. But treating the issue as "solved" is probably not right either.

Ian

Reporter

Comment 44

•

16 years ago

"That's a feature, not a problem :) It saves the work of connection establishment for everyone." I can understand your point. Looking at it from that perspective then the question is what is the benefit of this feature if it takes several minutes for FF to sort out that the connection is actually dead and not reusable? It may be beneficial to server-operators and network administrators, who save the aggregate connection-creation overhead... but to the end user it is a major headache. Perhaps, in addition to the reduction in timeout, there is further investigation that can be done to figure out a faster way to recover from "dead" connections in general, regardless of the cause (IIS ****, hardware failures, proxy issues...)? That seems like a much larger project though.

Dão Gottwald [:dao]

Updated

•

16 years ago

Assignee: nobody → michal

Jason Duell

Comment 45

•

16 years ago

Comment on attachment 377136 [details] [diff] [review] patch v3 Looks fine to go.

Attachment #377136 - Flags: superreview?(bzbarsky)

Attachment #377136 - Flags: review+

Boris Zbarsky [:bzbarsky]

Comment 46

•

16 years ago

Comment on attachment 377136 [details] [diff] [review] patch v3 This really was all set as far as I was concerned... ;)

Attachment #377136 - Flags: superreview?(bzbarsky) → superreview+

Boris Zbarsky [:bzbarsky]

Comment 47

•

16 years ago

I think we should take this on branch for a dot-release.

Flags: wanted1.9.1.x?

Boris Zbarsky [:bzbarsky]

Comment 48

•

16 years ago

Pushed http://hg.mozilla.org/mozilla-central/rev/ad10f4f7b8f7

Status: NEW → RESOLVED

Closed: 16 years ago

Flags: in-testsuite?

Resolution: --- → FIXED

Boris Zbarsky [:bzbarsky]

Updated

•

16 years ago

Keywords: checkin-needed

Benjamin Smedberg

Updated

•

16 years ago

Flags: blocking1.9.2?

patch 16 years ago Michal Novotny [:michal] 1.28 KB, patch	jduell.mcbugs : review+ bzbarsky : superreview+	Details \| Diff \| Splinter Review
patch v2 - added comment 16 years ago Michal Novotny [:michal] 1.62 KB, patch	jduell.mcbugs : review+ bzbarsky : superreview+	Details \| Diff \| Splinter Review
patch v3 16 years ago Michal Novotny [:michal] 1.62 KB, patch	jduell.mcbugs : review+ bzbarsky : superreview+	Details \| Diff \| Splinter Review