Closed Bug 491541 Opened 13 years ago Closed 13 years ago

Long delay with HTTP Post to IIS7-hosted websites

Categories

(Core :: Networking: HTTP, defect)

x86
Windows Vista
defect
Not set
major

Tracking

()

RESOLVED FIXED

People

(Reporter: iana, Assigned: michal)

References

()

Details

Attachments

(1 file, 2 obsolete files)

User-Agent:       Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Trident/4.0; SLCC1; .NET CLR 2.0.50727; .NET CLR 3.5.21022; .NET CLR 3.5.30729; .NET CLR 3.0.30618)
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10 (.NET CLR 3.5.30729)

I believe there is a bug in the way FireFox handles connections to IIS7 web servers. As outlined below I (and others) have experienced severe delays in page responses to IIS7-hosted websites. Viewing the same pages in other web browsers does not cause the delay, and changing the website host to IIS6 prevents the problem with FireFox. It appears to be an IIS7 (or perhaps Windows Server 2008) connection/communication problem with FireFox.

Reproducible: Sometimes

Steps to Reproduce:
1. Pull up the main www.winebid.com url
2. Click on the Search button to perform a blank search
3. Wait 60 seconds, then click the "Next" link to go to the next page of results.
Actual Results:  
30 seconds to several minutes delay in page response.

Expected Results:  
Quick response... a couple seconds max. 

Please see the following posts for more information:
http://forums.iis.net/t/1155755.aspx
https://support.mozilla.com/tiki-view_forum_thread.php?locale=en-US&comments_parentId=282453&forumId=1

Note, it does not seem to occur if you wait for an extended period of time (i.e. several minutes) between initial post and secondary post. It also does not occur if you click through the site continuously. It only presents itself if you wait for approximately 60 secs and then attempt the secondary post.
Version: unspecified → 3.0 Branch
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.11pre) Gecko/2009043005 GranParadiso/3.0.11pre

I couldn't reproduce the problem.
Okay, yes I see it now.
Status: UNCONFIRMED → NEW
Component: General → Networking: HTTP
Ever confirmed: true
Product: Firefox → Core
QA Contact: general → networking.http
Version: 3.0 Branch → 1.9.0 Branch
FYI... I have removed all server-side code from the loop. I setup a test page
that has no server-side code in in and I still get the problem. My guess is
that it is related to the size of the HTTP POST variables, specifically the
VIEWSTATE variable which is unusually large (although not by ASP.NET
standards).
See http://www.winebid.com/test.aspx

I will test further to eliminate client-side code as well...
Also, my coworker could not reproduce this with version 3.0.0 of FireFox, so this may only be a recently-introduced issue. I have yet to verify his claims but have reproduced the problem with all versions from 3.0.6 to 3.0.10 on Windows.
Version: 1.9.0 Branch → Trunk
This is not a (recent) regression; I see the problem in all Firefox versions I tried (I went back until Firefox 1.0.7).
No such problem in Opera until now.
Flags: blocking1.9.2?
Flags: blocking1.9.1?
I've done some test with the server and it is definitely a server bug. When the connection is idle for more than ca 130 seconds then the connection becomes completely dead. There is no response from the server to any packet. And the problem is that Firefox keeps idle connection for 300 seconds by default. So when you try to reuse connection that was idle for more than 130 seconds Firefox sends request but it never receive any response. The connection will timeout after some time (19 minutes in my case on linux).

Opera closes idle connections after 120 seconds so that's why the bug can't be reproduced with it. I've tried also Konqueror but I don't know the logic of its timeouts. It sometimes closes idle connection after 60 seconds and sometimes "never" (I've given up waiting for FIN packet after 10 minutes...). Anyway when it don't close the connection the bug can be reproduces too.

We can set mIdleTimeout in nsHttpConnection to 120 or less when we detect header "Server: Microsoft-IIS/7.0" but I'm in doubt that this is IIS bug. It seems to me rather like bug in Windows TCP/IP stack.
If the server is switched to IIS6 the problem goes away, so it seems to be something that is server-specific. It may be a problem with the Server 2008 TCP/IP stack as Server 2008 & IIS7 are inseparable. Either way, if there is a change that can be made to eliminate the issue from FireFox that would be worthwhile, since it is the only browser that appears to be affected by this problem.
I can't think of any reason not to drop the timeout to 120 seconds.  Seems likely there's a reason why Opera set it just below IIS 130 second roll-over-and-play-dead behavior.

Can anyone else think of a reason why it'd be bad to lower the limit?  CC-ing the usual suspects.
Well, lower limit just means more connection setup costs, generally speaking.  The question is how much of an effect that would be.  We're talking about cases when a new load to the same site starts within 5 mins of a previous load but not within 2 mins, right?  Seems like it might be rare and the cost of such non-reuse would be low, since we're talking one connection setup every two minutes.

To be sure, we could add some instrumentation (say prlog) to see how common such connection reuse actually is and then do some logs while browsing around, using gmail and whatnot, etc.
when the server is not responding, does it send a tcp ack in response to the FF request? (I know it doesn't send any data bytes)

and if so, does it ack the request or just reack the old sequence #?

this ought to tell you if the issue is at the OS or application level.

fwiw, I've worked on a number of projects which all had the client side timeout set to 120. Seems to be a common value.. amortizing the setup costs over a number that big is no big deal.
(In reply to comment #11)
> when the server is not responding, does it send a tcp ack in response to the FF
> request? (I know it doesn't send any data bytes)

I was talking about packet layer. I.e. it doesn't send any ACK, FIN, ... Simply no packet.
IIS's default HTTP-keepalive timeout value is 120 seconds. After that amount of inactivity I think the server closes the connection. When that happens is the server supposed to send some sort of notice to the client that the connection is no longer usable? I.E. how do server-side and client-side timeouts work together? How does FF know when the server is closing a connection early (before FF would consider it unusable)?
server should send a TCP fin when timing out, which FF would read and then remove the idle connection from its pool. Comment 12 indicates that isn't happening.

This sounds more like an interaction with a FW or a content-switch of some kind than a straight IIS version level thing.. maybe the different versions have different timeouts that tickle the FW differently.. maybe I'm wrong.
From this thread (which sounds very much like the issue we are dealing with) a poster mentions that IIS does not terminate connections with FIN, but rather sends an RST instead:
http://www.eggheadcafe.com/conversation.aspx?messageid=34041293&threadid=33101781

The linked paper mentions IIS6 as the tested environment, so perhaps MS changed this behavior in IIS7?
Not a recent regression, won't hold 3.5 for this.
Flags: blocking1.9.1? → blocking1.9.1-
The IIS support people seem to think this is a problem between Windows Server and the network card (http://forums.iis.net/t/1155755.aspx
). Doesn't seem to be sending the RST to close the connection. Either way I think the best solution from a FF perspective would be to reduce the keepalive-timeout from 5 minutes to at most 2. There is no real advantage gained in having a 5 minute window as no current server software supports (by default) such a large timeout. IIS defaults to a 2-minute timeout and Apache defaults to only 15 seconds. Since the potential benefit is so small, and the potential side affects so large (the issue, when encountered, can 'hang' FF for minutes at a time) it make sense to change this setting. Aside from that I think this issue can be closed from a FF perspective. Thanks for the help!
Yes, someone give me a patch to drop the keepalive timeout to 120 seconds (should be a one-liner), so I can review it and we can move this along.
Or you can patch and I can review.
should be 115 imo.

no need to taunt the race condition gods by making it the same on each end.
For what it's worth, MSIE defaults to 60 secs.
Attached patch patch (obsolete) — Splinter Review
Attachment #376965 - Flags: review?(jduell.mcbugs)
Um, wait a sec.

Reading back over the actual bug report, it seems that Ian was seeing this behavior after waiting "approximately 60 seconds", so we may be barking up the wrong tree here, as that's less than the 130 second cutoff IIS allegedly uses.

But we can find out easily enough. Ian (or Ria):  could you go into your about:config and set "network.http.keep-alive.timeout" to 120 seconds, and see if it changes anything?  If it doesn't fix things, try something smaller (say 30 seconds).  If the smaller time works but the larger one doesn't, and you've got time to do a quick binary search to see where the cutoff time for the error is (at least roughly) that would be great.
To clarify, the reason it was happening after 60 seconds was due to a load balancer sitting in between IIS and FireFox that itself had a 60-sec timeout. Once we took that out of the loop we started experiencing the issue after 120 seconds, as mentioned above.
Note that this is also another argument for having a shorter client-side timeout, as other timeouts can also come into play aside from just the web-server settings.
(In reply to comment #23)

> But we can find out easily enough. Ian (or Ria):  could you go into your
> about:config and set "network.http.keep-alive.timeout" to 120 seconds, and see
> if it changes anything? 

Yes, this seemed to help. I tested this twice (waited 2,5 minutes) and the site loaded instantly.
http://www.winebid.com/buy_wine/search_results.aspx?SearchString=&PBCT=3 also works with the default value now, so this is a bit confusing :)
(In reply to comment #27)
> http://www.winebid.com/buy_wine/search_results.aspx?SearchString=&PBCT=3 also
> works with the default value now, so this is a bit confusing :)

Did you restart FF? Did you drop the http cache?

...

BTW: I've seen some servers sending Connection: keep-alive and Keep-alive: timeout=<some number of seconds>. I wasn't able to find an exact spec for this header but do we consider it? Does IIS7 send it?
I have implemented the workaround of setting IIS to 300 seconds on the server-side, so you should not see the issue any more on our site.
Comment on attachment 376965 [details] [diff] [review]
patch

Patch approved with 115 second timeout value.  Sounds like 120 might work, but might as well play it safe.
Attachment #376965 - Flags: superreview?(bzbarsky)
Attachment #376965 - Flags: review?(jduell.mcbugs)
Attachment #376965 - Flags: review+
JST:  this is a one-line fix.  Is it worth making a blocker (assuming that's the path to getting it into 1.9.0 ASAP)?
Flags: blocking1.9.1- → blocking1.9.1?
Attachment #376965 - Flags: superreview?(bzbarsky) → superreview+
Comment on attachment 376965 [details] [diff] [review]
patch

Please add a nice comment explaining why it's such a non-round number.
In particular why it's not 120.
the fast track here makes me a little queasy. It sounds like the bug is "IIS7 persistent connection timeout handling is broken" and I'm a little skeptical the real issue is that broad.. that would be a major bug if its that broad.

is there a KB article on it if so?

there is nothing wrong with the smaller timeout, but maybe we should be sure we've got the issue fully understood before declaring victory?
> there is nothing wrong with the smaller timeout

True.  The current timeout value appears to be much larger than any servers out there will wait for, and it's higher than IE or Opera.  So I think dropping down to anywhere between IIS/Opera's value (120 secs) or IE's (60) is fine.  Patrick liked the idea of making the number slightly smaller than IIS's, hence 115.

> but maybe we should be sure we've got the issue fully understood 
> before declaring victory?

Well, yes, that would be nice.  It's not really clear what's going on here.  Comment #15 indicates that IIS may be using a RST rather than a FIN, but that would still shut down the socket on our end:  that doesn't seem to be happening.  Actually, the IIS server doesn't seem to respond at all (assuming comment #12 is true). 

I'd be fine with us poking around at this some more, but given that any timeout value between 60-115 appears to fix this bug, without any significant cost (and that other browsers use similar values, and that no servers honor a keepalive as long as our current 300 secs) I don't see why we shouldn't just check in the patch.

Thoughts?
Flags: blocking1.9.1?
Here's some potentially relevant stuff from the IIS bug forum:

    http://forums.iis.net/t/1155755.aspx?PageIndex=4
    
Here is Microsoft's response to the ticket I opened. 

"Firefox is waiting for more than the standard 2 minutes before trying to re-use the connection.

Firefox never sends "FIN" command(FIN- Finish is used during a graceful session close to show that the sender has no more data to send) to the server, so it cannot re-open the connection.

IIS times the request out as expected, due to the default 2 minute ConnectionTimeout setting of HTTP.sys.

The IIS server, however should not be waiting for 9 seconds to send a reset. So we doubt that there could be some issues with the NIC or NIC drivers which initiates this waiting.

So, the part of the problem here is Firefox trying to reuse an old connection.

The other problem seems to be with TCP on the server not issuing a timely RST(RST- Reset is an instantaneous abort in both directions (abnormal session disconnection)).

Recommendations:

- Let’s  disable TCP chimney and/or update NIC drivers on server.
- Lets run the following command to disable the TCPChimney,
- Netsh int ip set chimney DISABLED


Unfortunately that command didn't work for me... I kept getting "command not found". So they had me add the following registry entries:

..."
> Please add a nice comment explaining why it's such a non-round number.

"Set slightly under IIS's keepalive timeout (120 secs) to avoid potential race."
Attached patch patch v2 - added comment (obsolete) — Splinter Review
(In reply to comment #38)
> > Please add a nice comment explaining why it's such a non-round number.
> 
> "Set slightly under IIS's keepalive timeout (120 secs) to avoid potential
> race."

I've just attached patch with more verbose comment. Feel free to change it as you wish.
Attachment #377033 - Flags: superreview?(bzbarsky)
Attachment #377033 - Flags: review+
Comment on attachment 377033 [details] [diff] [review]
patch v2 - added comment

"don't close the connection" and looks good.
Attachment #377033 - Flags: superreview?(bzbarsky) → superreview+
Attached patch patch v3Splinter Review
Attachment #376965 - Attachment is obsolete: true
Attachment #377033 - Attachment is obsolete: true
Keywords: checkin-needed
(In reply to comment #37)
> 
> Here is Microsoft's response to the ticket I opened. 

Wow. That's a lotta questionable material stuffed into one response from MS. I think you should keep trying them until they identify a specific bug number on their side.

> 
> "Firefox is waiting for more than the standard 2 minutes before trying to
> re-use the connection.

There is no standard 2 minutes. That's crap. Indeed, the standard says explicitly that there is no particular value - from 8.1.4 of rfc 2616 - 

"The use of persistent
   connections places no requirements on the length (or existence) of
   this time-out for either the client or the server."

Firefox is doing nothing wrong.

> 
> Firefox never sends "FIN" command(FIN- Finish is used during a graceful session
> close to show that the sender has no more data to send) to the server, so it
> cannot re-open the connection.

The fallacy of this is obvious. The presumption is that firefox has no more data to send - but that's not true! We have another request, which we're sending down a perfectly valid idle persistent connection (that the server has not closed).

> 
> IIS times the request out as expected, due to the default 2 minute
> ConnectionTimeout setting of HTTP.sys.

no.

If IIS timed out the connection (not request, there is no active request) it would generate a FIN towards Firefox. (well, perhaps IIS did time it out, but windows or the TOE hardware is broken - that's all the "server" as far as firefox can be concerned)

A RST, btw, is not an acceptable replacement for a fin. (not that there is even a rst at the 2 minute mark)  RSTs are not subject to reliable delivery and they would abort the reliable retransmission of any portion of the prior transaction's response that had not yet reached the client. And it could also seriously call into question the length of a message body that was being determined by conneciton close rather than a length header or chunking frame if it were used in those situations.

Again - RFC 2616
" When a client or server wishes to time-out it SHOULD issue a graceful close on the transport connection."

The wording graceful is specifically about fin vs rst. (in the spec world it is an attempt to keep the http spec open to transports other than tcp..)

> So, the part of the problem here is Firefox trying to reuse an old connection.

That's a feature, not a problem :) It saves the work of connection establishment for everyone.

If the server would rather firefox did not do this they only need to close that old connection, which they do not do.

> Recommendations:
> 
> - Let’s  disable TCP chimney and/or update NIC drivers on server.
> - Lets run the following command to disable the TCPChimney,
> - Netsh int ip set chimney DISABLED
> 

Ah - now the meat of the matter. It sounds like a bug with chimney. Chimney is the MS layer that connects TCP TOE functionality on ethernet cards with the OS. I have worked with TOE hardware - it can be really confusing and the hardware can (and does) have bugs too. The hw bugs are generally fixable in firmware.

Looks like a big fat hardware dependent OS bug, not really an IIS-7 one.

Indeed, if I try an infinite pconn to www.iis.net I can reproduce the issue as described.

But if I try such a beast to www.microsoft.com, which identifies itself as running IIS/7.5, the server does a graceful close after 4 minutes and 17 seconds. Completely in spec.

In a different trial run, I checked to make sure the connection was usable after 3 minutes of idle time. It was.

09:34:50.217126 IP 192.168.16.214.54408 > 65.55.21.250.80: S 547311562:547311562(0) win 5840 <mss 1460,sackOK,timestamp 306027870 0,nop,wscale 7>
09:34:50.322780 IP 65.55.21.250.80 > 192.168.16.214.54408: S 498729028:498729028(0) ack 547311563 win 8190 <mss 1460>
09:34:50.322806 IP 192.168.16.214.54408 > 65.55.21.250.80: . ack 1 win 5840
09:34:51.617992 IP 192.168.16.214.54408 > 65.55.21.250.80: P 1:18(17) ack 1 win 5840
09:34:51.723172 IP 65.55.21.250.80 > 192.168.16.214.54408: . ack 18 win 8190
09:34:51.723200 IP 192.168.16.214.54408 > 65.55.21.250.80: P 18:45(27) ack 1 win 5840
09:34:51.828137 IP 65.55.21.250.80 > 192.168.16.214.54408: . ack 45 win 62867
09:34:51.828693 IP 65.55.21.250.80 > 192.168.16.214.54408: P 1:391(390) ack 45 win 62867
09:34:51.828703 IP 192.168.16.214.54408 > 65.55.21.250.80: . ack 391 win 6432
09:39:08.276051 IP 65.55.21.250.80 > 192.168.16.214.54408: F 391:391(0) ack 45 win 8190
09:39:08.276142 IP 192.168.16.214.54408 > 65.55.21.250.80: F 45:45(0) ack 392 win 6432
09:39:08.385123 IP 65.55.21.250.80 > 192.168.16.214.54408: . ack 46 win 8190

I did a different run and got the same graceful close after 3 minutes and 23 seconds of idle time. Also perfectly legitimate - there is no rule that says the timeout period has to be based on a constant. 

This makes some sense. Closing the connections is just meant to preserve resources - if you don't have any resource pressure why close it at all? I wrote a tool once that only closed idle connections when their fixed size pool was full (and then it did it fifo style).

But in some ways this gets to the sillyness of us changing the timeout in firefox. The 120 number is configurable, there are often proxy steps introduced as part of load balancing, and so on..

So 115 might not mean you can sleep well.

But please do note that the session from microsoft.com is closed with an F, not an R. Any talk that RSTs are a normal way to close the session (or that IIS normally does that) is just someone running their mouth/keyboard - RSTs are by definition error conditions, and the closing of a persistent connection is not an error.

.....

I just needed to understand the issue a little better - please don't take this as an argument against changing the timeout value. It will help in some cases, and the harm is relatively minor. But treating the issue as "solved" is probably not right either.
"That's a feature, not a problem :) It saves the work of connection
establishment for everyone."

I can understand your point. Looking at it from that perspective then the question is what is the benefit of this feature if it takes several minutes for FF to sort out that the connection is actually dead and not reusable? It may be beneficial to server-operators and network administrators, who save the aggregate connection-creation overhead... but to the end user it is a major headache. 

Perhaps, in addition to the reduction in timeout, there is further investigation that can be done to figure out a faster way to recover from "dead" connections in general, regardless of the cause (IIS ****, hardware failures, proxy issues...)? That seems like a much larger project though.
Assignee: nobody → michal
Comment on attachment 377136 [details] [diff] [review]
patch v3

Looks fine to go.
Attachment #377136 - Flags: superreview?(bzbarsky)
Attachment #377136 - Flags: review+
Comment on attachment 377136 [details] [diff] [review]
patch v3

This really was all set as far as I was concerned...  ;)
Attachment #377136 - Flags: superreview?(bzbarsky) → superreview+
I think we should take this on branch for a dot-release.
Flags: wanted1.9.1.x?
Pushed http://hg.mozilla.org/mozilla-central/rev/ad10f4f7b8f7
Status: NEW → RESOLVED
Closed: 13 years ago
Flags: in-testsuite?
Resolution: --- → FIXED
Flags: blocking1.9.2?
You need to log in before you can comment on or make changes to this bug.