Closed Bug 947391 Opened 6 years ago Closed 6 years ago

HTTP connections (exc. XHR, SPDY) should have a response timeout

Categories

(Core :: Networking: HTTP, defect)

25 Branch
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla29

People

(Reporter: sworkman, Assigned: sworkman)

References

Details

(Keywords: dev-doc-needed, site-compat)

Attachments

(3 files, 2 obsolete files)

Currently, HTTP connections wait indefinitely for initial responses from the server. We should enable a timeout to drop the connection if responses take too long (specific time to be decided).
Jason, we had discussed this as timing out half open connections before. However, I assume that you didn't mean nsHalfOpenSockets, as these are only used to setup the initial TCP connection, and we had discussed the HTTP server not responding. Moreover, there is a timeout for the TCP connection not being established. So, most of the changes refer to nsHttpConnection, where the transaction is activated once the socket is writeable.

I've setup a timeout for 10 mins, but that can be changed.

Also, this should drop connections (non XHR and SPDY anyway) after a sleep/suspend for longer that 10 mins. A good thing, no?

Let me know if I'm going in the right direction.
Attachment #8343934 - Flags: feedback?(jduell.mcbugs)
Updated based on test results (test patch up next).
Attachment #8343934 - Attachment is obsolete: true
Attachment #8343934 - Flags: feedback?(jduell.mcbugs)
Attachment #8344996 - Flags: review?(jduell.mcbugs)
Pat, these patches set an HTTP response timeout of 10 minutes. Note that this is only for the initial response bytes after a request; long intervals between multipart responses should not be affected. Also, it's disabled by default for XHR (which has its own timeout facility), so long-polled requests should be safe too.

Thoughts?
Looks like WebSockets have their own timeout: WebSocketChannel::mOpenTimeout - defaults to 20s, set using network.websocket.timeout.open.

So, this patch disables the HTTP response timeout for WebSockets.
Attachment #8345034 - Flags: review?(jduell.mcbugs)
Attachment #8344996 - Flags: review?(jduell.mcbugs) → review?(mcmanus)
Attachment #8344998 - Flags: review?(jduell.mcbugs) → review?(mcmanus)
Attachment #8345034 - Flags: review?(jduell.mcbugs) → review?(mcmanus)
Jason suggested that Pat would be a better reviewer for this.

Let me know what you think of the patches.
Will downloading huge files fail if it takes more than 10 minutes?
(In reply to Masatoshi Kimura [:emk] from comment #7)
> Will downloading huge files fail if it takes more than 10 minutes?

Well, let me clarify. The timeout is for the length of time until the first byte is downloaded. So, if the download doesn't start until 10 mins (or more) after the initial HTTP request was sent, then it will fail. But if the first byte is received within the first 10 mins, then the duration of the download does not matter; at least as far as this patch is concerned.
Comment on attachment 8345034 [details] [diff] [review]
v1.0 Disable HTTP response timeouts for WebSockets

Review of attachment 8345034 [details] [diff] [review]:
-----------------------------------------------------------------

I actually think we don't want any version this patch. The timeout in question is the first byte of the HTTP response, and in the case of websockets that response is the expected 101 from the upgrade request - which we certainly expect to come back quickly.

I don't think the logic in the other patch would cause timeouts once websockets is setup.. would it steve? (hand test?)
Attachment #8345034 - Flags: review?(mcmanus) → review-
Attachment #8344998 - Flags: review?(mcmanus) → review+
Comment on attachment 8344996 [details] [diff] [review]
v1.1 Add a timeout for initial HTTP responses

Review of attachment 8344996 [details] [diff] [review]:
-----------------------------------------------------------------

::: modules/libpref/src/init/all.js
@@ +1035,5 @@
>  // We set the timeout a little shorter to keep a reserve for cases when
>  // the packet is lost or delayed on the route.
>  pref("network.http.keep-alive.timeout", 115);
>  
> +// Timeout connections if an initial response is not received after 10 mins.

let's cut it down to 5 and see what happens.. they both too big to be super helpful, but 5 is better than 10.

::: netwerk/protocol/http/nsHttpConnection.cpp
@@ +969,5 @@
> +            LOG(("canceling transaction: no response for %ums: timeout is %dms\n",
> +                 PR_IntervalToMilliseconds(initialResponseDelta),
> +                 PR_IntervalToMilliseconds(gHttpHandler->ResponseTimeout())));
> +
> +            // This will also close the connection

mResponseTimeoutEnabled = false;

@@ +1385,5 @@
>      PRIntervalTime now = PR_IntervalNow();
>      PRIntervalTime delta = now - mLastReadTime;
>  
> +    // Reset mResponseTimeoutEnabled to stop response timeout checks.
> +    if (!mSpdySession) {

I don't think you need the spdy check.. if nothing has happened for mumble minutes on the spdy channel when it first gets setup then we don't need it.

::: netwerk/protocol/http/nsIHttpChannelInternal.idl
@@ +181,5 @@
>       */
>      attribute boolean loadUnblocked;
>  
> +    /**
> +     * This attribute en/disables the timeout for an HTTP response.

"timeout for the first byte of an HTTP response"
Attachment #8344996 - Flags: review?(mcmanus) → review+
(In reply to Patrick McManus [:mcmanus] from comment #9)
> Comment on attachment 8345034 [details] [diff] [review]
> v1.0 Disable HTTP response timeouts for WebSockets
> 
> Review of attachment 8345034 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I actually think we don't want any version this patch. The timeout in
> question is the first byte of the HTTP response, and in the case of
> websockets that response is the expected 101 from the upgrade request -
> which we certainly expect to come back quickly.
> 
> I don't think the logic in the other patch would cause timeouts once
> websockets is setup.. would it steve? (hand test?)

Yeah, I think you're right. I think I was being over-zealous. Let's leave this one out.
Comment on attachment 8345034 [details] [diff] [review]
v1.0 Disable HTTP response timeouts for WebSockets

Not going to use this patch; obsoleting it to hide it.
Attachment #8345034 - Attachment is obsolete: true
Think I've fixed it now. Two problems:
1. The port I was using in the xpcshell test wasn't ok on Android. I'm now starting the HTTP server with port = -1, then getting the port it sets up after.
2. Fennec was creating a pipelined connection to the HTTP server, and my patch wasn't fully setup to enable response timeouts for pipelining. I've added ResponseTimeoutEnabled() to SPDY and pipelining headers. Note, this is based on Pat's comment 10, which suggests we should not disable the timeout for both of those. I needed to add those functions and return 'true' to fully enable it.

Try run looks good - some B2G mozharness failures; unrelated:
https://tbpl.mozilla.org/?tree=Try&rev=218f929ce5c0

Inbound:
https://hg.mozilla.org/integration/mozilla-inbound/rev/08a000ab3a2a
https://hg.mozilla.org/integration/mozilla-inbound/rev/1d193264de9d
https://hg.mozilla.org/mozilla-central/rev/08a000ab3a2a
https://hg.mozilla.org/mozilla-central/rev/1d193264de9d
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla29
Ooops. The patch chunk that canceled the timeout got skipped when I was removing bitrot.

On inbound:
https://hg.mozilla.org/integration/mozilla-inbound/rev/b9534b65143d
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
bug 957831 is caused by this bug 947391 ?
https://hg.mozilla.org/mozilla-central/rev/b9534b65143d
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → FIXED
(In reply to pal-moz from comment #18)
> bug 957831 is caused by this bug 947391 ?

can you retest with the cset in comment 19?
Flags: needinfo?(bugmozz)
(In reply to Patrick McManus [:mcmanus] from comment #20)
> (In reply to pal-moz from comment #18)
> > bug 957831 is caused by this bug 947391 ?
> 
> can you retest with the cset in comment 19?

tested with the build, cset:99afe134bc7a
and seems to be fixed. (I can download)

thanks.
Flags: needinfo?(bugmozz)
Apologies for the failure, pal-moz - thanks for verifying.
This behavior change is a bit invasive : in our (tomcat) web applications we have some pages that sometimes are taking more than 5 minutes before any reply... So now firefox users will have the timeout.
(In reply to Xavier Poinsard from comment #23)
> This behavior change is a bit invasive : in our (tomcat) web applications we
> have some pages that sometimes are taking more than 5 minutes before any
> reply... So now firefox users will have the timeout.

Hi Xavier,

Sorry you're seeing some problems here. The intention of this bug was to timeout unresponsive HTTP servers that establish TCP connections but for some reason don't respond to HTTP requests. We specifically avoided XHRs to allow for long-polled requests. Now, you say you have pages that take longer than 5 mins to load - can you confirm that these are HTML document loads and not XHRs? If so, please file a new bug (cc'ing me) with as much information as you can provide so we can figure out a way forward.
Depends on: 1005808
Depends on: 979268
Depends on: 1008950
No longer depends on: 1008950
See also: bug #1024015.
This bug introduced API changes and therefore should have been subject
to super-review.  Better late than never.  Sincere apologies if I am
reading policy wrong.

Some applications which used to display their pages to all Firefox
users must now change API calls in one of the following ways:

  * Altering the timing of the API calls that send HTTP responses.
  * Substituting XHR for "regular" HTTP requests.


Mozilla began with the promise that Firefox would be a platform for
the web.  Firefox would be an agnostic way to access remote computers
via HTTP.

The promise did not include caveats; "unless access takes more than 5
minutes", or "so long as your application uses a progress bar", or
"but sometimes the web server must deliver the HTTP response in a
special way", or even "but Firefox must be reconfigured if it is to
display _every_ web page".

The change "HTTP connections now have a response timeout",
https://developer.mozilla.org/en-US/Firefox/Releases/29/Site_Compatibility#Networking,
introduces all of the above caveats into the Firefox platform.

Some things just take more than N number of minutes to happen.


Proposals for alternative patches:

I can envision a number of alternatives to the existing solution.
(But see below regards my understanding of the problem.)

Do not send any new traffic out on "response-timed-out" connections,
but keep them open and continue to wait for a response.  Do not count
"response-timed-out" connections toward the total connection limit.
Upon receiving inbound traffic on a "response-timed-out" connection
change it's state so that it is not "response-timed-out".

Or...

Give the user a dialog box when the server is taking a long time to
respond.  Instead of forcing an abort have the dialog box give the
user the option to continue to wait.

Either of these alternatives would restore the Firefox platform
promise.  In the meantime you might consider a default
network.http.response.timeout of 0.


My understanding of the problem:

I have searched Mozilla mailing lists and queried the patch authors in
bug reports but remain ignorant of the essential nature of this bug.
This makes it hard to discuss solutions.  Perhaps I have not done a
good enough job in this because I've been unable to find any
discussion of what should be done upon discovery of TCP connections
that silently eat packets and the pros and cons of different
approaches.  I apologize if my searching is inadequate.

The worry is of open TCP connections that are un-usable, due to
network intermediaries that do not send TCP RST but begin to block a
connection's packets.  I guess that the perceived problem is in the
Firefox user.  I presume the thought is that when HTTP responses take
a long time the user will not close or reuse the tab/window and
thereby close the TCP connection.  (I assume here that Firefox is
clever enough to close TCP connections when windows/tabs are closed or
re-used.)  So, useless TCP connections will hang around.  I do not
know the full impact of such open, useless, TCP connections.
(Obviously, attempts to communicate on a TCP connection that eats
packets will silently fail.)  All I know for sure is that after 5
minutes the existing solution forces a TCP RST and notifies the user.

(In my opinion, unless unusable but open TCP connections impact
something other than the site involved, this is bloat.  People are
perfectly capable of recognizing that things are not working and
aborting on their own initiative.  All setting an arbitrary timeout
does is put an upper limit on the amount of time things remain broken.
It does not prevent breakage.  An expert should be consulted here but,
given what I know of people's attention span, after more than 10 or 20
seconds of breakage most people will typically abandon the site
anyway.)

Thank you for your attention and for Firefox.

(Attachement quoted above.)
Attachment #8440369 - Flags: superreview?(bzbarsky)
Karl,

There's a OS limit on how may file descriptors the browser can open.  Every "hung" TCP connection that we don't timeout will wind up clogging up that count, and eventually we can't open up new sockets and the browser looks "hung".  Before that point it may limit parallelism (i.e. fewer HTTP connections are able to be open), so that perf is reduced in a way that's not easy to pin down the cause for.

In our experience trying to add UI and let the user figure things out for something like this is simply confusing for a plurality of users.  It's also extremely rare for valid HTTP responses to take >5 minutes.  We've made the decision that the tradeoff of breaking a tiny handful of sites is worth getting recovery from bad network conditions.  Yes, we could have tried to engineer this so that we only close connections if they seem to be degrading performance, but that's a much more complex solution that would take a lot of time and code to get right, and we have limited engineering resources.  Supporting HTTP responses that take >5 min is just not enough of a priority to justify that work.

That said, now that we also use TCP keepalives on our sockets, I might personally be persuaded that the 5 minutes timeout is less important for detecting errant sockets than it used to be.  OTOH we may not always get keepalive support turned on for all platforms (esp phones), and those often have lower file descriptor counts.

The superreview process at Mozilla has frankly gotten a little fuzzy--it was primarily for changes to IDLs (for addon compatibility).  Now that we're willing to break addons when needed, it's used a lot less.  I've never seen it used for something like this.  HTTP behavior isn't really part of a "contract" the same way (otherwise we wouldn't have been able to do things like open speculative "extra" connections to web sites).  This decision really belongs in the hands of the networking module owner, so I'm needinfo'ing :mcmanus to make that call.  I very much doubt :bz will disapprove but if he does he can jump in.

Karl: thanks for the eloquent appeal for your cause--I'm not sure I agree that this is important enough to change course, but you've made your case well and politely and with a good understanding of how to move things through the "appeal" process :)
Flags: needinfo?(mcmanus)
Attachment #8440369 - Flags: superreview?(bzbarsky)
(In reply to Jason Duell (:jduell) from comment #28)

> The superreview process at Mozilla has frankly gotten a little fuzzy--it was
> primarily for changes to IDLs (for addon compatibility).  Now that we're
> willing to break addons when needed, it's used a lot less.  I've never seen
> it used for something like this.  HTTP behavior isn't really part of a
> "contract" the same way (otherwise we wouldn't have been able to do things
> like open speculative "extra" connections to web sites).

Ok.  Thanks.  I understand.

(It sure does feel like a API change, since the recommened solution is to re-write the application using a different API.)

I understand the problem regards parallelism.  It seems to me that you might be able solve this pretty easily and make everybody happy by adopting the notion I outline above regards putting connections into a "response-timed-out" state and putting a small upper limit on the number of connections that are allowed to exist in this state.  A TCP RST would be sent on a fifo basis to the oldest if the limit is exceeded.

I can't argue this approach since I don't understand why it's important to time out "reglar" HTTP connections and not time out others (XHR, etc.).  I'm not clear why this bias exists.  It seems to me that all HTTP connections, regardless of the API used to generate them, suffer from the "evil firewall" problem.  If so the above appraoch could be applied regardless of HTTP API and solve the problem in all cases.  If Mozilla is unable to devote engineering resources to this it would be helpful if this thread contained a response to this proposal.  Someone else might be willing to take up the coding if there is a possiblity that the solution would be acceptable.

I would also, without having any answers, like to question the assumption that sites which take more than 5 minutes to generate a HTTP response are more rare than network paths which eat TCP traffic because they break connections without sending a TCP RST.  Both would seem to be pretty rare.  But you may not be considering all use cases.  When researching this the first or second hit I came across involved problems created by network.http.response.timeout involving Senelium automated testing.  There may be other usage idioms, beyond normal web browsing, that you have not considered in your analysis.

Thanks for all the consdieration you've given to this.
Karl, superreview is generally about API changes that affect how different parts of Gecko interact with each other, or APIs exposed to extensions.  

For things like this, where it's a matter of standards and implementation thereof, the decision is up to the people most familiar with the relevant standard.  In this case that would probably be Patrick.
I just wanted to acknowledge this bug activity - I'm away from keyboard for a few days. but we'll talk about this again.

Karl, I do think one of the things you need to consider is that the timeout path that you object to is a state that already exists in your application that you need to deal with to operate correctly. i.e. the connection can be torn down independent of the client or server anyhow (e.g. nats that do send a FIN - they come in a variety of frustrating flavors :)), so this shouldn't be adding any logic to a complete application - just changing the frequency of some things.

otoh I've never loved this patch - though it does address a very real and ugly problem. We've added some KA support and :bagder is also adding some logic to deal with network transitions which I think underlie a fair chunk of the real issue here. After that goes in we might be able to back this out.
Flags: needinfo?(mcmanus)
(In reply to Patrick McManus [:mcmanus] from comment #31)

> Karl, I do think one of the things you need to consider is that the timeout
> path that you object to is a state that already exists in your application
> that you need to deal with to operate correctly. i.e. the connection can be
> torn down independent of the client or server anyhow (e.g. nats that do send
> a FIN - they come in a variety of frustrating flavors :)), so this shouldn't
> be adding any logic to a complete application - just changing the frequency
> of some things.

Yes.  This is true.  Although most of the time there is not an issue
because the application in question is for use within a single
institution.  Hence, there are people to complain to about brokenness.
Long HTTP response times are not something I'd want the general public
to experience, but when it comes to slapping a web interface on some
app that does some heavy crunching, well, the user's going to have to
wait no matter what.

Occasionally there are problems.  Usually they are easily addressed by
putting some sort of network encapsulation into the mix, like an ssh
tunnel.

It is also worth noting that the application is not always under my
control.  E.g. I have users that occasionally run SQL queries
overnight in phpPgAdmin.

Ultimately there are always work-arounds.  But, as noted, there is
always also limited budget.  :-)
Hello,
I'm not sure if this ticket is exactly the cause of setting  http.response.timeout to 300...refer me to the Ticket if its not.

Can this setting be reversed. or is there another Ticket to reverse this back to 0
 http.response.timeout 

Firefox cannot be used anymore when using reporting applications like microsoft reporting servies (or as prior comment phpMyAdmin..etc). We use firefox here at a small insurance company in Chicago. It took years for me to convert everybody to Firefox and Thunderbird, and now we cannot run reports on firefox while connecting to "Microsoft Reporting Services". Some reports do take more then 5min, some take 40min. Firefox timeouts the connection and users are forced to go back to IE, or are frustrated with reports not knowing its firefox that causes a timeout.

Isn't there another way to see if the service is not responding? vs assuming that waiting for report to generate means it timed out. 

This ticket causes a critical bug for us for 18 people at our company that utilize some types of reports to get stats on our business.

Thank you
Lucas
Flags: needinfo?(sworkman)
Lucas, you can work around on your end by simply changing the http.response.timeout preference on the relevant computers, right?

> Isn't there another way to see if the service is not responding?

Not really, unfortunately.
(In reply to Boris Zbarsky [:bz] from comment #34)
> > Isn't there another way to see if the service is not responding?
> 
> Not really, unfortunately.

Actually, there is, sorta.  Per bug 1024015 the new plan is to
test for TCP Keepalive's.  If that succeeds then the lack of
server reponse is moot.  If TCP Keepalives fail then Firefox
will fall back to the 300 second network.http.reponse.timeout
and kill the connection upon timeout.

(I'm still unclear on why normal http requests are
subject to timeout but XHR requests are not so I can't say
just what "kinds" of http reqeuests are affected
and which are not.)
(In reply to Boris Zbarsky [:bz] from comment #34)
> Lucas, you can work around on your end by simply changing the
> http.response.timeout preference on the relevant computers, right?

This can be a non-trival operation.  Not only timewise, especially regards
the "relevant" criteria and potential issues of geography, but especially socially.
In my environment the collective response has been to stop using Firefox.  This is
emergent behavior and fighting the tide is tough.
(In reply to Lukasz Szybalski from comment #33)
> Hello,
> I'm not sure if this ticket is exactly the cause of setting 
> http.response.timeout to 300...refer me to the Ticket if its not.

This was the original bug, yes.

> Can this setting be reversed. or is there another Ticket to reverse this
> back to 0 - http.response.timeout

Karl's comment #35 is correct. I have just landed a patch which effectively disables the response timeout by default. It will only be enabled if TCP keepalives for HTTP connections are disabled. You can follow the progress of the patch by adding yourself to the CC list on bug 1024015.

> Firefox cannot be used anymore when using reporting applications like
> microsoft reporting servies (or as prior comment phpMyAdmin..etc). We use
> firefox here at a small insurance company in Chicago. It took years for me
> to convert everybody to Firefox and Thunderbird, and now we cannot run
> reports on firefox while connecting to "Microsoft Reporting Services". Some
> reports do take more then 5min, some take 40min. Firefox timeouts the
> connection and users are forced to go back to IE, or are frustrated with
> reports not knowing its firefox that causes a timeout.

I'm sorry that this has been a pain point for you and others; our intention was to make Firefox better by detecting unresponsive connections (multiple comments have more details on the rationale if you would like to know more about the background - comment #24, comment #28, comment #31). As :bz said in comment #34, you can work around this by setting network.http.response.timeout to 0 manually. And bug 1024015 will disable the timeout and rely on TCP keepalives instead.

(In reply to Karl O. Pinc from comment #35)
> (I'm still unclear on why normal http requests are 
> subject to timeout but XHR requests are not so I can't say
> just what "kinds" of http reqeuests are affected
> and which are not.)

Karl, XHR requests already have a timeout API that can be set on a per application basis. The requests that were being timed out here are page loads, resource loads etc.
Flags: needinfo?(sworkman)
(In reply to Steve Workman [:sworkman] from comment #37)
> Karl, XHR requests already have a timeout API that can be set on a per
> application basis.

Thanks.
(In reply to Xavier Poinsard from comment #23)
> This behavior change is a bit invasive : in our (tomcat) web applications we
> have some pages that sometimes are taking more than 5 minutes before any
> reply... So now firefox users will have the timeout.

I understand the idea of this configuration, but...

I have same problem in Apache Tomcat using the Birt Webreport application. In reports PDF generator, the Birt open a new web page (at new window) and wait for more five minutes to finalize this page. Since Firefox version 29, the reports when have more five minutes to generate, ocurrer timeout request. But in this application I have hundreds users using the FF, to change this configuration manually or create add-on for fix this is unfeasible for me.

If it is kept, I need change dramatically my application. :-(
Thanks!
(In reply to Maicon Schelter from comment #39)

Hi Maicon! Sorry you're having trouble here. We decided to disable the timeout in bug 1024015; it should be fixed in Firefox 31.0 which is currently in beta.
Depends on: 1063358
You need to log in before you can comment on or make changes to this bug.