Closed Bug 1269055 Opened 8 years ago Closed 8 years ago

RFE - auto retry failed POST

Categories

(Core :: Networking: HTTP, defect)

46 Branch
defect
Not set
major

Tracking

()

RESOLVED FIXED
mozilla49
Tracking Status
firefox46 blocking wontfix
firefox47 + fixed
firefox48 + fixed
firefox49 + fixed

People

(Reporter: thomas.baugh, Assigned: dragana, NeedInfo)

References

Details

(4 keywords, Whiteboard: [necko-active])

Attachments

(2 files)

Attached file rfc2616-8.2.4.pl
User Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0
Build ID: 20160309193552

Steps to reproduce:

See the attached perl script that starts a simple webserver on port 8080 that will render a page triggering the problem on POSTing via form action.
 
Simply run the script, then browse to http://localhost:8080 and then click the 'Maximum Go' button. You can safely send a TERM signal to the perl process running the script after you are done testing.


Actual results:

On Firefox 46+ for desktop users the user will see a 'connection reset' page.


Expected results:

See https://tools.ietf.org/html/rfc2616#section-8.2.4
When requests terminate unexpectedly on requests like those you accomplish with POST form action, the default action should be to automatically retry the request.

On Google Chrome the page will reload successfully when using the same provided script.
Note that even though the bug submission form automatically appended the user agent I filed this bug with, etc., the builds I am finding this on are exclusively 46.0 - 49.0a1 (2016-04-29). Version 45 works just fine on these requests.
Component: Untriaged → Networking: HTTP
Product: Firefox → Core
Do you have a live testcase?
Flags: needinfo?(thomas.baugh)
Are you asking me to run the perl script I attached to the original comment? The provided script should start up a perl HTTP server process on port 8080 that you can immediately browse to and view the problem by clicking on the html page the server process is setup to render. That is the test case. Please see the attachment to the first comment, let me know if there's still something else needed here.
I am tempting to set this as WONTFIX.
The POST method is not safe to retry, I am not sure we should retry it automatically. ff45 retries automatically without informing users that a POST method is retried. The retry is happening after a TCP reset but we cannot know if the server actually has consumed our data. 


The change that cause this mismatch at:
http://hg.mozilla.org/mozilla-central/annotate/tip/netwerk/protocol/http/nsHttpTransaction.cpp#l979
Flags: needinfo?(mcmanus)
This is the "double pizza" problem. The most relevant text is probably 
https://tools.ietf.org/html/rfc7230#section-6.3.1

"6.3.1.  Retrying Requests

   Connections can be closed at any time, with or without intention.
   Implementations ought to anticipate the need to recover from
   asynchronous close events.

   When an inbound connection is closed prematurely, a client MAY open a
   new connection and automatically retransmit an aborted sequence of
   requests if all of those requests have idempotent methods (Section
   4.2.2 of [RFC7231]).  A proxy MUST NOT automatically retry
   non-idempotent requests.

   A user agent MUST NOT automatically retry a request with a non-
   idempotent method unless it has some means to know that the request
   semantics are actually idempotent, regardless of the method, or some
   means to detect that the original request was never applied.  For
   example, a user agent that knows (through design or configuration)
   that a POST request to a given resource is safe can repeat that
   request automatically.  Likewise, a user agent designed specifically
   to operate on a version control repository might be able to recover
   from partial failure conditions by checking the target resource
   revision(s) after a failed connection, reverting or fixing any
   changes that were partially applied, and then automatically retrying
   the requests that failed."

The text says MUST NOT retry non-idempotent (such as POST) - but then gives lots of wiggle room. And we do use that wiggle room from time to time, as does Chrome.

In general we won't retry posts if there are the first transaction on a connection - we might do so later under the presumption that a closed persistent connection is the issue. But again, its interesting that chrome takes on the risk here of resending. I certainly think that's the trend (and booting unqiueness to the application layer completely)

What interests me most is that comment 1 suggests a change in behavior between 45 and 46.. that sounds vaguely familiar to me but I can't put my finger on it.
Flags: needinfo?(mcmanus)
Summary: RFC 2616 8.2.4 not obeyed on POST connection resets → RFE - auto retry failed POST
(In reply to Patrick McManus [:mcmanus] from comment #5)
> 
> What interests me most is that comment 1 suggests a change in behavior
> between 45 and 46.. that sounds vaguely familiar to me but I can't put my
> finger on it.

Bug 1236277
There were no check before and you ask me to add one :)
ah. thanks. that means the reporter must be reusing a pconn somehow.

what do you think of adding || !resuedConn back into the mix. It seems compatible
(In reply to Patrick McManus [:mcmanus] from comment #7)
> ah. thanks. that means the reporter must be reusing a pconn somehow.
> 
> what do you think of adding || !resuedConn back into the mix. It seems
> compatible

I will add that, so that in this case behaves as before plus retry if it is safe.
Assignee: nobody → dd.mozilla
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Whiteboard: [necko-active]
Blocks: 1236277
Keywords: regression
Attachment #8748112 - Flags: review?(mcmanus)
Comment on attachment 8748112 [details] [diff] [review]
bug_1269055.patch

Review of attachment 8748112 [details] [diff] [review]:
-----------------------------------------------------------------

uplifts too. thanks!
Attachment #8748112 - Flags: review?(mcmanus) → review+
Keywords: checkin-needed
Comment on attachment 8748112 [details] [diff] [review]
bug_1269055.patch

Approval Request Comment
[Feature/regressing bug #]: regression from bug 1236277. 
[User impact if declined]: if a connection breaks we use to try again if the connection is reused. We decided not to do it for a not-safe requests (non-idempotent method), like POST (1236277). But some applications rely on firefox retrying also POST. See bug 1269482. This will be rarely triggered.
[Describe test coverage new/current, TreeHerder]: there is a test attached in this bug
[Risks and why]: very low, reverting the old behavior.
[String/UUID change made/needed]: none
Attachment #8748112 - Flags: approval-mozilla-release?
Attachment #8748112 - Flags: approval-mozilla-beta?
Attachment #8748112 - Flags: approval-mozilla-aurora?
https://hg.mozilla.org/mozilla-central/rev/454eb0d09444
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla49
Comment on attachment 8748112 [details] [diff] [review]
bug_1269055.patch

Fix for regression from 46, includes tests.
Attachment #8748112 - Flags: approval-mozilla-beta?
Attachment #8748112 - Flags: approval-mozilla-beta+
Attachment #8748112 - Flags: approval-mozilla-aurora?
Attachment #8748112 - Flags: approval-mozilla-aurora+
Tracking for 46+.  
I can keep an eye on it but it seems unlikely this will be fixed before 47 comes out (June 7th)
This impacts nginx pretty severely - it would be nice to put it on release in case there was a ride along
The duplicate bug 1268775 "Cannot establish connection when using HTTP/2 and POST after idle time" gives more information about relation of this bug to nginx releases. It works fine with nginx 1.9.12, and does not work with 1.10 (and bug 1268775 mentions even 1.9.15 does not work). Just adding the comment for people hit by the same problem.
See Also: → 1269170
See Also: 1269170
Liz: Given 7 duplicates and low fix risk, I believe we should fix this ASAP with Firefox 46.0.2. One of the affected sites reported in Bug 1269170 is a very (the most?) popular Flash game in Japan, and many people are complaining about the issue on Twitter.
Flags: needinfo?(lhenry)
This was suggested as a ridealong if we had another dot release a couple of weeks ago, but I didn't realize the problem was bigger than one game, until now. Thanks for bringing it back to my attention!   I can look at the duplicates and talk about this tomorrow morning.  So, this seems related to the april 26th ngnix 1.10 release? Or is it a combination of that and something new we introduced in 46?
No, this was a major change in POST behavior in Firefox 46, and it can happen with any HTTP server. Among other things, if the browser makes at least two requests to a server, and waits longer than the server's keep-alive timeout (which could easily be only a few seconds or less) between them, if the request after the delay is a POST, it will fail. As far as I know, this is a change from how Firefox has handled POST requests for many years, and quite possibly forever. There's some discussion in comment 5 about what the correct behavior really should be, but given the history of this behavior this is a pretty serious regression in Firefox 46 that would affect many websites.
I suspect this bug is causing random crashes of gmail as well.
(In reply to Mahks from comment #31)
> I suspect this bug is causing random crashes of gmail as well.

I'm not sure what you mean by crash - but assuming you mean firefox crashing then that's not a possible impact of this bug.

liz, the nginx thing was unfortunate timing as their implementation hit this squarely on the nose.. other implementations are more of an unusual race condition. The behavior restored by the patch is actually non-standard, and in general most well functioning sites don't rely on it.. but clearly a few do.
What I would like is some way to measure the impact and how widespread this problem is, but going by the number of duplicate bugs a lot of people are running into it. This may drive a release now rather than waiting for June 7 and Firefox 47.

Patrick, if this is from non standard configurations, is there anything you can advise as a workaround for server admins?
Flags: needinfo?(lhenry)
One more question. People have said that the issue is now fixed in beta 47. Can we reproduce the problem and make sure to test it on a release build? Do we know for sure this is a good fix? Thanks.
this would have been perfect ride along material. oh well.

Given that I know dot releases (well all releases) annoy the userbase, I wouldn't cut a release for this at this stage.

well designed sites are going to be robust to this anyhow - so if there are a handful of high profile sites with the issue we can evangelize an easy fix for them:

"Adjust your server side persistent connection timeout to be about 5 seconds longer than the firefox value. Firefox uses 115 seconds for HTTP/1.x and 180 seconds for HTTP/2"

does that help? I can't think of any useful telemetry here.
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #35)
> One more question. People have said that the issue is now fixed in beta 47.
> Can we reproduce the problem and make sure to test it on a release build? Do
> we know for sure this is a good fix? Thanks.

a reporter verified the fix over here
https://bugzilla.mozilla.org/show_bug.cgi?id=1269170#c59
Thank you Patrick, that's super helpful. I'll respond in the other bugs to let people know the workaround. I can also find it in support pages and answer there.
Lithium community software is subject to this problem. A lot of high profile companies use it. I'll aim to feedback Patrick's comments to the developers - http://www.lithium.com/why-lithium/customer-success/
(In reply to Patrick McManus [:mcmanus] from comment #36)
> well designed sites are going to be robust to this anyhow - so if there are
> a handful of high profile sites with the issue we can evangelize an easy fix
> for them:
> 
> "Adjust your server side persistent connection timeout to be about 5 seconds
> longer than the firefox value. Firefox uses 115 seconds for HTTP/1.x and 180
> seconds for HTTP/2"

I'll post a site compatibility doc for this shortly.
Thanks, this fix works for me!
Just want to mention that you have to tune linux TCP/IP stack for large timeouts as well.
For example, I had to comment out this setting in sysctl.conf:
#net.ipv4.tcp_keepalive_time = 120
I wonder if this is the reason I am getting Secure Connection Failed messages on Wordpress.org when taking awhile (more than 3 minutes always) to post a reply.  Created a post there: https://wordpress.org/support/topic/this-forum-generating-secure-connection-failed-when-post?replies=45

Doesn't happen with 47beta7. Does with 46.0.1.
(In reply to MarkRH from comment #46)
> I wonder if this is the reason I am getting Secure Connection Failed
> messages on Wordpress.org when taking awhile (more than 3 minutes always) to
> post a reply.  Created a post there:
> https://wordpress.org/support/topic/this-forum-generating-secure-connection-
> failed-when-post?replies=45
> 
> Doesn't happen with 47beta7. Does with 46.0.1.

It's bug 1271301 and probably fixed by this current bug fix.
It's pretty absurd you just assume "Oh it's a duplicate" and close the bug without any reply then do snide remarks like "well designed sites are going to be robust to this anyhow".

My bug that I filed, https://bugzilla.mozilla.org/show_bug.cgi?id=1270434 is NOT a timeout issue because it errors out faster than 115s into the request. Which you would know if you'd taken the time to read through the nginx log I posted.
(In reply to henrik from comment #50)
> It's pretty absurd you just assume "Oh it's a duplicate" and close the bug
> without any reply then do snide remarks like "well designed sites are going
> to be robust to this anyhow".
> 
> My bug that I filed, https://bugzilla.mozilla.org/show_bug.cgi?id=1270434 is
> NOT a timeout issue because it errors out faster than 115s into the request.
> Which you would know if you'd taken the time to read through the nginx log I
> posted.

Correction, https://bugzilla.mozilla.org/show_bug.cgi?id=1274606 – I'm getting so much e-mail from all the linked bugs now that I lost track.
(In reply to henrik from comment #51)
> 
> Correction, https://bugzilla.mozilla.org/show_bug.cgi?id=1274606 – I'm
> getting so much e-mail from all the linked bugs now that I lost track.

that bug, especially as logged ff46 and nginx 1.10 and POST, reads exactly like a dup of this bug (which is fixed on 47). The issue isn't dismissed, its simply likely the same one and needs only to be tracked once.

comment 2 in that bug asked you to help confirm that by testing beta-47 which you haven't replied to. We're still interested in that even though the bug was marked dup'd - I apologize if there was confusion over that.
I have replied to it. I've just gotten a bit annoyed at the statement above about well designed sites; please show me what I've done wrong and I'll fix it. It's a damn well designed site if I may say so myself and we've even taken care to use the best-practise SSL guidance, staying up to date on even this **** HTTP2 protocol that has caused 2 days downtime for some browsers for us. So don't say it's not a "well designed site".

I know you sent a comment and excuse me if I'm trying to be the CEO of the company as well as be your testing staff. I answered your query with another one: how do I test FF on Android if it's in beta?
Comment on attachment 8748112 [details] [diff] [review]
bug_1269055.patch

Clearing the m-r flag as 46 will EOL in a week and this fix is already in Beta47.
Attachment #8748112 - Flags: approval-mozilla-release?
Hi henrik. If you would like to test with Firefox for Android (beta) you can download it from the Play Store (search for Firefox Beta). I know this took a while to fix, and some back and forth with site admins. The bugmail can certainly get to be a lot.  If you test and there are still problems, you can let us know and we can reopen the closed bug(s) if there's still work to do.  Thanks for your help, it is appreciated. 

Sounds to me like this bug is wontfix for 46 and fixed in 47.
I have tested this on two different computers with Firefox 47 that came out today. The bug is fixed
You need to log in before you can comment on or make changes to this bug.