Closed Bug 1393691 Opened 2 years ago Closed 2 years ago

Firefox retries TLS connection for many times if "client hello" packet was not ACK'ed

Categories

(Core :: Networking: HTTP, defect, P1)

54 Branch
defect

Tracking

()

RESOLVED FIXED
mozilla58
Tracking Status
firefox58 --- fixed

People

(Reporter: duanyao.ustc, Assigned: dragana)

Details

(Keywords: dev-doc-complete, Whiteboard: [necko-triaged])

Attachments

(2 files)

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:54.0) Gecko/20100101 Firefox/54.0
Build ID: 20170608105825

Steps to reproduce:

When I access a stackoverflow page (e.g. https://stackoverflow.com/questions/22276149/game-server-tcp-networking-sockets-fairness ), the page loading is blocked at a script ( https://ajax.googleapis.com/ajax/libs/jquery/1.12.4/jquery.min.js ) for more than 10 minutes, and the script failed to load finally, and the page shows without script.

I don't know how to setup a https server which can simulate this issue locally, but the network traffic for that script at the client side is as follow:

1) client: SYN
2) server: SYN,ACK
3) client: ACK
4) client: TLS Client Hello
5) client: TCP Retransmission
6) client: TCP Retransmission
...(client retransmit Client Hello for many times)
7) server: RST

...(repeat 1-7 for ~15min)

You can download the attached firefox-tls-googleapis-2.pcapng for the full traffic (172.16.0.163 is the client).

Note that the client has never received ACK from the server for its Client Hello packet, maybe it is blocked by an intermediate node.


Actual results:

Firefox retries TLS connection for many times (up to 15min) if "client hello" packet was not ACK'ed.


Expected results:

Firefox should fail the https request much faster (e.g. < 60s).
I tested Google Chrome 60 on Linux X86_64, it only tried once, and failed the request in ~30s in this case.
Can you reproduce this reliably? or this occurs from time to time?. May be this is a TLS1.3 blocking issue.
Can you change tls.version.max to 3 and try?


Maybe we should fail earlier if tls handshake cannot be finished.
The value of security.tls.version.max was already 3.

I can reproduce this reliably at least today. It seems in my region (Beijing, China), https://ajax.googleapis.com is blocked from time to time, and when this happens, stackoverflow becomes difficult to load in Firefox.
Assignee: nobody → dd.mozilla
Whiteboard: [necko-active]
Bulk priority update: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Priority: -- → P1
Keeping at P1 in case this is widespread issue.  Let's investigate more ASAP.
Flags: needinfo?(dd.mozilla)
Whiteboard: [necko-active]
We rely on on kernel values which is /proc/sys/net/ipv4/tcp_retries2 and default for my machine is 15, i.e. 15 retransmissions. This is long. We could try to change the value TCP_USER_TIMEOUT to something lower. At the same time we should not go too low for request in general.

We could also close connection that take to long from the channel listener if that resource is not absolutely critical and do not change TCP_USER_TIMEOUT in general. This would influence only resource that block rendering, e.g. not images etc, i.e. resources that block rendering but still we can render page without them. This will give use a better solution, because necko does not know what is absolutely important and what not.
Flags: needinfo?(dd.mozilla)
I will move this to dom to ask what they think.
Assignee: dd.mozilla → nobody
Component: Networking → DOM
Hmm, not sure ... Boris?
Flags: needinfo?(bzbarsky)
We can certainly communicate more information about whether requests block or not to necko.

It would be a really good idea to actually standardize this stuff, though.  Not sure what the right venue for that is.
Flags: needinfo?(mcmanus)
Flags: needinfo?(bzbarsky)
Flags: needinfo?(annevk)
dragana, I don't think the issue is the kernel level tcp params as that actually succeeds inthis case.. and in any event -

we do truncate connections before the OS does.. that's governed by network.http.connection-timeout (90 seconds)

It makes sense to apply this to the tls handshake as well as the tcp layer. And if chrome is using 30 seconds its ok to align with that (When we did this, chrome just had the OS timeouts) as it shouldn't cause breakage beyond what they are seeing.
Flags: needinfo?(mcmanus)
(In reply to Patrick McManus [:mcmanus] from comment #10)
> dragana, I don't think the issue is the kernel level tcp params as that
> actually succeeds inthis case.. and in any event -
> 
> we do truncate connections before the OS does.. that's governed by
> network.http.connection-timeout (90 seconds)
> 
> It makes sense to apply this to the tls handshake as well as the tcp layer.
> And if chrome is using 30 seconds its ok to align with that (When we did
> this, chrome just had the OS timeouts) as it shouldn't cause breakage beyond
> what they are seeing.

I forgot about that.

I will take the bug.
Assignee: nobody → dd.mozilla
Status: UNCONFIRMED → ASSIGNED
Component: DOM → Networking: HTTP
Ever confirmed: true
Flags: needinfo?(annevk)
Whiteboard: [necko-triaged]
chrome times out tcp handshake after about 90s and tls handshake after 30s.

This patch adds timeout for the tls handshake.
Attachment #8916533 - Flags: review?(mcmanus)
I guess we don't really want to standardize timeout values, but it would be good to have them documented for developers.
Keywords: dev-doc-needed
Comment on attachment 8916533 [details] [diff] [review]
bug_1393691_v1.patch

Review of attachment 8916533 [details] [diff] [review]:
-----------------------------------------------------------------

thanks
Attachment #8916533 - Flags: review?(mcmanus) → review+
Pushed by dd.mozilla@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/694d1d7837b5
timeout connection if tls takes too long. r=mcmanus
https://hg.mozilla.org/mozilla-central/rev/694d1d7837b5
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla58
(In reply to Anne (:annevk) from comment #13)
> I guess we don't really want to standardize timeout values, but it would be
> good to have them documented for developers.

I've had a go at documenting this, although I'm not 100% sure this is what you want:

https://developer.mozilla.org/en-US/docs/Web/Security/Transport_Layer_Security#TLS_handshake_timeout_values

How best can I work out the timeout values for other browsers? What else do you think is needed?

I also added a note to the Fx58 rel notes:

https://developer.mozilla.org/en-US/Firefox/Releases/58#HTTP
Dragana can probably say how to test this, but note that comment 12 does give you the values for Chrome already.
(In reply to Chris Mills (Mozilla, MDN editor) [:cmills] from comment #17)
> (In reply to Anne (:annevk) from comment #13)
> > I guess we don't really want to standardize timeout values, but it would be
> > good to have them documented for developers.
> 
> I've had a go at documenting this, although I'm not 100% sure this is what
> you want:
> 
> https://developer.mozilla.org/en-US/docs/Web/Security/
> Transport_Layer_Security#TLS_handshake_timeout_values
> 
> How best can I work out the timeout values for other browsers? What else do
> you think is needed?
> 


Connection setup time out can be triggered if you try to connect to a non existing ip address. Just type an random IP address you will find a non-existing one easily.

For tls handshake timeout I needed to block packets that do not have a SYN flag. So I setup a server on a linux host and used iptable to drop packets that do not contain SYN flag. And use https:// to connect to the host. In that way TCP will establish a connection but TLS clientHello will be dropped.
You need to log in before you can comment on or make changes to this bug.