Last Comment Bug 782570 - FF14 does not close network connections correctly (no TCP FIN packet sent)
: FF14 does not close network connections correctly (no TCP FIN packet sent)
Status: VERIFIED FIXED
:
Product: Core
Classification: Components
Component: Networking: HTTP (show other bugs)
: 14 Branch
: All All
: -- normal (vote)
: mozilla17
Assigned To: Patrick McManus [:mcmanus]
:
: Patrick McManus [:mcmanus]
Mentors:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-08-14 01:47 PDT by dpp
Modified: 2012-11-05 06:41 PST (History)
5 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
wontfix
wontfix
wontfix
verified


Attachments
ff14-tcp-fin-bug.zip (1.07 KB, application/octet-stream)
2012-08-14 01:47 PDT, dpp
no flags Details
patch 0 (1.79 KB, patch)
2012-08-17 18:18 PDT, Patrick McManus [:mcmanus]
jduell.mcbugs: review+
Details | Diff | Splinter Review

Description dpp 2012-08-14 01:47:35 PDT
Created attachment 651664 [details]
ff14-tcp-fin-bug.zip

User Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.57 Safari/537.1

Steps to reproduce:

open connection (XMLHttpRequest or EventSource)
wait for data
close connection
>> TCP FIN packet should be sent <<

to demostrate


Actual results:

Firefox 14
root@dpp-thinkpad /home/dpp/nginx/html> tcpdump -n -i lo 
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on lo, link-type EN10MB (Ethernet), capture size 65535 bytes
12:38:13.268730 IP 127.0.0.1.46503 > 127.0.0.1.80: Flags [P.], seq 49322342:49322678, ack 1914234413, win 256, options [nop,nop,TS val 697211607 ecr 697200172], length 336
12:38:13.308646 IP 127.0.0.1.80 > 127.0.0.1.46503: Flags [.], ack 336, win 256, options [nop,nop,TS val 697211647 ecr 697211607], length 0
12:38:14.271288 IP 127.0.0.1.80 > 127.0.0.1.46503: Flags [P.], seq 1:265, ack 336, win 256, options [nop,nop,TS val 697212609 ecr 697211607], length 264
12:38:14.271314 IP 127.0.0.1.46503 > 127.0.0.1.80: Flags [.], ack 265, win 256, options [nop,nop,TS val 697212609 ecr 697212609], length 0
12:38:15.270952 IP 127.0.0.1.80 > 127.0.0.1.46503: Flags [P.], seq 265:301, ack 336, win 256, options [nop,nop,TS val 697213609 ecr 697212609], length 36
12:38:15.270966 IP 127.0.0.1.46503 > 127.0.0.1.80: Flags [.], ack 301, win 256, options [nop,nop,TS val 697213609 ecr 697213609], length 0
12:38:15.271175 IP 127.0.0.1.46503 > 127.0.0.1.80: Flags [R.], seq 336, ack 301, win 256, options [nop,nop,TS val 697213609 ecr 697213609], length 0
^C
7 packets captured
14 packets received by filter
0 packets dropped by kernel
root@dpp-thinkpad /home/dpp/nginx/html>


Expected results:

Firefox 13
root@dpp-thinkpad /home/dpp/nginx/html> tcpdump -n -i lo 
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on lo, link-type EN10MB (Ethernet), capture size 65535 bytes
12:39:42.232148 IP 127.0.0.1.46548 > 127.0.0.1.80: Flags [P.], seq 4236839640:4236839976, ack 4085474941, win 256, options [nop,nop,TS val 697300570 ecr 697293162], length 336
12:39:42.271645 IP 127.0.0.1.80 > 127.0.0.1.46548: Flags [.], ack 336, win 256, options [nop,nop,TS val 697300610 ecr 697300570], length 0
12:39:43.235369 IP 127.0.0.1.80 > 127.0.0.1.46548: Flags [P.], seq 1:265, ack 336, win 256, options [nop,nop,TS val 697301573 ecr 697300570], length 264
12:39:43.235394 IP 127.0.0.1.46548 > 127.0.0.1.80: Flags [.], ack 265, win 256, options [nop,nop,TS val 697301573 ecr 697301573], length 0
12:39:43.729816 IP 127.0.0.1.46548 > 127.0.0.1.80: Flags [F.], seq 336, ack 265, win 256, options [nop,nop,TS val 697302068 ecr 697301573], length 0
12:39:43.730045 IP 127.0.0.1.80 > 127.0.0.1.46548: Flags [F.], seq 265, ack 337, win 256, options [nop,nop,TS val 697302068 ecr 697302068], length 0
12:39:43.730084 IP 127.0.0.1.46548 > 127.0.0.1.80: Flags [.], ack 266, win 256, options [nop,nop,TS val 697302068 ecr 697302068], length 0
^C
7 packets captured
14 packets received by filter
0 packets dropped by kernel
root@dpp-thinkpad /home/dpp/nginx/html>
Comment 1 Patrick McManus [:mcmanus] 2012-08-14 07:05:36 PDT
Thanks for the report - that's interesting. The issue isn't really the lack of fin, but rather the rst that prempts it, yes?

almost certainly firefox is closing the connection in 14 and finding some unconsumed data on the socket - that generates the resest. two questions: what is the data, and why is ff14 initiating the close instead of the server as in your ff13 trace.

it would help to have a plug and play str (test case) for this as the server data seems to play a role.
Comment 2 dpp 2012-08-14 09:20:40 PDT
This is really about lack of FIN - back-end can't detect that connection is actually closed. RST is a result of back-end writing to closed socket.

In FF13 client is initiating close as you can see in logs above:
12:39:43.729816 IP 127.0.0.1.46548 > 127.0.0.1.80: Flags [F.], seq 336, ack 265, win 256, options [nop,nop,TS val 697302068 ecr 697301573], length 0

As you can see in attached sample of nodejs backend - data is just a ping message once a second. Live example: http://my.dpp.su/badoo/ff14-fin-bug/
This is minimal sample to reproduce the problem we found streaming data to client.

Internally, we have `one connection` limit in application, that's why when client calls close() then open() it stops working in FF14 because back-end discards second connection (not knowing that previous one is actually closed) due to app logic.
Comment 3 dpp 2012-08-15 22:19:06 PDT
related bug https://bugs.launchpad.net/ubuntu/+source/firefox/+bug/1031750 (but this affects all OS)
By the way, FF15beta does not have this bug
Comment 4 Patrick McManus [:mcmanus] 2012-08-16 10:47:25 PDT
xhr is to a never ending chunked response and the js contains this:

setTimeout(function(){t.abort();}, 1500);

in ff13/15/17 it seems to work and generates a normal fin from firefox at that t=~1500. in ff14 that close doesn't happen and when new data comes in from the server at t=2100 a close is triggered but that new data is apparently not fully consumed which makes the rst happen.

q 1: what's up with the timer and is it already really fixed?

q 2: can we drain data off of http sockets before actively closing them to lower the chances of a rst. Note that this is a common, but racy, strategy and if you've got a never ending response stream there is always going to be some chance of a rst.
Comment 5 Patrick McManus [:mcmanus] 2012-08-16 12:15:03 PDT
(In reply to Patrick McManus [:mcmanus] from comment #4)

> q 1: what's up with the timer and is it already really fixed?

Even though this is WFM on 15, I'm glad I looked into it - there's a real bug here.

It is related to the wider use of pipeline datastructures in FF14.

The change from 760608 (included in FF15) means we use those datastructures a little less - your XHR code is one of those cases so you saw the regression disappear. But the bug is still present in the pipeline code - it probably has been there for years - and we still use those data structures in lots of places in >= 15 (yes, even when pipelining is disabled) (just not with xhr or some other similar things). So there is a fix that needs to be made. Thanks!

The issue is basically that a canceled transaction from a pipeline lets the connection hang until activity if that cancelation reduces the pipeline to 0 transactions. (i.e. if it consists only of 1 transaction to start with).

I'll put together a patch.
Comment 6 dpp 2012-08-17 14:51:19 PDT
thanks for the participation!
we'll have to fix it some how on backend - 150mln worldwide instant messagening network is affected...
instead of closing new connections as we do for correct situations we will ping old ones and wait about a second for rst to understand which connection is really alive
Comment 7 Patrick McManus [:mcmanus] 2012-08-17 18:03:48 PDT
https://tbpl.mozilla.org/?tree=Try&rev=d26a4a79c1ac
Comment 8 Patrick McManus [:mcmanus] 2012-08-17 18:18:55 PDT
Created attachment 653004 [details] [diff] [review]
patch 0
Comment 9 Patrick McManus [:mcmanus] 2012-08-21 05:40:01 PDT
https://hg.mozilla.org/integration/mozilla-inbound/rev/a1d9dc563731
Comment 10 Ryan VanderMeulen [:RyanVM] 2012-08-21 19:10:30 PDT
https://hg.mozilla.org/mozilla-central/rev/a1d9dc563731
Comment 11 Anthony Hughes (:ashughes) [GFX][QA][Mentor] 2012-09-18 15:45:38 PDT
@dpp, can you please confirm this is fixed with the latest Firefox 17 builds from aurora.mozilla.org? Thanks.
Comment 12 dpp 2012-09-20 02:29:03 PDT
yes, it is fixed (i have checked firefox-17.0a2.en-US.linux-x86_64)
Comment 13 Anthony Hughes (:ashughes) [GFX][QA][Mentor] 2012-09-20 09:01:06 PDT
(In reply to dpp from comment #12)
> yes, it is fixed (i have checked firefox-17.0a2.en-US.linux-x86_64)

Thanks a lot for helping with the verification.

Note You need to log in before you can comment on or make changes to this bug.