Closed Bug 936979 Opened 11 years ago Closed 10 years ago

websocket will never connected after a lot of failure

Categories

(Core :: Networking: WebSockets, defect)

25 Branch
x86_64
Linux
defect
Not set
major

Tracking

()

RESOLVED FIXED
mozilla29
Tracking Status
firefox25 --- affected
firefox26 --- affected
firefox27 --- affected
firefox28 --- affected

People

(Reporter: fatmck, Assigned: jduell.mcbugs)

Details

Attachments

(3 files)

First, sorry for my poor English.

My code is using setInterval to check connection state of websocket(the delay is 3 seconds).
In setInterval callback function, if i found websocket is not connected, i would close it, and create a new websocket.

To see this bug, you should close the server, so the connecting attempt will be always failed.
After a lot connection failure(about over 8 times on my machine, you can just wait for one minute), startup the server, but no connection happens.
I use tcpdump to print the tcp packages: after about over 8 times failure, there is no TCP SYNC sent from firefox. (TCP SYNC means a client is trying connect to a server)
After you see this bug, refresh the page, connecting attempt will still be failure. you must wait a long time, then the connection will be success.

Env: ubuntu12.04 64bit  + firefox25 (also buggy in ubuntu13.04 32bit)

The attachment contains following files:
1. test.html  : the html file runing the websocket client (trying to conent 127.0.0.1 port 1026)
2. tcpdump.txt : the output of tcpdump in which you will see 8 SYNC packages, and also 8 RESET packages following each SYNC package. Lines marked by [S] is a TCP SYNC package sent by client side which is the firefox websocket. Lines marked by [R.] is a TCP RESET package sent by the server machine, which means no server side application is listening port 1026.

When you confirming this bug, you even don't need a server, just run tcpdump using the following command: sudo tcpdump -ilo tcp and port 1026
This will print any tcp packages happend on 127.0.0.1:1026.
Then open test.html in firefox, you can only see SYNC packages and RESET packages in the first few seconds(on my machine it is 8 SYNC packages in 24 seconds), and then nothing! That means: firefox can not make connecting attemp after a lot of failure. Refresh the web page, still, no connecting attemp happened!

Same code running perfect on google chromium 30.
Component: General → Networking: WebSockets
Product: Firefox → Core
echo, this is an interesting case and indeed a bug.

Basically firefox has some logic to backoff our connection rate when there are some failed connects - rfc 6455 7.2.3 encourages that. After some time goes by we reduce the backoff.

Your test essentially closes the websocket and restarts a new one every 3 seconds. The bug comes into play when you tests closes the socket from javascript during that backoff timeout - we interpret that as further failure and backoff even more. The process repeats every 3 seconds and the result is that we never end up with a backoff value of less than 3 seconds, so your test always cancels it. deadlock.

The fix appears simple - when we fail to connect during the self-imposed backoff period (probably because js closed the websocket), don't use that as input into extending/increasing the backoff period.
bug 936979 -  websocket will never connected after a lot of failure r?jduell
Attachment #832318 - Flags: review?(jduell.mcbugs)
wow, i am so happy to see this patch when i getup in the morning, thank you very much.

So this patch will go with firefox26 probablly? Currently i am using Chromium for development due to this bug.
I think this patch does a more complete fix.

The problem with filtering just on CONNECTING_DELAYED is that we can hit the same JS close() call when we're in CONNECTING_QUEUED (if a 1st websocket is trying to connect, and a second is launched with the same "close after 3 seconds" logic), or in CONNECTING_IN_PROGRESS if the timing is right (we're starting to connect but the timeout/close happens before we're done).  It can even happen in NOT_CONNECTING (AsyncOpen does a DNS lookup: if the timer/close happens before DNS calls OnLookupComplete, we're still in NOT_CONNECTING state).

We can be fairly certain that rv == NS_ERROR_NOT_CONNECTED means JS has called close while mTransport == null (we don't call StopSession with that error code anywhere else), and that captures all of these cases:

   http://mxr.mozilla.org/mozilla-central/source/netwerk/protocol/websocket/WebSocketChannel.cpp#2802

Patrick, let me know if you agree.
Attachment #8358660 - Flags: review?(mcmanus)
Comment on attachment 8358660 [details] [diff] [review]
936979.closedelay.v2

Review of attachment 8358660 [details] [diff] [review]:
-----------------------------------------------------------------

yes; better.
Attachment #8358660 - Flags: review?(mcmanus) → review+
Attachment #832318 - Flags: review?(jduell.mcbugs)
https://hg.mozilla.org/mozilla-central/rev/9ba11d59bf3f
Assignee: nobody → jduell.mcbugs
Status: UNCONFIRMED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla29
echo, can you please verify that this bug is fixed for you in Firefox 29?
Flags: needinfo?(fatmck)
just reply to clear the needinfo request.
Flags: needinfo?(fatmck)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: