Closed Bug 1816258 Opened 1 year ago Closed 7 months ago

Websocket opening takes ages after repeated failures to same address

Categories

(Core :: Networking: WebSockets, defect, P2)

Firefox 110
defect

Tracking

()

RESOLVED FIXED
122 Branch
Tracking Status
firefox122 --- fixed

People

(Reporter: waclaw66, Assigned: acreskey)

References

Details

(Whiteboard: [necko-triaged][necko-priority-queue], [wptsync upstream])

Attachments

(8 files)

Attached file server.py

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/110.0

Steps to reproduce:

I'm having issue with websocket opening on a certain domain that takes long time ~ 40s.
I've created a demo client and server project (attached) to debug it.
Client with javascript is available on:
https://pong.bolesiny.net/
https://pong.waclaw.cz/
Check the console, there is no visible html output.

Actual results:

When Firefox is opened, websocket is opened immediately. After a while (minutes) of browsing on about ~30 tabs it takes a long time to open that websocket in client demo. Once websocket is opened, then its reopening is fast again. After few minutes of inactivity its opening take ages again. The weird thing is that it does only on particular domain bolesiny.net, I've tried another domains (e.g. waclaw.cz) reverse proxied to the same server and it works without any problem. All those domains use the same way generated letsencrypt certificate.
You can check the profile https://share.firefox.dev/3HM8Ps0 of that slow websocket opening, it could help to trace the problem.

I've tried to disable antivirus, didn't help. It works fine with a clear Firefox profile and in Chrome as well. It's dependent to a specific FF state, probably high amount of tabs or websockets on a particular domain.

Expected results:

Websocket should be opened fast on any domain.

The Bugbug bot thinks this bug should belong to the 'Core::Security: PSM' component, and is moving the bug to that component. Please correct in case you think the bot is wrong.

Component: Untriaged → Security: PSM
Product: Firefox → Core
Component: Security: PSM → Networking: WebSockets

I think I've found a cause and know how to reproduce it. It's caused by many unsuccessful websocket opening attempts.
One of those mentioned (pinned) tabs is Grafana which uses websocket for live updates. I have unintentionally not enabled websocket upgrade in nginx reverse proxy for its subdomain. Grafana tries to connect to websocket every 15s, after few minutes, connecting to websockets for that domain and all other subdomains take ages (20-100s).
Once I enabled websocket upgrade for that Grafana subdomain, websockets for whole domain started working flawlessly.
Seems to be an issue in Firefox. Does Firefox somehow recycle websockets for a domain or is there any websocket pool limit?

I have prepared a simpe demo that demostrates the problem...
https://pong.bolesiny.net/proof/
Tried FF 110b9 devel edition and nightly 111a1, the second connection took always >30s in my case. Tested on different machines, always same problem.
Chrome without any problem.

I've tested FF with mozregression as far as I could (because of old TLS cababilities) and it looks that FF suffers with above problem since the beginning.

Attached image ChromeFailingWebSocket

Hi Sunil, can you filter for "connected" and post all connection times?

Attached image ChromeConnectionTimes

I was able to reproduce this locally using this link https://pong.bolesiny.net/proof/.

Apart from the huge connection time difference with regard to chrome, I also noticed that Firefox connects only once, whereas Chrome reconnects after few failures.

Whiteboard: [necko-triaged][necko-priority-review]

Thats weird, I've tested it on three different PC with different internet access, Windows 10, Fedora 37, always the same :/
Althought you are on MacOS, right?

Attached image ws_win10.jpg
Attached image ws_fedora37.jpg
Severity: -- → S3
Priority: -- → P2

(In reply to Václav Nováček from comment #10)

Thats weird, I've tested it on three different PC with different internet access, Windows 10, Fedora 37, always the same :/
Althought you are on MacOS, right?

Thats correct!

Comment on attachment 9317445 [details]
ChromeFailingWebSocket

Ohh wait, now I spotted that, you've tested Chrome :D Therefore I was surprised, that's working.

(In reply to Sunil Mayya from comment #9)

I was able to reproduce this locally using this link https://pong.bolesiny.net/proof/.

Apart from the huge connection time difference with regard to chrome, I also noticed that Firefox connects only once, whereas Chrome reconnects after few failures.

Please check the javascript behing, you've maybe misunderstood that demo example.
There are two kind of connections. First connects every 60s to proper websocket endpoint. The second tries to connect to false websocket endpoint, therefore it fails. That failures of second connection cause that those huge re-connection times of first connection.

I'll have a look at whether we're implementing the spec correctly.

Flags: needinfo?(valentin.gosu)
Whiteboard: [necko-triaged][necko-priority-review] → [necko-triaged][necko-priority-next]
Whiteboard: [necko-triaged][necko-priority-next] → [necko-triaged][necko-priority-queue]
Assignee: nobody → acreskey

This is interesting; thanks for logging it, Václav.

It looks like we keep track of webSocket connect failures by address and port.

In this test we attempt to connect to two urls, both at the same address and port:

wss://pong.bolesiny.net/ws (valid)
wss://pong.bolesiny.net/  (invalid)

We repeatedly attempt the invalid url which fails and thus progresses the expontential backoff for that address/port.

From RFC 6455 , "clients SHOULD use some form of backoff when trying to reconnect after abnormal closures as described in this section."

But if we keyed the errors by the full url and port -- that seems like it would prevent this seemingly odd behaviour and still minimize reconnects after failure.

Note that the test provided in this bug works gracefully in both Chrome (as noted) and an Safari (i.e. connections to the valid WebSocket URL are not delayed).

Summary: Websocket opening takes ages → Websocket opening takes ages after repeated failures to same address

Hmm, although we key Websocket connections off the address so implementation-wise if we key failures off the full url we've introduced some complexity.
Valentin -- thoughts?

Although our implementation follows the intention of rfc6455#section-4.1, we can handle the described situation more gracefully by prioritizing new WS connections to paths that have not previously failed.

We can discuss in the patch, but what's happening is that we need to serialize attempts to connect to a give host / port pair.
https://datatracker.ietf.org/doc/html/rfc6455#section-4.1
And the repeated failed attempts fill up the queue to that host.
So in this implementation I allow WS connections to a path that has not failed yet to get priority in the queue.

Pushed by acreskey@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/dee1e05a2cca
Websocket opening takes ages after repeated failures to same address r=necko-reviewers,valentin
https://hg.mozilla.org/integration/autoland/rev/27bfc748bec2
Add wpt tests for Websocket repeated failures to same address r=necko-reviewers,kershaw
Created web-platform-tests PR https://github.com/web-platform-tests/wpt/pull/43181 for changes under testing/web-platform/tests
Whiteboard: [necko-triaged][necko-priority-queue] → [necko-triaged][necko-priority-queue], [wptsync upstream]
Regressions: 1864922
Upstream PR was closed without merging
Pushed by acreskey@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/8e259edabb62
Websocket opening takes ages after repeated failures to same address r=necko-reviewers,valentin
https://hg.mozilla.org/integration/autoland/rev/872aa271adfd
Add wpt tests for Websocket repeated failures to same address r=necko-reviewers,kershaw
Flags: needinfo?(acreskey)
Upstream PR was closed without merging
Pushed by acreskey@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/36f007cf29d0
Websocket opening takes ages after repeated failures to same address r=necko-reviewers,valentin
https://hg.mozilla.org/integration/autoland/rev/180501fca227
Add wpt tests for Websocket repeated failures to same address r=necko-reviewers,kershaw
Status: UNCONFIRMED → RESOLVED
Closed: 7 months ago
Resolution: --- → FIXED
Target Milestone: --- → 122 Branch
Upstream PR merged by moz-wptsync-bot
Flags: needinfo?(acreskey)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: