TURN/TCP with hostnames doesn't work on Linux

RESOLVED FIXED in Firefox 49

Status

()

P1
normal
Rank:
19
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: fippo, Assigned: drno)

Tracking

({regression})

47 Branch
mozilla51
x86_64
Linux
regression
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox48 wontfix, firefox49+ fixed, firefox50 fixed, firefox51 fixed)

Details

Attachments

(5 attachments, 1 obsolete attachment)

(Reporter)

Description

2 years ago
str
Created attachment 8775882 [details]
about-webrtc

TURN/TCP on Linux doesn't work for me when I am using it with a TURN server specified as a URL. It works when using an ip address instead of a hostname.

To reproduce (easier on talky.io than on appear.in but I can reproduce it regardless of turn server used):
1. go to https://talky.io
2. grab turn credentials by pasting this into the console:
  var ice = app.xmpp.jingle.config.peerConnectionConfig.iceServers[2]
   Verify that ice.url looks is a turn/tcp server with an ip address
  (ya, I need to get this to move to .urls)
3. gather candidates by pasting this into the console:
  var pc = new RTCPeerConnection({iceServers: [ice]}); pc.onicecandidate = function(event) { console.log(event.candidate ? event.candidate.candidate : null); }; pc.createOffer({offerToReceiveAudio: 1}).then((offer) => pc.setLocalDescription(offer));
4. watch for relay candidates. This should work.
5. set
  ice.url = 'turn:turn1.talky.io:80?transport=tcp
6. repeat (3) and (4). Result: no relay candidate.

The about:webrtc log is attached. As well a a pcap which shows that Firefox connects to the TURN server. But nothing is ever sent

Looking at a limited dataset from the appear.in platform metrics there seems to be a pattern on linux:
   select count(*), browsertype, browseros, gatheredturntcp from features_permanent
   where (iceconnectedorcompleted = 't' or icefailure = 't') and browsertype = 'moz' group by browsertype, browseros, gatheredturntcp order by browseros, gatheredturntcp, count desc;
count | browsertype |             browseros              | gatheredturntcp 
-------+-------------+------------------------------------+-----------------
   138 | moz         | Fedora 64-bit                      | f
     9 | moz         | Fedora 64-bit                      | t
   327 | moz         | Linux 64-bit                       | f
   108 | moz         | Linux 64-bit                       | t
     4 | moz         | Linux i686                         | f
     9 | moz         | Linux i686                         | t
    74 | moz         | Ubuntu                             | f
    36 | moz         | Ubuntu                             | t
  1963 | moz         | Ubuntu 64-bit                      | f
   165 | moz         | Ubuntu 64-bit                      | t

The number of failures to gather a turn/tcp candidate is much higher than the number of successes (apart from the 'Linux i686 platform but that is not statistically relevant). And it works _sometimes_ which suggests this is some weird race condition.

Impact is low, this only affects firefox users that absolutely require a turn/tcp connection. Working around this by handing out ip addresses instead of hostnames is possible but that is not how we deploy turn servers using dns+geolocation
(Reporter)

Comment 1

2 years ago
Created attachment 8775883 [details]
pcap showing the tcp connection
nils, can you set the priority on this?  thanks
Flags: needinfo?(drno)

Comment 3

2 years ago
I've had no luck getting this to fail on my linux box. I wonder what happens if we try in non-e10s mode.

Comment 4

2 years ago
I also wonder if maybe bug 1287874 is showing up on linux here. Seems like that could be causing a premature timeout.
(Assignee)

Updated

2 years ago
Rank: 19
Flags: needinfo?(drno)
Priority: -- → P1
(Assignee)

Comment 5

2 years ago
Created attachment 8776291 [details]
Bug 1290365: added extensive logging for TURN TCP

Review commit: https://reviewboard.mozilla.org/r/68152/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/68152/
(Assignee)

Comment 6

2 years ago
I don't have easy access to a Linux machine myself until Monday. I'm suspicious if whats going on here is that we for some unknown reason cancel the read and/or write callbacks on the TCP sockets. All the code depends on these to proceed in case of TCP.
Fippo: I kicked off a try build so that you can download a build with the increased logging I added. I'll give it a try on Monday in the office myself.
(Reporter)

Comment 7

2 years ago
that build doesn't have the problem anymore which suggests this may have been fixed by bug 1287874!

\o/

Comment 8

2 years ago
Seems like a low-risk uplift to me, but let's verify that this was indeed the fix.
(Assignee)

Comment 9

2 years ago
I can repro with 47 Ubuntu build, but not with my local 50 Nightly build (which is from way before bug 1287874 landed).
I'll try to track down what makes this reproducible...
Assignee: nobody → drno
(Assignee)

Comment 10

2 years ago
Has nothing to do with the versions. The problem exists only in non-e10s mode. As Nightly and Aurora have e10s enabled by default the problem appears to be gone, until you turn e10s off.
(Assignee)

Comment 11

2 years ago
So I just build locally with the fix for 1287874 and the extra logging, and as soon as I turn off e10s the problem is still reproducible.
(Assignee)

Comment 12

2 years ago
So we get PR_POLL_HUP events on these TCP sockets (and then declare these sockets internally as dead http://searchfox.org/mozilla-central/source/media/﷒0﷓ )
(Assignee)

Comment 13

2 years ago
Note to myself: I'm pretty sure ICE TCP sockets suffer from the same problem.
(Assignee)

Updated

2 years ago
See Also: → bug 1291176
(Assignee)

Updated

2 years ago
backlog: --- → webrtc/webaudio+
Depends on: 1224845
Keywords: regression
(Assignee)

Comment 14

2 years ago
(In reply to Nils Ohlmeier [:drno] from comment #13)
> Note to myself: I'm pretty sure ICE TCP sockets suffer from the same problem.

Verified that our ICE TCP implementation actually handles this correctly.
(Assignee)

Comment 15

2 years ago
Created attachment 8777134 [details]
Bug 1290365: add TURN TCP socket to read poll after connect.

Review commit: https://reviewboard.mozilla.org/r/68750/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/68750/
Attachment #8777134 - Flags: review?(docfaraday)
(Assignee)

Updated

2 years ago
Attachment #8776291 - Attachment is obsolete: true

Comment 16

2 years ago
Comment on attachment 8777134 [details]
Bug 1290365: add TURN TCP socket to read poll after connect.

https://reviewboard.mozilla.org/r/68750/#review65764

Looks fine to me.
Attachment #8777134 - Flags: review?(docfaraday) → review+
(Assignee)

Comment 17

2 years ago
Created attachment 8777138 [details] [diff] [review]
bug1290365_beta.patch

The idea here is that the patch in mozreview is the proper fix which should go into mozilla-central and ride the trains. But this is the way easier work-around to avoid hitting the HUP problem which can be uplifted, until the real fix hits release.
Attachment #8777138 - Flags: review?(docfaraday)

Comment 18

2 years ago
Comment on attachment 8777138 [details] [diff] [review]
bug1290365_beta.patch

Review of attachment 8777138 [details] [diff] [review]:
-----------------------------------------------------------------

I'm confused. Does ignoring HUP on the socket allow it to work later?
(Assignee)

Comment 19

2 years ago
(In reply to Byron Campen [:bwc] from comment #18)
> Comment on attachment 8777138 [details] [diff] [review]
> bug1290365_beta.patch
> 
> Review of attachment 8777138 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I'm confused. Does ignoring HUP on the socket allow it to work later?

Yes it does. What happens is that in when we create a UDP socket we want to start listening on it right away. And so far we do the same with TCP socket. But technically you can't listen on a TCP socket which is not connected or you are listening on.
So in case of a DNS name for a TURN server we create the socket and include it immediately in the poll set for read, but can't connect as we don't have an IP address. And so the Linux kernel return the HUP to "inform" us that we are doing something which is not legit.
Before bug 1224845 landed we simply ignored the HUP. And if we do then NSPR still connects the socket once we make the connect() call with the IP address from DNS.

But I don't think that ignoring HUP should be the long term solution, as it re-opens the door for bug 1224845. And bug 1291176 shows that we are still having paths in our code which result in HUP signals.

My thinking here is that changing the way he handle incoming data and the size of the patch don't make the patch in mozreview a good candidate for uplift.
The second patch in attachment 8777138 [details] [diff] [review] is a simple workaround, with known side-effects, which I feel comfortable to uplift to DE and Beta. I'm open to discuss uplifting the real patch in mozreview to DE if anyone feels that is the better solution.

Comment 20

2 years ago
Comment on attachment 8777138 [details] [diff] [review]
bug1290365_beta.patch

Review of attachment 8777138 [details] [diff] [review]:
-----------------------------------------------------------------

Ok, looks good.
Attachment #8777138 - Flags: review?(docfaraday) → review+
(Assignee)

Comment 21

2 years ago
Comment on attachment 8777134 [details]
Bug 1290365: add TURN TCP socket to read poll after connect.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/68750/diff/1-2/

Comment 22

2 years ago
Pushed by drno@ohlmeier.org:
https://hg.mozilla.org/integration/autoland/rev/0f1333d8cf91
add TURN TCP socket to read poll after connect. r=bwc
(Assignee)

Comment 23

2 years ago
Comment on attachment 8777138 [details] [diff] [review]
bug1290365_beta.patch

Approval Request Comment
[Feature/regressing bug #]: The patch from bug 1224845 (landed in 47) added error handling which exposed the underlying problem which existed for a much longer time.

[User impact if declined]: Depending on how the WebRTC calling service is configured (DNS vs. IP address) and in which kind of network the Firefox user is this can result in no call getting established (where a Firefox <= 46 would have worked under the same conditions).

[Describe test coverage new/current, TreeHerder]: Only manual testing as apparently the delay introduce by a real DNS server is only triggering this bug.

[Risks and why]: The risk for new problems is pretty small as it only restores the behavior from FF <= 46. With this patch applied FF will use more CPU while establishing the call. The patch which solves both problems just landed in 51 (but the risk of uplifting that would be a lot higher, then reverting to the old, known behavior).

[String/UUID change made/needed]: N/A
Attachment #8777138 - Flags: approval-mozilla-beta?
Attachment #8777138 - Flags: approval-mozilla-aurora?

Comment 24

2 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/0f1333d8cf91
Status: NEW → RESOLVED
Last Resolved: 2 years ago
status-firefox51: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla51

Updated

2 years ago
status-firefox49: --- → affected
status-firefox50: --- → affected
(Assignee)

Updated

2 years ago
status-firefox48: --- → wontfix
Comment on attachment 8777138 [details] [diff] [review]
bug1290365_beta.patch

Review of attachment 8777138 [details] [diff] [review]:
-----------------------------------------------------------------

This patch fixes a regression for WebRTC. Take it in 49 beta and aurora.
Attachment #8777138 - Flags: approval-mozilla-beta?
Attachment #8777138 - Flags: approval-mozilla-beta+
Attachment #8777138 - Flags: approval-mozilla-aurora?
Attachment #8777138 - Flags: approval-mozilla-aurora+
https://hg.mozilla.org/releases/mozilla-aurora/rev/dafd5dce166f
https://hg.mozilla.org/releases/mozilla-beta/rev/60ae30593a0b

Backed out from release branches for permacrashing (mda) tests like https://treeherder.mozilla.org/logviewer.html#?job_id=1430555&repo=mozilla-beta
status-firefox49: fixed → affected
status-firefox50: fixed → affected
Flags: needinfo?(drno)
(Assignee)

Comment 28

2 years ago
I overlooked in the logs why these test are actually failing:
  Assertion `!sock->connected' failed.
The fix for that landed in bug 1293206 already.

I'll start test runs on Aurora to see if that fixes the problem.
Flags: needinfo?(drno)
(Assignee)

Comment 29

2 years ago
Confirmed. The fix from bug 1293206 is needed to get the tests working for the either of the two patches in this bug report.
Depends on: 1293206
(Assignee)

Comment 31

2 years ago
Created attachment 8780992 [details] [diff] [review]
bug1290365_aurora.patch

This patch combines the POLL_HUP change for this bug, with the connected write cancel from bug 1293206.
Try run: https://treeherder.mozilla.org/#/jobs?repo=try&revision=a1237290b083

Although I guess the cleaner solution here would be to request uplift in bug 1293206 and then re-land the patches in here.
Looks like Nils landed this but forgot to update the bug.
https://hg.mozilla.org/releases/mozilla-aurora/rev/fdc2f1d67e3977e4fbc79d0acbb32259eca0f524
status-firefox50: affected → fixed
Nils, do you have a new patch for beta now that we are uplifting the patch in bug 1293206? Thanks!
tracking-firefox49: --- → +
Flags: needinfo?(drno)

Comment 34

2 years ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-beta/rev/36cdcd06bf6f
status-firefox49: affected → fixed
(Assignee)

Comment 35

2 years ago
:RyanVM did already the right thing. Thanks!
Flags: needinfo?(drno)
(Assignee)

Updated

2 years ago
Blocks: 1296475
You need to log in before you can comment on or make changes to this bug.