TURN/TCP with hostnames doesn't work on Linux

RESOLVED FIXED in Firefox 49

Status

()

defect
P1
normal
Rank:
19
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: fippo, Assigned: drno)

Tracking

({regression})

47 Branch
mozilla51
x86_64
Linux
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox48 wontfix, firefox49+ fixed, firefox50 fixed, firefox51 fixed)

Details

Attachments

(5 attachments, 1 obsolete attachment)

Posted file about-webrtc
TURN/TCP on Linux doesn't work for me when I am using it with a TURN server specified as a URL. It works when using an ip address instead of a hostname.

To reproduce (easier on talky.io than on appear.in but I can reproduce it regardless of turn server used):
1. go to https://talky.io
2. grab turn credentials by pasting this into the console:
  var ice = app.xmpp.jingle.config.peerConnectionConfig.iceServers[2]
   Verify that ice.url looks is a turn/tcp server with an ip address
  (ya, I need to get this to move to .urls)
3. gather candidates by pasting this into the console:
  var pc = new RTCPeerConnection({iceServers: [ice]}); pc.onicecandidate = function(event) { console.log(event.candidate ? event.candidate.candidate : null); }; pc.createOffer({offerToReceiveAudio: 1}).then((offer) => pc.setLocalDescription(offer));
4. watch for relay candidates. This should work.
5. set
  ice.url = 'turn:turn1.talky.io:80?transport=tcp
6. repeat (3) and (4). Result: no relay candidate.

The about:webrtc log is attached. As well a a pcap which shows that Firefox connects to the TURN server. But nothing is ever sent

Looking at a limited dataset from the appear.in platform metrics there seems to be a pattern on linux:
   select count(*), browsertype, browseros, gatheredturntcp from features_permanent
   where (iceconnectedorcompleted = 't' or icefailure = 't') and browsertype = 'moz' group by browsertype, browseros, gatheredturntcp order by browseros, gatheredturntcp, count desc;
count | browsertype |             browseros              | gatheredturntcp 
-------+-------------+------------------------------------+-----------------
   138 | moz         | Fedora 64-bit                      | f
     9 | moz         | Fedora 64-bit                      | t
   327 | moz         | Linux 64-bit                       | f
   108 | moz         | Linux 64-bit                       | t
     4 | moz         | Linux i686                         | f
     9 | moz         | Linux i686                         | t
    74 | moz         | Ubuntu                             | f
    36 | moz         | Ubuntu                             | t
  1963 | moz         | Ubuntu 64-bit                      | f
   165 | moz         | Ubuntu 64-bit                      | t

The number of failures to gather a turn/tcp candidate is much higher than the number of successes (apart from the 'Linux i686 platform but that is not statistically relevant). And it works _sometimes_ which suggests this is some weird race condition.

Impact is low, this only affects firefox users that absolutely require a turn/tcp connection. Working around this by handing out ip addresses instead of hostnames is possible but that is not how we deploy turn servers using dns+geolocation
nils, can you set the priority on this?  thanks
Flags: needinfo?(drno)
I've had no luck getting this to fail on my linux box. I wonder what happens if we try in non-e10s mode.
I also wonder if maybe bug 1287874 is showing up on linux here. Seems like that could be causing a premature timeout.
Rank: 19
Flags: needinfo?(drno)
Priority: -- → P1
I don't have easy access to a Linux machine myself until Monday. I'm suspicious if whats going on here is that we for some unknown reason cancel the read and/or write callbacks on the TCP sockets. All the code depends on these to proceed in case of TCP.
Fippo: I kicked off a try build so that you can download a build with the increased logging I added. I'll give it a try on Monday in the office myself.
that build doesn't have the problem anymore which suggests this may have been fixed by bug 1287874!

\o/
Seems like a low-risk uplift to me, but let's verify that this was indeed the fix.
I can repro with 47 Ubuntu build, but not with my local 50 Nightly build (which is from way before bug 1287874 landed).
I'll try to track down what makes this reproducible...
Assignee: nobody → drno
Has nothing to do with the versions. The problem exists only in non-e10s mode. As Nightly and Aurora have e10s enabled by default the problem appears to be gone, until you turn e10s off.
So I just build locally with the fix for 1287874 and the extra logging, and as soon as I turn off e10s the problem is still reproducible.
So we get PR_POLL_HUP events on these TCP sockets (and then declare these sockets internally as dead http://searchfox.org/mozilla-central/source/media/mtransport/nr_socket_prsock.cpp#367 )
Note to myself: I'm pretty sure ICE TCP sockets suffer from the same problem.
See Also: → 1291176
backlog: --- → webrtc/webaudio+
Depends on: 1224845
Keywords: regression
(In reply to Nils Ohlmeier [:drno] from comment #13)
> Note to myself: I'm pretty sure ICE TCP sockets suffer from the same problem.

Verified that our ICE TCP implementation actually handles this correctly.
Attachment #8776291 - Attachment is obsolete: true
Comment on attachment 8777134 [details]
Bug 1290365: add TURN TCP socket to read poll after connect.

https://reviewboard.mozilla.org/r/68750/#review65764

Looks fine to me.
Attachment #8777134 - Flags: review?(docfaraday) → review+
The idea here is that the patch in mozreview is the proper fix which should go into mozilla-central and ride the trains. But this is the way easier work-around to avoid hitting the HUP problem which can be uplifted, until the real fix hits release.
Attachment #8777138 - Flags: review?(docfaraday)
Comment on attachment 8777138 [details] [diff] [review]
bug1290365_beta.patch

Review of attachment 8777138 [details] [diff] [review]:
-----------------------------------------------------------------

I'm confused. Does ignoring HUP on the socket allow it to work later?
(In reply to Byron Campen [:bwc] from comment #18)
> Comment on attachment 8777138 [details] [diff] [review]
> bug1290365_beta.patch
> 
> Review of attachment 8777138 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I'm confused. Does ignoring HUP on the socket allow it to work later?

Yes it does. What happens is that in when we create a UDP socket we want to start listening on it right away. And so far we do the same with TCP socket. But technically you can't listen on a TCP socket which is not connected or you are listening on.
So in case of a DNS name for a TURN server we create the socket and include it immediately in the poll set for read, but can't connect as we don't have an IP address. And so the Linux kernel return the HUP to "inform" us that we are doing something which is not legit.
Before bug 1224845 landed we simply ignored the HUP. And if we do then NSPR still connects the socket once we make the connect() call with the IP address from DNS.

But I don't think that ignoring HUP should be the long term solution, as it re-opens the door for bug 1224845. And bug 1291176 shows that we are still having paths in our code which result in HUP signals.

My thinking here is that changing the way he handle incoming data and the size of the patch don't make the patch in mozreview a good candidate for uplift.
The second patch in attachment 8777138 [details] [diff] [review] is a simple workaround, with known side-effects, which I feel comfortable to uplift to DE and Beta. I'm open to discuss uplifting the real patch in mozreview to DE if anyone feels that is the better solution.
Comment on attachment 8777138 [details] [diff] [review]
bug1290365_beta.patch

Review of attachment 8777138 [details] [diff] [review]:
-----------------------------------------------------------------

Ok, looks good.
Attachment #8777138 - Flags: review?(docfaraday) → review+
Comment on attachment 8777134 [details]
Bug 1290365: add TURN TCP socket to read poll after connect.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/68750/diff/1-2/
Pushed by drno@ohlmeier.org:
https://hg.mozilla.org/integration/autoland/rev/0f1333d8cf91
add TURN TCP socket to read poll after connect. r=bwc
Comment on attachment 8777138 [details] [diff] [review]
bug1290365_beta.patch

Approval Request Comment
[Feature/regressing bug #]: The patch from bug 1224845 (landed in 47) added error handling which exposed the underlying problem which existed for a much longer time.

[User impact if declined]: Depending on how the WebRTC calling service is configured (DNS vs. IP address) and in which kind of network the Firefox user is this can result in no call getting established (where a Firefox <= 46 would have worked under the same conditions).

[Describe test coverage new/current, TreeHerder]: Only manual testing as apparently the delay introduce by a real DNS server is only triggering this bug.

[Risks and why]: The risk for new problems is pretty small as it only restores the behavior from FF <= 46. With this patch applied FF will use more CPU while establishing the call. The patch which solves both problems just landed in 51 (but the risk of uplifting that would be a lot higher, then reverting to the old, known behavior).

[String/UUID change made/needed]: N/A
Attachment #8777138 - Flags: approval-mozilla-beta?
Attachment #8777138 - Flags: approval-mozilla-aurora?
https://hg.mozilla.org/mozilla-central/rev/0f1333d8cf91
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla51
Comment on attachment 8777138 [details] [diff] [review]
bug1290365_beta.patch

Review of attachment 8777138 [details] [diff] [review]:
-----------------------------------------------------------------

This patch fixes a regression for WebRTC. Take it in 49 beta and aurora.
Attachment #8777138 - Flags: approval-mozilla-beta?
Attachment #8777138 - Flags: approval-mozilla-beta+
Attachment #8777138 - Flags: approval-mozilla-aurora?
Attachment #8777138 - Flags: approval-mozilla-aurora+
I overlooked in the logs why these test are actually failing:
  Assertion `!sock->connected' failed.
The fix for that landed in bug 1293206 already.

I'll start test runs on Aurora to see if that fixes the problem.
Flags: needinfo?(drno)
Confirmed. The fix from bug 1293206 is needed to get the tests working for the either of the two patches in this bug report.
Depends on: 1293206
This patch combines the POLL_HUP change for this bug, with the connected write cancel from bug 1293206.
Try run: https://treeherder.mozilla.org/#/jobs?repo=try&revision=a1237290b083

Although I guess the cleaner solution here would be to request uplift in bug 1293206 and then re-land the patches in here.
Nils, do you have a new patch for beta now that we are uplifting the patch in bug 1293206? Thanks!
Flags: needinfo?(drno)
:RyanVM did already the right thing. Thanks!
Flags: needinfo?(drno)
Blocks: 1296475
You need to log in before you can comment on or make changes to this bug.