Closed
Bug 1290365
Opened 9 years ago
Closed 9 years ago
TURN/TCP with hostnames doesn't work on Linux
Categories
(Core :: WebRTC: Networking, defect, P1)
Tracking
()
RESOLVED
FIXED
mozilla51
backlog | webrtc/webaudio+ |
People
(Reporter: fippo, Assigned: drno)
References
Details
(Keywords: regression)
Attachments
(5 files, 1 obsolete file)
3.25 KB,
text/plain
|
Details | |
2.88 KB,
application/x-pcapng
|
Details | |
58 bytes,
text/x-review-board-request
|
bwc
:
review+
|
Details |
821 bytes,
patch
|
bwc
:
review+
gchang
:
approval-mozilla-aurora+
gchang
:
approval-mozilla-beta+
|
Details | Diff | Splinter Review |
2.20 KB,
patch
|
Details | Diff | Splinter Review |
TURN/TCP on Linux doesn't work for me when I am using it with a TURN server specified as a URL. It works when using an ip address instead of a hostname.
To reproduce (easier on talky.io than on appear.in but I can reproduce it regardless of turn server used):
1. go to https://talky.io
2. grab turn credentials by pasting this into the console:
var ice = app.xmpp.jingle.config.peerConnectionConfig.iceServers[2]
Verify that ice.url looks is a turn/tcp server with an ip address
(ya, I need to get this to move to .urls)
3. gather candidates by pasting this into the console:
var pc = new RTCPeerConnection({iceServers: [ice]}); pc.onicecandidate = function(event) { console.log(event.candidate ? event.candidate.candidate : null); }; pc.createOffer({offerToReceiveAudio: 1}).then((offer) => pc.setLocalDescription(offer));
4. watch for relay candidates. This should work.
5. set
ice.url = 'turn:turn1.talky.io:80?transport=tcp
6. repeat (3) and (4). Result: no relay candidate.
The about:webrtc log is attached. As well a a pcap which shows that Firefox connects to the TURN server. But nothing is ever sent
Looking at a limited dataset from the appear.in platform metrics there seems to be a pattern on linux:
select count(*), browsertype, browseros, gatheredturntcp from features_permanent
where (iceconnectedorcompleted = 't' or icefailure = 't') and browsertype = 'moz' group by browsertype, browseros, gatheredturntcp order by browseros, gatheredturntcp, count desc;
count | browsertype | browseros | gatheredturntcp
-------+-------------+------------------------------------+-----------------
138 | moz | Fedora 64-bit | f
9 | moz | Fedora 64-bit | t
327 | moz | Linux 64-bit | f
108 | moz | Linux 64-bit | t
4 | moz | Linux i686 | f
9 | moz | Linux i686 | t
74 | moz | Ubuntu | f
36 | moz | Ubuntu | t
1963 | moz | Ubuntu 64-bit | f
165 | moz | Ubuntu 64-bit | t
The number of failures to gather a turn/tcp candidate is much higher than the number of successes (apart from the 'Linux i686 platform but that is not statistically relevant). And it works _sometimes_ which suggests this is some weird race condition.
Impact is low, this only affects firefox users that absolutely require a turn/tcp connection. Working around this by handing out ip addresses instead of hostnames is possible but that is not how we deploy turn servers using dns+geolocation
Reporter | ||
Comment 1•9 years ago
|
||
Comment 3•9 years ago
|
||
I've had no luck getting this to fail on my linux box. I wonder what happens if we try in non-e10s mode.
Comment 4•9 years ago
|
||
I also wonder if maybe bug 1287874 is showing up on linux here. Seems like that could be causing a premature timeout.
Assignee | ||
Updated•9 years ago
|
Rank: 19
Flags: needinfo?(drno)
Priority: -- → P1
Assignee | ||
Comment 5•9 years ago
|
||
Review commit: https://reviewboard.mozilla.org/r/68152/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/68152/
Assignee | ||
Comment 6•9 years ago
|
||
I don't have easy access to a Linux machine myself until Monday. I'm suspicious if whats going on here is that we for some unknown reason cancel the read and/or write callbacks on the TCP sockets. All the code depends on these to proceed in case of TCP.
Fippo: I kicked off a try build so that you can download a build with the increased logging I added. I'll give it a try on Monday in the office myself.
Reporter | ||
Comment 7•9 years ago
|
||
that build doesn't have the problem anymore which suggests this may have been fixed by bug 1287874!
\o/
Comment 8•9 years ago
|
||
Seems like a low-risk uplift to me, but let's verify that this was indeed the fix.
Assignee | ||
Comment 9•9 years ago
|
||
I can repro with 47 Ubuntu build, but not with my local 50 Nightly build (which is from way before bug 1287874 landed).
I'll try to track down what makes this reproducible...
Assignee: nobody → drno
Assignee | ||
Comment 10•9 years ago
|
||
Has nothing to do with the versions. The problem exists only in non-e10s mode. As Nightly and Aurora have e10s enabled by default the problem appears to be gone, until you turn e10s off.
Assignee | ||
Comment 11•9 years ago
|
||
So I just build locally with the fix for 1287874 and the extra logging, and as soon as I turn off e10s the problem is still reproducible.
Assignee | ||
Comment 12•9 years ago
|
||
So we get PR_POLL_HUP events on these TCP sockets (and then declare these sockets internally as dead http://searchfox.org/mozilla-central/source/media/mtransport/nr_socket_prsock.cpp#367 )
Assignee | ||
Comment 13•9 years ago
|
||
Note to myself: I'm pretty sure ICE TCP sockets suffer from the same problem.
Assignee | ||
Updated•9 years ago
|
Assignee | ||
Comment 14•9 years ago
|
||
(In reply to Nils Ohlmeier [:drno] from comment #13)
> Note to myself: I'm pretty sure ICE TCP sockets suffer from the same problem.
Verified that our ICE TCP implementation actually handles this correctly.
Assignee | ||
Comment 15•9 years ago
|
||
Review commit: https://reviewboard.mozilla.org/r/68750/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/68750/
Attachment #8777134 -
Flags: review?(docfaraday)
Assignee | ||
Updated•9 years ago
|
Attachment #8776291 -
Attachment is obsolete: true
Comment 16•9 years ago
|
||
Comment on attachment 8777134 [details]
Bug 1290365: add TURN TCP socket to read poll after connect.
https://reviewboard.mozilla.org/r/68750/#review65764
Looks fine to me.
Attachment #8777134 -
Flags: review?(docfaraday) → review+
Assignee | ||
Comment 17•9 years ago
|
||
The idea here is that the patch in mozreview is the proper fix which should go into mozilla-central and ride the trains. But this is the way easier work-around to avoid hitting the HUP problem which can be uplifted, until the real fix hits release.
Attachment #8777138 -
Flags: review?(docfaraday)
Comment 18•9 years ago
|
||
Comment on attachment 8777138 [details] [diff] [review]
bug1290365_beta.patch
Review of attachment 8777138 [details] [diff] [review]:
-----------------------------------------------------------------
I'm confused. Does ignoring HUP on the socket allow it to work later?
Assignee | ||
Comment 19•9 years ago
|
||
(In reply to Byron Campen [:bwc] from comment #18)
> Comment on attachment 8777138 [details] [diff] [review]
> bug1290365_beta.patch
>
> Review of attachment 8777138 [details] [diff] [review]:
> -----------------------------------------------------------------
>
> I'm confused. Does ignoring HUP on the socket allow it to work later?
Yes it does. What happens is that in when we create a UDP socket we want to start listening on it right away. And so far we do the same with TCP socket. But technically you can't listen on a TCP socket which is not connected or you are listening on.
So in case of a DNS name for a TURN server we create the socket and include it immediately in the poll set for read, but can't connect as we don't have an IP address. And so the Linux kernel return the HUP to "inform" us that we are doing something which is not legit.
Before bug 1224845 landed we simply ignored the HUP. And if we do then NSPR still connects the socket once we make the connect() call with the IP address from DNS.
But I don't think that ignoring HUP should be the long term solution, as it re-opens the door for bug 1224845. And bug 1291176 shows that we are still having paths in our code which result in HUP signals.
My thinking here is that changing the way he handle incoming data and the size of the patch don't make the patch in mozreview a good candidate for uplift.
The second patch in attachment 8777138 [details] [diff] [review] is a simple workaround, with known side-effects, which I feel comfortable to uplift to DE and Beta. I'm open to discuss uplifting the real patch in mozreview to DE if anyone feels that is the better solution.
Comment 20•9 years ago
|
||
Comment on attachment 8777138 [details] [diff] [review]
bug1290365_beta.patch
Review of attachment 8777138 [details] [diff] [review]:
-----------------------------------------------------------------
Ok, looks good.
Attachment #8777138 -
Flags: review?(docfaraday) → review+
Assignee | ||
Comment 21•9 years ago
|
||
Comment on attachment 8777134 [details]
Bug 1290365: add TURN TCP socket to read poll after connect.
Review request updated; see interdiff: https://reviewboard.mozilla.org/r/68750/diff/1-2/
Comment 22•9 years ago
|
||
Pushed by drno@ohlmeier.org:
https://hg.mozilla.org/integration/autoland/rev/0f1333d8cf91
add TURN TCP socket to read poll after connect. r=bwc
Assignee | ||
Comment 23•9 years ago
|
||
Comment on attachment 8777138 [details] [diff] [review]
bug1290365_beta.patch
Approval Request Comment
[Feature/regressing bug #]: The patch from bug 1224845 (landed in 47) added error handling which exposed the underlying problem which existed for a much longer time.
[User impact if declined]: Depending on how the WebRTC calling service is configured (DNS vs. IP address) and in which kind of network the Firefox user is this can result in no call getting established (where a Firefox <= 46 would have worked under the same conditions).
[Describe test coverage new/current, TreeHerder]: Only manual testing as apparently the delay introduce by a real DNS server is only triggering this bug.
[Risks and why]: The risk for new problems is pretty small as it only restores the behavior from FF <= 46. With this patch applied FF will use more CPU while establishing the call. The patch which solves both problems just landed in 51 (but the risk of uplifting that would be a lot higher, then reverting to the old, known behavior).
[String/UUID change made/needed]: N/A
Attachment #8777138 -
Flags: approval-mozilla-beta?
Attachment #8777138 -
Flags: approval-mozilla-aurora?
Comment 24•9 years ago
|
||
bugherder |
Status: NEW → RESOLVED
Closed: 9 years ago
status-firefox51:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla51
Updated•9 years ago
|
status-firefox49:
--- → affected
status-firefox50:
--- → affected
Assignee | ||
Updated•9 years ago
|
status-firefox48:
--- → wontfix
Comment 25•9 years ago
|
||
Comment on attachment 8777138 [details] [diff] [review]
bug1290365_beta.patch
Review of attachment 8777138 [details] [diff] [review]:
-----------------------------------------------------------------
This patch fixes a regression for WebRTC. Take it in 49 beta and aurora.
Attachment #8777138 -
Flags: approval-mozilla-beta?
Attachment #8777138 -
Flags: approval-mozilla-beta+
Attachment #8777138 -
Flags: approval-mozilla-aurora?
Attachment #8777138 -
Flags: approval-mozilla-aurora+
Comment 26•9 years ago
|
||
https://hg.mozilla.org/releases/mozilla-aurora/rev/dafd5dce166f
https://hg.mozilla.org/releases/mozilla-beta/rev/60ae30593a0b
Backed out from release branches for permacrashing (mda) tests like https://treeherder.mozilla.org/logviewer.html#?job_id=1430555&repo=mozilla-beta
Assignee | ||
Comment 28•9 years ago
|
||
I overlooked in the logs why these test are actually failing:
Assertion `!sock->connected' failed.
The fix for that landed in bug 1293206 already.
I'll start test runs on Aurora to see if that fixes the problem.
Flags: needinfo?(drno)
Assignee | ||
Comment 29•9 years ago
|
||
Confirmed. The fix from bug 1293206 is needed to get the tests working for the either of the two patches in this bug report.
Depends on: 1293206
Assignee | ||
Comment 30•9 years ago
|
||
Assignee | ||
Comment 31•9 years ago
|
||
This patch combines the POLL_HUP change for this bug, with the connected write cancel from bug 1293206.
Try run: https://treeherder.mozilla.org/#/jobs?repo=try&revision=a1237290b083
Although I guess the cleaner solution here would be to request uplift in bug 1293206 and then re-land the patches in here.
Comment 32•9 years ago
|
||
Looks like Nils landed this but forgot to update the bug.
https://hg.mozilla.org/releases/mozilla-aurora/rev/fdc2f1d67e3977e4fbc79d0acbb32259eca0f524
Comment 33•9 years ago
|
||
Nils, do you have a new patch for beta now that we are uplifting the patch in bug 1293206? Thanks!
tracking-firefox49:
--- → +
Flags: needinfo?(drno)
Comment 34•9 years ago
|
||
bugherder uplift |
Assignee | ||
Comment 35•9 years ago
|
||
:RyanVM did already the right thing. Thanks!
Flags: needinfo?(drno)
You need to log in
before you can comment on or make changes to this bug.
Description
•