Closed Bug 1416220 Opened 7 years ago Closed 3 years ago

WebRTC problem with TURNs in tcp and a proxy (ok with TURN and proxy)

Categories

(Core :: WebRTC: Networking, defect, P3)

56 Branch
defect

Tracking

()

RESOLVED FIXED
87 Branch
Tracking Status
firefox87 --- fixed

People

(Reporter: borschneck, Assigned: bwc)

References

(Regressed 1 open bug)

Details

(Whiteboard: [needinfo to drno on 2017/11/13] )

Attachments

(10 files, 2 obsolete files)

84.23 KB, image/jpeg
Details
84.23 KB, image/jpeg
Details
102.87 KB, image/jpeg
Details
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
User Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:56.0) Gecko/20100101 Firefox/56.0
Build ID: 20171024165158

Steps to reproduce:

One partner has a Firefox with a proxy.pac
This proxy.pac points to the proxy
IP: 192.168.245.49
Port: 8080

I don't face a problem 
 * by negotiating WebRTC with a TURN server in tcp.  It'll well use the proxy IP:Port (see attached "TURN proxy ok.jpg").
 * using Chrome, with the same proxy.pac, OK with TURN and TURNs.

But negotiating in TURNs (with tcp), the FF WebRTC stack seems to not use proxy IP:port but
<TURNs server IP>:<Proxy port> 
I think its a bug (see "TURNs direct + proxy port.jpg"  )
With here 149.202.202.213 our TURN server's IP adress ... and 8080 the proxy port number !

I reproduced it using online tool
https://webrtc.github.io/samples/src/content/peerconnection/trickle-ice/
and adding our TURN server then test, removing it, adding our TURNs server and testing
(note they have TURN username and password, I won't copy them here ;) )


Actual results:

TURNs negotiation not possible with a proxy in our partner's network from Firefox 56.


Expected results:

Works well as TURN does ;)
Attached image TURN proxy ok.jpg
Component: Untriaged → WebRTC: Networking
Product: Firefox → Core
Nils, is that known ?
Flags: needinfo?(drno)
Whiteboard: [needinfo to drno on 2017/11/13]
Hi,

No news ?

With a test on another client (a bank) which has the same problem with TURN and Firefox we found something weird.  This bank has
 - a proxy
 - their corporate DNS is very restrictive (only trusted sites from the bank are resolved by their corporate DNS).

we noticed that
 * no problem with Chrome...
 * but Firefox seems to not ask the PROXY to do the TURN DNS resolution, but try to do it first locally (with here a specific DNS very restrictive)

After looking at network traces, Chrome asks the PROXY to do the DNS resolution => OK
But Firefox tries to resolve it with the PC's DNS ... and this doesn't work.

=> Firefox should also ask the PROXY to do the DNS resolution ?  Or ?
Byron, perhaps you can check this out sooner than Nils?
Flags: needinfo?(docfaraday)
Let me look into it.
Assignee: nobody → docfaraday
Flags: needinfo?(drno)
Flags: needinfo?(docfaraday)
Yeah, looking at the proxy code in nICEr, there's no way it can handle establishing a TLS connection through a http proxy, because the way we implement this is:

1. Take the socket, and override its remote address/port to point at the proxy.
2. Tell the socket to connect.
3. Do a proxy handshake over the socket once it is connected.
4. Hand the resulting socket to the TURN code.

When we're starting out with a TURN TLS socket, this doesn't quite work; step 2 ignores the IP address override and instead uses the FQDN for the TURN server. I don't see any quick fix for this, because in this case we do not want a TLS handshake at step 2, we want it between steps 3 and 4.

I think the right way to fix this is to let Necko handle all of the proxy stuff for us under the hood, so that it just looks like we're using plain old TCP/TLS sockets.
We should also not resolve the TURN URL in case we use a proxy. That allows to use filter criteria if the proxy is to be used. And there is another bug report which states that this should work on systems without any DNS resolver, because the proxy can do the DNS resolution for them.

The bigger issue though is that I think reworking this should wait until we have moved mtransport into a Necko process.
Status: UNCONFIRMED → NEW
Rank: 25
Ever confirmed: true
Priority: -- → P3
The CONNECT-only flag for http channels will now perform ssl setup for https
uris.  The http connection will now request the transaction to reset after the
proxy tunnel has been established.  An http transaction will only do a partial
reset since the http request head is still needed but no more http data will be
transmitted.  After ssl setup the http connection will close the transaction to
complete the http request.

The xpcshell test requires a raw tcp socket to pipe data through the proxy to
setup ssl.  A node http server has been added which accepts specific CONNECT
requests to facilitate this pipe.  CONNECT requests are accepted if the host has
been registered in the source file and the port matches a listening port.  The
server has http and https listeners.  These ports are recorded in environmnent
variables.
This adds a flag to WebrtcProxyChannel and related IPDL classes/helpers to
indicate a secure connection should be established to the endpoint.

When using a double tunnel to the TURN server, the streams are tls filters
wrapped around a raw socket stream.  The stream callbacks do not receive the
tls filter stream but the raw socket stream.

Depends on D13037
Depends on: 1528472

What is the Status of this bug?
We are still affected by this as we use a proxy which is only allowed to connect to WAN using 80,8080,443 and 8443.
So only TURNs is possible but the request never hits the proxy.

Firefox 82.0.2 is still affected. Unrelated to the OS used.

ni self to look into this. Maybe this is easier to fix since the socket process work?

Flags: needinfo?(docfaraday)

So the patches on this bug have been overtaken by events. At this point, we'll need to do the following:

  1. Change this member to be used for FQDNs in general, and fill it in unconditionally:

https://searchfox.org/mozilla-central/rev/16d30bafd4e5276d6d3c632fb52a6c71e739cc44/dom/media/webrtc/transport/third_party/nICEr/src/net/transport_addr.h#72
https://searchfox.org/mozilla-central/rev/16d30bafd4e5276d6d3c632fb52a6c71e739cc44/dom/media/webrtc/transport/third_party/nICEr/src/ice/ice_component.c#577-582

  1. Modify this code to grab the FQDN instead of an IP address for the remote end, if that FQDN is present. This will need to happen regardless of whether this is a TLS candidate or not; all TCP sockets will end up working the same way.

https://searchfox.org/mozilla-central/rev/16d30bafd4e5276d6d3c632fb52a6c71e739cc44/dom/media/webrtc/transport/nr_socket_tcp.cpp#103-106

  1. Stop running this code for TCP candidates, and instead do something similar to what we do when an IP address is used:

https://searchfox.org/mozilla-central/rev/16d30bafd4e5276d6d3c632fb52a6c71e739cc44/dom/media/webrtc/transport/third_party/nICEr/src/ice/ice_candidate.c#623-653

  1. Scrub through nICEr and find places that assume nr_transport_addr in an initialized TURN candidate has a resolved IP address.

I think the code on the other side of this call will do the right thing if we feed it an FQDN already, but of course there could be problems in practice:

https://searchfox.org/mozilla-central/rev/16d30bafd4e5276d6d3c632fb52a6c71e739cc44/dom/media/webrtc/transport/nr_socket_tcp.cpp#124-125

Flags: needinfo?(docfaraday)
See Also: → 1680771

I think the problem I am tripping over here relates to the fact that the linux testers do not have a resolvable hostname configured, and so we are forced to use "localhost" for the test TURN server on those machines. By moving the DNS resolution into the WebrtcTCPSocket class, nICEr does not know what to use for the local IP address (loopback address vs a real one), and it defaults to using a real one, which cannot be used to connect to a TURN server running on a localhost address. I could fix that, but it would be kinda ugly, and would only be useful for making this stuff work on the testers, since in the real world it is extremely unlikely to see anyone using a TURN server on a localhost address, and if they did it would be acceptable to tell them to attach it to the real local address instead.

Joel, is there any way we could ensure that the linux testers have a real hostname configured that is resolvable through DNS? That would get me over this hurdle. (See comment 18 for the reason this would be helpful)

Flags: needinfo?(jmaher)

thanks for asking :bwc. I think your best bet will be to put this on our hardware workers that run perf tests in our datacenter. These all have DNS already. There are 2 concerns:

  1. load (we might overload these boxes), if too much we could create a media-turn job that just runs these specific tests
  2. os version/package versions. Currently these run ubuntu 16.04 (in the next 6 weeks should all be running 18.04). These machines are not in parity with the docker worker we run unittests on, so there could be packages that need installation.

I would recommend:

  1. forcing to run on existing hardware. here is an example of running mochitest-webgpu on windows hardware: https://searchfox.org/mozilla-central/source/taskcluster/ci/test/mochitest.yml#549, you would need to do something similar for linux+mochitest-media.

  2. running on an 18.04 machine that is in staging. When you are ready for that, ask :aerickson for what to set it to. Ideally you would be able to run something like ./mach try fuzzy linux !debug mochitest-media !fis --worker-override t-talos-1604=t-talos-1804 where you put the correct values in for the worker-override.

  3. if packages are missing on 18.04, this would be a great time to get them installed :)

Flags: needinfo?(jmaher)

(In reply to Joel Maher ( :jmaher ) (UTC -0800) from comment #20)

thanks for asking :bwc. I think your best bet will be to put this on our hardware workers that run perf tests in our datacenter. These all have DNS already. There are 2 concerns:

  1. load (we might overload these boxes), if too much we could create a media-turn job that just runs these specific tests
  2. os version/package versions. Currently these run ubuntu 16.04 (in the next 6 weeks should all be running 18.04). These machines are not in parity with the docker worker we run unittests on, so there could be packages that need installation.

I would recommend:

  1. forcing to run on existing hardware. here is an example of running mochitest-webgpu on windows hardware: https://searchfox.org/mozilla-central/source/taskcluster/ci/test/mochitest.yml#549, you would need to do something similar for linux+mochitest-media.

  2. running on an 18.04 machine that is in staging. When you are ready for that, ask :aerickson for what to set it to. Ideally you would be able to run something like ./mach try fuzzy linux !debug mochitest-media !fis --worker-override t-talos-1604=t-talos-1804 where you put the correct values in for the worker-override.

  3. if packages are missing on 18.04, this would be a great time to get them installed :)

Do we have an idea how much additional load we're talking about here, percentage-wise? Are we sure that we cannot modify the image for the testers? We can do DNS lookups on them already it seems, but they just don't have their hostname configured (it is a somewhat short random hex string that doesn't resolve to anything, as opposed to something like "ec2-13-56-250-44.us-west-1.compute.amazonaws.com" which does)?

I am not too familiar with AWS and docker- it would seem that we would need to coordinate across AWS and into docker with the dns entry. Possibly there are some ifconfig routes to add which could make something work.

One option if you want to play with things is to consider running a command in the pre-flight script:
https://searchfox.org/mozilla-central/source/testing/mozharness/configs/unittests/linux_unittest.py#250

That would allow the docker image to be setup with whatever you do there before running any harness scripts.

So I don't see anything in the preflight for windows that looks like it is setting up hostname stuff; I wonder how that hostname is set on windows, and whether we're just missing a bit of configuration like that on linux? Who might know more about this?

I thought this was for linux specifically, not windows. Do you need this both for windows and linux? I think windows will be more difficult as we run it in a VM in a hacky way on an instance not via docker. If this is needed for both, do we need it for macosx and android also?

I'm saying that the hostname is set properly when running Windows tests in AWS, and that we may be missing some bit of config for linux that would do the same thing for linux tests in AWS.

oh, I overlooked that, much clearer now.
:grenade

  1. can you explain how we set the hostname for windows @AWS to resolve via DNS?
  2. will azure offer the same support as we switch there next quarter.
  3. any thoughts on how to get our docker image working with dns name resolution @AWS like windows is?
Flags: needinfo?(rthijssen)

if packages are missing on 18.04, this would be a great time to get them installed :)

I audited the new 1804 puppet configuration and it should have every package that's present in the 1804 docker image.

So I don't see anything in the preflight for windows that looks like it is setting up hostname stuff; I wonder how that hostname is set on windows, and whether we're just missing a bit of configuration like that on linux? Who might know more about this?

I don't think we have windows docker workers. These linux testers are running under docker-worker (https://docs.taskcluster.net/docs/reference/workers/docker-worker). Taskcluster-team maintains the linux docker-worker instances.

The linux talos workers Joel recommended are metal/hardware instances.

(In reply to Joel Maher ( :jmaher ) (UTC -0800) from comment #26)

oh, I overlooked that, much clearer now.
:grenade

  1. can you explain how we set the hostname for windows @AWS to resolve via DNS?
  2. will azure offer the same support as we switch there next quarter.
  3. any thoughts on how to get our docker image working with dns name resolution @AWS like windows is?
  1. on windows we just modify the local hosts file (c:\windows\system32\drivers\etc\hosts) so that the hostname points to 127.0.0.1
  2. yes
  3. it's pretty hacky but the same trick of modifying the local hosts file should work.

sorry about the slow response.

Flags: needinfo?(rthijssen)
  • Modified nr_transport_addr to allow it to represent an
    fqdn/ip-version/protocol tuple, and taught nICEr to handle the
    fqdn case appropriately.
  • Since nr_transport_addr can represent an fqdn, nr_ice_stun_server
    did not need this ability anymore, and was significantly simplified.
  • Taught NrIceCtx to handle creation of a V4/V6 pair of nr_ice_stun_server
    when an fqdn is used, instead of having nICEr create pairs later.

Depends on D101657

Attachment #9027783 - Attachment is obsolete: true
Attachment #9027784 - Attachment is obsolete: true
Pushed by bcampen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/a2069e46d965
Fix a typo (srvflx -> srflx) in this test. r=ng
https://hg.mozilla.org/integration/autoland/rev/d6ec5692e05e
Ensure that TCP is not blocked in this test, and allow loopback to be used because that's what happens in CI sometimes. r=ng
https://hg.mozilla.org/integration/autoland/rev/d4cadef3d55d
Make the log module for WebrtcTCPSocket be what one would expect it to be. r=mjf
https://hg.mozilla.org/integration/autoland/rev/8720dfa5a6e0
Modify mtransport to delegate DNS lookups for TCP sockets to the socket class, instead of using nr_resolver. r=mjf
https://hg.mozilla.org/integration/autoland/rev/8e9a13c508c6
Prevent WebrtcTCPSocket from using an IP version that is inconsistent with the requested local address. r=mjf
https://hg.mozilla.org/integration/autoland/rev/cf60a787b5ad
Add/improve some logging, and add an assertion. r=mjf

Ok, there are multiple problems here. I've worked around most of them, but one still remains; the OS X testers do not seem to be able to reach Google's STUN servers. Flaws in the tests have prevented this problem from causing failures.

Joel, any idea why the OS X testers wouldn't be able to reach Google's publicly available STUN servers?

Flags: needinfo?(docfaraday) → needinfo?(jmaher)

ok, the one difference between windows/linux and macosx is that macosx is real machines in a datacenter whilst the others are vm/docker images at AWS.

There must be one of 2 things going on:

  1. network ports are blocked from the datacenter
  2. the osx machines have some firewall rules

I suspect #1.

:bwc, which ip addresses do we need? the failures look to be on false stun servers, not live ones.

Flags: needinfo?(jmaher) → needinfo?(docfaraday)

So, we cannot predict the remote IP address. Right this second, stun.l.google.com resolves to 173.194.200.127, but that will always be subject to change.

What do you mean by "false stun servers"?

Flags: needinfo?(docfaraday) → needinfo?(jmaher)

I see the test failures being named: TestGatherDNSStunBogusHostnameTcp, VerifyTestStunServerV6FQDN.

possibly I don't understand things. so basically we need stun*.*.google.com resolved?

:dhouse, do you have a way to determine if these are allowed hosts on the MDC OSX machines?

Flags: needinfo?(jmaher) → needinfo?(dhouse)

:bwc what can I test for in the mac testers? (or can the tests be adjusted to log the actual failure, or to test for the stun host names resolving if that is the cause of this problem?)

I've tested resolving 'stun.l.google.com' with just ping from a few of the production gecko-t-osx-1014 workers (r7 mac minis in the mozilla datacenters), and they are all able to resolve it (currently to 209.85.144.127):

$ for (( I=0; I<=472; I+=RANDOM%100+1 )); do ssh -o StrictHostKeyChecking=no -o ConnectTimeout=3 -o UserKnownHostsFile=/dev/null t-mojave-r7-$(printf "%03d" $I).test.releng.mdc$(( 2-I/236 )).mozilla.com 'hostname; ping -c1 stun.l.google.com | grep -o "from [0-9\.]*"' 2>/dev/null; done
t-mojave-r7-041.test.releng.mdc2.mozilla.com
from 209.85.144.127
t-mojave-r7-099.test.releng.mdc2.mozilla.com
from 209.85.144.127
t-mojave-r7-256.test.releng.mdc1.mozilla.com
from 209.85.144.127
t-mojave-r7-351.test.releng.mdc1.mozilla.com
from 209.85.144.127
t-mojave-r7-407.test.releng.mdc1.mozilla.com
from 209.85.144.127
Flags: needinfo?(dhouse) → needinfo?(docfaraday)

We aren't having any trouble resolving, but it does appear that we aren't reaching the STUN port (in this case, 19305). We send (UDP) STUN packets, but never see a response. The linux/windows testers get responses just fine.

Flags: needinfo?(docfaraday) → needinfo?(dhouse)

(In reply to Byron Campen [:bwc] from comment #45)

We aren't having any trouble resolving, but it does appear that we aren't reaching the STUN port (in this case, 19305). We send (UDP) STUN packets, but never see a response. The linux/windows testers get responses just fine.

I tested reaching out on port 19305 with netcat, and it does appear to be blocked. Is 19305 the port that will always be used for STUN, or is there a standard port we can allow through the network? We can ask netops to allow the specific traffic.

Are the windows and linux tests being done from the perf/talos workers within the datacenters? (netops would be able to compare the rules for those to see how to match it for the macs)

Flags: needinfo?(dhouse) → needinfo?(docfaraday)

:dhouse, the linux/windows tests are run at AWS not at MDC1, so there might not be working flows setup for networking.

(In reply to Joel Maher ( :jmaher ) (UTC -0800) from comment #47)

:dhouse, the linux/windows tests are run at AWS not at MDC1, so there might not be working flows setup for networking.

Thanks! That explains why they aren't getting blocked on linux/windows.

The firewalls in the datacenter do deep packet inspection, and so we'll need to capture and re-test some failed connections to verify we are allowing the traffic (since the firewall needs to know what the traffic does and looks like and not just an allow on the port). If there is a range of ports that could be used, then we can allow that traffic for more than one port.

I tested a stun connection with the nodejs stun library (https://github.com/nodertc/stun), and confirmed that is blocked by the firewall on port 19305:

[dhouse@t-mojave-r7-256.test.releng.mdc1.mozilla.com ~]$ node <<<"require('stun').request('stun.l.google.com:19305',(e,r)=>{console.log(e||r.getXorAddress())});"
Error: timeout
[...]
house@home:~$ node <<<"require('stun').request('stun.l.google.com:19305',(e,r)=>{console.log(e||r.getXorAddress())});"
{ port: 60803, family: 'IPv4', address: 'my_ip_addr' }

Port 19305 is what Google's stun servers have been using for a long time, so it probably won't change on us. I am not sure why they chose that port; 3478 is supposed to be the standard port. If Google were to change that port, I would think that it would change to 3478, so ensuring 3478 is also not blocked would be a good idea (this would also allow us to switch our tests over to a different STUN server, if it became necessary).

Flags: needinfo?(docfaraday)

:bwc, :ctb added a firewall policy to allow this and I tested to stun.l.google.com:19305 from a mac tester in both datacenters, and it works now (I tried another on port 3478 that works also).
Can you re-try the tasks?

[dhouse@t-mojave-r7-256.test.releng.mdc1.mozilla.com ~]$ node <<<"require('stun').request('stun.l.google.com:19305',(e,r)=>{console.log(e||r.getXorAddress())});"                             
{ port: 38160, family: 'IPv4', address: '63.245.208.129' }
[dhouse@t-mojave-r7-256.test.releng.mdc1.mozilla.com ~]$ node <<<"require('stun').request('stun.stunprotocol.org:3478',(e,r)=>{console.log(e||r.getXorAddress())});"
{ port: 32902, family: 'IPv4', address: '63.245.208.129' }
Flags: needinfo?(docfaraday)
Flags: needinfo?(docfaraday)
  • Allow nr_resolver to be used for TCP when not running in e10s/socket process mode.
  • Init IPv6 STUN/TURN servers appropriately.
  • Fix bug that was preventing STUN server hostname from being configured.
  • Disable some tests that relied on STUN TCP that hasn't been available for a long time.
    (This went unnoticed due to the previous problem)
  • A small logging improvement.

Depends on D101660

Pushed by bcampen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b32d3600aab9
Fix a typo (srvflx -> srflx) in this test. r=ng
https://hg.mozilla.org/integration/autoland/rev/86a4f3321529
Ensure that TCP is not blocked in this test, and allow loopback to be used because that's what happens in CI sometimes. r=ng
https://hg.mozilla.org/integration/autoland/rev/8bd218b64217
Make the log module for WebrtcTCPSocket be what one would expect it to be. r=mjf
https://hg.mozilla.org/integration/autoland/rev/dfae2287a9e9
Modify mtransport to delegate DNS lookups for TCP sockets to the socket class, instead of using nr_resolver. r=mjf
https://hg.mozilla.org/integration/autoland/rev/fc28a9d072fa
Prevent WebrtcTCPSocket from using an IP version that is inconsistent with the requested local address. r=mjf
https://hg.mozilla.org/integration/autoland/rev/66cd9bf00b3d
Add/improve some logging, and add an assertion. r=mjf
https://hg.mozilla.org/integration/autoland/rev/6ee3e9718f05
Get ice_unittest working. r=mjf
Regressions: 1705563
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: