Open Bug 1461606 Opened 6 years ago Updated 1 year ago

Creating ServerSocket doesn't fail if port is bound by external application via multicast address

Categories

(Remote Protocol :: Marionette, defect, P3)

Version 3
defect

Tracking

(Not tracked)

People

(Reporter: whimboo, Unassigned)

References

Details

Today I discovered that when another application on the system has already bound itself via a multicast address to the port 2828 which Marionette uses by default, Firefox still creates a ServerSocket instance and doesn't fail with an `NS_ERROR_SOCKET_ADDRESS_IN_USE` error. The error is only visible if another instance of Firefox already bound to that port via a unicast address.

Steps to reproduce:
1) Run `ncat -k -l 2828`
2) Run `mach marionette test -vv --gecko-log -`

This bound ncat to:

> % lsof -nP -iTCP:2828
> COMMAND   PID   USER   FD   TYPE             DEVICE SIZE/OFF NODE NAME
> ncat    13963 henrik    3u  IPv6 0x66002f962daf1099      0t0  TCP *:2828 (LISTEN)
> ncat    13963 henrik    4u  IPv4 0x66002f961ea01e81      0t0  TCP *:2828 (LISTEN)

In the gecko log output you will see:

> 1526370899262	Marionette	DEBUG	New connections are accepted
> 1526370899264	Marionette	INFO	Listening on port 2828

Using a unicast address for ncat instead (`ncat -k -l localhost 2828`) results in the expected behavior:

> % lsof -nP -iTCP:2828
> COMMAND   PID   USER   FD   TYPE             DEVICE SIZE/OFF NODE NAME
> ncat    15006 henrik    3u  IPv4 0x66002f961f3a97e1      0t0  TCP 127.0.0.1:2828 (LISTEN)

The same also applies to the Marionette client which doesn't fail in `is_port_available()` because it uses the `socket.SO_REUSEADDR` flag. Removing that flag results in the expected IOError.

Now starting the Marionette tests with the port 2828 in `LISTEN` status for a multicast address Marionette hangs for about 60s during each start of Firefox before new connections from the client are getting accepted:

> 1526374256640	Marionette	DEBUG	New connections are accepted
> 1526374256641	Marionette	INFO	Listening on port 2828
> [..]
> 1526374316179	Marionette	DEBUG	Accepted connection 0 from 127.0.0.1:62943
> 1526374316206	Marionette	DEBUG	Closed connection 0
> 1526374316309	Marionette	DEBUG	Accepted connection 1 from 127.0.0.1:62944
> 1526374316318	Marionette	DEBUG	Closed connection 1
> 1526374316319	Marionette	DEBUG	Accepted connection 2 from 127.0.0.1:62945
> 1526374316336	Marionette	TRACE	2 -> [0,1,"WebDriver:NewSession",{}]

So I wonder if we should use the same logic as geckodriver to find a random free port, and get rid of the usage of `SO_REUSEADDR`. It would allow us to ensure that we really connect to an instance of Firefox under the control of Marionette and not to a probably inappropriately closed socket in `TIME_WAIT` state. A solution like stated on bug 1240830 would be better but still wouldn't be that easy to implement.

A socket handling explanation which I found useful can be found here:
https://stackoverflow.com/questions/14388706/socket-options-so-reuseaddr-and-so-reuseport-how-do-they-differ-do-they-mean-t

For Windows that is important:

A socket with SO_REUSEADDR can always bind to exactly the same source address and port as an already bound socket, even if the other socket did not have this option set when it was bound. This behavior is somewhat dangerous because it allows an application "to steal" the connected port of another application.

Andreas, IMO this is an important thing to fix. So what do you think?
Flags: needinfo?(ato)
Blocks: 1461608
Priority: -- → P3
My understanding is, with the big caveat that my knowledge of
networking is limited, that multicast in TCP is not a thing.  The
examples you gave above use TCP exclusively, but differentiate in
that they use AF_INET and AF_INET6.  Let me try to explain.

Most modern operating systems, with the exception of OpenBSD,
implement ‘dual-stacked’ IP architectures that support both IPv4
and IPv6 simultaneously.  It is important to keep two things in
mind: firstly that TCP ports and UDP ports are entirely separate,
and that they are features of the transport layer protocols, not
of the IP layer as such; and secondly, that multicasting is a
one-to-many broadcasting system that practically does not make sense
in a stream-oriented one-to-one protocol such as TCP.  Consequently,
talking about unicasting vs. multicasting here is misleading.

Ports being a feature of the protocol means you it is possible to
bind one TCP port as well as one UDP port to the same number.  In
dual-stacked IP architectures you can further multiply this by
binding two of them to the IPv4 layer only and the other two to
IPv6 only.  This means you can have four distinct ports listening
on two layers, for two different protocols.

So when you ask (nmap’s) ncat to listen on port 2828, it relies on
a set of defaults that are different to those of a standard socket.
Your example showed that it bound to both IPv4 and IPv6 ports on
TCP.  I can confirm this on Debian:

> % ncat -l 2828 &
> [1] 15017
> % lsof -i :2828
> COMMAND   PID USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
> ncat    15017  ato    3u  IPv6 1302799      0t0  TCP *:2828 (LISTEN)
> ncat    15017  ato    4u  IPv4 1302800      0t0  TCP *:2828 (LISTEN)

However, if you ask ncat to listen on a specific host, it will use
the type of the address as the basis for the protocol it chooses.
Listening on 127.0.0.1 will open a TCP port on IPv4:

> % ncat -l 127.0.0.1 2828 &
> [1] 15389
> % lsof -i :2828
> COMMAND   PID USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
> ncat    15389  ato    3u  IPv4 1305864      0t0  TCP localhost:2828 (LISTEN)

In your example you bind to "localhost", but what "localhost" means
is system-dependent due to host overrides in e.g. /etc/hosts.  On
my Debian system it happens to look up ::1, but it may not on other
systems.  A quick test I did shows that "ncat -l localhost 2828"
binds to both IPv4 and v6 on Debian, but only to IPv6 on my macOS
system, and apparently only IPv4 on your system.

The same principle goes for the _type of address_ you ask it to
listen for.  If you explicitly listen on an IPv6 address it will
apply the IPV6_V6ONLY socket option, similar to passing the -6 flag:

> % ncat -l ::1 2828 &
> [1] 15417
> % lsof -i :2828
> COMMAND   PID USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
> ncat    15417  ato    3u  IPv6 1312899      0t0  TCP localhost:2828 (LISTEN)

Since "ncat -l 2828", out of no other conviction than its own, binds
on both layers we will see the expected IOError when trying to run
the Mn tests:

> % ncat -l 2828 &
> [1] 15463
> % lsof -i :2828
> COMMAND   PID USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
> ncat    15463  ato    3u  IPv6 1312971      0t0  TCP *:2828 (LISTEN)
> ncat    15463  ato    4u  IPv4 1312972      0t0  TCP *:2828 (LISTEN)
> % ./mach marionette-test -z
> […]
> IOError: Port localhost:2828 is unavailable.
> […]

The same goes if we explicitly pick the IPv4 layer through -4 or
choosing 127.0.0.1:

> % ncat -l 127.0.0.1 2828 &
> [1] 15499
> % lsof -i :2828
> COMMAND   PID USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
> ncat    15499  ato    3u  IPv4 1305957      0t0  TCP localhost:2828 (LISTEN)
> % ./mach marionette-test -z
> […]
> IOError: Port localhost:2828 is unavailable.
> […]

On the other hand the tests will run just fine when ncat does not
reserve the IPv4 port, because the Python client specifically checks
"localhost:2828" in
https://searchfox.org/mozilla-central/rev/2b9779c59390ecc47be7a70d99753653d8eb5afc/testing/marionette/client/marionette_driver/marionette.py#563:

> % ncat -l ::1 2828 &
> [1] 15537
> % lsof -i :2828
> COMMAND   PID USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
> ncat    15537  ato    3u  IPv6 1313864      0t0  TCP localhost:2828 (LISTEN)
> % ./mach marionette-test -z
> […]
>  0:02.63 SUITE_START: marionette-test - running 84 tests
>  0:02.63 TEST_START: testing/marionette/harness/marionette_harness/tests/unit/test_window_minimize.py
>  0:02.63 TEST_END: SKIP
> […]

Since the client prevents us from reaching the server, I looked
into how the server behaves, and nsIServerSocket’s default also
appears to be to listen on IPv4-only:

> % ./mach run -marionette --headless &
> [1] 16263
> %  0:00.33 /home/ato/src/gecko/obj-x86_64-pc-linux-gnu/dist/bin/firefox -marionette --headless -no-remote -profile /home/ato/src/gecko/obj-x86_64-pc-linux-gnu/tmp/profile-default
> *** You are running in headless mode.
> 1526393456143	Marionette	INFO	Listening on port 2828
> % lsof -i :2828
> COMMAND   PID USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
> firefox 16295  ato   80u  IPv4 1308171      0t0  TCP localhost:2828 (LISTEN)

Since this bug is starting to look more and more like
https://bugzilla.mozilla.org/show_bug.cgi?id=1257719, the right
thing to do here would probably be to make use of IN6ADDR_SETV4MAPPED,
which is a technique for making IPv6 sockets also listen on IPv4
and map IPv4 connections to the IPv6 layer using the ::ffff:*
namespace.  An approach to this can be found in
https://stackoverflow.com/a/16763601.

(In reply to Henrik Skupin (:whimboo) from comment #0)

> The same also applies to the Marionette client which doesn't fail
> in `is_port_available()` because it uses the `socket.SO_REUSEADDR`
> flag.  Removing that flag results in the expected IOError.

SO_REUSEADDR on the “is this port available” check probably does
make some sense, since the intention is not to fail because there
might have been previous connections open when the server process
was killed.  But using it on the _client connection_ to Marionette
in transport.py I agree does indeed not make any sense.

> So I wonder if we should use the same logic as geckodriver to find
> a random free port, and get rid of the usage of `SO_REUSEADDR`. It
> would allow us to ensure that we really connect to an instance
> of Firefox under the control of Marionette and not to a probably
> inappropriately closed socket in `TIME_WAIT` state. A solution
> like stated on bug 1240830 would be better but still wouldn't be
> that easy to implement.

Yes, I do think using a randomly chosen port would do away with
some of the problems you describe with regards to the TIME_WAIT
period, or at least make them less prominent.

This naturally does come with some minuscule chance of raciness due
to the fact that the port is atomically allocated, then freed, then
bound again, but seems overall like it would solve the particular
problem you are describing.

> Andreas, IMO this is an important thing to fix. So what do you
> think?

IMHO it is not so important to fix, but we do appear to have
pinpointed a couple of orthogonal problems:

  (1) SO_REUSEADDR on client connections are probably not necessary.

  (2) Having the system atomically allocate a free port would
      increase reliability due to socket timeout durations.
  
      Since we cannot have true atomicity due to the client picking
      the port, using a geckodriver-esque approach for the time
      being makes sense as it has proven sufficiently stable.

  (3) nsIServerSocket doesn’t appear to support IPv6, and we don’t
      know if it will support IPv6-mapped-IPv4-addresses.
Flags: needinfo?(ato)
Severity: normal → S3
Product: Testing → Remote Protocol
You need to log in before you can comment on or make changes to this bug.