Closed Bug 1448846 Opened 2 years ago Closed 2 years ago
Temporary ICMP error causing fatal error in ICE UDP socket?
From the dev-media list: Hello all, I am facing a very strange error with ICE negotiation during WebRTC call setup. Firefox(58.0.2 - 64bit) fails to respond to STUN request only on few machines. Initially I suspected it could be related to specific OS. But right now, I see this issue only on few Windows machines, while it works perfectly in other machines with Windows/Suse/Mac. Setup - - Firefox version - 58.0.2 - 64bit - No STUN or TURN severs configured - Calls are established using only host candidates - There exists a media server between 2 parties Scenario - A(Firefox-Windows) calls B - Internally - A--------Media Server---------B Result - ICE negotiation fails for A with an error - ICE failed, add a STUN server and see about:webrtc for more details Observation - When A triggers an outgoing call, media server gathers ICE candidates and open its UDP ports before it triggers ICE connectivity checks from its end. However, Firefox sends STUN request before ports are actually opened on the media server. And gets back an ICMP Destination unreachable. After few milliseconds(~10 ms), media server opens its ports and sends STUN request to Firefox. Firefox does not respond nor does it retry STUN requests from its end. ICE negotiation fails on both side. When the exact same scenario is tested on Firefox running on Suse or Mac or few other Windows machines, I see the same behavior as above, except that the Firefox does respond to STUN request from media server. Also it retries STUN requests from its end, even though it received ICMP destination unreachable before. pcap and ICE connection logs for both error(Windows 7) and success(Suse Leap 42.3) case - https://drive.google.com/open?id=1u8g8uq8WPtcOZxekxP1lzkr17CX0lajd Why doesn't Firefox respond to STUN request and why doesn't it retry STUN request from its end either? If I am using the same Firefox version, and if OS is not an issue, why is the behavior different on different machines? I also tested this scenario on the latest 59.0.1 (64-bit) on Windows and I still see the same issue. Thanks in advance!!
I'm pretty sure this is not the first bug report we have received on this topic. I think the problem is most likely around this code: https://searchfox.org/mozilla-central/rev/003262ae12ce937950ffb8d3b0fa520d1cc38bff/netwerk/base/nsUDPSocket.cpp#463 Where it makes sense to close TCP socket when they encounter errors, it only makes sense for connected UDP sockets. Problem so far was that I had no idea how to fix the code without being able to reproduce this. But two things just came to my mind: - #1 feed an ICE candidate for a closed port from another machine on the network - #2 create a unit test here https://searchfox.org/mozilla-central/source/netwerk/test/TestUDPSocket.cpp and/or here https://searchfox.org/mozilla-central/source/dom/network/tests/test_udpsocket.html One big question from my point of view is if we: - just go and ignore any errors on UDP sockets (which would be a change in behavior, but I don't think anything else besides WebRTC actually uses the UdpSocket code) - ignore UDP socket errors only on unconnected UDP sockets - let the UDP socket child (= the content process) specify via a parameter what behavior it wants for the UDP socket to happen
So, looking at those errors: PR_POLL_ERR: Corresponds to POLLERR, man pages are unclear on when we should expect to see this on a UDP socket. Some limited reading indicates that some platforms will set POLLERR when ICMP errors happen, but others do not. I am not sure if there are other cases where POLLERR would be set for UDP. Maybe if the network interface was brought down? PR_POLL_HUP: Corresponds to POLLHUP, doesn't make a whole lot of sense for UDP. I guess I could see an implementation converting ICMP errors to this if the UDP socket is connected? Totally unsure here. PR_POLL_NVAL: Corresponds to POLLNVAL, means the socket is not open. It seems that the meaning of these flags for UDP sockets is not specified very well, and inconsistent across platforms. Maybe we could get away with ignoring POLLERR for UDP, although we might lose the ability to detect things like the network interface going down/breaking. Thoughts?
Flags: needinfo?(docfaraday) → needinfo?(mcmanus)
dragana can probably support you quicker than I will be able to..
Flags: needinfo?(mcmanus) → needinfo?(dd.mozilla)
We can get http logging to confirm this if you like. We will only need nsSocketTransport:5. What about implementing version 3 from comment 1: specify via a parameter whether to ignore error or not. In that way you can decide in webRTC code if you want to ignore error only at a beginning and do not ignore it later (e.g. network change events)
In my opinion, there's just too much variation from platform to platform for POLLERR to be useful; it is noise that we should be ignoring.
I hope it's all right that I assign you on this Byron, so we don't drop the ball.
Assignee: nobody → docfaraday
(In reply to Byron Campen [:bwc] from comment #7) > https://treeherder.mozilla.org/#/ > jobs?repo=try&revision=d9945bd6cf87d2c73bcb24d8283b68fbc5222c6e Yes I think this is the right thing to do (for now).
Comment on attachment 8993515 [details] Bug 1448846: Ignore POLLERR on UDP sockets. https://reviewboard.mozilla.org/r/258220/#review265644
Attachment #8993515 - Flags: review?(dd.mozilla) → review+
Pushed by email@example.com: https://hg.mozilla.org/integration/autoland/rev/fe9c353b57a1 Ignore POLLERR on UDP sockets. r=dragana
You need to log in before you can comment on or make changes to this bug.