Open Bug 1898696 Opened 4 months ago Updated 28 days ago

Webrtc ICE candidate selection incorrect

Categories

(Core :: WebRTC, defect)

Firefox 126
defect

Tracking

()

UNCONFIRMED

People

(Reporter: kventers, Unassigned)

Details

Attachments

(1 file)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36 Edg/126.0.0.0

Steps to reproduce:

Firefox (ICE controlling) succeeds on connectivity checks for two interesting candidates and nominates the highest priority one:

Line 7359: nominated pair is pqc7|IP4:194.69.105.213:56566/UDP|IP4:52.113.143.19:3480/UDP(prflx|candidate:1 1 UDP 54001663 52.113.143.19 3480 typ relay raddr 10.0.144.33 rport 3480)
Line 7360: cancelling all pairs but pqc7

At this step: Firefox expects data via prflx 194.69.105.213:56566

However, there is also higher-priority IPV6 TLS host candidate which appears to be allocated the same IPV4 address as one of the UDP candidates
Line 49: [trickling candidate candidate:6 1 TCP 2105524479 fec0::a39a 9 typ host tcptype active
Line 7308: TURN(relay(IP6:[fec0::a39a]:0/TLS|IP6:[::]:443/TLS)): Succesfully allocated addr IP4:20.202.140.193:52661/UDP lifetime=600

SEND INDICATION succeeds for this TLS path (higher priority than relay path)
Following agressive nomination rules, remote party starts sending data via 20.202.140.193:52661

However, for firefox this seems to be used in a successful lower priority candidate pair:
Line 7329: CAND-PAIR(hm5X): Pairing candidate IP4:20.202.140.193:52661/UDP (7f1fff):IP4:52.113.143.19:3480/UDP (337ffff) priority=35782502227378174 (7f1fff066ffffe)
Line 7397: setting pair to state SUCCEEDED: hm5X|IP4:20.202.140.193:52661/UDP|IP4:52.113.143.19:3480/UDP(turn-relay(IP6:[fec0::a39a]:0/TLS|IP4:20.202.140.193:52661/UDP)|candidate:1 1 UDP 54001663 52.113.143.19 3480 typ relay raddr 10.0.144.33 rport 3480)

Actual results:

Firefox disconnects after 15 seconds of trying to send consent packets
STUN-CLIENT(consent): Timed out
Thu May 23 2024, 23:7:57:657 RTCPeerConnection oniceconnectionstatechange iceConnectionState: connected
Thu May 23 2024, 23:8: RTCPeerConnection oniceconnectionstatechange iceConnectionState: disconnected

Expected results:

Firefox uses the higher priority path which succeeded/

Chromium can handle this same network setup, seemingle because it ignores the ipv4-ipv6 allocation and does not send out a SEND INDICATION packet

We can reliably reproduce in our test environment.
I can also provide network capture if needed, configure logging as needed, compare with chrome etc.

The Bugbug bot thinks this bug should belong to the 'Core::WebRTC' component, and is moving the bug to that component. Please correct in case you think the bot is wrong.

Component: Untriaged → WebRTC
Product: Firefox → Core
Flags: needinfo?(docfaraday)

I'm going to have to see a copy of about:webrtc; I see a couple of candidate priority values, and one candidate pair priority value in the description, but I would need to see the priority value for the other pair.

Flags: needinfo?(docfaraday) → needinfo?(kventers)

Thanks for checking!
The issue is reproduced in automated test runs, so accessing about:webrtc is difficult.
Could you please suggest what's missing from the logging configuration to triage this, or a good way to capture the contents of about:webrtc (for example, due to expected security restrictions it's blocked from embedding in an iframe).

We're running firefox with logging configured as signaling:5,mtransport:5,MediaManager:4,GetUserMedia:4,webrtc_trace:5 and env variables R_LOG_DESTINATION: 'stderr', R_LOG_LEVEL: '3', R_LOG_VERBOSE: '1'.
Also, is there a way to trigger dumping ICE logs in the successful cases? Looking at Firefox source code, it looks like ICE logs are only dumped in case of failures.

Flags: needinfo?(docfaraday)

I'd set R_LOG_LEVEL to 5 to debug this, if I could not get a copy of about:webrtc. The rest of the R_LOG env variables look like they'd work.

Flags: needinfo?(docfaraday)
No longer blocks: webrtc-triage
Status: UNCONFIRMED → RESOLVED
Closed: 2 months ago
Resolution: --- → INCOMPLETE

Apologies about the delay in providing additional information. Did some test runs with increased logging verbosity.

Attaching the files with logs:
2024_07_16_firefox_debug_0.log.child-2.moz_log
2024_07_16_firefox_debug_0.log.child-5.moz_log
2024_07_16_network_capture.pcap

What I think is happening:

From wireshark network capture I see that firefox client is ICE-CONTROLLING and sending out the following STUN packets:
Filter: (stun || (ipv6.src == fec0::e979 && tcp.port == 10752))
Decode TCP port 10752 as STUN

Packet 21798 2024-07-16 20:07:57,135809484 fec0::e979 2603:1063:118::2f0 STUN 158 Binding Request user: NOD8:5f07ce45
Priority: 1853817087 == 0x6e7f00ff
This request fails

Packet 21822 2024-07-16 20:07:57,157129405 10.11.129.66 52.112.140.223 STUN 138 Binding Request user: NOD8:5f07ce45
Priority: 1853751551 == 0x6e7e00ff
This request succeeds, and Firefox nominates this network path (as evidenced from logs below)

However:
Packet 21869 2024-07-16 20:07:57,199912027 fec0::e979 fec0::aa64:0:14ca:4460 STUN 214 Send Indication XOR-PEER-ADDRESS: 52.112.140.223:3480
Packet 21987 2024-07-16 20:07:57,300879416 fec0::e979 fec0::aa64:0:14ca:4460 STUN 214 Send Indication XOR-PEER-ADDRESS: 52.112.140.223:3480
Value contains Priority: 1853817087 == 0x6e7f00ff and USE-CANDIDATE
Ice-controlled client considers this as higher priority candidate and tries to use this media path as specified by aggressive nomination.
However, I can't find anything definitive about this in Firefox logs.

Reviewing ICE logs:

Ipv6 address fec0::e979 appears to be connected the pair 'fgRI' which succeeds but shows priority '7f1fff', which is expectedly low for a relay candidate.
Note that there is also an allocation from ipv6 to ipv4 at port 51469, which is the port ice-controlled client tries to use
2024-07-16 17:08:32.860000 UTC - [Socket 8996: Socket Thread]: D/mtransport (turn/INFO) TURN(relay(IP6:[fec0::e979]:0/TCP|IP6:[::]:443/TCP)): Succesfully allocated addr IP4:20.202.68.89:51469/UDP lifetime=600
2024-07-16 17:08:32.860000 UTC - [Socket 8996: Socket Thread]: D/mtransport (ice/INFO) ICE(PC:{2b030457-c489-4bad-92ed-e4a1ba966afc} 1721149674207000 (id=6442450946 url=https://latest-webclient.skype.com/?userId=0))/CAND-PAIR(fgRI): Pairing candidate IP4:20.202.68.89:51469/UDP (7f1fff):IP4:52.112.140.223:3480/UDP (337ffff) priority=35782502227378174 (7f1fff066ffffe)
2024-07-16 17:08:32.861000 UTC - [Socket 8996: Socket Thread]: D/mtransport (ice/INFO) ICE-PEER(PC:{2b030457-c489-4bad-92ed-e4a1ba966afc} 1721149674207000 (id=6442450946 url=https://latest-webclient.skype.com/?userId=0):default)/CAND-PAIR(fgRI): setting pair to state SUCCEEDED: fgRI|IP4:20.202.68.89:51469/UDP|IP4:52.112.140.223:3480/UDP(turn-relay(IP6:[fec0::e979]:0/TCP|IP4:20.202.68.89:51469/UDP)|candidate:1 1 UDP 54001663 52.112.140.223 3480 typ relay raddr 10.0.147.236 rport 3480)\

Nominated pair 'ys2b' has priority '6e7e00ff' which is the highest as expected for prflx candidate:
2024-07-16 17:08:32.860000 UTC - [Socket 8996: Socket Thread]: D/mtransport (ice/INFO) ICE(PC:{2b030457-c489-4bad-92ed-e4a1ba966afc} 1721149674207000 (id=6442450946 url=https://latest-webclient.skype.com/?userId=0))/CAND-PAIR(ys2b): Pairing candidate IP4:194.69.105.213:9331/UDP (6e7e00ff):IP4:52.112.140.223:3480/UDP (337ffff) priority=231935380222116351 (337ffffdcfc01ff)
2024-07-16 17:08:32.861000 UTC - [Socket 8996: Socket Thread]: D/mtransport (ice/INFO) ICE-PEER(PC:{2b030457-c489-4bad-92ed-e4a1ba966afc} 1721149674207000 (id=6442450946 url=https://latest-webclient.skype.com/?userId=0):default)/CAND-PAIR(ys2b): setting pair to state SUCCEEDED: ys2b|IP4:194.69.105.213:9331/UDP|IP4:52.112.140.223:3480/UDP(prflx|candidate:1 1 UDP 54001663 52.112.140.223 3480 typ relay raddr 10.0.147.236 rport 3480)
2024-07-16 17:08:32.860000 UTC - [Socket 8996: Socket Thread]: D/mtransport (ice/INFO) ICE-PEER(PC:{2b030457-c489-4bad-92ed-e4a1ba966afc} 1721149674207000 (id=6442450946 url=https://latest-webclient.skype.com/?userId=0):default)/STREAM(PC:{2b030457-c489-4bad-92ed-e4a1ba966afc} 1721149674207000 (id=6442450946 url=https://latest-webclient.skype.com/?userId=0) transport-id=transport_0 - 5f07ce45:9281193c3088050178f225f88588b77c)/COMP(1)/CAND-PAIR(ys2b): nominated pair is ys2b|IP4:194.69.105.213:9331/UDP|IP4:52.112.140.223:3480/UDP(prflx|candidate:1 1 UDP 54001663 52.112.140.223 3480 typ relay raddr 10.0.147.236 rport 3480)
2024-07-16 17:08:32.860000 UTC - [Socket 8996: Socket Thread]: D/mtransport (ice/INFO) ICE-PEER(PC:{2b030457-c489-4bad-92ed-e4a1ba966afc} 1721149674207000 (id=6442450946 url=https://latest-webclient.skype.com/?userId=0):default)/STREAM(PC:{2b030457-c489-4bad-92ed-e4a1ba966afc} 1721149674207000 (id=6442450946 url=https://latest-webclient.skype.com/?userId=0) transport-id=transport_0 - 5f07ce45:9281193c3088050178f225f88588b77c)/COMP(1)/CAND-PAIR(ys2b): cancelling all pairs but ys2b|IP4:194.69.105.213:9331/UDP|IP4:52.112.140.223:3480/UDP(prflx|candidate:1 1 UDP 54001663 52.112.140.223 3480 typ relay raddr 10.0.147.236 rport 3480)

Can you comment why Firefox is issuing STUN Send Indication with higher priority than the nominated pair? ICE-CONTROLLED client therefore selects incorrect network path and transport disconnects

Flags: needinfo?(kventers)

Network logs were too large to attach, exported the interesting packets only and noteworthy packets are #44, #46, #50, #56

Status: RESOLVED → UNCONFIRMED
Resolution: INCOMPLETE → ---

I think this is a consequence of our aggressive nomination code, but I can look at the pcap.

In aggressive nomination, USE-CANDIDATE is put in all ICE checks, and the controlled agent is supposed to use the highest priority one that was ever nominated:

From https://datatracker.ietf.org/doc/html/rfc5245#section-8.1.1.2

   With aggressive nomination, the controlling agent includes the USE-
   CANDIDATE attribute in every check it sends.  Once the first check
   for a component succeeds, it will be added to the valid list and have
   its nominated flag set.  When all components have a nominated pair in
   the valid list, media can begin to flow using the highest priority
   nominated pair.  However, because the agent included the USE-
   CANDIDATE attribute in all of its checks, another check may yet
   complete, causing another valid pair to have its nominated flag set.
   ICE always selects the highest-priority nominated candidate pair from
   the valid list as the one used for media.  Consequently, the selected
   pair may actually change briefly as ICE checks complete, resulting in
   a set of transient selections until it stabilizes.

From the PCAP, it appears that Firefox is trying to use the higher priority pair, since it is sending consent checks for that and nothing else. However, those consent checks are going unanswered. This might be because the stray lower-priority ICE check with USE-CANDIDATE is being used by the controlled agent despite being a lower priority than the other nominated candidate pairs.

Can you double check the logs on your end to see what the controlled agent thinks the active pair is?

Flags: needinfo?(kventers)

Thanks for checking!

The controlled agent is using these pairs. First, the prflx pair that firefox is also using as the nominated pair
2024-07-16 17:07:57.240 CheckDtlsHandshakeStart: DTLS path changed IceCandidatePair{ IceCandidate{ Rtp:{PeerDerived, {IP:194.69.105.x:9331}, base:194.69.105.x:9331, rel:194.69.105.x:9331, p:0x6e7e00ff, pipe:UDP },

But 5s later the controlled agent switches to using a higher priority pair (the one received from STUN Send Indication):
2024-07-16 17:08:02.619 CheckDtlsHandshakeStart: DTLS path changed IceCandidatePair{ IceCandidate{ Rtp:{PeerDerived, {IP:20.202.68.x:51469}, base:20.202.68.x:51469, rel:20.202.68.x:51469, p:0x6e7f00ff, pipe:UDP},

The controlled agent receives responses for consent requests on this pair:

2024-07-16 17:08:02.784 CONSENT Ping response received on pair IceComponent{{PeerDerived, {IP:20.202.68.x:51469}, base:20.202.68.x:51469, rel:20.202.68.x:51469, p:0x6e7f00ff, pipe:UDP}, PipeData: Peer:20.202.68.x:51469, {0101002c2112a4429ea34b52b12a883d1daed499...}
2024-07-16 17:08:07.889 CONSENT Ping response received on pair, 0101002c2112a4425b181492b3eff63c6bece6a9...}}
32024-07-16 17:08:13.199 CONSENT Ping response received on pair, 0101002c2112a4421d3633020928d89ae4468ac0...}}

Flags: needinfo?(kventers)

(In reply to kventers from comment #11)

Thanks for checking!

The controlled agent is using these pairs. First, the prflx pair that firefox is also using as the nominated pair
2024-07-16 17:07:57.240 CheckDtlsHandshakeStart: DTLS path changed IceCandidatePair{ IceCandidate{ Rtp:{PeerDerived, {IP:194.69.105.x:9331}, base:194.69.105.x:9331, rel:194.69.105.x:9331, p:0x6e7e00ff, pipe:UDP },

But 5s later the controlled agent switches to using a higher priority pair (the one received from STUN Send Indication):
2024-07-16 17:08:02.619 CheckDtlsHandshakeStart: DTLS path changed IceCandidatePair{ IceCandidate{ Rtp:{PeerDerived, {IP:20.202.68.x:51469}, base:20.202.68.x:51469, rel:20.202.68.x:51469, p:0x6e7f00ff, pipe:UDP},

So, the ICE check in the indication is expected to have the same priority value as prflx. From https://datatracker.ietf.org/doc/html/rfc5245#section-7.1.2.1:

   An agent MUST include the PRIORITY attribute in its Binding request.
   The attribute MUST be set equal to the priority that would be
   assigned, based on the algorithm in Section 4.1.2, to a peer
   reflexive candidate, should one be learned as a consequence of this
   check (see Section 7.1.3.2.1 for how peer reflexive candidates are
   learned).  This priority value will be computed identically to how
   the priority for the local candidate of the pair was computed, except
   that the type preference is set to the value for peer reflexive
   candidate types.

That priority only gets used if the receiving ICE agent has never been told about that candidate before. Looking at the trickle ice RFC and the discussion around it on IETF mailing lists, it looks like they anticipated an issue where the two sides computed different priorities because one side interpreted the remote candidate as prflx (because the ICE check arrived before the candidate was trickled), while the other side interpreted that same candidate as something different. It appears that they attempted to resolve this problem in draft-ietf-ice-trickle-20, but I think the resulting behavior is broken for relay candidates. I'm trying to find someplace where this issue has been addressed. This might end up being a spec bug.

I believe this might be the same difficulty you're having:

https://github.com/w3c/webrtc-pc/issues/400

A similar issue was addressed in libwebrtc https://issues.webrtc.org/issues/42223384

(In reply to Byron Campen [:bwc] from comment #12)

(In reply to kventers from comment #11)

The controlled agent is using these pairs. First, the prflx pair that firefox is also using as the nominated pair
2024-07-16 17:07:57.240 CheckDtlsHandshakeStart: DTLS path changed IceCandidatePair{ IceCandidate{ Rtp:{PeerDerived, {IP:194.69.105.x:9331}, base:194.69.105.x:9331, rel:194.69.105.x:9331, p:0x6e7e00ff, pipe:UDP },

But 5s later the controlled agent switches to using a higher priority pair (the one received from STUN Send Indication):
2024-07-16 17:08:02.619 CheckDtlsHandshakeStart: DTLS path changed IceCandidatePair{ IceCandidate{ Rtp:{PeerDerived, {IP:20.202.68.x:51469}, base:20.202.68.x:51469, rel:20.202.68.x:51469, p:0x6e7f00ff, pipe:UDP},

So, the ICE check in the indication is expected to have the same priority value as prflx. From https://datatracker.ietf.org/doc/html/rfc5245#section-7.1.2.1:

Thanks for commenting! I understand that this linked issue https://datatracker.ietf.org/doc/html/rfc5245#section-7.1.2.1 is about candidates that controlling agent considers peer reflexive, but are routed through TURN server. However, in this case, it appears that Firefox already knows that the candidate pair is relay-relay, so I still don't get why
the priority that is sent for the first prflx-relay candidate: 6e7e00ff / 1853751551
is lower
than priority sent for the second relay-relay candidate: 6e7f00ff / 1853817087
and why doesn't the second value appear anywhere in Firefox logs.
In logs, it appears that this second pair should have a much lower priority
Line 8898: 2024-07-16 17:08:32.860000 UTC - [Socket 8996: Socket Thread]: D/mtransport (ice/INFO) ICE(PC:{2b030457-c489-4bad-92ed-e4a1ba966afc} 1721149674207000 (id=6442450946 url=https://latest-webclient.skype.com/?userId=0))/CAND-PAIR(fgRI): Pairing candidate IP4:20.202.68.89:51469/UDP (7f1fff):IP4:52.112.140.223:3480/UDP (337ffff) priority=35782502227378174 (7f1fff066ffffe)
Line 8957: 2024-07-16 17:08:32.861000 UTC - [Socket 8996: Socket Thread]: D/mtransport (ice/INFO) ICE-PEER(PC:{2b030457-c489-4bad-92ed-e4a1ba966afc} 1721149674207000 (id=6442450946 url=https://latest-webclient.skype.com/?userId=0):default)/CAND-PAIR(fgRI): setting pair to state SUCCEEDED: fgRI|IP4:20.202.68.89:51469/UDP|IP4:52.112.140.223:3480/UDP(turn-relay(IP6:[fec0::e979]:0/TCP|IP4:20.202.68.89:51469/UDP)|candidate:1 1 UDP 54001663 52.112.140.223 3480 typ relay raddr 10.0.147.236 rport 3480)

Is there some way to get more insight into the internals and logic of how Firefox sends these STUN indications and sets priorities on them? It appears that some outgoing STUN requests are not logged at all.

Flags: needinfo?(docfaraday)

You might get more by setting R_LOG_LEVEL=7 in the environment, but most of the time when debugging nICEr I end up using a pcap. The issue you're encountering here is a bug in RFC 8838. I'm talking with the authors in preparation to file some errata on that spec, and determine the appropriate fix.

Flags: needinfo?(docfaraday)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: