Closed Bug 1525802 Opened 5 years ago Closed 5 years ago

ICE restart fails (if invoked twice)

Categories

(Core :: WebRTC: Networking, defect, P2)

67 Branch
defect

Tracking

()

RESOLVED DUPLICATE of bug 1526477
Tracking Status
firefox67 --- affected

People

(Reporter: ibc, Unassigned)

References

(Depends on 1 open bug)

Details

Strange issue has it happens when the "ICE restart" procedure is invoked twice into a PeerConnection.

Scenario:

Two PeerConnections, one just for sending audio and video, and another just for receiving N audio and video tracks.

ICE restart in the sending PeerConnection:

  • Call pc.createOffer({ iceRestart: true }) and pc.setLocalDescription(newOffer).
  • Call pc.setRemoteDescription(answer) with an answer SDP that contains new ICE ufrag/pwd.

It works. FF still sends media to the remote.

ICE restart in the receiving PeerConnection:

  • Call pc.setRemoteDescription(answer) with a remote offer SDP that contains new ICE ufrag/pwd.
  • Call pc.createAnswer() and pc.setLocalDescription(newAnswer).

It works. FF still receives media from the remote.

Now, if steps above are repeated, FF stops sending media in the first pc and stops receiving media in the second pc.

100% reproducible here: https://demo.mediasoup.org (just click on the "Wi-Fi" button at the left of the screen, which performs the procedures above in both PeerConnections).

Hi Iñaki, I see the problem when I follow your steps. But it's hard to tell what's going on.

I tried to reproduce it in https://jsfiddle.net/jib1/sy82fmbg/ based on comment 0, but it works for me. Would you be able to modify the fiddle to show the problem?

Flags: needinfo?(ibc)

Hi Jan, I think the issue happens when the remove ICE ufrag/pwd changes. This is, when calling setRemoteDescription() with a re-offer or re-answer that have different ICE ufrag/pwd. Is the jsfiddle supposed to also do that? I think so. However, given that it's local host, there is no chance for receiving a ICE 401 (wrong ufrag/pwd), which does happen in my environment. May be related?

Flags: needinfo?(ibc)

Byron, wdyt?

Flags: needinfo?(docfaraday)

The demo does not seem to work at all for me. I see 4 PCs created. 2 are not used at all. One of the remaining has audio and video m-sections, both sendonly. The other one has only an audio m-section, also sendonly.

Edit: This is on first load, no ICE restarts have happened.

Flags: needinfo?(docfaraday)

(In reply to Iñaki Baz Castillo from comment #2)

Hi Jan, I think the issue happens when the remove ICE ufrag/pwd changes. This is, when calling setRemoteDescription() with a re-offer or re-answer that have different ICE ufrag/pwd. Is the jsfiddle supposed to also do that? I think so. However, given that it's local host, there is no chance for receiving a ICE 401 (wrong ufrag/pwd), which does happen in my environment. May be related?

That fiddle does perform an ICE restart, so the ufrag/pwd should be changing.

The demo does not seem to work at all for me. I see 4 PCs created. 2 are not used at all. One of the remaining has audio and video m-sections, both sendonly. The other one has only an audio m-section, also sendonly.

What does i mean "does not seem to work at all"? The app creates 3 pcs:

  • An initial one just to inspect device RTC capabilities (it's closed soon).
  • A pc for just sending audio/video (so a=sendonly in all m= sections).
  • A pc for just receiving N audio/video tracks from remote participants (so a=recvonly in all m= sections).

This is how it works.

That fiddle does perform an ICE restart, so the ufrag/pwd should be changing.

I know. But it runs in localhost so there is no chance to get a ICE 401 "usernameFrag/pwd mismatch". In my demo such a ICE 401 does happen (because the remote usernameFrag/pwd may change while there is an ongoing STUN Binding request from FF. That cannot happen in localhost.

I just wonder if Firefox stops the transport upon receipt of a ICE 401 reply.

(In reply to Iñaki Baz Castillo from comment #6)

The demo does not seem to work at all for me. I see 4 PCs created. 2 are not used at all. One of the remaining has audio and video m-sections, both sendonly. The other one has only an audio m-section, also sendonly.

What does i mean "does not seem to work at all"? The app creates 3 pcs:

  • An initial one just to inspect device RTC capabilities (it's closed soon).
  • A pc for just sending audio/video (so a=sendonly in all m= sections).
  • A pc for just receiving N audio/video tracks from remote participants (so a=recvonly in all m= sections).

This is how it works.

That's not what is happening in the SDP negotiation when I try using it. I wonder what has gone wrong...

That fiddle does perform an ICE restart, so the ufrag/pwd should be changing.

I know. But it runs in localhost so there is no chance to get a ICE 401 "usernameFrag/pwd mismatch". In my demo such a ICE 401 does happen (because the remote usernameFrag/pwd may change while there is an ongoing STUN Binding request from FF. That cannot happen in localhost.

I just wonder if Firefox stops the transport upon receipt of a ICE 401 reply.

That can cause the stun transaction to fail if it happens more than once. Getting a 401 would have to involve a candidate mixup though, right? Because an ICE restart is going to use a new set of ports, and I would not expect the old ones (and their STUN contexts) to know anything about the new ufrag/pwd.

That's not what is happening in the SDP negotiation when I try using it. I wonder what has gone wrong...

I hope this is irrelevant, unless something is failing for you (I mean without even trying the ICE restart stuff), is it failing?

That can cause the stun transaction to fail if it happens more than once. Getting a 401 would have to involve a candidate mixup though, right? Because an ICE restart is going to use a new set of ports, and I would not expect the old ones (and their STUN contexts) to know anything about the new ufrag/pwd.

The server does not change its ports. It's a ICE Lite server. When we do "ICE restart" we tell the server to just change its ICE usernameFrag/pwd, nothing else. So this is what happens in client side (FF):

Sending PC

  1. FF calls pc.createOffer({ iceRestart: true }) => newOffer.

  2. FF calls pc.setLocalDescription(newOffer).

  3. FF calls pc.setRemoteDescription(newAnswer). Such a newAnswer is the very same as before (same ICE candidates) but different a=ice-uFrag and a=ice-pwd.

Receiving PC

  1. FF calls pc.setRemoteDescription(newOffer). Such a newOffer is the very same as before (same ICE candidates) but different a=ice-uFrag and a=ice-pwd.

  2. FF calls pc.createAnswer() => newAnswer.

  3. FF calls pc.setLocalDescription(newAnswer).

So, yes, it may happen that, during the ICE restart procedure, FF gets some STUN 401 in a valid candidate pair. This also happens in Chrome/Safari, but Chrome/Safari do NOT "stop" the transport due to those eventual STUN 401 responses.

(In reply to Iñaki Baz Castillo from comment #8)

That's not what is happening in the SDP negotiation when I try using it. I wonder what has gone wrong...

I hope this is irrelevant, unless something is failing for you (I mean without even trying the ICE restart stuff), is it failing?

It does not work from the very beginning, no ICE restart required.

That can cause the stun transaction to fail if it happens more than once. Getting a 401 would have to involve a candidate mixup though, right? Because an ICE restart is going to use a new set of ports, and I would not expect the old ones (and their STUN contexts) to know anything about the new ufrag/pwd.

The server does not change its ports. It's a ICE Lite server. When we do "ICE restart" we tell the server to just change its ICE usernameFrag/pwd, nothing else. So this is what happens in client side (FF):

Ah, I see.

Sending PC

  1. FF calls pc.createOffer({ iceRestart: true }) => newOffer.

  2. FF calls pc.setLocalDescription(newOffer).

  3. FF calls pc.setRemoteDescription(newAnswer). Such a newAnswer is the very same as before (same ICE candidates) but different a=ice-uFrag and a=ice-pwd.

Receiving PC

  1. FF calls pc.setRemoteDescription(newOffer). Such a newOffer is the very same as before (same ICE candidates) but different a=ice-uFrag and a=ice-pwd.

  2. FF calls pc.createAnswer() => newAnswer.

  3. FF calls pc.setLocalDescription(newAnswer).

So, yes, it may happen that, during the ICE restart procedure, FF gets some STUN 401 in a valid candidate pair. This also happens in Chrome/Safari, but Chrome/Safari do NOT "stop" the transport due to those eventual STUN 401 responses.

So, 401s on the old streams shouldn't cause the transport to fail, which is what we're seeing here right? The new streams aren't getting 401ed or anything?

It does not work from the very beginning, no ICE restart required.

Oh, this is super tested code. It works in Firefox for more than a year. I've just tested it right now with FF stable and nightly.

I know this is not the place to ask this, but may you please set localStorage.setItem('debug', '*'), refresh the page, and tell me which errors you get?

So, 401s on the old streams shouldn't cause the transport to fail, which is what we're seeing here right? The new streams aren't getting 401ed or anything?

I don't know what you mean with "the new streams". If you mean about new added remote streams/tracks, they are not received/rendered given that they all use a single Bundled transport.

By testing I see the following:

Running the ICE restart procedures described above (by clicking into the "Wi-Fi" icon) sometimes FF sending and/or receiving media freezes. Not always, but typically after the first or second "click" on the ICE restart button.

I strongly think this is due the eventual STUN 401 received by FF. This is what IMHO is happening:

  • FF pc1 is sending audio/video and periodic ICE binding requests.
  • FF pc2 is receiving audio/video and is sending periodic ICE binding requests.
  • FF pc1 sends another ICE binding request (no response received yet).
  • FF pc2 sends another ICE binding request (no response received yet).
  • ICE restart procedures done, so the server transport has changed its userFrag/pwd.
  • The ongoing ICE binding requests reach the server.
  • The server replies 401 to them (because its user/pwd have changed).
  • FF does not like it and assumes "broken ICE/transport" => transport becomes unusable.

(In reply to Iñaki Baz Castillo from comment #10)

It does not work from the very beginning, no ICE restart required.

Oh, this is super tested code. It works in Firefox for more than a year. I've just tested it right now with FF stable and nightly.

I know this is not the place to ask this, but may you please set localStorage.setItem('debug', '*'), refresh the page, and tell me which errors you get?

I'll get back to you later today on that.

So, 401s on the old streams shouldn't cause the transport to fail, which is what we're seeing here right? The new streams aren't getting 401ed or anything?

I don't know what you mean with "the new streams". If you mean about new added remote streams/tracks, they are not received/rendered given that they all use a single Bundled transport.

The "new streams" are the new ICE streams established due to the ICE restart. The way we implement ICE restart is by marking the old ICE streams "obsolete" (ie; they won't send consent refresh, and won't keep running ICE if they are not connected yet, but are still able to send/recv packets if they already were connected), and creating a brand new set of ICE streams to take over for the old ones. This allows traffic to continue flowing for a short time while the new ICE runs. So, 401s landing on the old streams shouldn't hurt anything.

OkK. However something is happening and I'll try to reduce the scenario as follows (assuming you get your FF working in the demo app):

  • Open https://v2demo.mediasoup.org/?roomId=FOO in Chrome and enable webcam.

  • Open https://v2demo.mediasoup.org/?roomId=FOO&spy=true (note the spy=true). It won't send anything but just receive media. So it will just run a single peerconnection (a=recvonly) and the SDP offerer will always be the server (although remote SDPs are generated locally by the JS lib).

  • Now, if you click on the "Wi-Fi" icon at the left, ICE restart will be done as follows:

  1. FF requests the server (via signaling) to change its ICE ufrag/pwd and retrieves the new values.
  2. FF calls pc.setRemoteDescription(newOffer). newOffer is the very same as before (same ICE candidates) but with the new remote a=ice-uFrag and a=ice-pwd.
  3. FF calls pc.createAnswer() => newAnswer.
  4. FF calls pc.setLocalDescription(newAnswer).

So there is no even pc.createOffer({ restartIce:true }) here.

Do we agree that this scenario should not fail in FF and the transport should still receive media?

What happens when running steps in my previous comment is that:

  • First click on "ICE restart": may be nothing wrong happens and FF still receives audio/video from Chrome via the SFU (an STUN 401 "wrong authentication in STUN Binding Request" may have happened, but media still flows).

  • Second click on "ICE restart": remote audio/video freezes in FF (an STUN 401 "wrong authentication in STUN Binding Request" may have happened, but even if not, media becomes frozen).

So it seems not to be related to the 401 as you indicated.

NOTE: I do log those 401 in the SFU:

"RTC::IceServer::ProcessStunMessage() | wrong authentication in STUN Binding Request => 401"

So, first "ICE restart" works. Second fails. Honestly no idea. In the server side this is proven (you can do those ICE restarts in Chrome 1000 times and will work).

Oh, this is neat.

This demo is still not quite working for me, but it looks like a media rendering problem; packets still flow, and I can see them stop flowing in the way you describe.

So you remember me saying that the old streams are kept around until the ICE restart succeeds?

The ICE restarts never succeed here. So the first time we try to restart, and it never works, we continue using the old streams. The second time we restart, we detect that an ICE restart is already in progress, and discard the old streams.

I think this is mainly an ICE lite bug maybe? I see a succeeded pair, but nothing has nominated it.

What is fascinating to me is that the new ICE for the restart is never timing out, it's just in limbo. Very odd.

Ah! That's really neat. When we time out, we look for components without succeeded pairs, but do not check whether those pairs have been nominated. That's how we end up in limbo.

See Also: → 1526477

I think this is mainly an ICE lite bug maybe? I see a succeeded pair, but nothing has nominated it.

That is some just related to FF, right? I mean, the server just replies to ICE binding request. I know that FF does not properly support ICE Lite endpoints (it tries to do aggressive nomination anyway) but our SFU does not care. Anyway, since FF does not support media over TCP, there is a single remote ICE candidate (UDP IPv4 in this case) so ICE should connect no matter FF uses aggressive nomination (this is mainly tested).

Well, I see you have identified the bug :)

Is there anything else you need for me to help here?

Well, I have identified at least one bug here. I am not totally sure what I think about the limbo we get into if the controller never nominates; it has certainly obfuscated the real issue here. Seems like a bug also, but could be debated.

See Also: → 1526485

I've got net traces (did them many months ago) but AFAIR the flow is as follows:

  • Initially FF sends ICE Binding request with USE-CANDIDATE (since FF ignores ICE Lite rules).

  • The server replies with ok response.

  • Eventually ICE restart is done in the server (so the server changes its user/pwd and tells about them to FF within a SDP re-offer with the very same remote candidates as before, but new a=ice-xxxx).

  • FF sends a new ICE Binding request with USE-CANDIDATE and new ICE credentials.

  • The server replies with ok response.

  • Now the FF issue takes place.

AFAIU nothing else happens, am I wrong?

Depends on: 1526477, 1526485
Priority: -- → P5
Priority: P5 → P2
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.