Closed Bug 1629565 Opened 4 months ago Closed 4 months ago

null candidate fired ahead of SLD(answer) on SRD w/rollback or double SRD (regression)

Categories

(Core :: WebRTC: Signaling, defect, P2)

74 Branch
defect

Tracking

()

RESOLVED FIXED
mozilla77
Tracking Status
firefox-esr68 --- unaffected
firefox74 --- wontfix
firefox75 --- wontfix
firefox76 --- wontfix
firefox77 --- fixed

People

(Reporter: jib, Assigned: bwc)

References

(Regression)

Details

(Keywords: regression)

Attachments

(11 files)

47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review

Doing SRD twice on the answerer (which is allowed), causes ICE gathering to jump the gate, firing answer candidates ahead of answer.

This is against spec.

STRs:

  1. Open https://jsfiddle.net/jib1/gjcms01p/ and share camera

Expected results:

pc2.SRD DONE
pc2.SRD TWICE DONE
stable
pc2.SLD DONE
pc2.onicecandidate in stable
pc2.onicecandidate in stable

Actual results:

pc2.SRD DONE
pc2.SRD TWICE DONE
pc2.onicecandidate in have-remote-offer BOOOH!
pc1.addIce: InvalidStateError: Cannot add ICE candidate when there is no remote SDP
stable
pc2.SLD DONE
pc2.onicecandidate in stable

Regression range:

45:19.27 INFO: Narrowed inbound regression window from [442f57db, f829ca2e] (4 builds) to [fde54097, f829ca2e] (2 builds) (~1 steps left)
45:19.27 INFO: No more inbound revisions, bisection finished.
45:19.27 INFO: Last good revision: fde540977b4e52dd21031777aa35906ff81d6e63
45:19.27 INFO: First bad revision: f829ca2e1ad6d637eb01f5cbe50876f22fe53c37
45:19.27 INFO: Pushlog:
https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=fde540977b4e52dd21031777aa35906ff81d6e63&tochange=f829ca2e1ad6d637eb01f5cbe50876f22fe53c37

Flags: needinfo?(docfaraday)

Yeah, this is just another race opened up by bug 1591199.

Flags: needinfo?(docfaraday)
Priority: -- → P3
Summary: Early pc.onicecandidate before SLD on double SRD (regression) → Answer candidates fired ahead of answer on double SRD (regression)

Seems like a logic problem, not a race. Note the

console.log("pc2.SRD DONE");
await wait(1000);

in the fiddle, which is not needed to reproduce btw, but shows it isn't timing related. I.e. we can wait any number of seconds after the first SRD and NO candidates fire prematurely.

It specifically takes a second SRD to trigger ICE, at which point it's 100% reproducible (no SLD needed). No timing involved. The other end will have no answer to apply these candidates to, likely causing the call to fail.

I think investigating why this happens should be a P2 in case there are other ways to trigger this logic error.

In fact, it appears to be happening on SRD implicit rollback: https://jsfiddle.net/jib1/gjcms01p/28/ which may miss an important candidate, causing calls to fail to connect.

Priority: P3 → P2
Flags: needinfo?(docfaraday)
Summary: Answer candidates fired ahead of answer on double SRD (regression) → Answer candidates fired ahead of answer on SRD w/rollback or double SRD (regression)

Hi Nico, is there currently someone working on this? Thank you?

Flags: needinfo?(na-g)

(In reply to Byron Campen [:bwc] from comment #3)

This test case widens the window of opportunity for the race, even without any waiting. The problem is that the handling for state changes for the sLD(answer) has been delayed by bug 1591199 enough that the candidates can arrive before those state changes are carried out, provided the ICE stack is given enough head start.

I don't understand this explanation, because this reproduces even without calling sLD(answer) https://jsfiddle.net/jib1/gjcms01p/39/

I also verified this is not fixed by the patches in bug 1622384.

Flags: needinfo?(docfaraday)

Ok, that's weird. I'll look into it.

Assignee: nobody → docfaraday
Flags: needinfo?(docfaraday)

iceGatheringState goes to "complete" for some reason on second SRD(offer) - https://jsfiddle.net/jib1/gjcms01p/48/

Flags: needinfo?(na-g)
Summary: Answer candidates fired ahead of answer on SRD w/rollback or double SRD (regression) → null candidate fired ahead of answer on SRD w/rollback or double SRD (regression)
Summary: null candidate fired ahead of answer on SRD w/rollback or double SRD (regression) → null candidate fired ahead of SLD(answer) on SRD w/rollback or double SRD (regression)

I've verified it's the null candidate that's being fired on single SRD() with implicit rollback as well. https://jsfiddle.net/jib1/gjcms01p/50/

Ok, this is due to the implicit rollback itself; rolling back a sLD shuts down gathering, which means we emit an end-of-candidates. Looking at the spec now to determine what it says should happen on rollback.

Yeah, webrtc-pc says that whenever gathering transitions to complete for all ice transports, we fire this null candidate. I wonder if this should apply when we're rolling back to a state where there are no ICE transports at all? The ICE gathering state should be "new" in that case, probably?

Ah, unlike comment 8 which said pc2.onicecandidate in have-remote-offer null BOOOH!,
comment 9 actually says pc2.onicecandidate null in have-local-offer BOOOH! which is quite normal. I'll remove the BOOOH! there.

But one problem remains with the comment 9 fiddle updated with onicegatheringstatechange: It looks like we're firing "complete" without ever having fired "gathering", which the spec indicates we should have done "When the ICE Agent indicates that it began gathering" which sans a state diagram clearly has to have happened before "When the ICE Agent is finished gathering". That seems wrong (in Firefox).

Yeah, webrtc-pc says that whenever gathering transitions to complete for all ice transports, we fire this null candidate. I wonder if this should apply when we're rolling back to a state where there are no ICE transports at all? The ICE gathering state should be "new" in that case, probably?

I think it's a stretch to say ice gathering is "complete" when it got aborted by its transport being rolled back and vaporized.

After all, the spec says: "When the ICE Agent is finished gathering a generation of candidates for an RTCIceTransport, and those candidates have been surfaced to the application". I'd argue it wasn't allowed to finish in this case.

ICE seems supportive of this interpretation: "ICE processing across all media streams also has a state associated with it. This state is equal to Running while ICE processing is under way. The state is Completed when ICE processing is complete and Failed if it failed without success."

But regardless of interpretation, I see no support anywhere for the behavior in comment 8 (without any SLD on pc2) and earlier, because the phrase "and those candidates have been surfaced to the application" above is a reference to [[EarlyCandidates]], which means we must not ever fire "complete" before SLD(answer).

Yeah, I think in this case we should be going back to "new". As for going directly from "new" to "complete", there are corner-cases like failure to start gathering that look under-specified (although that isn't what is happening here).

As for comment 8, I think maybe we're triggering the ICE agent rollback stuff with sRD somehow? Looking into it.

Depends on D72242

Depends on D72243

Depends on D72244

Try looks good.

Pushed by bcampen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/e2f411ffa079
Test-case for bug. r=jib
https://hg.mozilla.org/integration/autoland/rev/5e5e7b27d296
Fix bug where repeated sRD(offer) could start gathering. r=mjf
https://hg.mozilla.org/integration/autoland/rev/673ef75f8fe6
Test that we do not transition to "gathering" on a simple reoffer. r=jib
https://hg.mozilla.org/integration/autoland/rev/bee9068e5a39
Do not transition back to "gathering" if no gathering is needed. r=mjf
https://hg.mozilla.org/integration/autoland/rev/41ac898a5d97
Test that we do not transition to "gathering" on rollback if we were already done gathering. r=jib
https://hg.mozilla.org/integration/autoland/rev/3a5601c68d76
Fix bug where rollback of local offer would stomp the pre-existing transport with a new one with the same ufrag/pwd. r=mjf
https://hg.mozilla.org/integration/autoland/rev/40c9fc4fa292
Test-cases for transitioning to "new" gathering state. r=jib
https://hg.mozilla.org/integration/autoland/rev/065ddaaefe60
Transition gathering state to "new" if a negotiation leaves us with no transports. r=mjf
https://hg.mozilla.org/integration/autoland/rev/09af49103607
Some logging that was helpful. r=mjf
https://hg.mozilla.org/integration/autoland/rev/53ba1301dfdd
Mochitest fixes. r=jib
https://hg.mozilla.org/integration/autoland/rev/7917d7e0a471
Another test we should probably have. Passes. r=jib
Created web-platform-tests PR https://github.com/web-platform-tests/wpt/pull/23274 for changes under testing/web-platform/tests
Upstream web-platform-tests status checks passed, PR will merge once commit reaches central.
Regressions: 1633533
Upstream PR merged by moz-wptsync-bot
Regressions: 1633780
Regressions: 1633595
You need to log in before you can comment on or make changes to this bug.