Closed Bug 1965831 Opened 6 months ago Closed 5 months ago

Sites using Twilio Video SDK broke in Firefox 138

Categories

(Web Compatibility :: Site Reports, defect, P1)

Tracking

(Webcompat Priority:P2, Webcompat Score:5, firefox-esr128 unaffected, firefox138 wontfix, firefox139+ fixed, firefox140+ verified)

VERIFIED FIXED
140 Branch
Webcompat Priority P2
Webcompat Score 5
Tracking Status
firefox-esr128 --- unaffected
firefox138 --- wontfix
firefox139 + fixed
firefox140 + verified

People

(Reporter: pehrsons, Assigned: pehrsons)

References

(Regression, )

Details

(Keywords: regression, webcompat:platform-bug, webcompat:site-report)

User Story

platform:windows,mac,linux,android
impact:workflow-broken
configuration:general
affects:all
branch:release
diagnosis-team:video-conferencing
user-impact-score:160

This was filed with Twilio as https://github.com/twilio/twilio-video.js/issues/2101
We have been able to reproduce this issue using https://networktest.twilio.com/

136 and 137 are passing all tests, whereas Nightly fails the last two tests which are for testing video using Twilio's TURN servers.

Severity: -- → S2
Priority: -- → P2
Flags: needinfo?(dbaker)
User Story: (updated)
Webcompat Priority: --- → P2
Webcompat Score: --- → 5
Priority: P2 → P1

Set release status flags based on info from the regressing bug 1949282

Hello, this is Luis.
I'm part of the team working on the Twilio Video SDK. After compiling the source locally and doing some investigation, I identified the commit that appears to be causing the issue:
https://phabricator.services.mozilla.com/rMOZILLACENTRAL4de509660d334ea7eb0746f300177ee419f62171
Please let me know if you need any additional information or if there's anything I can help clarify.

Set release status flags based on info from the regressing bug 1949282

(In reply to Luis Rivas from comment #3)

Hello, this is Luis.
I'm part of the team working on the Twilio Video SDK. After compiling the source locally and doing some investigation, I identified the commit that appears to be causing the issue:
https://phabricator.services.mozilla.com/rMOZILLACENTRAL4de509660d334ea7eb0746f300177ee419f62171
Please let me know if you need any additional information or if there's anything I can help clarify.

Thanks Luis for your effort on this. That's the same regressor we found. I believe I have caught the bug on https://networktest.twilio.com with rr and Pernosco, so we should be able to figure this out shortly.

If you do employ a workaround for this issue, it'd be great to keep some test page for us to verify a fix against. We should also be able to add a test case for this issue in automation, but always good with the end-to-end verification in addition.

(In reply to Andreas Pehrson [:pehrsons] from comment #5)

Thanks Luis for your effort on this. That's the same regressor we found. I believe I have caught the bug on https://networktest.twilio.com with rr and Pernosco, so we should be able to figure this out shortly.

If you do employ a workaround for this issue, it'd be great to keep some test page for us to verify a fix against. We should also be able to add a test case for this issue in automation, but always good with the end-to-end verification in addition.

Absolutely, we can keep that site as is so you can use it for validations. While we considered workarounds for Twilio Video, supporting the Mozilla team in addressing the issue seems best. A quick fix might cause problems with other use cases and future Firefox versions, as it's hard to pinpoint the exact issue.

Depends on: 1965960
Depends on: 1966185

(In reply to Luis Rivas from comment #7)

Absolutely, we can keep that site as is so you can use it for validations. While we considered workarounds for Twilio Video, supporting the Mozilla team in addressing the issue seems best. A quick fix might cause problems with other use cases and future Firefox versions, as it's hard to pinpoint the exact issue.

Thank you Luis. Please note we have confirmed three different fixes at different levels all address the issue of not receiving video. Not all are up yet.

During our investigation on https://networktest.twilio.com we found there are three offer/answer exchanges taking place. We don't think that's relevant but haven't finished a minimal test case yet.

The final local description for Firefox, an answer, contains two video m-lines, one inactive and one recvonly. This is relevant and triggers the bug. A number of things have to hold for this bug to trigger:

  • a video m-section A must have been the only m-section, and active with a recv direction, when negotiated
  • m-section A may at no point have had an a=ssrc line
  • a renegotiation must happen where A is inactive and another video m-section B is active with a recv direction, and with the same payload types that A was configured for when active
  • m-sections A and B must be combined with BUNDLE
  • no MID RTP header extension, at least for m-section A

If all this holds and packets destined for m-section B are received, our code gets confused, routes them to A instead (which is inactive, so the packets in the end are just ignored) and B gets reconfigured internally for some other recv ssrc that we generate on the fly

edit May 14: added the bit on MID
edit May 15: rewritten with the bits on a=ssrc lines and renegotiation

Flags: needinfo?(dbaker)

:dbaker, since you are the author of the regressor, bug 1949282, could you take a look?

For more information, please visit BugBot documentation.

Flags: needinfo?(dbaker)

Looking more into this situation with MID, I think Twilio is in violation of RFC8843 section 9.1 here.
On MID with BUNDLE it says:

The RTP MID header extension MUST be enabled, by including an SDP 'extmap' attribute [RFC8285], with a 'urn:ietf:params:rtp-hdrext:sdes:mid' URI value, in each bundled RTP-based "m=" section in every offer and answer.

I see a=group:BUNDLE 0 1 application0 audio0 video0 and no MID extension.

Luis, FYI.

Flags: needinfo?(lrivas)

(In reply to Andreas Pehrson [:pehrsons] from comment #10)

Looking more into this situation with MID, I think Twilio is in violation of RFC8843 section 9.1 here.
On MID with BUNDLE it says:

The RTP MID header extension MUST be enabled, by including an SDP 'extmap' attribute [RFC8285], with a 'urn:ietf:params:rtp-hdrext:sdes:mid' URI value, in each bundled RTP-based "m=" section in every offer and answer.

I see a=group:BUNDLE 0 1 application0 audio0 video0 and no MID extension.

Luis, FYI.

Hi, Andreas

Thank you very much for the heads up. I will report that mismatch internally so we can review it. Is your team planning to apply stricter rules for this in the coming versions? We offer another type of configuration where SDP is not modified, so I confirmed what you mentioned regarding MID extension is not happening there, but it is still impossible to establish a call between Firefox 137 and Firefox 138 if Firefox 137 connects first. The only way to set up a video call correctly was to connect Firefox 138 first and then join a call using Firefox 137 or any other browser, which is something out of our control.

Said that, we decided not to use a workaround for Firefox 138 since we considered it would have more downsides down the road. However, we would be more than happy to discuss how we could collaborate on this matter going forward to offer a good experience for Firefox users.

Flags: needinfo?(lrivas) → needinfo?(apehrson)

The bug is marked as tracked for firefox139 (beta) and tracked for firefox140 (nightly). We have limited time to fix this, the soft freeze is in 7 days. However, the bug still isn't assigned.

:denschub, could you please find an assignee for this tracked bug? Given that it is a regression and we know the cause, we could also simply backout the regressor. If you disagree with the tracking decision, please talk with the release managers.

For more information, please visit BugBot documentation.

Flags: needinfo?(dschubert)

Assigning Andreas to make the bot happy - but please feel free to assign someone else if needed.

Assignee: nobody → apehrson
Flags: needinfo?(dschubert)

(In reply to Luis Rivas from comment #11)

Hi, Andreas

Thank you very much for the heads up. I will report that mismatch internally so we can review it. Is your team planning to apply stricter rules for this in the coming versions? We offer another type of configuration where SDP is not modified, so I confirmed what you mentioned regarding MID extension is not happening there, but it is still impossible to establish a call between Firefox 137 and Firefox 138 if Firefox 137 connects first. The only way to set up a video call correctly was to connect Firefox 138 first and then join a call using Firefox 137 or any other browser, which is something out of our control.

Said that, we decided not to use a workaround for Firefox 138 since we considered it would have more downsides down the road. However, we would be more than happy to discuss how we could collaborate on this matter going forward to offer a good experience for Firefox users.

No stricter rules planned, that seems risky wrt breaking sites.

Note I have finished making an automated test case for this issue. The trigger case is even more narrow than I made out in comment 8. In addition to

  • BUNDLE
  • No MID extension
  • Inactive and active (receiving) video transceiver present

you also need

  • No a=ssrc line for the inactive transceiver, also for any earlier negotiation involving that transceiver
  • The inactive transceiver must have been active when finishing an earlier negotiation, when it also must have been the only active receiving transceiver with the payload type that later gets sent to the other active transceiver

That's a lot of stars aligning -- you should be able to figure out a workaround fairly easily. Let me know if I can assist further in that.

See my test case on https://phabricator.services.mozilla.com/D249545.

Do you have a test page where I can reproduce the bug with the other type of configuration for myself? I can look for bugs. Or check whether my slew of fixes addresses it.

Flags: needinfo?(apehrson) → needinfo?(lrivas)

I seem to have encountered this issue in Firefox 138 (it was working fine in Firefox 137). I've created a simple demo here, could you please help me resolve it?

This issue causes our RTC service to become unusable in Firefox 138 (received video renders as black screen).

In the past few days, we have received thousands of user reports, and after investigation, we have pinpointed this issue. I hope this problem can be resolved quickly (is it possible to revert the recent commit?) to minimize the impact on more users.

Thanks...

(In reply to xuanshu from comment #16)

This issue causes our RTC service to become unusable in Firefox 138 (received video renders as black screen).

In the past few days, we have received thousands of user reports, and after investigation, we have pinpointed this issue. I hope this problem can be resolved quickly (is it possible to revert the recent commit?) to minimize the impact on more users.

Thanks...

Thank you for that test case. It does indicate that neither absence of the MID rtp header extension, nor absence of a=ssrc is required to reproduce. I'll take a look to see how this failure mode is different from the one we previous found. We are working on a fix for Firefox 139. For a shorter-term fix than that, you'll need a workaround. I'll try to come up with something.

Here's a profile for the test case in comment 15. The failure mode is indeed different. This seems like a regression due to the packet filter now learning about new SSRCs even when it already knows about some.

(In reply to Andreas Pehrson [:pehrsons] from comment #18)

Here's a profile for the test case in comment 15. The failure mode is indeed different. This seems like a regression due to the packet filter now learning about new SSRCs even when it already knows about some.

Hi, Andreas

Thank you for your response.

Given the critical and potentially devastating impact this issue is having on our operations, we kindly request an expedited resolution—even a temporary workaround—at your earliest convenience.

To explain further: many of our clients rely on the third-party SDK we provide. Implementing a Web SDK workaround would require extensive client-side upgrades, which is unfortunately not feasible in the short term due to deployment complexities. For this reason, a server-side or browser-level fix for Firefox would be invaluable to mitigate the issue immediately.

We sincerely appreciate your understanding and support in prioritizing this matter.

Best regards,

shu

Hi all.

Out service is currently experiencing this issue in Firefox 138. The minimal reproduction is here.

This reproduction contains two transceivers, both of which negotiate normally. In certian scenarios, the first transceiver may not start to sending RTP, which causes the second one to render nothing. This scenarios is quite common when connected with SFU.

In my opinion, this is a critical bug that already impacted several use cases, some of which are quite common. I strongly recommend reverting this change as soon as possible.

In latest Nightly, https://networktest.twilio.com and https://wpj5jv.csb.app/ (comment 15) are now working.
We'll have to do something more for the repro case in comment 20. I'll file another bug.

Depends on: 1967189

(In reply to Andreas Pehrson [:pehrsons] from comment #21)

In latest Nightly, https://networktest.twilio.com and https://wpj5jv.csb.app/ (comment 15) are now working.
We'll have to do something more for the repro case in comment 20. I'll file another bug.

Thanks Andreas!! Could you kindly confirm the estimated timeline for merging this fix into the codebase and which Firefox version it will be included in?

(In reply to xuanshu from comment #22)

(In reply to Andreas Pehrson [:pehrsons] from comment #21)

In latest Nightly, https://networktest.twilio.com and https://wpj5jv.csb.app/ (comment 15) are now working.
We'll have to do something more for the repro case in comment 20. I'll file another bug.

Thanks Andreas!! Could you kindly confirm the estimated timeline for merging this fix into the codebase and which Firefox version it will be included in?

We hope to get it into 139 which releases in about a week.

Hi, Andreas!

(In reply to Andreas Pehrson [:pehrsons] from comment #14)

Do you have a test page where I can reproduce the bug with the other type of configuration for myself? I can look for bugs. Or check whether my slew of fixes addresses it.

We wanted to let you know we were planning to provide you with a private deployment so you could test it, but today our QE team confirmed that the most recent build (140.0a1) also fixes the issues in that configuration.

Flags: needinfo?(lrivas)

(In reply to Luis Rivas from comment #24)

Hi, Andreas!

(In reply to Andreas Pehrson [:pehrsons] from comment #14)

Do you have a test page where I can reproduce the bug with the other type of configuration for myself? I can look for bugs. Or check whether my slew of fixes addresses it.

We wanted to let you know we were planning to provide you with a private deployment so you could test it, but today our QE team confirmed that the most recent build (140.0a1) also fixes the issues in that configuration.

Great, thank you for confirming.

We found so far that bug 1965960 fixes most issues. Bug 1967189 should fix the rest. Bug 1966185 will be for some tests, and cleanup of non-critical paths.

The primary Depends on bugs, Bug 1965960 and Bug 1967189, have been uplifted to Fx139. Fx139 is considered "fixed"
Bug 1966185 will follow in Fx140.

We don't have to track Bug 1966185 here.

Flags: needinfo?(dbaker)
See Also: → 1915079
Status: NEW → RESOLVED
Closed: 5 months ago
Resolution: --- → FIXED
Target Milestone: --- → 140 Branch

Based on comment #0

This was filed with Twilio as https://github.com/twilio/twilio-video.js/issues/2101
We have been able to reproduce this issue using https://networktest.twilio.com/

136 and 137 are passing all tests, whereas Nightly fails the last two tests which are for testing video using Twilio's TURN servers.

Verified, it passes all the tests on https://networktest.twilio.com/

Tested with:

  • Browser / Version: Firefox 140.0-candidate build 1
  • Operating System: Windows 10
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.