Closed Bug 1337810 Opened 3 years ago Closed 3 years ago

Jitsi video freezes after a few minutes of conversation

Categories

(Core :: WebRTC: Audio/Video, defect, P1)

defect

Tracking

()

VERIFIED FIXED
mozilla55
Tracking Status
firefox51 --- wontfix
firefox52 --- wontfix
firefox-esr52 --- fixed
firefox53 --- fixed
firefox54 --- verified
firefox55 --- verified

People

(Reporter: JuliaC, Assigned: jesup)

References

()

Details

(Keywords: regression, regressionwindow-wanted)

Attachments

(3 files, 2 obsolete files)

[Note]:
- This issue affects both Firefox-Firefox calls and Chrome-Firefox calls

[Affected versions]:
- latest Nightly 54.0a1 (2017-02-08)
- latest Aurora 53.0a2 (2017-02-08)
- 52.0b4 build1 (20170206101855)
- 51.0.1 build3 (20170125094131)


[Affected platforms]:
- Windows 10 x64
- Windows 7 x64
- Ubuntu 14.04 x86
- Mac OS X 10.11

[Steps to reproduce]:
1. Launch Firefox 
2. Go to https://meet.jit.si/, create a room, provide the camera and microphone permissions and join it
3. Join the same room from another station
4. Pay attention at the video/audio playback

[Expected result]:
- The call is successfully initiated for both participants
- The audio/video streams are properly rendered 

[Actual result]:
- After a few minutes from the call initiation, video freezes on the host side for ~2-3 minutes
- Video can be often resumed by switching between self view and guest view

[Regression range]:
- It is difficult to find a precise regression range, considering that a manual search is required
- On 49.0a2 (2016-07-04) the issue is still reproducible, but the freeze lasts just several seconds
Nils -- I'd like you to investigate and reach out Emil when it makes sense.  Since this reproduces with Fx 49, it's likely due to a recent change on Jitsi's end.  Thanks!
Assignee: nobody → drno
Rank: 15
Priority: -- → P1
So Jitsi is not aware of any such reports or anything which they have deployed which could have caused this. I'll have a look.
I was able to repro the problem. Call initiator receiving video froze for at least 30s. Video on the other end froze also but only for 5s or so.

RTP stats on about:webrtc kept increasing through the freeze. So it's probably not a actual RTP transport issue itself. Maybe some RTCP problem?
Looking at some NSPR and webrtc.org log files I see that the first couple of seconds (in this case) everything looks normal. Then suddenly Firefox sends three RTCP on the receiving video stream. And from there on at least every second received video RTP packet gets "answered" by Firefox with an RTCP packet.

So that looks like Firefox is requesting some form of retransmission, but is not happy with what it gets from the SFU.
Looking at Wireshark trace with logs combined: a single missing RTP video packet appear to generate the picture freeze and RTCP storm.
I wonder if this is https://bugs.chromium.org/p/webrtc/issues/detail?id=7143
You could try commenting out     bytes_sent += SendPadData(bytes - bytes_sent, false, 0, 0);
 in rtp_sender.cc in TimeToSendPadding(), and replace it with "bytes_sent = bytes".  Note: this is on the sending side...

Also: seeing the logs might be useful.
Yes I also wondered if bug 1339270 is the reason. Especially as I was only able to repro this with Nightly so far, but not with release (although missing a single packet is obviously not very deterministic way to repro something).
See Also: → 1339270
Verified that bug 1339270 is not causing this. Which would not have made too much sense as I test with a local build and release on the other end, but the video freezes on Nightly (which would be the one sending the padding in this case).
See Also: 1339270
Attached file mozLogs.zip
Here are logs from such a frozen call; local buid; I shut down Firefox while the video was frozen.
If you grep for ac2f53d5-b310-49df-bd72-1f339bedec02 in the child log file you will see that up to '2017-02-14 18:30:05.808549 UTC' we receive lots of video RTP packets and only send out occasionally an RTCP packet. But after that timestamp Firefox starts to send a lot of RTCP. Unfortunately error log messages indicate why.
So I did verify today that Fx is sending out NACKs. But what the logging showed me is that it sends out the NACKs for some time and then stops sending NACKs even though the video is still frozen. Once the video recovers I saw NACK being send again.

Randel does that sound like a behavior to be expected?
Flags: needinfo?(rjesup)
At first though yes, I would expect it to stop trying to use NACK to recover after a very short time (<1sec), since after that it's cheaper/faster/better to just get an iframe (send PLI or FIR).  I'll look into exactly what the trigger to stop sending NACKs is (I know when the queue of packets waiting for a NACK to unblock them overflows it will ask for an iframe)
Flags: needinfo?(rjesup)
Flags: needinfo?(rjesup)
So even more logging statements show that at the time of the video freeze Fx starts to switch from sending NACK's to sending out PLI's. But apparently as the video stays frozen it never receives an I frame.
So the new RTP clear text logger from Bug 1343640 tells me that the Fx on both sides apparently send PLI's, but never receive any PLI from the SFU.

One things which sticks out about the PLIs is that the sender SSRC in the receiver report and the PLI is not used in any of the streams. Although the source SSRC correctly points to the stream with the missing packets.

Interestingly I can reproduce similar problems with in our SFU room on appear.in. But I haven't looked at the clear text RTCP for that case.
Attached patch use sender SSRC for receivers (obsolete) — Splinter Review
This appears to fix the wrong-SSRC for receviers - give this a try.   https://treeherder.mozilla.org/#/jobs?repo=try&revision=ea77d78b045337653d94a22b3268bd725eccee87
Attachment #8843546 - Flags: review?(drno)
Assignee: drno → rjesup
Status: NEW → ASSIGNED
Thanks Randell. I'll give it a try on Monday to check if this makes a difference.
had some oranges on linux only; suspect it's uninitialized RTT values if we don't exchange RCTP before it's measured.  Also added a debug
Attachment #8843602 - Flags: review?(drno)
Attachment #8843546 - Attachment is obsolete: true
Attachment #8843546 - Flags: review?(drno)
Today I mostly did not get the video working reliably at all. As it turned out the problem was that when you use iptables to randomly drop UDP packets sometime NSPR will report that as a permission error, which then terminates the connection at the transportlayer.
With attachment #8843602 [details] [diff] [review] applied plus a hack to ignore the NSPR permission errors I no longer saw any video freezes. Just a few short hangs which quickly (within 1-2s) recovered.
I did not have the time to test again without attachment #8843602 [details] [diff] [review] to check if the combination fixed things ultimately. I'll try to do that soon, because last week the video freezes recovered after a long time, where todays video freezes never recovered (I gave up waiting after >= 30s).
Just verified that in the RTP logs from the run with all patches and hacks PLI's appear and stop after big video packets are send/received.
Attachment #8843602 - Flags: review?(drno) → review+
Reminder to nominate for 54 and 53
Flags: needinfo?(rjesup)
See Also: → 1325447
Blocks: 1335469
https://hg.mozilla.org/mozilla-central/rev/cfe4b63ed4f7
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Duplicate of this bug: 1335469
Comment on attachment 8843602 [details] [diff] [review]
use sender SSRC for receivers

Approval Request Comment
[Feature/Bug causing the regression]: webrtc 49 update (FF53)

[User impact if declined]: Some webrtc conference servers may fail (i.e.. jitsi, a large provider)

[Is this code covered by automated tests?]: No; we don't care about this value in our code or check it (nor does chrome).  

[Has the fix been verified in Nightly?]: yes

[Needs manual test from QE? If yes, steps to reproduce]: We verified the fix using jitsi (we were able to repro it easily, and could see the server not forwarding packets). QE could do the same tests we did, but I don't think that adds anything.

[List of other uplifts needed for the feature/fix]: None

[Is the change risky?]: No 

[Why is the change risky/not risky?]: very simple change (which also fixes a timing hole that could make tests orange).

[String changes made/needed]: none
Flags: needinfo?(rjesup)
Attachment #8843602 - Flags: approval-mozilla-beta?
Attachment #8843602 - Flags: approval-mozilla-aurora?
Comment on attachment 8843602 [details] [diff] [review]
use sender SSRC for receivers

Fix a webrtc regression issue and was verified locally. Aurora54+ & Beta53+.
Attachment #8843602 - Flags: approval-mozilla-beta?
Attachment #8843602 - Flags: approval-mozilla-beta+
Attachment #8843602 - Flags: approval-mozilla-aurora?
Attachment #8843602 - Flags: approval-mozilla-aurora+
Iulia, could you please spot-check this on 53?
Flags: needinfo?(iulia.cristescu)
I still manage to reproduce the issue on Windows 10 x64, Ubuntu 16.04 x64 and Mac OS X 10.11.6, using 53.0b5 build1 (20170320143328). The bug is also still encountered on 54.0a2 (2017-03-21) and 55.0a1 (2017-03-21), but the freeze lasts only for about 1 minute.
Flags: needinfo?(iulia.cristescu)
Nils - please re-check ASAP on Nightly (and beta if nightly is working); check if the NACKs are getting through as well.
Status: RESOLVED → REOPENED
Flags: needinfo?(rjesup)
Resolution: FIXED → ---
Lulia can you get us a MOZ log file with Nightly with the following value:
  MOZ_LOG=timestamp,signaling:5,jsep:5,rtplogger:5

It would be ideal if you can get us the logs from both sides of the same identical call. Note: the logging from above only works in Nightly.
Then I'll have a look at the RTP packets in Wireshark from the log. That should hopefully give us quickly an idea of what kind of freeze you actually see there.
Flags: needinfo?(iulia.cristescu)
I was able to repro short hangs in the video on Jitsi myself.

By looking at the new RTP logging I was able to find out that Firefox sends out NACK's for missing frames to the Jitsi server. These NACKs then appear to get forwarded to the sender. But the sender is using a completely different RTP sequence number range. So it is not surprising that it never sends the requested missing frame.

I contacted Jitsi and they already located and fixed the issue which caused the NACK's to be forwarded. The Jitsi server is suppose to terminate the NACKs and not forward them to the sender.
It is my understanding that they will also look into why their server is not responding to Firefox NACK's packets.
Flags: needinfo?(rjesup)
(In reply to Nils Ohlmeier [:drno] from comment #31)
> Lulia can you get us a MOZ log file with Nightly with the following value:
>   MOZ_LOG=timestamp,signaling:5,jsep:5,rtplogger:5
> 
> It would be ideal if you can get us the logs from both sides of the same
> identical call. Note: the logging from above only works in Nightly.
> Then I'll have a look at the RTP packets in Wireshark from the log. That
> should hopefully give us quickly an idea of what kind of freeze you actually
> see there.

I held calls between 2 Windows 10 x64 stations, using 55.0a1 (2017-03-24) and having 

cd c:\
set MOZ_LOG=timestamp,signaling:5,jsep:5,rtplogger:5
set MOZ_LOG_FILE=%TEMP%\log.txt
cd "Program Files\Nightly"
.\firefox.exe

previously set. 

Once I've reproduced the problem and exited Firefox, looked for the generated log file, but this was empty every time. Is there anything wrong in the above environment settings?
Flags: needinfo?(iulia.cristescu) → needinfo?(drno)
Sorry I forgot to mention that rtp logging only work with e10s off (because the sandboxed content process can't write to file any more).

I got confirmation from Jitsi that remaining NACK issue got fixed on their end. And I just verified that NACKs now work as expected again. The fix on Jitis side is currently on beta.meet.jit.si, so probably not yet deployed on the production environment. I can check with Jitsi when the fix will be available in production.

Closing the bug since this was not a Firefox issue, and it's working now again.
Status: REOPENED → RESOLVED
Closed: 3 years ago3 years ago
Flags: needinfo?(drno)
Resolution: --- → FIXED
Is this something we want to uplift to ESR52 as well?
Flags: needinfo?(rjesup)
I believe this is fallout from bug 1250356 - Webrtc 49 update, landed in FF 53
Blocks: 1250356
Flags: needinfo?(rjesup)
Or rather - the code changed in 53.  It may be incorrect in 52 as well (and the initial report indicates this).  If so, it will need a slightly different patch I suspect.  Let's keep it on the radar for 52ESR, given the XP population ending up there
Flags: needinfo?(rjesup)
If we want this on ESR52, we should probably get this nominated for approval soon.
possible fix for this issue in 52 -- 52 is very different in this area, so the fix would need to be verified before uplift
Attachment #8855625 - Flags: review?(drno)
Could you try the 52ESR Try run in comment 40?  Thanks!
Flags: needinfo?(iulia.cristescu)
Comment on attachment 8855625 [details] [diff] [review]
ESR52 patch for setting receive SSRC same as send SSRC

Review of attachment 8855625 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM

::: media/webrtc/signaling/src/peerconnection/MediaPipelineFactory.cpp
@@ +826,5 @@
> +    if (aTrackPair.mSending) {
> +      auto ssrcs = &aTrackPair.mSending->GetSsrcs();
> +      if (!ssrcs->empty()) {
> +        conduit->SetLocalSSRC(ssrcs->front());
> +      }

I think we should have MOZ_MTLOG() and return NS_ERROR_FAILURE for the else case here to prevent silent failure when not setting the local SSRC.
Attachment #8855625 - Flags: review?(drno) → review+
Attachment #8855625 - Attachment is obsolete: true
Comment on attachment 8855939 [details] [diff] [review]
ESR52 patch for setting receive SSRC same as send SSRC

Review of attachment 8855939 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM
Attachment #8855939 - Flags: review?(drno) → review+
NI to nils to test this on Monday; once verified I'll put it up for esr52
Flags: needinfo?(drno)
(In reply to Randell Jesup [:jesup] from comment #41)
> Could you try the 52ESR Try run in comment 40?  Thanks!

The issue is not reproducible anymore on the the 52ESR Try run in comment 40.
Flags: needinfo?(iulia.cristescu)
Also, verified 55.0a1 (2017-04-10), 54.0a2 (2017-04-10) and 53.0b9 (20170403072723) builds.  Nightly and Aurora builds are verified fixed, but the issue is still reproducible on the beta build.
I also tried the ESR try build for 5 min call. I had some very small video freezes, but they recovered within 1-3 seconds. After about 4min or so on side the remote video got greyed out with a message that the video got turned of due to bandwidth limitations. But that sounds like a Jitsi issue to me. So the ESR build seems to work fine.
Flags: needinfo?(drno)
Please request ESR52 approval on this when you get a chance.
Flags: needinfo?(rjesup)
Comment on attachment 8855939 [details] [diff] [review]
ESR52 patch for setting receive SSRC same as send SSRC

[Approval Request Comment]
If this is not a sec:{high,crit} bug, please state case for ESR consideration: Failure of a major WebRTC provider

User impact if declined: ESR users unable to use Jitsi (and maybe some others)

Fix Landed on Version: 55

Risk to taking this patch (and alternatives if risky): Very low risk, just changes where it gets the default SRRC value.

String or UUID changes made by this patch: none

See https://wiki.mozilla.org/Release_Management/ESR_Landing_Process for more info.
Flags: needinfo?(rjesup)
Attachment #8855939 - Flags: approval-mozilla-esr52?
Comment on attachment 8855939 [details] [diff] [review]
ESR52 patch for setting receive SSRC same as send SSRC

Video freezing isn't a good experience, the fix has been verified in a few places already, ESR52.2+
Attachment #8855939 - Flags: approval-mozilla-esr52? → approval-mozilla-esr52+
You need to log in before you can comment on or make changes to this bug.