Closed Bug 960296 Opened 6 years ago Closed 6 years ago

Video freeze during WebRTC call

Categories

(Core :: WebRTC: Audio/Video, defect)

x86
macOS
defect
Not set

Tracking

()

VERIFIED FIXED
mozilla29
Tracking Status
firefox27 --- unaffected
firefox28 --- verified
firefox29 --- verified

People

(Reporter: drno, Assigned: bwc)

Details

Attachments

(2 files, 1 obsolete file)

While on a WebRTC call with mreavy the video in both directions kept freezing and not recovering.
I was on a local build from 12/11/2013 (67ae84fa6807a57be78e6a237d146297861f2da1[67ae84fa6807]) and gathered the attached log files.

nspr.log contains this line:
6057984[100330b20]: Couldn't send media on '2b4c6f46c597b17a/stream2/video'

Stun/Turn stdout contains this around the same time:
(generic/INFO) Buffers full
(turn/WARNING) TURN(relay(IP4:10.250.7.48:56697|IP4:199.223.236.92:3478)): Failed sending send indication
(turn/DEBUG) TURN(relay(IP4:10.250.7.48:56697|IP4:199.223.236.92:3478)): Send indication len=1131
Attached file nspr.log.bz2
Link to the stdout log file (because it is too big for anything I have access to):
https://dl.dropboxusercontent.com/u/17831554/bz-960296-stdout.bz2
Some observations from grepping through the log files:

- In the nspr log file the "SendPacket(...) succeeded" messages seem to cluster up (in the beginning you see ~2 of them grouped, right before the first "couldn't send media..." message you see 12 in one block), as if a send buffer got filled and then flushed

- The stdout also contains this line (not sure what if it is relevant):
7934:(ice/ERR) ICE(PC:2b4c6f46c597b17a): peer (PC:2b4c6f46c597b17a:default), stream(2b4c6f46c597b17a/stream2/video) tried to trickle ICE in inappropriate state 4
Attached patch Potential patch (obsolete) — Splinter Review
I believe this is due to failing to update the capacity of the buffer when the TCP buffers drain. This means that once the buffer is full it appears to stay full and we can never write again.

Attached is a very quick patch that seems to fix the problem in unit test (plus a test that exercises this). That said, I haven't double-checked the logic to make sure we are decrementing in the right place, so I'd like Byron to take a look.
Assignee: nobody → docfaraday
Looking at this now.
The logic around |pending| seems to be correct now. The decrement seems to be in the right place, and we never decrease it anywhere else in the code, since the only other place where we write is when we've been given a new buffer, and |pending| is zero.
How about we add an assert(!sock->pending) after the flush loop?
Assert that |pending| is zero after we've flushed all writes.
Attachment #8361139 - Attachment is obsolete: true
Comment on attachment 8361210 [details] [diff] [review]
Update buffered stun socket size when flushes happen

Review of attachment 8361210 [details] [diff] [review]:
-----------------------------------------------------------------

Added an assert, just sanity-checking that.
Attachment #8361210 - Flags: review?(ekr)
Comment on attachment 8361210 [details] [diff] [review]
Update buffered stun socket size when flushes happen

Review of attachment 8361210 [details] [diff] [review]:
-----------------------------------------------------------------

Let's not have me review my own code.
Attachment #8361210 - Flags: review?(ekr) → review?(adam)
Comment on attachment 8361210 [details] [diff] [review]
Update buffered stun socket size when flushes happen

Review of attachment 8361210 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM
Attachment #8361210 - Flags: review?(adam) → review+
Keywords: checkin-needed
(In reply to Nils Ohlmeier [:drno] from comment #3)
> Some observations from grepping through the log files:
> 
> - In the nspr log file the "SendPacket(...) succeeded" messages seem to
> cluster up (in the beginning you see ~2 of them grouped, right before the
> first "couldn't send media..." message you see 12 in one block), as if a
> send buffer got filled and then flushed

Note that video frames are often more than one packet (and thus clustered).  And if there's a packet loss (or jitter buffer underflow) you can see iframes which are typically a bunch more packets, or retransmits of blocks of packets (you'll see those as using "mOther...." because of the split nature of the VideoConduits).
https://hg.mozilla.org/mozilla-central/rev/09dc812fa7db
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla29
Comment on attachment 8361210 [details] [diff] [review]
Update buffered stun socket size when flushes happen

[Approval Request Comment]
Bug caused by (feature/regressing bug #): Initial TURN TCP landing in 28

User impact if declined: Video will freeze randomly when using TURN TCP and not recover

Testing completed (on m-c, etc.): on m-c since last week

Risk to taking this patch (and alternatives if risky): virtually no risk.  Includes a test

String or IDL/UUID changes made by this patch: none
Attachment #8361210 - Flags: approval-mozilla-aurora?
Attachment #8361210 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Nils, can you please verify this is fixed in Firefox 28 and 29?
Flags: needinfo?(drno)
Keywords: verifyme
QA Contact: drno
By blocking UDP completely locally on my machine I forced a TURN TCP call.

Result: 
The STUN log file from FF 29 shows that it now is able to recover from occasional "Buffers full" situation, meaning the log messages show up, but then later data gets send successfully to/over the same TCP TURN connection.

BUT with FF 28 and 29 calls between two office machines still freeze after 10-15min. I'll try to take logs and packet dumps from both side to investigate if this is the same or another problem.
Flags: needinfo?(drno)
I traced another call in which the video froze after about 25-30mins.
After analyzing the traces it looks like the call froze around that time, because the TCP TURN connection gets really confused about missing packets and retransmission. From that point on both Firefox clients are still sending data to the TURN server, but non of data gets relayed anymore. So it is not an obvious problem in Firefox.
Throughout this call the video froze a couple of times for short periods of times (e.g. seconds), but with this fix in place it always recovered from this.
Status: RESOLVED → VERIFIED
Is this with apprtc? There is some other evidence that apprtc may only allocate TURN for around 30 minutes...
Yes this was with apprtc.
No other evidence. I can repeat or if let me know which other platform I should test with.
You need to log in before you can comment on or make changes to this bug.