960296 - Video freeze during WebRTC call

Reporter

Description

•

12 years ago

While on a WebRTC call with mreavy the video in both directions kept freezing and not recovering. I was on a local build from 12/11/2013 (67ae84fa6807a57be78e6a237d146297861f2da1[67ae84fa6807]) and gathered the attached log files. nspr.log contains this line: 6057984[100330b20]: Couldn't send media on '2b4c6f46c597b17a/stream2/video' Stun/Turn stdout contains this around the same time: (generic/INFO) Buffers full (turn/WARNING) TURN(relay(IP4:10.250.7.48:56697|IP4:199.223.236.92:3478)): Failed sending send indication (turn/DEBUG) TURN(relay(IP4:10.250.7.48:56697|IP4:199.223.236.92:3478)): Send indication len=1131

Nils Ohlmeier [:drno]

Reporter

Comment 1

•

12 years ago

Attached file nspr.log.bz2 — Details

Nils Ohlmeier [:drno]

Reporter

Comment 2

•

12 years ago

Link to the stdout log file (because it is too big for anything I have access to): https://dl.dropboxusercontent.com/u/17831554/bz-960296-stdout.bz2

Nils Ohlmeier [:drno]

Reporter

Comment 3

•

12 years ago

Some observations from grepping through the log files: - In the nspr log file the "SendPacket(...) succeeded" messages seem to cluster up (in the beginning you see ~2 of them grouped, right before the first "couldn't send media..." message you see 12 in one block), as if a send buffer got filled and then flushed - The stdout also contains this line (not sure what if it is relevant): 7934:(ice/ERR) ICE(PC:2b4c6f46c597b17a): peer (PC:2b4c6f46c597b17a:default), stream(2b4c6f46c597b17a/stream2/video) tried to trickle ICE in inappropriate state 4

Eric Rescorla (:ekr)

Comment 4

•

12 years ago

Attached patch Potential patch (obsolete) — Details — Splinter Review

I believe this is due to failing to update the capacity of the buffer when the TCP buffers drain. This means that once the buffer is full it appears to stay full and we can never write again. Attached is a very quick patch that seems to fix the problem in unit test (plus a test that exercises this). That said, I haven't double-checked the logic to make sure we are decrementing in the right place, so I'd like Byron to take a look.

Eric Rescorla (:ekr)

Updated

•

12 years ago

Assignee: nobody → docfaraday

Byron Campen [:bwc]

Assignee

Comment 5

•

12 years ago

Looking at this now.

Byron Campen [:bwc]

Assignee

Comment 6

•

12 years ago

The logic around |pending| seems to be correct now. The decrement seems to be in the right place, and we never decrease it anywhere else in the code, since the only other place where we write is when we've been given a new buffer, and |pending| is zero.

Byron Campen [:bwc]

Assignee

Comment 7

•

12 years ago

How about we add an assert(!sock->pending) after the flush loop?

Byron Campen [:bwc]

Assignee

Comment 8

•

12 years ago

Attached patch Update buffered stun socket size when flushes happen — Details — Splinter Review

Assert that |pending| is zero after we've flushed all writes.

Byron Campen [:bwc]

Assignee

Updated

•

12 years ago

Attachment #8361139 - Attachment is obsolete: true

Byron Campen [:bwc]

Assignee

Comment 9

•

12 years ago

Comment on attachment 8361210 [details] [diff] [review] Update buffered stun socket size when flushes happen Review of attachment 8361210 [details] [diff] [review]: ----------------------------------------------------------------- Added an assert, just sanity-checking that.

Attachment #8361210 - Flags: review?(ekr)

Eric Rescorla (:ekr)

Comment 10

•

12 years ago

Comment on attachment 8361210 [details] [diff] [review] Update buffered stun socket size when flushes happen Review of attachment 8361210 [details] [diff] [review]: ----------------------------------------------------------------- Let's not have me review my own code.

Attachment #8361210 - Flags: review?(ekr) → review?(adam)

Adam Roach [:abr]

Comment 11

•

12 years ago

Comment on attachment 8361210 [details] [diff] [review] Update buffered stun socket size when flushes happen Review of attachment 8361210 [details] [diff] [review]: ----------------------------------------------------------------- LGTM

Attachment #8361210 - Flags: review?(adam) → review+

Byron Campen [:bwc]

Assignee

Updated

•

12 years ago

Keywords: checkin-needed

Ryan VanderMeulen [:RyanVM]

Comment 12

•

12 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/09dc812fa7db

Keywords: checkin-needed

Randell Jesup [:jesup] (needinfo me)

Comment 13

•

12 years ago

(In reply to Nils Ohlmeier [:drno] from comment #3) > Some observations from grepping through the log files: > > - In the nspr log file the "SendPacket(...) succeeded" messages seem to > cluster up (in the beginning you see ~2 of them grouped, right before the > first "couldn't send media..." message you see 12 in one block), as if a > send buffer got filled and then flushed Note that video frames are often more than one packet (and thus clustered). And if there's a packet loss (or jitter buffer underflow) you can see iframes which are typically a bunch more packets, or retransmits of blocks of packets (you'll see those as using "mOther...." because of the split nature of the VideoConduits).

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 14

•

12 years ago

https://hg.mozilla.org/mozilla-central/rev/09dc812fa7db

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla29

Randell Jesup [:jesup] (needinfo me)

Updated

•

12 years ago

status-firefox27: --- → unaffected

status-firefox28: --- → affected

status-firefox29: --- → fixed

Randell Jesup [:jesup] (needinfo me)

Comment 15

•

12 years ago

Comment on attachment 8361210 [details] [diff] [review] Update buffered stun socket size when flushes happen [Approval Request Comment] Bug caused by (feature/regressing bug #): Initial TURN TCP landing in 28 User impact if declined: Video will freeze randomly when using TURN TCP and not recover Testing completed (on m-c, etc.): on m-c since last week Risk to taking this patch (and alternatives if risky): virtually no risk. Includes a test String or IDL/UUID changes made by this patch: none

Attachment #8361210 - Flags: approval-mozilla-aurora?

bhavana bajaj [:bajaj]

Updated

•

12 years ago

Attachment #8361210 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

Ryan VanderMeulen [:RyanVM]

Comment 16

•

12 years ago

https://hg.mozilla.org/releases/mozilla-aurora/rev/ddb0773a7ce8

status-firefox28: affected → fixed

u279076

Comment 17

•

12 years ago

Nils, can you please verify this is fixed in Firefox 28 and 29?

Flags: needinfo?(drno)

Keywords: verifyme

QA Contact: drno

Nils Ohlmeier [:drno]

Reporter

Comment 18

•

12 years ago

By blocking UDP completely locally on my machine I forced a TURN TCP call. Result: The STUN log file from FF 29 shows that it now is able to recover from occasional "Buffers full" situation, meaning the log messages show up, but then later data gets send successfully to/over the same TCP TURN connection. BUT with FF 28 and 29 calls between two office machines still freeze after 10-15min. I'll try to take logs and packet dumps from both side to investigate if this is the same or another problem.

Flags: needinfo?(drno)

Nils Ohlmeier [:drno]

Reporter

Comment 19

•

12 years ago

I traced another call in which the video froze after about 25-30mins. After analyzing the traces it looks like the call froze around that time, because the TCP TURN connection gets really confused about missing packets and retransmission. From that point on both Firefox clients are still sending data to the TURN server, but non of data gets relayed anymore. So it is not an obvious problem in Firefox. Throughout this call the video froze a couple of times for short periods of times (e.g. seconds), but with this fix in place it always recovered from this.

Status: RESOLVED → VERIFIED

Nils Ohlmeier [:drno]

Reporter

Updated

•

12 years ago

status-firefox28: fixed → verified

status-firefox29: fixed → verified

Keywords: verifyme

Eric Rescorla (:ekr)

Comment 20

•

12 years ago

Is this with apprtc? There is some other evidence that apprtc may only allocate TURN for around 30 minutes...

Nils Ohlmeier [:drno]

Reporter

Comment 21

•

11 years ago

Yes this was with apprtc. No other evidence. I can repeat or if let me know which other platform I should test with.

nspr.log.bz2 12 years ago Nils Ohlmeier [:drno] 270.80 KB, application/x-bzip2		Details
Potential patch 12 years ago Eric Rescorla (:ekr) 2.43 KB, patch		Details \| Diff \| Splinter Review
Update buffered stun socket size when flushes happen 12 years ago Byron Campen [:bwc] 2.96 KB, patch	abr : review+ bajaj : approval-mozilla-aurora+	Details \| Diff \| Splinter Review