Closed Bug 1023539 Opened 10 years ago Closed 10 years ago

In call audio failed (both ways), the video froze later (one way)

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla36

Tracking Flags:

Tracking

Status

firefox34

---

fixed

firefox35

---

fixed

firefox36

---

fixed

b2g-v2.1

---

fixed

b2g-v2.2

---

fixed

People

(Reporter: standard8, Assigned: pkerr)

References

Details

(Whiteboard: [webrtc-uplift])

Attachments

(4 files, 4 obsolete files)

Extract of about:webrtc (mreavy - standard8 fail) 10 years ago Mark Banner (:standard8) 14.52 KB, text/plain		Details
about:webrtc from a call with video freeze 10 years ago Nils Ohlmeier [:drno] 110.08 KB, text/plain		Details
webrtc_logs_drno_20140723 -- I copy-and-pasted the about:webrtc logs twice during the call (about 2 mins apart) 10 years ago Maire Reavy [:mreavy] 164.71 KB, text/plain		Details
Part 1: WIP - test change to TURN Data indication handling 10 years ago Paul Kerr [:pkerr] 3.00 KB, patch		Details \| Diff \| Splinter Review
Part 1: WIP - test change to TURN Data indication handling 10 years ago Paul Kerr [:pkerr] 2.75 KB, patch		Details \| Diff \| Splinter Review
run TURN relay reservation refresh logic when handling a Data Indication 10 years ago Paul Kerr [:pkerr] 3.03 KB, patch	jesup : review+	Details \| Diff \| Splinter Review
run TURN relay reservation refresh logic when handling a Data Indication 10 years ago Paul Kerr [:pkerr] 3.21 KB, patch	bwc : review-	Details \| Diff \| Splinter Review
Run TURN relay reservation refresh logic when handling a Data Indication packet 10 years ago Paul Kerr [:pkerr] 2.53 KB, patch	bwc : review+ lsblakk : approval-mozilla-aurora+ lmandel : approval-mozilla-beta+	Details \| Diff \| Splinter Review

Mark Banner (:standard8)

Reporter

Description

•

10 years ago

Dan and I were chatting (MoCo offices - my home), we had been chatting for about 6 mins, when we lost audio at both ends - it seemed to happen at the exactly the same time.

After a further few minutes Dan's video froze (at both ends of the call).

We were both on Mac. I did not have a headset, Dan did.

Jesup has the about:webrtc information and the extra logs from the call.

Mark Banner (:standard8)

Reporter

Comment 1

•

10 years ago

Attached file Extract of about:webrtc (mreavy - standard8 fail) — Details

Maire & I just had the situation where we were on a Loop call, and my audio stopped working completely. Here's my about:webrtc log.

Maire Reavy [:mreavy]

Comment 2

•

10 years ago

It's unclear at this point if this is a Core::WebRTC bug or a Loop bug (possibly related to the SDK).

I've asked Pkerr to start investigating this and see if we can narrow down the possible culprits.

Anthony, Nils -- Anything you guys can do to help us reproduce this and get a reduced test case is appreciated.  I'm doing a needinfo to you both to get this on your radar and see if you have any suggestions for getting us more data.    Thanks.

NOTE: I've seen audio freeze in one direction and not the other.  I've also seen instances where the video freezes in one direction and never comes back, but the audio continues in both directions.  I don't believe I've seen instances of freezing with no recovery on non-Loop calls (e.g. talky.io).  If anyone has seen audio or video freezes with no recovery outside of Loop, that would be useful info -- especially if it's audio and not video or video and not audio.  Thanks.

Assignee: nobody → pkerr

Flags: needinfo?(drno)

Flags: needinfo?(anthony.s.hughes)

Whiteboard: [investigation]

Nils Ohlmeier [:drno]

Comment 3

•

10 years ago

The one about:webrtc attached to this uses server-reflexive on both sides. So it looks like we can't blame the relay. Does the SDK handle every single media packet in the case where no relay is in use?

From the comments here it sounds like Mark has a nasty home router :-)
Unfortunately we don't have automated calls end-to-end working yet. So I guess the best we can do add manual testing to see if we can get it reproduced.

Flags: needinfo?(drno)

Paul Kerr [:pkerr]

Assignee

Updated

•

10 years ago

Whiteboard: [investigation] → [investigation][p=4]

Nils Ohlmeier [:drno]

Comment 4

•

10 years ago

I talked to the TokBox folks yesterday at their office about this.
If the faulty calls go P2P (not over their servers), they were claiming their SDK does not really touch the media parts. So that would leave us then with a platform issue here.

u279076

Comment 5

•

10 years ago

Sorry, I just got back from PTO and am still catching up on mail. I'll coordinate some manual testing to try to confirm if this is a Loop issue or a Platform issue. We'll need to get this clearly reproduced before we can work on a reduced testcase and/or a regression window.

Flags: needinfo?(anthony.s.hughes)

u279076

Updated

•

10 years ago

QA Contact: anthony.s.hughes

Nils Ohlmeier [:drno]

Comment 6

•

10 years ago

FYI my understanding is that there have been hints that some degree of packet loss might trigger this.

Randell Jesup [:jesup] (needinfo me)

Comment 7

•

10 years ago

Also that in one case dmose was seeing "Dropped" frames in about:webrtc going up a lot when video from him to mreavy froze.  That implies captured frames were not being turned into packets and sent for some reason.  Hitting Start Debug Log in about:webrtc at that point might be helpful (plus NSPR logs).  Note the Debug logs get saved separately from NSPR; the location defaults (%TEMP% for win, /tmp for linux/mac) but can be overridden in about:config

Maire Reavy [:mreavy]

Comment 8

•

10 years ago

I'd love to get this into Fx 33.  (Once we find identify the fix, this is the sort of bug that could/would be uplifted.)  This is right behind Bug 1002414 in his queue.  He has already started investigation and trying to repo this.

Priority: -- → P1

Target Milestone: --- → mozilla33

u279076

Comment 10

•

10 years ago

I worked with some people at Softvision to try to reproduce this but we've not yet succeeded. Softvision worked on this all night and didn't experience any call quality issues. I had them try both Loop and AppRTC on different networks with calls lasting longer than 10 minutes and no problems were encountered.

Randell Jesup [:jesup] (needinfo me)

Comment 11

•

10 years ago

Thanks - I've never (yet, personally) seen it fail in this manner on apprtc/talky/etc.  I know mreavy/dmose/standard8/etc have seen it multiple times on loop

Paul Kerr [:pkerr]

Assignee

Comment 12

•

10 years ago

(In reply to Nils Ohlmeier [:drno] from comment #4)
> I talked to the TokBox folks yesterday at their office about this.
> If the faulty calls go P2P (not over their servers), they were claiming
> their SDK does not really touch the media parts. So that would leave us then
> with a platform issue here.

If you look at the attached log, the address that is treated as a peer reflexive candidate is actually the same as the relayed candidate from the remote peer. Discussing this with bwc, he seems to feel that if the local peer sees the connectivity check BIND request from the remote before being informed of the remote relay candidate via trickle that it marks this as peer reflexive. This address, 70.42.47.104:20444/udp, was the one selected for the audio part of the call. (The loop client did not eventually notice that it matched the relay candidate.) So, unless I am reading the log wrong, an audio channel for this call did go through the loop server while the rest of the channel ended up a peer-to-peer.

ggb

Comment 13

•

10 years ago

70.42.47.104 is one of the loop turn servers

Could the issue be related to this one reported some time back? https://bugzilla.mozilla.org/show_bug.cgi?id=935806   (that issue was making connections be closed after 5 mins with some routers)

Paul Kerr [:pkerr]

Assignee

Comment 14

•

10 years ago

(In reply to ggb from comment #13)
> 70.42.47.104 is one of the loop turn servers
> 
> Could the issue be related to this one reported some time back?
> https://bugzilla.mozilla.org/show_bug.cgi?id=935806   (that issue was making
> connections be closed after 5 mins with some routers)

That fits with one of the scenarios on my suspects list.

ggb

Comment 15

•

10 years ago

Any way you can verify it?    One of the core differences between loop and other services is probably the usage of multiple unidirectional PeerConnections and this issue (NAT expiration because of missing keepalives) would manifest only in that case.

Nils Ohlmeier [:drno]

Comment 16

•

10 years ago

Attached file about:webrtc from a call with video freeze — Details

This is the about:webrtc from a call between mreavy and me, where the video froze for her after about 40min.
Looks like everything was relayed in this case.

Maire Reavy [:mreavy]

Comment 17

•

10 years ago

Attached file webrtc_logs_drno_20140723 -- I copy-and-pasted the about:webrtc logs twice during the call (about 2 mins apart) — Details

From my call with Nils where his video to me froze

Nils Ohlmeier [:drno]

Updated

•

10 years ago

Attachment #8461102 - Attachment mime type: text/x-log → text/plain

Byron Campen [:bwc]

Comment 18

•

10 years ago

It looks like that call used rtcp-mux, so rtcp should have been keeping the pinholes open for recvonly streams, but the following makes it look like rtcp stopped being sent for video (note the very old timestamp on the "Remote:" label)? Do you guys have a relatively precise time at which video stopped?

outbound_rtp_video_1
Encoder: Avg. bitrate: 77.88 Mbps (565.98 SD) Avg. framerate: 29.81 fps (0.99 SD) Dropped frames: 552263
Local: 12:53:36 GMT-0700 (PDT) outboundrtp SSRC: 735382457 Sent: 634192 packets (682697.51 Kb)
Remote: 12:46:59 GMT-0700 (PDT) inboundrtp SSRC: 735382457 Received: 0 packets (0 Kb) Lost: 74691 Jitter: 10.447 RTT: 0 ms

Paul Kerr [:pkerr]

Assignee

Comment 19

•

10 years ago

(In reply to Byron Campen [:bwc] from comment #18)
> It looks like that call used rtcp-mux, so rtcp should have been keeping the
> pinholes open for recvonly streams, but the following makes it look like
> rtcp stopped being sent for video (note the very old timestamp on the
> "Remote:" label)? Do you guys have a relatively precise time at which video
> stopped?

RTCP traffic will keep the NATs open but may not keep the TURN allocation open used for the relayed-udp path. The client which allocated the relay should also be doing a refresh.

Paul Kerr [:pkerr]

Assignee

Updated

•

10 years ago

Status: NEW → ASSIGNED

Maire Reavy [:mreavy]

Updated

•

10 years ago

Target Milestone: mozilla33 → mozilla34

Paul Kerr [:pkerr]

Assignee

Comment 20

•

10 years ago

I am looking at the difference in the nr_turn_client_parse_send_indication() and nr_turn_client_parse_data_indication() operations in /media/mtransport/third_party/nICEr/src/stun/turn_client_ctx.c. For the Send indication path, a call is made to nr_turn_client_ensure_perm() to signal to the TURN logic to continue to refresh the allocation for this relay. For the Data indication, the fact that a matching permission exists is verified but no call is made to the refresh maintenance logic. With the loop server setting up a recvonly and sendonly calls, a relayed connection may not use any Send indications for the recvonly segment. So, this path will time out.

I have made a modification to the nr_turn_client_parse_data_indication() function and am trying to force relayed connection to test this hypothesis.

Paul Kerr [:pkerr]

Assignee

Comment 21

•

10 years ago

Attached patch Part 1: WIP - test change to TURN Data indication handling (obsolete) — Details — Splinter Review

WIP upload. DO NOT CHECKIN

Paul Kerr [:pkerr]

Assignee

Comment 22

•

10 years ago

Attached patch Part 1: WIP - test change to TURN Data indication handling (obsolete) — Details — Splinter Review

Current WIP

Paul Kerr [:pkerr]

Assignee

Updated

•

10 years ago

Attachment #8467470 - Attachment is obsolete: true

Eric Rescorla (:ekr)

Comment 23

•

10 years ago

(In reply to Paul Kerr [:pkerr] from comment #20)
> I am looking at the difference in the nr_turn_client_parse_send_indication()
> and nr_turn_client_parse_data_indication() operations in
> /media/mtransport/third_party/nICEr/src/stun/turn_client_ctx.c. For the Send
> indication path, a call is made to nr_turn_client_ensure_perm() to signal to
> the TURN logic to continue to refresh the allocation for this relay. For the
> Data indication, the fact that a matching permission exists is verified but
> no call is made to the refresh maintenance logic. With the loop server
> setting up a recvonly and sendonly calls, a relayed connection may not use
> any Send indications for the recvonly segment. So, this path will time out.
> 
> I have made a modification to the nr_turn_client_parse_data_indication()
> function and am trying to force relayed connection to test this hypothesis.

I do not believe that this is the correct fix.

ICE implementations are supposed to ensure traffic is periodically sent
in the outgoing direction and this will cause the send indication to
refresh. If we are doing RTCP-MUX this will happen on its own. If not,
we are supposed to send a keepalive. IIRC these timers get set when 
we do finalize() so perhaps that's not what's happening. In any case,
you should be addressing this at the ICE level, not the TURN level,
since you will have the same timeout problem with NATs

Eric Rescorla (:ekr)

Comment 24

•

10 years ago

http://tools.ietf.org/html/rfc5245#section-10

Maire Reavy [:mreavy]

Updated

•

10 years ago

Comment 26

•

10 years ago

FYI it seems I can replicate fairly reliably the behavior described on bug 1034595 between my home office and the Paris office - if it helps anyone I am happy to help troubleshoot.

Paul Kerr [:pkerr]

Assignee

Comment 27

•

10 years ago

Attached patch run TURN relay reservation refresh logic when handling a Data Indication (obsolete) — Details — Splinter Review

update temporary fix

Paul Kerr [:pkerr]

Assignee

Updated

•

10 years ago

Attachment #8468539 - Attachment is obsolete: true

Paul Kerr [:pkerr]

Assignee

Comment 28

•

10 years ago

Comment on attachment 8517592 [details] [diff] [review]
run TURN relay reservation refresh logic when handling a Data Indication

Review as possible temporary fix until keep-alive is implemented at the ICE peer connection level.

Attachment #8517592 - Flags: review?(rjesup)

Randell Jesup [:jesup] (needinfo me)

Comment 29

•

10 years ago

Comment on attachment 8517592 [details] [diff] [review]
run TURN relay reservation refresh logic when handling a Data Indication

Review of attachment 8517592 [details] [diff] [review]:
-----------------------------------------------------------------

::: media/mtransport/third_party/nICEr/src/stun/turn_client_ctx.c
@@ +63,5 @@
>                                                    times out after about 5. */
>  #define TURN_PERMISSION_LIFETIME_SECONDS 300   /* 5 minutes. From RFC 5766 2.3 */
>  
> +#define REFRESH_RESERVATION_ON_RECV 1 // Test for TURN reservation request on receive also
> +

File a bug for doing this the right way, and note it here in a comment.

Attachment #8517592 - Flags: review?(rjesup)

Attachment #8517592 - Flags: review?(docfaraday)

Attachment #8517592 - Flags: review+

Paul Kerr [:pkerr]

Assignee

Updated

•

10 years ago

Blocks: 1094997

Paul Kerr [:pkerr]

Assignee

Comment 30

•

10 years ago

Attached patch run TURN relay reservation refresh logic when handling a Data Indication (obsolete) — Details — Splinter Review

Added comment referencing permanent fix bug number

Paul Kerr [:pkerr]

Assignee

Updated

•

10 years ago

Attachment #8517592 - Attachment is obsolete: true

Attachment #8517592 - Flags: review?(docfaraday)

Paul Kerr [:pkerr]

Assignee

Updated

•

10 years ago

Attachment #8518310 - Flags: review?(docfaraday)

Randell Jesup [:jesup] (needinfo me)

Updated

•

10 years ago

status-firefox34: --- → affected

status-firefox35: --- → affected

status-firefox36: --- → affected

Whiteboard: [investigation][p=4] → [webrtc-uplift]

Byron Campen [:bwc]

Comment 31

•

10 years ago

Comment on attachment 8518310 [details] [diff] [review]
run TURN relay reservation refresh logic when handling a Data Indication

Review of attachment 8518310 [details] [diff] [review]:
-----------------------------------------------------------------

One problem, but big enough that I'm not comfortable giving r+.

::: media/mtransport/third_party/nICEr/src/stun/turn_client_ctx.c
@@ +64,5 @@
>  #define TURN_PERMISSION_LIFETIME_SECONDS 300   /* 5 minutes. From RFC 5766 2.3 */
>  
> +// Set to enable a temporary fix that will run the TURN reservation keep-alive
> +// logic when data is received via a TURN relayed path: a DATA_INDICATION packet is received.
> +// This should be replace/removed when bug 1094997 is implemented.

Point this at bug 935806

@@ +787,5 @@
> +#if REFRESH_RESERVATION_ON_RECV
> +  if ((r=nr_turn_client_ensure_perm(ctx, remote_addr))) {
> +#else
> +  if ((r=nr_turn_permission_find(ctx, remote_addr, &perm))) {
> +#endif

Seems like this will create new permissions if we get data from a peer without one (and we'll allow the data through). We probably don't want that. Maybe what we want to do instead is move the nr_turn_client_ensure_perm after this block (ie; move the nr_transport_addr_copy back where it was before, and move the nr_turn_client_ensure_perm after that).

Attachment #8518310 - Flags: review?(docfaraday) → review-

Paul Kerr [:pkerr]

Assignee

Comment 32

•

10 years ago

Attached patch Run TURN relay reservation refresh logic when handling a Data Indication packet — Details — Splinter Review

incorporate review feedback

Paul Kerr [:pkerr]

Assignee

Updated

•

10 years ago

Attachment #8518310 - Attachment is obsolete: true

Paul Kerr [:pkerr]

Assignee

Updated

•

10 years ago

Attachment #8518518 - Flags: review?(docfaraday)

Byron Campen [:bwc]

Comment 33

•

10 years ago

Comment on attachment 8518518 [details] [diff] [review]
Run TURN relay reservation refresh logic when handling a Data Indication packet

Review of attachment 8518518 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm

Attachment #8518518 - Flags: review?(docfaraday) → review+

Paul Kerr [:pkerr]

Assignee

Comment 34

•

10 years ago

Try run: https://tbpl.mozilla.org/?tree=Try&rev=9b78c2480928

Randell Jesup [:jesup] (needinfo me)

Comment 35

•

10 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/a9f7ff5b387e

Randell Jesup [:jesup] (needinfo me)

Comment 36

•

10 years ago

Comment on attachment 8518518 [details] [diff] [review]
Run TURN relay reservation refresh logic when handling a Data Indication packet

Approval Request Comment
[Feature/regressing bug #]: N/A

[User impact if declined]: random loss of calls (or often either the audio or video of a call) after some number of minutes when using TURN and one-way connections (Loop/Hello uses all one-way connections, and 10-25% of calls need to use TURN).

[Describe test coverage new/current, TBPL]: Requires long tests/dogfooding with TURN server use.  We don't have TURN in tbpl, and can't run 15-minute call tests there, so this is manually tested.

[Risks and why]: Very low risk patch; just turns on refresh if we receive data, not just when we send it.  We have a more-complete (and slightly higher risk) patch in another bug that will land on inbound but not be uplifted.

[String/UUID change made/needed]: none

Attachment #8518518 - Flags: approval-mozilla-beta?

Attachment #8518518 - Flags: approval-mozilla-aurora?

u279076

Comment 37

•

10 years ago

Given that this is not easily reproducible through manual testing I'm untracking this for QE verification. We can reconsider this for QE verification if someone can propose more robust testing but sitting in calls for tens of minutes or longer at a time is probably not going to cut it.

Flags: qe-verify-

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 38

•

10 years ago

https://hg.mozilla.org/mozilla-central/rev/a9f7ff5b387e

Status: ASSIGNED → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Target Milestone: mozilla34 → mozilla36

Lukas Blakk [:lsblakk] use ?needinfo

Comment 39

•

10 years ago

Comment on attachment 8518518 [details] [diff] [review]
Run TURN relay reservation refresh logic when handling a Data Indication packet

Approving for Aurora for wider bake audience.

Attachment #8518518 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

Lawrence Mandel [:lmandel] (use needinfo)

Comment 40

•

10 years ago

Comment on attachment 8518518 [details] [diff] [review]
Run TURN relay reservation refresh logic when handling a Data Indication packet

Beta+

Attachment #8518518 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Randell Jesup [:jesup] (needinfo me)

Comment 41

•

10 years ago

https://hg.mozilla.org/releases/mozilla-beta/rev/d73c4671a18f

status-firefox34: affected → fixed

status-firefox36: affected → fixed

u279076

Comment 42

•

10 years ago

(In reply to Randell Jesup [:jesup] from comment #41)
> https://hg.mozilla.org/releases/mozilla-beta/rev/d73c4671a18f

Any reason this landed on Beta before Aurora?

Randell Jesup [:jesup] (needinfo me)

Comment 43

•

10 years ago

(In reply to Anthony Hughes, QA Mentor (:ashughes) from comment #42)
> (In reply to Randell Jesup [:jesup] from comment #41)
> > https://hg.mozilla.org/releases/mozilla-beta/rev/d73c4671a18f
> 
> Any reason this landed on Beta before Aurora?

Aurora has been closed since late last week to prepare for the 10th anniversary, and won't open until sometime later today (hopefully), and next-to-last beta is GTB today.  Discussed with lsblakk and amandel before approval was granted.  lsblakk initially approved for aurora without realizing it had been closed for the weekend.

u279076

Comment 44

•

10 years ago

Thanks for the explanation.

Ryan VanderMeulen [:RyanVM]

Comment 45

•

10 years ago

https://hg.mozilla.org/releases/mozilla-aurora/rev/3a5f9751df57

status-firefox35: affected → fixed

Ryan VanderMeulen [:RyanVM]

Comment 46

•

10 years ago

https://hg.mozilla.org/releases/mozilla-b2g34_v2_1/rev/d73c4671a18f

status-b2g-v2.1: --- → fixed

status-b2g-v2.2: --- → fixed

Shine(Leave from Mozilla)

Comment 47

•

10 years ago

Hi Mark,
Could you please provide a repro video or repro steps, thanks!

Flags: needinfo?(standard8)

Mark Banner (:standard8)

Reporter

Comment 48

•

10 years ago

(In reply to Shine from comment #47)
> Hi Mark,
> Could you please provide a repro video or repro steps, thanks!

Please see comment 0 - basically have lots of long calls and see if you get a/v issues on any of them.

I believe this may be related to specific network setups - Paul may be able to advise more.

Flags: needinfo?(standard8) → needinfo?(pkerr)

Paul Kerr [:pkerr]

Assignee

Comment 49

•

10 years ago

First, test with the Loop client, or at least with a test client that uses the TokBox Loop server. That is necessary because the bug manifests itself for calls that are built up from two half-duplex connections: a send-only and a recv-only RTP connection, and this is the method used by TokBox.

One of the clients in the call should be behind a NAT that will block direct connections and needs to use a TURN relay. You are looking for the condition that can be found in the attached logs. Where a client has a transmit connection to the other client through a non-relayed path and a receive connection via the TURN relay. It is thought that the bug is based on a client not renewing the TURN allocation because it does not send any data via that path.

Once such a connection is set up, keep an active call up for at least ten minutes, but longer is better, up to 20 minutes.

Flags: needinfo?(pkerr)

You need to log in before you can comment on or make changes to this bug.