Closed Bug 1023539 Opened 10 years ago Closed 10 years ago

In call audio failed (both ways), the video froze later (one way)

Categories

(Core :: WebRTC: Audio/Video, defect, P1)

33 Branch
defect

Tracking

()

RESOLVED FIXED
mozilla36
Tracking Status
firefox34 --- fixed
firefox35 --- fixed
firefox36 --- fixed
b2g-v2.1 --- fixed
b2g-v2.2 --- fixed

People

(Reporter: standard8, Assigned: pkerr)

References

Details

(Whiteboard: [webrtc-uplift])

Attachments

(4 files, 4 obsolete files)

Dan and I were chatting (MoCo offices - my home), we had been chatting for about 6 mins, when we lost audio at both ends - it seemed to happen at the exactly the same time.

After a further few minutes Dan's video froze (at both ends of the call).

We were both on Mac. I did not have a headset, Dan did.

Jesup has the about:webrtc information and the extra logs from the call.
Maire & I just had the situation where we were on a Loop call, and my audio stopped working completely. Here's my about:webrtc log.
It's unclear at this point if this is a Core::WebRTC bug or a Loop bug (possibly related to the SDK).

I've asked Pkerr to start investigating this and see if we can narrow down the possible culprits.

Anthony, Nils -- Anything you guys can do to help us reproduce this and get a reduced test case is appreciated.  I'm doing a needinfo to you both to get this on your radar and see if you have any suggestions for getting us more data.    Thanks.

NOTE: I've seen audio freeze in one direction and not the other.  I've also seen instances where the video freezes in one direction and never comes back, but the audio continues in both directions.  I don't believe I've seen instances of freezing with no recovery on non-Loop calls (e.g. talky.io).  If anyone has seen audio or video freezes with no recovery outside of Loop, that would be useful info -- especially if it's audio and not video or video and not audio.  Thanks.
Assignee: nobody → pkerr
Flags: needinfo?(drno)
Flags: needinfo?(anthony.s.hughes)
Whiteboard: [investigation]
The one about:webrtc attached to this uses server-reflexive on both sides. So it looks like we can't blame the relay. Does the SDK handle every single media packet in the case where no relay is in use?

From the comments here it sounds like Mark has a nasty home router :-)
Unfortunately we don't have automated calls end-to-end working yet. So I guess the best we can do add manual testing to see if we can get it reproduced.
Flags: needinfo?(drno)
Whiteboard: [investigation] → [investigation][p=4]
I talked to the TokBox folks yesterday at their office about this.
If the faulty calls go P2P (not over their servers), they were claiming their SDK does not really touch the media parts. So that would leave us then with a platform issue here.
Sorry, I just got back from PTO and am still catching up on mail. I'll coordinate some manual testing to try to confirm if this is a Loop issue or a Platform issue. We'll need to get this clearly reproduced before we can work on a reduced testcase and/or a regression window.
Flags: needinfo?(anthony.s.hughes)
QA Contact: anthony.s.hughes
FYI my understanding is that there have been hints that some degree of packet loss might trigger this.
Also that in one case dmose was seeing "Dropped" frames in about:webrtc going up a lot when video from him to mreavy froze.  That implies captured frames were not being turned into packets and sent for some reason.  Hitting Start Debug Log in about:webrtc at that point might be helpful (plus NSPR logs).  Note the Debug logs get saved separately from NSPR; the location defaults (%TEMP% for win, /tmp for linux/mac) but can be overridden in about:config
I'd love to get this into Fx 33.  (Once we find identify the fix, this is the sort of bug that could/would be uplifted.)  This is right behind Bug 1002414 in his queue.  He has already started investigation and trying to repo this.
Priority: -- → P1
Target Milestone: --- → mozilla33
I worked with some people at Softvision to try to reproduce this but we've not yet succeeded. Softvision worked on this all night and didn't experience any call quality issues. I had them try both Loop and AppRTC on different networks with calls lasting longer than 10 minutes and no problems were encountered.
Thanks - I've never (yet, personally) seen it fail in this manner on apprtc/talky/etc.  I know mreavy/dmose/standard8/etc have seen it multiple times on loop
(In reply to Nils Ohlmeier [:drno] from comment #4)
> I talked to the TokBox folks yesterday at their office about this.
> If the faulty calls go P2P (not over their servers), they were claiming
> their SDK does not really touch the media parts. So that would leave us then
> with a platform issue here.

If you look at the attached log, the address that is treated as a peer reflexive candidate is actually the same as the relayed candidate from the remote peer. Discussing this with bwc, he seems to feel that if the local peer sees the connectivity check BIND request from the remote before being informed of the remote relay candidate via trickle that it marks this as peer reflexive. This address, 70.42.47.104:20444/udp, was the one selected for the audio part of the call. (The loop client did not eventually notice that it matched the relay candidate.) So, unless I am reading the log wrong, an audio channel for this call did go through the loop server while the rest of the channel ended up a peer-to-peer.
70.42.47.104 is one of the loop turn servers

Could the issue be related to this one reported some time back? https://bugzilla.mozilla.org/show_bug.cgi?id=935806   (that issue was making connections be closed after 5 mins with some routers)
(In reply to ggb from comment #13)
> 70.42.47.104 is one of the loop turn servers
> 
> Could the issue be related to this one reported some time back?
> https://bugzilla.mozilla.org/show_bug.cgi?id=935806   (that issue was making
> connections be closed after 5 mins with some routers)

That fits with one of the scenarios on my suspects list.
Any way you can verify it?    One of the core differences between loop and other services is probably the usage of multiple unidirectional PeerConnections and this issue (NAT expiration because of missing keepalives) would manifest only in that case.
This is the about:webrtc from a call between mreavy and me, where the video froze for her after about 40min.
Looks like everything was relayed in this case.
Attachment #8461102 - Attachment mime type: text/x-log → text/plain
It looks like that call used rtcp-mux, so rtcp should have been keeping the pinholes open for recvonly streams, but the following makes it look like rtcp stopped being sent for video (note the very old timestamp on the "Remote:" label)? Do you guys have a relatively precise time at which video stopped?

outbound_rtp_video_1
Encoder: Avg. bitrate: 77.88 Mbps (565.98 SD) Avg. framerate: 29.81 fps (0.99 SD) Dropped frames: 552263
Local: 12:53:36 GMT-0700 (PDT) outboundrtp SSRC: 735382457 Sent: 634192 packets (682697.51 Kb)
Remote: 12:46:59 GMT-0700 (PDT) inboundrtp SSRC: 735382457 Received: 0 packets (0 Kb) Lost: 74691 Jitter: 10.447 RTT: 0 ms
(In reply to Byron Campen [:bwc] from comment #18)
> It looks like that call used rtcp-mux, so rtcp should have been keeping the
> pinholes open for recvonly streams, but the following makes it look like
> rtcp stopped being sent for video (note the very old timestamp on the
> "Remote:" label)? Do you guys have a relatively precise time at which video
> stopped?

RTCP traffic will keep the NATs open but may not keep the TURN allocation open used for the relayed-udp path. The client which allocated the relay should also be doing a refresh.
Status: NEW → ASSIGNED
Target Milestone: mozilla33 → mozilla34
I am looking at the difference in the nr_turn_client_parse_send_indication() and nr_turn_client_parse_data_indication() operations in /media/mtransport/third_party/nICEr/src/stun/turn_client_ctx.c. For the Send indication path, a call is made to nr_turn_client_ensure_perm() to signal to the TURN logic to continue to refresh the allocation for this relay. For the Data indication, the fact that a matching permission exists is verified but no call is made to the refresh maintenance logic. With the loop server setting up a recvonly and sendonly calls, a relayed connection may not use any Send indications for the recvonly segment. So, this path will time out.

I have made a modification to the nr_turn_client_parse_data_indication() function and am trying to force relayed connection to test this hypothesis.
WIP upload. DO NOT CHECKIN
Current WIP
Attachment #8467470 - Attachment is obsolete: true
(In reply to Paul Kerr [:pkerr] from comment #20)
> I am looking at the difference in the nr_turn_client_parse_send_indication()
> and nr_turn_client_parse_data_indication() operations in
> /media/mtransport/third_party/nICEr/src/stun/turn_client_ctx.c. For the Send
> indication path, a call is made to nr_turn_client_ensure_perm() to signal to
> the TURN logic to continue to refresh the allocation for this relay. For the
> Data indication, the fact that a matching permission exists is verified but
> no call is made to the refresh maintenance logic. With the loop server
> setting up a recvonly and sendonly calls, a relayed connection may not use
> any Send indications for the recvonly segment. So, this path will time out.
> 
> I have made a modification to the nr_turn_client_parse_data_indication()
> function and am trying to force relayed connection to test this hypothesis.

I do not believe that this is the correct fix.

ICE implementations are supposed to ensure traffic is periodically sent
in the outgoing direction and this will cause the send indication to
refresh. If we are doing RTCP-MUX this will happen on its own. If not,
we are supposed to send a keepalive. IIRC these timers get set when 
we do finalize() so perhaps that's not what's happening. In any case,
you should be addressing this at the ICE level, not the TURN level,
since you will have the same timeout problem with NATs
See Also: → 1054988
FYI it seems I can replicate fairly reliably the behavior described on bug 1034595 between my home office and the Paris office - if it helps anyone I am happy to help troubleshoot.
update temporary fix
Attachment #8468539 - Attachment is obsolete: true
Comment on attachment 8517592 [details] [diff] [review]
run TURN relay reservation refresh logic when handling a Data Indication

Review as possible temporary fix until keep-alive is implemented at the ICE peer connection level.
Attachment #8517592 - Flags: review?(rjesup)
Comment on attachment 8517592 [details] [diff] [review]
run TURN relay reservation refresh logic when handling a Data Indication

Review of attachment 8517592 [details] [diff] [review]:
-----------------------------------------------------------------

::: media/mtransport/third_party/nICEr/src/stun/turn_client_ctx.c
@@ +63,5 @@
>                                                    times out after about 5. */
>  #define TURN_PERMISSION_LIFETIME_SECONDS 300   /* 5 minutes. From RFC 5766 2.3 */
>  
> +#define REFRESH_RESERVATION_ON_RECV 1 // Test for TURN reservation request on receive also
> +

File a bug for doing this the right way, and note it here in a comment.
Attachment #8517592 - Flags: review?(rjesup)
Attachment #8517592 - Flags: review?(docfaraday)
Attachment #8517592 - Flags: review+
Blocks: 1094997
Added comment referencing permanent fix bug number
Attachment #8517592 - Attachment is obsolete: true
Attachment #8517592 - Flags: review?(docfaraday)
Attachment #8518310 - Flags: review?(docfaraday)
Whiteboard: [investigation][p=4] → [webrtc-uplift]
Comment on attachment 8518310 [details] [diff] [review]
run TURN relay reservation refresh logic when handling a Data Indication

Review of attachment 8518310 [details] [diff] [review]:
-----------------------------------------------------------------

One problem, but big enough that I'm not comfortable giving r+.

::: media/mtransport/third_party/nICEr/src/stun/turn_client_ctx.c
@@ +64,5 @@
>  #define TURN_PERMISSION_LIFETIME_SECONDS 300   /* 5 minutes. From RFC 5766 2.3 */
>  
> +// Set to enable a temporary fix that will run the TURN reservation keep-alive
> +// logic when data is received via a TURN relayed path: a DATA_INDICATION packet is received.
> +// This should be replace/removed when bug 1094997 is implemented.

Point this at bug 935806

@@ +787,5 @@
> +#if REFRESH_RESERVATION_ON_RECV
> +  if ((r=nr_turn_client_ensure_perm(ctx, remote_addr))) {
> +#else
> +  if ((r=nr_turn_permission_find(ctx, remote_addr, &perm))) {
> +#endif

Seems like this will create new permissions if we get data from a peer without one (and we'll allow the data through). We probably don't want that. Maybe what we want to do instead is move the nr_turn_client_ensure_perm after this block (ie; move the nr_transport_addr_copy back where it was before, and move the nr_turn_client_ensure_perm after that).
Attachment #8518310 - Flags: review?(docfaraday) → review-
Attachment #8518310 - Attachment is obsolete: true
Attachment #8518518 - Flags: review?(docfaraday)
Comment on attachment 8518518 [details] [diff] [review]
Run TURN relay reservation refresh logic when handling a Data Indication packet

Review of attachment 8518518 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm
Attachment #8518518 - Flags: review?(docfaraday) → review+
Comment on attachment 8518518 [details] [diff] [review]
Run TURN relay reservation refresh logic when handling a Data Indication packet

Approval Request Comment
[Feature/regressing bug #]: N/A

[User impact if declined]: random loss of calls (or often either the audio or video of a call) after some number of minutes when using TURN and one-way connections (Loop/Hello uses all one-way connections, and 10-25% of calls need to use TURN).

[Describe test coverage new/current, TBPL]: Requires long tests/dogfooding with TURN server use.  We don't have TURN in tbpl, and can't run 15-minute call tests there, so this is manually tested.

[Risks and why]: Very low risk patch; just turns on refresh if we receive data, not just when we send it.  We have a more-complete (and slightly higher risk) patch in another bug that will land on inbound but not be uplifted.

[String/UUID change made/needed]: none
Attachment #8518518 - Flags: approval-mozilla-beta?
Attachment #8518518 - Flags: approval-mozilla-aurora?
Given that this is not easily reproducible through manual testing I'm untracking this for QE verification. We can reconsider this for QE verification if someone can propose more robust testing but sitting in calls for tens of minutes or longer at a time is probably not going to cut it.
Flags: qe-verify-
https://hg.mozilla.org/mozilla-central/rev/a9f7ff5b387e
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: mozilla34 → mozilla36
Comment on attachment 8518518 [details] [diff] [review]
Run TURN relay reservation refresh logic when handling a Data Indication packet

Approving for Aurora for wider bake audience.
Attachment #8518518 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment on attachment 8518518 [details] [diff] [review]
Run TURN relay reservation refresh logic when handling a Data Indication packet

Beta+
Attachment #8518518 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
(In reply to Randell Jesup [:jesup] from comment #41)
> https://hg.mozilla.org/releases/mozilla-beta/rev/d73c4671a18f

Any reason this landed on Beta before Aurora?
(In reply to Anthony Hughes, QA Mentor (:ashughes) from comment #42)
> (In reply to Randell Jesup [:jesup] from comment #41)
> > https://hg.mozilla.org/releases/mozilla-beta/rev/d73c4671a18f
> 
> Any reason this landed on Beta before Aurora?

Aurora has been closed since late last week to prepare for the 10th anniversary, and won't open until sometime later today (hopefully), and next-to-last beta is GTB today.  Discussed with lsblakk and amandel before approval was granted.  lsblakk initially approved for aurora without realizing it had been closed for the weekend.
Thanks for the explanation.
Hi Mark,
Could you please provide a repro video or repro steps, thanks!
Flags: needinfo?(standard8)
(In reply to Shine from comment #47)
> Hi Mark,
> Could you please provide a repro video or repro steps, thanks!

Please see comment 0 - basically have lots of long calls and see if you get a/v issues on any of them.

I believe this may be related to specific network setups - Paul may be able to advise more.
Flags: needinfo?(standard8) → needinfo?(pkerr)
First, test with the Loop client, or at least with a test client that uses the TokBox Loop server. That is necessary because the bug manifests itself for calls that are built up from two half-duplex connections: a send-only and a recv-only RTP connection, and this is the method used by TokBox.

One of the clients in the call should be behind a NAT that will block direct connections and needs to use a TURN relay. You are looking for the condition that can be found in the attached logs. Where a client has a transmit connection to the other client through a non-relayed path and a receive connection via the TURN relay. It is thought that the bug is based on a client not renewing the TURN allocation because it does not send any data via that path.

Once such a connection is set up, keep an active call up for at least ten minutes, but longer is better, up to 20 minutes.
Flags: needinfo?(pkerr)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: