Closed
Bug 1023539
Opened 10 years ago
Closed 10 years ago
In call audio failed (both ways), the video froze later (one way)
Categories
(Core :: WebRTC: Audio/Video, defect, P1)
Tracking
()
RESOLVED
FIXED
mozilla36
People
(Reporter: standard8, Assigned: pkerr)
References
Details
(Whiteboard: [webrtc-uplift])
Attachments
(4 files, 4 obsolete files)
14.52 KB,
text/plain
|
Details | |
110.08 KB,
text/plain
|
Details | |
164.71 KB,
text/plain
|
Details | |
2.53 KB,
patch
|
bwc
:
review+
lsblakk
:
approval-mozilla-aurora+
lmandel
:
approval-mozilla-beta+
|
Details | Diff | Splinter Review |
Dan and I were chatting (MoCo offices - my home), we had been chatting for about 6 mins, when we lost audio at both ends - it seemed to happen at the exactly the same time. After a further few minutes Dan's video froze (at both ends of the call). We were both on Mac. I did not have a headset, Dan did. Jesup has the about:webrtc information and the extra logs from the call.
Reporter | ||
Comment 1•10 years ago
|
||
Maire & I just had the situation where we were on a Loop call, and my audio stopped working completely. Here's my about:webrtc log.
Comment 2•10 years ago
|
||
It's unclear at this point if this is a Core::WebRTC bug or a Loop bug (possibly related to the SDK). I've asked Pkerr to start investigating this and see if we can narrow down the possible culprits. Anthony, Nils -- Anything you guys can do to help us reproduce this and get a reduced test case is appreciated. I'm doing a needinfo to you both to get this on your radar and see if you have any suggestions for getting us more data. Thanks. NOTE: I've seen audio freeze in one direction and not the other. I've also seen instances where the video freezes in one direction and never comes back, but the audio continues in both directions. I don't believe I've seen instances of freezing with no recovery on non-Loop calls (e.g. talky.io). If anyone has seen audio or video freezes with no recovery outside of Loop, that would be useful info -- especially if it's audio and not video or video and not audio. Thanks.
Assignee: nobody → pkerr
Flags: needinfo?(drno)
Flags: needinfo?(anthony.s.hughes)
Whiteboard: [investigation]
Comment 3•10 years ago
|
||
The one about:webrtc attached to this uses server-reflexive on both sides. So it looks like we can't blame the relay. Does the SDK handle every single media packet in the case where no relay is in use? From the comments here it sounds like Mark has a nasty home router :-) Unfortunately we don't have automated calls end-to-end working yet. So I guess the best we can do add manual testing to see if we can get it reproduced.
Flags: needinfo?(drno)
Assignee | ||
Updated•10 years ago
|
Whiteboard: [investigation] → [investigation][p=4]
Comment 4•10 years ago
|
||
I talked to the TokBox folks yesterday at their office about this. If the faulty calls go P2P (not over their servers), they were claiming their SDK does not really touch the media parts. So that would leave us then with a platform issue here.
Sorry, I just got back from PTO and am still catching up on mail. I'll coordinate some manual testing to try to confirm if this is a Loop issue or a Platform issue. We'll need to get this clearly reproduced before we can work on a reduced testcase and/or a regression window.
Flags: needinfo?(anthony.s.hughes)
Comment 6•10 years ago
|
||
FYI my understanding is that there have been hints that some degree of packet loss might trigger this.
Comment 7•10 years ago
|
||
Also that in one case dmose was seeing "Dropped" frames in about:webrtc going up a lot when video from him to mreavy froze. That implies captured frames were not being turned into packets and sent for some reason. Hitting Start Debug Log in about:webrtc at that point might be helpful (plus NSPR logs). Note the Debug logs get saved separately from NSPR; the location defaults (%TEMP% for win, /tmp for linux/mac) but can be overridden in about:config
Comment 8•10 years ago
|
||
I'd love to get this into Fx 33. (Once we find identify the fix, this is the sort of bug that could/would be uplifted.) This is right behind Bug 1002414 in his queue. He has already started investigation and trying to repo this.
Priority: -- → P1
Target Milestone: --- → mozilla33
Comment 10•10 years ago
|
||
I worked with some people at Softvision to try to reproduce this but we've not yet succeeded. Softvision worked on this all night and didn't experience any call quality issues. I had them try both Loop and AppRTC on different networks with calls lasting longer than 10 minutes and no problems were encountered.
Comment 11•10 years ago
|
||
Thanks - I've never (yet, personally) seen it fail in this manner on apprtc/talky/etc. I know mreavy/dmose/standard8/etc have seen it multiple times on loop
Assignee | ||
Comment 12•10 years ago
|
||
(In reply to Nils Ohlmeier [:drno] from comment #4) > I talked to the TokBox folks yesterday at their office about this. > If the faulty calls go P2P (not over their servers), they were claiming > their SDK does not really touch the media parts. So that would leave us then > with a platform issue here. If you look at the attached log, the address that is treated as a peer reflexive candidate is actually the same as the relayed candidate from the remote peer. Discussing this with bwc, he seems to feel that if the local peer sees the connectivity check BIND request from the remote before being informed of the remote relay candidate via trickle that it marks this as peer reflexive. This address, 70.42.47.104:20444/udp, was the one selected for the audio part of the call. (The loop client did not eventually notice that it matched the relay candidate.) So, unless I am reading the log wrong, an audio channel for this call did go through the loop server while the rest of the channel ended up a peer-to-peer.
Comment 13•10 years ago
|
||
70.42.47.104 is one of the loop turn servers Could the issue be related to this one reported some time back? https://bugzilla.mozilla.org/show_bug.cgi?id=935806 (that issue was making connections be closed after 5 mins with some routers)
Assignee | ||
Comment 14•10 years ago
|
||
(In reply to ggb from comment #13) > 70.42.47.104 is one of the loop turn servers > > Could the issue be related to this one reported some time back? > https://bugzilla.mozilla.org/show_bug.cgi?id=935806 (that issue was making > connections be closed after 5 mins with some routers) That fits with one of the scenarios on my suspects list.
Comment 15•10 years ago
|
||
Any way you can verify it? One of the core differences between loop and other services is probably the usage of multiple unidirectional PeerConnections and this issue (NAT expiration because of missing keepalives) would manifest only in that case.
Comment 16•10 years ago
|
||
This is the about:webrtc from a call between mreavy and me, where the video froze for her after about 40min. Looks like everything was relayed in this case.
Comment 17•10 years ago
|
||
From my call with Nils where his video to me froze
Updated•10 years ago
|
Attachment #8461102 -
Attachment mime type: text/x-log → text/plain
Comment 18•10 years ago
|
||
It looks like that call used rtcp-mux, so rtcp should have been keeping the pinholes open for recvonly streams, but the following makes it look like rtcp stopped being sent for video (note the very old timestamp on the "Remote:" label)? Do you guys have a relatively precise time at which video stopped? outbound_rtp_video_1 Encoder: Avg. bitrate: 77.88 Mbps (565.98 SD) Avg. framerate: 29.81 fps (0.99 SD) Dropped frames: 552263 Local: 12:53:36 GMT-0700 (PDT) outboundrtp SSRC: 735382457 Sent: 634192 packets (682697.51 Kb) Remote: 12:46:59 GMT-0700 (PDT) inboundrtp SSRC: 735382457 Received: 0 packets (0 Kb) Lost: 74691 Jitter: 10.447 RTT: 0 ms
Assignee | ||
Comment 19•10 years ago
|
||
(In reply to Byron Campen [:bwc] from comment #18) > It looks like that call used rtcp-mux, so rtcp should have been keeping the > pinholes open for recvonly streams, but the following makes it look like > rtcp stopped being sent for video (note the very old timestamp on the > "Remote:" label)? Do you guys have a relatively precise time at which video > stopped? RTCP traffic will keep the NATs open but may not keep the TURN allocation open used for the relayed-udp path. The client which allocated the relay should also be doing a refresh.
Assignee | ||
Updated•10 years ago
|
Status: NEW → ASSIGNED
Updated•10 years ago
|
Target Milestone: mozilla33 → mozilla34
Assignee | ||
Comment 20•10 years ago
|
||
I am looking at the difference in the nr_turn_client_parse_send_indication() and nr_turn_client_parse_data_indication() operations in /media/mtransport/third_party/nICEr/src/stun/turn_client_ctx.c. For the Send indication path, a call is made to nr_turn_client_ensure_perm() to signal to the TURN logic to continue to refresh the allocation for this relay. For the Data indication, the fact that a matching permission exists is verified but no call is made to the refresh maintenance logic. With the loop server setting up a recvonly and sendonly calls, a relayed connection may not use any Send indications for the recvonly segment. So, this path will time out. I have made a modification to the nr_turn_client_parse_data_indication() function and am trying to force relayed connection to test this hypothesis.
Assignee | ||
Comment 21•10 years ago
|
||
WIP upload. DO NOT CHECKIN
Assignee | ||
Comment 22•10 years ago
|
||
Current WIP
Assignee | ||
Updated•10 years ago
|
Attachment #8467470 -
Attachment is obsolete: true
Comment 23•10 years ago
|
||
(In reply to Paul Kerr [:pkerr] from comment #20) > I am looking at the difference in the nr_turn_client_parse_send_indication() > and nr_turn_client_parse_data_indication() operations in > /media/mtransport/third_party/nICEr/src/stun/turn_client_ctx.c. For the Send > indication path, a call is made to nr_turn_client_ensure_perm() to signal to > the TURN logic to continue to refresh the allocation for this relay. For the > Data indication, the fact that a matching permission exists is verified but > no call is made to the refresh maintenance logic. With the loop server > setting up a recvonly and sendonly calls, a relayed connection may not use > any Send indications for the recvonly segment. So, this path will time out. > > I have made a modification to the nr_turn_client_parse_data_indication() > function and am trying to force relayed connection to test this hypothesis. I do not believe that this is the correct fix. ICE implementations are supposed to ensure traffic is periodically sent in the outgoing direction and this will cause the send indication to refresh. If we are doing RTCP-MUX this will happen on its own. If not, we are supposed to send a keepalive. IIRC these timers get set when we do finalize() so perhaps that's not what's happening. In any case, you should be addressing this at the ICE level, not the TURN level, since you will have the same timeout problem with NATs
Comment 26•10 years ago
|
||
FYI it seems I can replicate fairly reliably the behavior described on bug 1034595 between my home office and the Paris office - if it helps anyone I am happy to help troubleshoot.
Assignee | ||
Comment 27•10 years ago
|
||
update temporary fix
Assignee | ||
Updated•10 years ago
|
Attachment #8468539 -
Attachment is obsolete: true
Assignee | ||
Comment 28•10 years ago
|
||
Comment on attachment 8517592 [details] [diff] [review] run TURN relay reservation refresh logic when handling a Data Indication Review as possible temporary fix until keep-alive is implemented at the ICE peer connection level.
Attachment #8517592 -
Flags: review?(rjesup)
Comment 29•10 years ago
|
||
Comment on attachment 8517592 [details] [diff] [review] run TURN relay reservation refresh logic when handling a Data Indication Review of attachment 8517592 [details] [diff] [review]: ----------------------------------------------------------------- ::: media/mtransport/third_party/nICEr/src/stun/turn_client_ctx.c @@ +63,5 @@ > times out after about 5. */ > #define TURN_PERMISSION_LIFETIME_SECONDS 300 /* 5 minutes. From RFC 5766 2.3 */ > > +#define REFRESH_RESERVATION_ON_RECV 1 // Test for TURN reservation request on receive also > + File a bug for doing this the right way, and note it here in a comment.
Attachment #8517592 -
Flags: review?(rjesup)
Attachment #8517592 -
Flags: review?(docfaraday)
Attachment #8517592 -
Flags: review+
Assignee | ||
Comment 30•10 years ago
|
||
Added comment referencing permanent fix bug number
Assignee | ||
Updated•10 years ago
|
Attachment #8517592 -
Attachment is obsolete: true
Attachment #8517592 -
Flags: review?(docfaraday)
Assignee | ||
Updated•10 years ago
|
Attachment #8518310 -
Flags: review?(docfaraday)
Updated•10 years ago
|
status-firefox34:
--- → affected
status-firefox35:
--- → affected
status-firefox36:
--- → affected
Whiteboard: [investigation][p=4] → [webrtc-uplift]
Comment 31•10 years ago
|
||
Comment on attachment 8518310 [details] [diff] [review] run TURN relay reservation refresh logic when handling a Data Indication Review of attachment 8518310 [details] [diff] [review]: ----------------------------------------------------------------- One problem, but big enough that I'm not comfortable giving r+. ::: media/mtransport/third_party/nICEr/src/stun/turn_client_ctx.c @@ +64,5 @@ > #define TURN_PERMISSION_LIFETIME_SECONDS 300 /* 5 minutes. From RFC 5766 2.3 */ > > +// Set to enable a temporary fix that will run the TURN reservation keep-alive > +// logic when data is received via a TURN relayed path: a DATA_INDICATION packet is received. > +// This should be replace/removed when bug 1094997 is implemented. Point this at bug 935806 @@ +787,5 @@ > +#if REFRESH_RESERVATION_ON_RECV > + if ((r=nr_turn_client_ensure_perm(ctx, remote_addr))) { > +#else > + if ((r=nr_turn_permission_find(ctx, remote_addr, &perm))) { > +#endif Seems like this will create new permissions if we get data from a peer without one (and we'll allow the data through). We probably don't want that. Maybe what we want to do instead is move the nr_turn_client_ensure_perm after this block (ie; move the nr_transport_addr_copy back where it was before, and move the nr_turn_client_ensure_perm after that).
Attachment #8518310 -
Flags: review?(docfaraday) → review-
Assignee | ||
Comment 32•10 years ago
|
||
incorporate review feedback
Assignee | ||
Updated•10 years ago
|
Attachment #8518310 -
Attachment is obsolete: true
Assignee | ||
Updated•10 years ago
|
Attachment #8518518 -
Flags: review?(docfaraday)
Comment 33•10 years ago
|
||
Comment on attachment 8518518 [details] [diff] [review] Run TURN relay reservation refresh logic when handling a Data Indication packet Review of attachment 8518518 [details] [diff] [review]: ----------------------------------------------------------------- lgtm
Attachment #8518518 -
Flags: review?(docfaraday) → review+
Assignee | ||
Comment 34•10 years ago
|
||
Try run: https://tbpl.mozilla.org/?tree=Try&rev=9b78c2480928
Comment 36•10 years ago
|
||
Comment on attachment 8518518 [details] [diff] [review] Run TURN relay reservation refresh logic when handling a Data Indication packet Approval Request Comment [Feature/regressing bug #]: N/A [User impact if declined]: random loss of calls (or often either the audio or video of a call) after some number of minutes when using TURN and one-way connections (Loop/Hello uses all one-way connections, and 10-25% of calls need to use TURN). [Describe test coverage new/current, TBPL]: Requires long tests/dogfooding with TURN server use. We don't have TURN in tbpl, and can't run 15-minute call tests there, so this is manually tested. [Risks and why]: Very low risk patch; just turns on refresh if we receive data, not just when we send it. We have a more-complete (and slightly higher risk) patch in another bug that will land on inbound but not be uplifted. [String/UUID change made/needed]: none
Attachment #8518518 -
Flags: approval-mozilla-beta?
Attachment #8518518 -
Flags: approval-mozilla-aurora?
Comment 37•10 years ago
|
||
Given that this is not easily reproducible through manual testing I'm untracking this for QE verification. We can reconsider this for QE verification if someone can propose more robust testing but sitting in calls for tens of minutes or longer at a time is probably not going to cut it.
Flags: qe-verify-
https://hg.mozilla.org/mozilla-central/rev/a9f7ff5b387e
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: mozilla34 → mozilla36
Comment 39•10 years ago
|
||
Comment on attachment 8518518 [details] [diff] [review] Run TURN relay reservation refresh logic when handling a Data Indication packet Approving for Aurora for wider bake audience.
Attachment #8518518 -
Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment 40•10 years ago
|
||
Comment on attachment 8518518 [details] [diff] [review] Run TURN relay reservation refresh logic when handling a Data Indication packet Beta+
Attachment #8518518 -
Flags: approval-mozilla-beta? → approval-mozilla-beta+
Comment 42•10 years ago
|
||
(In reply to Randell Jesup [:jesup] from comment #41) > https://hg.mozilla.org/releases/mozilla-beta/rev/d73c4671a18f Any reason this landed on Beta before Aurora?
Comment 43•10 years ago
|
||
(In reply to Anthony Hughes, QA Mentor (:ashughes) from comment #42) > (In reply to Randell Jesup [:jesup] from comment #41) > > https://hg.mozilla.org/releases/mozilla-beta/rev/d73c4671a18f > > Any reason this landed on Beta before Aurora? Aurora has been closed since late last week to prepare for the 10th anniversary, and won't open until sometime later today (hopefully), and next-to-last beta is GTB today. Discussed with lsblakk and amandel before approval was granted. lsblakk initially approved for aurora without realizing it had been closed for the weekend.
Comment 44•10 years ago
|
||
Thanks for the explanation.
Comment 46•10 years ago
|
||
https://hg.mozilla.org/releases/mozilla-b2g34_v2_1/rev/d73c4671a18f
status-b2g-v2.1:
--- → fixed
status-b2g-v2.2:
--- → fixed
Comment 47•10 years ago
|
||
Hi Mark, Could you please provide a repro video or repro steps, thanks!
Flags: needinfo?(standard8)
Reporter | ||
Comment 48•10 years ago
|
||
(In reply to Shine from comment #47) > Hi Mark, > Could you please provide a repro video or repro steps, thanks! Please see comment 0 - basically have lots of long calls and see if you get a/v issues on any of them. I believe this may be related to specific network setups - Paul may be able to advise more.
Flags: needinfo?(standard8) → needinfo?(pkerr)
Assignee | ||
Comment 49•10 years ago
|
||
First, test with the Loop client, or at least with a test client that uses the TokBox Loop server. That is necessary because the bug manifests itself for calls that are built up from two half-duplex connections: a send-only and a recv-only RTP connection, and this is the method used by TokBox. One of the clients in the call should be behind a NAT that will block direct connections and needs to use a TURN relay. You are looking for the condition that can be found in the attached logs. Where a client has a transmit connection to the other client through a non-relayed path and a receive connection via the TURN relay. It is thought that the bug is based on a client not renewing the TURN allocation because it does not send any data via that path. Once such a connection is set up, keep an active call up for at least ten minutes, but longer is better, up to 20 minutes.
Flags: needinfo?(pkerr)
You need to log in
before you can comment on or make changes to this bug.
Description
•