1074517 - Second call with URL gets disconnected

Reporter

Description

•

10 years ago

- Generate a link clicker URL - Establish the call and enjoy... - Hang up the call from either end - Use the same URL again to call a second time - Call gets established just fine (audio and video working) Actual behavior: - After 3-5 seconds into the second call the call terminates for both sides Expected behavior: - Second call should run without getting terminated.

u279076

Comment 1

•

10 years ago

Nils, can you please see if this reproduces with the latest Aurora and Beta builds?

Flags: needinfo?(drno)

Nils Ohlmeier [:drno]

Reporter

Comment 2

•

10 years ago

Reproduces on: - Beta - Aurora - Nightly as the once creating the link. My guess is at this point that it is a server side issue.

status-firefox33: --- → affected

status-firefox34: --- → affected

status-firefox35: --- → affected

Component: Client → Server

Flags: needinfo?(drno)

QA Contact: anthony.s.hughes

Mark Banner (:standard8)

Assignee

Comment 3

•

10 years ago

Nils, it would be useful to turn on some logging to see narrow down the issue. In FF desktop, there's a "loop.debug.websocket" preference, if you can set that to true, and monitor the browser console on the link generation side. When the call connects, there should be some extra info appear along the lines of "WS <something>". My suspicion is that the call is timing out, but getting that information might help. You can also monitor this on the link-clicker side, by opening up a web console, and entering localStorage.setItem("debug.websocket", true); You can clear it later if you want with: localStorage.removeItem("debug.websocket");

Mark Banner (:standard8)

Assignee

Updated

•

10 years ago

Flags: needinfo?(drno)

Richard Pappalardo [:rpapa][:rpappalardo]

Comment 4

•

10 years ago

I tried placing several calls from Nightly (calling Aurora). I had loop.debug.websocket set to try on link generation (Nightly) side. I made about 6 calls. The first and last calls failed (disconnected), but all the ones in-between succeeded. Log from failed call: http://pastebin.mozilla.org/6672346 Log from successful call: http://pastebin.mozilla.org/6672352

u279076

Comment 6

•

10 years ago

Wontfix for 33 since Loop has been disabled there.

status-firefox33: affected → wontfix

Alexis Metaireau (:alexis)

Comment 7

•

10 years ago

Changing the component here since we don't know if that's related to the server or the client.

Component: Server → General

QA Contact: anthony.s.hughes

Nils Ohlmeier [:drno]

Reporter

Comment 8

•

10 years ago

Attached file link_clicker_web_console.log — Details

Nils Ohlmeier [:drno]

Reporter

Comment 9

•

10 years ago

Attached file link_generator_browser_console.log — Details

Nils Ohlmeier [:drno]

Reporter

Comment 10

•

10 years ago

So both sides of the call as you can see in the logs say: "WS Receiving" "{"messageType":"progress","state":"half-connected"}" and then later "WS Receiving" "{"messageType":"progress","state":"terminated","reason":"timeout"}" But the call is up and running. So who sends this progress terminated message, even though the call is up and running (I see video on both ends, I hear a nice audio feedback loop)?

Flags: needinfo?(drno)

Mark Banner (:standard8)

Assignee

Comment 11

•

10 years ago

Given the logs, I suspect this is a duplicate/related to bug 1073410. The media hasn't quite managed to finish exchanging fully before the server timeout kicks in.

Depends on: 1073410

Rémy Hubscher (:natim)

Comment 12

•

10 years ago

Maybe we should add a new action to give more time to the connection process if needed. Abr do you think this could be relevant?

Flags: needinfo?(adam)

Alexis Metaireau (:alexis)

Comment 13

•

10 years ago

Remy, I fear we will always have this case (even if we have a new action to give more time): at some point in time, the connection timer will expire and the server will end the WS connection. Exactly when the server decides the timer is over, the client can still send it's media up, but it's too late since the server already closed the connection. I believe the following happens: > Client1 Server Client2 > | init conn. | init conn. | > |----------------->|<---------------------| > | | (server starts timer)| > | media Up | | > |----------------->| | > | | Time pass, timer | > | | expires | > | close conn | close conn. | > |<-----------------|--------------------->| > | | media-up | > | |<---------------------| I don't know how that's being handled on the client, or if that's the problem we're seeing?

Rémy Hubscher (:natim)

Comment 14

•

10 years ago

If you can see both video with good feedback, it means that one of the two clients didn't send the connected action fast enough so that the server get's it before the timeout. We could increase the connection timer.

Alexis Metaireau (:alexis)

Comment 15

•

10 years ago

So, "media-up" is sent when the two clients are connected, in case the client is connected and then gets disconnected from the websocket, I believe it shouldn't have any impact and the call should not be terminated, so maybe that's just for the client to handle the socket connection differently once the media had been setup? (then, yes, we would have an inconsistency between what the server knows and the actual reality, but well, that's a border case, I believe).

Rémy Hubscher (:natim)

Comment 16

•

10 years ago

If both webrtc party can know they are connected to each other, that could work.

Adam Roach [:abr]

Comment 17

•

10 years ago

(In reply to Alexis Metaireau (:alexis) from comment #15) > So, "media-up" is sent when the two clients are connected, in case the > client is connected and then gets disconnected from the websocket, I believe > it shouldn't have any impact and the call should not be terminated, so maybe > that's just for the client to handle the socket connection differently once > the media had been setup? (then, yes, we would have an inconsistency between > what the server knows and the actual reality, but well, that's a border > case, I believe). I'd like to see logging that indicates that this is the issue before we go about changing behavior. If this is really just a race between the media coming up on both sides and the clients sending the "media-up" events, then I think increasing the timers (again) would be okay. But I'm really rather hesitant to take either action unless we see logging that confirms that one of these two things is happening. Given the timeframes involved here (10 seconds is a pretty long time to establish media streams), I'm actually very much inclined to agree with Mark's guess: this feels like Bug 1073410. My preferred solution here is to add a self-view prior to contacting the server, using the technique I demonstrate here: https://bugzilla.mozilla.org/page.cgi?id=splinter.html&bug=1039224&attachment=8497795 (which obviously needs some clean up, but it demonstrates an effective approach for self-view before activating the call).

Flags: needinfo?(adam)

Alexis Metaireau (:alexis)

Comment 18

•

10 years ago

That's correct I have no logs to back up what I'm thinking is the issue, and I've actually not be able to reproduce the issue. I think this case could happen, though, and we would need to check what's the client behavior in this case, nevertheless. I'll let you debug on Bug 1073410, feel free to fill a server bug if you think server might be the problem here.

Mark Banner (:standard8)

Assignee

Comment 19

•

10 years ago

Once we get something for bug 1073410, we'll be able to remove the case where the user takes a "long" time to accept the prompt, which typically causes this. We can then more easily watch for the cases where its just the media taking a long time to connect once the call is actually accepted.

Nils Ohlmeier [:drno]

Reporter

Comment 20

•

10 years ago

Just to make it clear this bug to me looks/feels quite different then bug 1073410 (so they are not dupes IMHO). Here is why: - I follow exactly the same behavior for the first and second call. If this would a timing issue of clicking on the door hanger (and yes there is obviously delay as I drive two browsers) then the first call should fail, but it does not. - The second call gets established perfectly fine! Both sides can see each other. Both sides can hear each other (a nice audio feedback loop). about:webrtc reports working ICE connections with packets flowing. Not sure if you noticed, but in the log files I see that the link generator sends this pretty early on: WS Sending" Object { messageType: "action", event: "media-up" } Is it reporting here that the media flow is up and running? But then it receives the report from the link clicker side: WS Receiving" "{"messageType":"progress","state":"half-connected"} And the link clicker side for what ever reason never reports that media is up fully. FYI I actually had exactly this issue on a real call with Maire a couple of minutes ago. The first call attempt was not successful because of user error. Then she clicked on the link a second time. We got connected for a couple of seconds (with audio and video just fine). And then the call terminated for both of us.

Nils Ohlmeier [:drno]

Reporter

Comment 21

•

10 years ago

So here is comparison of the WS messages from a successful call: Link generator: "WS Sending" Object { messageType: "action", event: "media-up" } "WS Receiving" "{"messageType":"progress","state":"connected"}" websocket.js:252 "WS Close" close { target: WebSocket, isTrusted: true, wasClean: false, code: Link clicker: "WS Sending" Object { messageType: "hello", callId: "WS Receiving" "{"messageType":"hello","state":"init"}" websocket.js:277 "WS Receiving" "{"messageType":"progress","state":"alerting"}" websocket.js:277 "WS Receiving" "{"messageType":"progress","state":"connecting"}" "WS Sending" Object { messageType: "action", event: "media-up" } "WS Receiving" "{"messageType":"progress","state":"half-connected"}" "WS Receiving" "{"messageType":"progress","state":"connected"}" websocket.js:277 "WS Close" close { target: WebSocket, isTrusted: true, wasClean: false, code: Now for the failure case of this bug the WS messages look like this: Link generator: "WS Sending" Object { messageType: "action", event: "media-up" } "WS Receiving" "{"messageType":"progress","state":"half-connected"}" "WS Receiving" "WS Close" close { target: WebSocket, isTrusted: true, wasClean: false, code: Link clicker: "WS Sending" Object { messageType: "hello", callId: "WS Receiving" "{"messageType":"hello","state":"init"}" websocket.js:277 "WS Receiving" "{"messageType":"progress","state":"alerting"}" websocket.js:277 "WS Receiving" "{"messageType":"progress","state":"connecting"}" "WS Receiving" "{"messageType":"progress","state":"half-connected"}" "WS Receiving" "WS Close" close { target: WebSocket, isTrusted: true, wasClean: false, code: BTW really annoying that the "WS Receiving" has an extra new line in the log message for the "{"messageType":"progress","state":"terminated","reason":"timeout"}" messages. So the "emtpy" lines above are in fact where both clients received the state terminated messages. The noticeable difference in my opinion here is that the for the second call the link clicker never sends this: "WS Sending" Object { messageType: "action", event: "media-up" } My guess is that makes the server side believe the call never got fully established and therefore tells both clients to terminate the call?!

Mark Banner (:standard8)

Assignee

Comment 22

•

10 years ago

> Link generator: > "WS Sending" Object { messageType: "action", event: "media-up" } > "WS Receiving" "{"messageType":"progress","state":"connected"}" > websocket.js:252 > "WS Close" close { target: WebSocket, isTrusted: true, wasClean: false, code: This is a bit strange, and looks like you've missed a few lines from the log copying (both sides go through the "hello" messages, and both sides should get all progress updates. > The noticeable difference in my opinion here is that the for the second call > the link clicker never sends this: > "WS Sending" Object { messageType: "action", event: "media-up" } > My guess is that makes the server side believe the call never got fully > established and therefore tells both clients to terminate the call?! So for "connection", both sides have to send "media-up". "media-up" is sent once the local stream is published and the remote stream has been subscribed to (not quite shown normally, but close enough. If that doesn't happen within a 10 second period after the server changes the progress to "connecting" (e.g. the call was accepted), then it times out the connection. I've generally seen different call setup times call to call using the same machine, so it doesn't entirely surprise me that a second call might be different. This is generally why I'm thinking this is the same as bug 1073410. Although what might be useful, is to get timestamps on the "state":"connecting" through "state":"terminated", I believe there's some way of making the console do this, but I've never worked out the magic command.

Nils Ohlmeier [:drno]

Reporter

Comment 23

•

10 years ago

(In reply to Mark Banner (:standard8) from comment #22) > > Link generator: > > "WS Sending" Object { messageType: "action", event: "media-up" } > > "WS Receiving" "{"messageType":"progress","state":"connected"}" > > websocket.js:252 > > "WS Close" close { target: WebSocket, isTrusted: true, wasClean: false, code: > > This is a bit strange, and looks like you've missed a few lines from the log > copying (both sides go through the "hello" messages, and both sides should > get all progress updates. One explanation here is that I might cleared the logs before starting the call. The other could be that the browser console buffer seems to be limited. Is there a way to hold it more log messages? > > The noticeable difference in my opinion here is that the for the second call > > the link clicker never sends this: > > "WS Sending" Object { messageType: "action", event: "media-up" } > > My guess is that makes the server side believe the call never got fully > > established and therefore tells both clients to terminate the call?! > > So for "connection", both sides have to send "media-up". "media-up" is sent > once the local stream is published and the remote stream has been subscribed > to (not quite shown normally, but close enough. > > If that doesn't happen within a 10 second period after the server changes > the progress to "connecting" (e.g. the call was accepted), then it times out > the connection. Is it possible that this 10 seconds after "connecting" gets confused, because I'm placing the first call, hang up and then start a second call briefly after clicking the green feedback smileys? Could the server side get mix up the state of these two calls for the same URL? > I've generally seen different call setup times call to call using the same > machine, so it doesn't entirely surprise me that a second call might be > different. > > This is generally why I'm thinking this is the same as bug 1073410. Although > what might be useful, is to get timestamps on the "state":"connecting" > through "state":"terminated", I believe there's some way of making the > console do this, but I've never worked out the magic command. If you let me know how I'll be happy to give it a try.

Nils Ohlmeier [:drno]

Reporter

Comment 24

•

10 years ago

Some more notes from trying to debug this: - Richard and me were able to reproduce this problem 100% across two different machines and different FF versions (Aurora and Nightly) - When I tried to today to wait 30min in between the first and the second call it did not help (I thought maybe some timeouts would got reset by that time) - Today with Nightly after hitting this problem once I was not even able to make any call with a new URL any more fore more then 10s (In other words: even if I generated a new URL, the initial call on that new URL would get terminated before 10s!). One thing to potentially investigate is: check if it helps to close the tab from the first call and wait some time to make sure the PeerConnections are garbage collected....

James Bonacci [:jbonacci]

Updated

•

10 years ago

Flags: qe-verify+

:shell escalante

Updated

•

10 years ago

backlog: --- → Fx34+

Maire Reavy [:mreavy]

Comment 26

•

10 years ago

We're fixing bug 1073410 first, and then we'll see if that helps this bug.

Nils Ohlmeier [:drno]

Reporter

Comment 27

•

10 years ago

If the tab from the first call gets closed and another gets opened where you start the second call everything works fine. So only if you submit the feedback after the first call and then wait the 5 seconds to get bounced back to the original start call page and you start the second call from there this problem appears. I have the feeling that now it takes longer until the second call gets dis-connected (maybe some timeout value on the server side got increased?), but it still happens.

Mark Banner (:standard8)

Assignee

Comment 28

•

10 years ago

Ok, I've hit one case where I can reproduce this issue: 1) Generate a link 2) Click the link in a browser 3) At the generation end, select Accept 4) Wait (do not accept the camera/microphone prompt) 5) Let the call timeout. 6) Submit feedback if necessary (bug 1081959 should change it to "call failed") 7) Start the call again (start or retry) 8) Accept the call 9) Accept the camera/microphone prompt Actual Results => Call connects and times out a few seconds later Expected Results => Call connects and continues happily

Mark Banner (:standard8)

Assignee

Comment 29

•

10 years ago

(In reply to Nils Ohlmeier [:drno] from comment #27) > If the tab from the first call gets closed and another gets opened where you > start the second call everything works fine. > So only if you submit the feedback after the first call and then wait the 5 > seconds to get bounced back to the original start call page and you start > the second call from there this problem appears. Yes this sounds right as well. > I have the feeling that now it takes longer until the second call gets > dis-connected (maybe some timeout value on the server side got increased?), > but it still happens. Yeah, we increased the timeout. This is primarily an issue on the standalone side, I have a vague handle on what's going on.

Assignee: nobody → standard8

Iteration: --- → 36.1

No longer depends on: 1073410

Priority: -- → P1

Target Milestone: --- → mozilla36

Mark Banner (:standard8)

Assignee

Comment 30

•

10 years ago

Attached patch Second call with a Loop url gets disconnected - ensure stream published/subscribed flags are properly reset. — Details — Splinter Review

Ok, I found the problem. There's a couple of flags that are used to work out when we have a published and a subscribed stream - these aren't getting reset when a new call is started, so on a "redial" the standalone client doesn't notify the media is up, so the server times out the call. Adding the resets to both end and start session, is really just bullet-proofing in case there's a situation I haven't thought of, as the code is quite complex. I checked the newer code based on the flux model for the outgoing direct calling, and it doesn't suffer from this problem, as the variables are already cleaned up.

Attachment #8507239 - Flags: review?(dmose)

Mark Banner (:standard8)

Assignee

Comment 31

•

10 years ago

Note: although some of this code is shared with desktop, it doesn't actually affect the desktop code - only standalone.

Whiteboard: [standalone only]

Dan Mosedale (:dmosedale, :dmose)

Comment 32

•

10 years ago

Comment on attachment 8507239 [details] [diff] [review] Second call with a Loop url gets disconnected - ensure stream published/subscribed flags are properly reset. Review of attachment 8507239 [details] [diff] [review]: ----------------------------------------------------------------- r=dmose, with the points here addressed as you see fit. ::: browser/components/loop/content/shared/js/models.js @@ +189,5 @@ > this.session.disconnect(); > + this.set({ > + publishedStream: false, > + subscribedStream: false, > + outgoing: false The above line looks like it wants to be "ongoing: false" instead of "outgoing: false". ::: browser/components/loop/test/shared/models_test.js @@ +335,5 @@ > it("should set the ongoing attribute to false", function() { > model.endSession(); > > expect(model.get("ongoing")).eql(false); > }); The fact that the above test didn't start failing on the original change from "outgoing" to "ongoing" kinda suggests that we should be stubbing out session.disconnect.

Attachment #8507239 - Flags: review?(dmose) → review+

:shell escalante

Updated

•

10 years ago

Target Milestone: mozilla36 → ---

Mark Banner (:standard8)

Assignee

Comment 33

•

10 years ago

(In reply to Dan Mosedale (:dmose) - not reading bugmail; needinfo? for response from comment #32) > ::: browser/components/loop/test/shared/models_test.js > @@ +335,5 @@ > > it("should set the ongoing attribute to false", function() { > > model.endSession(); > > > > expect(model.get("ongoing")).eql(false); > > }); > > The fact that the above test didn't start failing on the original change > from "outgoing" to "ongoing" kinda suggests that we should be stubbing out > session.disconnect. Turns out the test hadn't set ongoing to true to begin with, and the default was false. Now fixed.

Mark Banner (:standard8)

Assignee

Comment 34

•

10 years ago

https://hg.mozilla.org/integration/fx-team/rev/fd211cd3ed68

Target Milestone: --- → mozilla36

Mark Banner (:standard8)

Assignee

Comment 35

•

10 years ago

Resetting the affected flags - although this affects shared code, it only affects the standalone client as this part of the shared code isn't run on desktop - new incoming calls have different conversation windows. The outgoing conversation windows currently have a different code route and are not affected by this.

status-firefox34: affected → ---

status-firefox35: affected → ---

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 36

•

10 years ago

https://hg.mozilla.org/mozilla-central/rev/fd211cd3ed68

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

u279076

Comment 37

•

10 years ago

[Tracking Requested - why for this release]: nominated to track since this is a Loop 34 blocker. Nils, can you please verify this is fixed in the latest mozilla-central build?

status-firefox34: --- → affected

status-firefox35: --- → affected

status-firefox36: --- → fixed

tracking-firefox34: --- → ?

tracking-firefox35: --- → ?

tracking-firefox36: --- → ?

Flags: needinfo?(drno)

QA Contact: anthony.s.hughes → drno

Nils Ohlmeier [:drno]

Reporter

Comment 38

•

10 years ago

Verified with 36.0a1 (2014-10-24) on Mac OSX that it no longer drops the second call for me.

status-firefox36: fixed → verified

Flags: needinfo?(drno)

Lukas Blakk [:lsblakk] use ?needinfo

Updated

•

10 years ago

tracking-firefox34: ? → +

tracking-firefox35: ? → +

tracking-firefox36: ? → +

Lukas Blakk [:lsblakk] use ?needinfo

Comment 39

•

10 years ago

Since this is verified on Nightly can we get some Aurora/Beta uplift nomination requests please?

Flags: needinfo?(standard8)

Maire Reavy [:mreavy]

Comment 40

•

10 years ago

Please see Comment 31 and Comment 35. This bug fix is for the standalone app (not the desktop code); desktop is not affected. That is why we haven't requested uplift. This bug fix will get pushed out with the next standalone app update (which the Ops folks handle for us). We are updating the standlone app on the order of once per week currently.

status-firefox34: affected → ---

status-firefox35: affected → ---

tracking-firefox34: + → ---

tracking-firefox35: + → ---

tracking-firefox36: + → ---

Flags: needinfo?(standard8)

Mark Banner (:standard8)

Assignee

Comment 41

•

10 years ago

(In reply to Maire Reavy [:mreavy] (Plz needinfo me) from comment #40) > This bug fix will get pushed out with the next standalone app update (which > the Ops folks handle for us). We are updating the standlone app on the > order of once per week currently. This was actually deployed last Friday.

u279076

Updated

•

10 years ago

Status: RESOLVED → VERIFIED

Randell Jesup [:jesup] (needinfo me)

Comment 42

•

10 years ago

https://hg.mozilla.org/releases/mozilla-aurora/rev/3dc36b88d623

u279076

Updated

•

10 years ago

status-firefox35: --- → fixed

Randell Jesup [:jesup] (needinfo me)

Comment 43

•

10 years ago

Comment on attachment 8507239 [details] [diff] [review] Second call with a Loop url gets disconnected - ensure stream published/subscribed flags are properly reset. Approval Request Comment Landed on aurora per IRC with lsblakk with a=loop-only

Attachment #8507239 - Flags: approval-mozilla-aurora?

Lukas Blakk [:lsblakk] use ?needinfo

Updated

•

10 years ago

Attachment #8507239 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

link_clicker_web_console.log 10 years ago Nils Ohlmeier [:drno] 6.03 KB, text/x-log		Details
link_generator_browser_console.log 10 years ago Nils Ohlmeier [:drno] 47.25 KB, text/x-log		Details
Second call with a Loop url gets disconnected - ensure stream published/subscribed flags are properly reset. 10 years ago Mark Banner (:standard8) 4.89 KB, patch	dmosedale : review+ lsblakk : approval-mozilla-aurora+	Details \| Diff \| Splinter Review