Closed Bug 1074517 Opened 10 years ago Closed 10 years ago

Second call with URL gets disconnected

Categories

(Hello (Loop) :: General, defect, P1)

defect

Tracking

(firefox33 wontfix, firefox35 fixed, firefox36 verified)

VERIFIED FIXED
mozilla36
Iteration:
36.1
Tracking Status
firefox33 --- wontfix
firefox35 --- fixed
firefox36 --- verified
backlog Fx34+

People

(Reporter: drno, Assigned: standard8)

References

Details

(Whiteboard: [standalone only])

Attachments

(3 files)

- Generate a link clicker URL
- Establish the call and enjoy...
- Hang up the call from either end
- Use the same URL again to call a second time
- Call gets established just fine (audio and video working)

Actual behavior:
- After 3-5 seconds into the second call the call terminates for both sides

Expected behavior:
- Second call should run without getting terminated.
Nils, can you please see if this reproduces with the latest Aurora and Beta builds?
Flags: needinfo?(drno)
Reproduces on:
- Beta
- Aurora
- Nightly
as the once creating the link. My guess is at this point that it is a server side issue.
Component: Client → Server
Flags: needinfo?(drno)
QA Contact: anthony.s.hughes
Nils, it would be useful to turn on some logging to see narrow down the issue.

In FF desktop, there's a "loop.debug.websocket" preference, if you can set that to true, and monitor the browser console on the link generation side.

When the call connects, there should be some extra info appear along the lines of "WS <something>".

My suspicion is that the call is timing out, but getting that information might help.

You can also monitor this on the link-clicker side, by opening up a web console, and entering 

localStorage.setItem("debug.websocket", true);

You can clear it later if you want with:

localStorage.removeItem("debug.websocket");
Flags: needinfo?(drno)
I tried placing several calls from Nightly (calling Aurora).  I had loop.debug.websocket set to try on link generation (Nightly)  side.

I made about 6 calls.  The first and last calls failed (disconnected), but all the ones in-between succeeded.  

Log from failed call: http://pastebin.mozilla.org/6672346
Log from successful call: http://pastebin.mozilla.org/6672352
Wontfix for 33 since Loop has been disabled there.
Changing the component here since we don't know if that's related to the server or the client.
Component: Server → General
QA Contact: anthony.s.hughes
So both sides of the call as you can see in the logs say:

"WS Receiving" "{"messageType":"progress","state":"half-connected"}"

and then later

"WS Receiving" "{"messageType":"progress","state":"terminated","reason":"timeout"}"

But the call is up and running. So who sends this progress terminated message, even though the call is up and running (I see video on both ends, I hear a nice audio feedback loop)?
Flags: needinfo?(drno)
Given the logs, I suspect this is a duplicate/related to bug 1073410. The media hasn't quite managed to finish exchanging fully before the server timeout kicks in.
Depends on: 1073410
Maybe we should add a new action to give more time to the connection process if needed.

Abr do you think this could be relevant?
Flags: needinfo?(adam)
Remy, I fear we will always have this case (even if we have a new action to give more time): at some point in time, the connection timer will expire and the server will end the WS connection.

Exactly when the server decides the timer is over, the client can still send it's media up, but it's too late since the server already closed the connection.

I believe the following happens:

>     Client1             Server                Client2
>        |  init conn.      |    init conn.        |
>        |----------------->|<---------------------|
>        |                  | (server starts timer)|
>        |  media Up        |                      |
>        |----------------->|                      |
>        |                  | Time pass, timer     |
>        |                  | expires              |
>        |       close conn | close conn.          |
>        |<-----------------|--------------------->|
>        |                  |             media-up |
>        |                  |<---------------------|


I don't know how that's being handled on the client, or if that's the problem we're seeing?
If you can see both video with good feedback, it means that one of the two clients didn't send the connected action fast enough so that the server get's it before the timeout.

We could increase the connection timer.
So, "media-up" is sent when the two clients are connected, in case the client is connected and then gets disconnected from the websocket, I believe it shouldn't have any impact and the call should not be terminated, so maybe that's just for the client to handle the socket connection differently once the media had been setup? (then, yes, we would have an inconsistency between what the server knows and the actual reality, but well, that's a border case, I believe).
If both webrtc party can know they are connected to each other, that could work.
(In reply to Alexis Metaireau (:alexis) from comment #15)
> So, "media-up" is sent when the two clients are connected, in case the
> client is connected and then gets disconnected from the websocket, I believe
> it shouldn't have any impact and the call should not be terminated, so maybe
> that's just for the client to handle the socket connection differently once
> the media had been setup? (then, yes, we would have an inconsistency between
> what the server knows and the actual reality, but well, that's a border
> case, I believe).

I'd like to see logging that indicates that this is the issue before we go about changing behavior.

If this is really just a race between the media coming up on both sides and the clients sending the "media-up" events, then I think increasing the timers (again) would be okay.

But I'm really rather hesitant to take either action unless we see logging that confirms that one of these two things is happening.

Given the timeframes involved here (10 seconds is a pretty long time to establish media streams), I'm actually very much inclined to agree with Mark's guess: this feels like Bug 1073410. My preferred solution here is to add a self-view prior to contacting the server, using the technique I demonstrate here: https://bugzilla.mozilla.org/page.cgi?id=splinter.html&bug=1039224&attachment=8497795 (which obviously needs some clean up, but it demonstrates an effective approach for self-view before activating the call).
Flags: needinfo?(adam)
That's correct I have no logs to back up what I'm thinking is the issue, and I've actually not be able to reproduce the issue.

I think this case could happen, though, and we would need to check what's the client behavior in this case, nevertheless.

I'll let you debug on Bug 1073410, feel free to fill a server bug if you think server might be the problem here.
Once we get something for bug 1073410, we'll be able to remove the case where the user takes a "long" time to accept the prompt, which typically causes this. We can then more easily watch for the cases where its just the media taking a long time to connect once the call is actually accepted.
Just to make it clear this bug to me looks/feels quite different then bug 1073410 (so they are not dupes IMHO).

Here is why:
- I follow exactly the same behavior for the first and second call. If this would a timing issue of clicking on the door hanger (and yes there is obviously delay as I drive two browsers) then the first call should fail, but it does not.
- The second call gets established perfectly fine! Both sides can see each other. Both sides can hear each other (a nice audio feedback loop). about:webrtc reports working ICE connections with packets flowing.

Not sure if you noticed, but in the log files I see that the link generator sends this pretty early on:
  WS Sending" Object { messageType: "action", event: "media-up" }
Is it reporting here that the media flow is up and running?
But then it receives the report from the link clicker side:
  WS Receiving" "{"messageType":"progress","state":"half-connected"}
And the link clicker side for what ever reason never reports that media is up fully.

FYI I actually had exactly this issue on a real call with Maire a couple of minutes ago. The first call attempt was not successful because of user error. Then she clicked on the link a second time. We got connected for a couple of seconds (with audio and video just fine). And then the call terminated for both of us.
So here is comparison of the WS messages from a successful call:

Link generator:
"WS Sending" Object { messageType: "action", event: "media-up" }
"WS Receiving" "{"messageType":"progress","state":"connected"}" websocket.js:252
"WS Close" close { target: WebSocket, isTrusted: true, wasClean: false, code:

Link clicker:
"WS Sending" Object { messageType: "hello", callId:
"WS Receiving" "{"messageType":"hello","state":"init"}" websocket.js:277
"WS Receiving" "{"messageType":"progress","state":"alerting"}" websocket.js:277
"WS Receiving" "{"messageType":"progress","state":"connecting"}"
"WS Sending" Object { messageType: "action", event: "media-up" }
"WS Receiving" "{"messageType":"progress","state":"half-connected"}"
"WS Receiving" "{"messageType":"progress","state":"connected"}" websocket.js:277
"WS Close" close { target: WebSocket, isTrusted: true, wasClean: false, code:

Now for the failure case of this bug the WS messages look like this:

Link generator:
"WS Sending" Object { messageType: "action", event: "media-up" }
"WS Receiving" "{"messageType":"progress","state":"half-connected"}"
"WS Receiving"
"WS Close" close { target: WebSocket, isTrusted: true, wasClean: false, code:

Link clicker:
"WS Sending" Object { messageType: "hello", callId:
"WS Receiving" "{"messageType":"hello","state":"init"}" websocket.js:277
"WS Receiving" "{"messageType":"progress","state":"alerting"}" websocket.js:277
"WS Receiving" "{"messageType":"progress","state":"connecting"}"
"WS Receiving" "{"messageType":"progress","state":"half-connected"}"
"WS Receiving"
"WS Close" close { target: WebSocket, isTrusted: true, wasClean: false, code:

BTW really annoying that the "WS Receiving" has an extra new line in the log message for the "{"messageType":"progress","state":"terminated","reason":"timeout"}" messages. So the "emtpy" lines above are in fact where both clients received the state terminated messages.

The noticeable difference in my opinion here is that the for the second call the link clicker never sends this:
  "WS Sending" Object { messageType: "action", event: "media-up" }
My guess is that makes the server side believe the call never got fully established and therefore tells both clients to terminate the call?!
> Link generator:
> "WS Sending" Object { messageType: "action", event: "media-up" }
> "WS Receiving" "{"messageType":"progress","state":"connected"}"
> websocket.js:252
> "WS Close" close { target: WebSocket, isTrusted: true, wasClean: false, code:

This is a bit strange, and looks like you've missed a few lines from the log copying (both sides go through the "hello" messages, and both sides should get all progress updates.

> The noticeable difference in my opinion here is that the for the second call
> the link clicker never sends this:
>   "WS Sending" Object { messageType: "action", event: "media-up" }
> My guess is that makes the server side believe the call never got fully
> established and therefore tells both clients to terminate the call?!

So for "connection", both sides have to send "media-up". "media-up" is sent once the local stream is published and the remote stream has been subscribed to (not quite shown normally, but close enough.

If that doesn't happen within a 10 second period after the server changes the progress to "connecting" (e.g. the call was accepted), then it times out the connection.

I've generally seen different call setup times call to call using the same machine, so it doesn't entirely surprise me that a second call might be different.

This is generally why I'm thinking this is the same as bug 1073410. Although what might be useful, is to get timestamps on the "state":"connecting" through "state":"terminated", I believe there's some way of making the console do this, but I've never worked out the magic command.
(In reply to Mark Banner (:standard8) from comment #22)
> > Link generator:
> > "WS Sending" Object { messageType: "action", event: "media-up" }
> > "WS Receiving" "{"messageType":"progress","state":"connected"}"
> > websocket.js:252
> > "WS Close" close { target: WebSocket, isTrusted: true, wasClean: false, code:
> 
> This is a bit strange, and looks like you've missed a few lines from the log
> copying (both sides go through the "hello" messages, and both sides should
> get all progress updates.

One explanation here is that I might cleared the logs before starting the call. The other could be that the browser console buffer seems to be limited. Is there a way to hold it more log messages?
 
> > The noticeable difference in my opinion here is that the for the second call
> > the link clicker never sends this:
> >   "WS Sending" Object { messageType: "action", event: "media-up" }
> > My guess is that makes the server side believe the call never got fully
> > established and therefore tells both clients to terminate the call?!
> 
> So for "connection", both sides have to send "media-up". "media-up" is sent
> once the local stream is published and the remote stream has been subscribed
> to (not quite shown normally, but close enough.
> 
> If that doesn't happen within a 10 second period after the server changes
> the progress to "connecting" (e.g. the call was accepted), then it times out
> the connection.

Is it possible that this 10 seconds after "connecting" gets confused, because I'm placing the first call, hang up and then start a second call briefly after clicking the green feedback smileys? Could the server side get mix up the state of these two calls for the same URL?
 
> I've generally seen different call setup times call to call using the same
> machine, so it doesn't entirely surprise me that a second call might be
> different.
> 
> This is generally why I'm thinking this is the same as bug 1073410. Although
> what might be useful, is to get timestamps on the "state":"connecting"
> through "state":"terminated", I believe there's some way of making the
> console do this, but I've never worked out the magic command.

If you let me know how I'll be happy to give it a try.
Some more notes from trying to debug this:

- Richard and me were able to reproduce this problem 100% across two different machines and different FF versions (Aurora and Nightly)
- When I tried to today to wait 30min in between the first and the second call it did not help (I thought maybe some timeouts would got reset by that time)
- Today with Nightly after hitting this problem once I was not even able to make any call with a new URL any more fore more then 10s (In other words: even if I generated a new URL, the initial call on that new URL would get terminated before 10s!).

One thing to potentially investigate is: check if it helps to close the tab from the first call and wait some time to make sure the PeerConnections are garbage collected....
Flags: qe-verify+
backlog: --- → Fx34+
We're fixing bug 1073410 first, and then we'll see if that helps this bug.
If the tab from the first call gets closed and another gets opened where you start the second call everything works fine.
So only if you submit the feedback after the first call and then wait the 5 seconds to get bounced back to the original start call page and you start the second call from there this problem appears.

I have the feeling that now it takes longer until the second call gets dis-connected (maybe some timeout value on the server side got increased?), but it still happens.
Ok, I've hit one case where I can reproduce this issue:

1) Generate a link
2) Click the link in a browser
3) At the generation end, select Accept
4) Wait (do not accept the camera/microphone prompt)
5) Let the call timeout.
6) Submit feedback if necessary (bug 1081959 should change it to "call failed")
7) Start the call again (start or retry)
8) Accept the call
9) Accept the camera/microphone prompt

Actual Results

=> Call connects and times out a few seconds later

Expected Results

=> Call connects and continues happily
(In reply to Nils Ohlmeier [:drno] from comment #27)
> If the tab from the first call gets closed and another gets opened where you
> start the second call everything works fine.
> So only if you submit the feedback after the first call and then wait the 5
> seconds to get bounced back to the original start call page and you start
> the second call from there this problem appears.

Yes this sounds right as well.

> I have the feeling that now it takes longer until the second call gets
> dis-connected (maybe some timeout value on the server side got increased?),
> but it still happens.

Yeah, we increased the timeout.

This is primarily an issue on the standalone side, I have a vague handle on what's going on.
Assignee: nobody → standard8
Iteration: --- → 36.1
No longer depends on: 1073410
Priority: -- → P1
Target Milestone: --- → mozilla36
Ok, I found the problem. There's a couple of flags that are used to work out when we have a published and a subscribed stream - these aren't getting reset when a new call is started, so on a "redial" the standalone client doesn't notify the media is up, so the server times out the call.

Adding the resets to both end and start session, is really just bullet-proofing in case there's a situation I haven't thought of, as the code is quite complex.

I checked the newer code based on the flux model for the outgoing direct calling, and it doesn't suffer from this problem, as the variables are already cleaned up.
Attachment #8507239 - Flags: review?(dmose)
Note: although some of this code is shared with desktop, it doesn't actually affect the desktop code - only standalone.
Whiteboard: [standalone only]
Comment on attachment 8507239 [details] [diff] [review]
Second call with a Loop url gets disconnected - ensure stream published/subscribed flags are properly reset.

Review of attachment 8507239 [details] [diff] [review]:
-----------------------------------------------------------------

r=dmose, with the points here addressed as you see fit.

::: browser/components/loop/content/shared/js/models.js
@@ +189,5 @@
>        this.session.disconnect();
> +      this.set({
> +        publishedStream: false,
> +        subscribedStream: false,
> +        outgoing: false

The above line looks like it wants to be "ongoing: false" instead of "outgoing: false".

::: browser/components/loop/test/shared/models_test.js
@@ +335,5 @@
>          it("should set the ongoing attribute to false", function() {
>            model.endSession();
>  
>            expect(model.get("ongoing")).eql(false);
>          });

The fact that the above test didn't start failing on the original change from "outgoing" to "ongoing" kinda suggests that we should be stubbing out session.disconnect.
Attachment #8507239 - Flags: review?(dmose) → review+
Target Milestone: mozilla36 → ---
(In reply to Dan Mosedale (:dmose) - not reading bugmail; needinfo? for response from comment #32)
> ::: browser/components/loop/test/shared/models_test.js
> @@ +335,5 @@
> >          it("should set the ongoing attribute to false", function() {
> >            model.endSession();
> >  
> >            expect(model.get("ongoing")).eql(false);
> >          });
> 
> The fact that the above test didn't start failing on the original change
> from "outgoing" to "ongoing" kinda suggests that we should be stubbing out
> session.disconnect.

Turns out the test hadn't set ongoing to true to begin with, and the default was false.

Now fixed.
https://hg.mozilla.org/integration/fx-team/rev/fd211cd3ed68
Target Milestone: --- → mozilla36
Resetting the affected flags - although this affects shared code, it only affects the standalone client as this part of the shared code isn't run on desktop - new incoming calls have different conversation windows.

The outgoing conversation windows currently have a different code route and are not affected by this.
[Tracking Requested - why for this release]: nominated to track since this is a Loop 34 blocker.

Nils, can you please verify this is fixed in the latest mozilla-central build?
Flags: needinfo?(drno)
QA Contact: anthony.s.hughes → drno
Verified with 36.0a1 (2014-10-24) on Mac OSX that it no longer drops the second call for me.
Flags: needinfo?(drno)
Since this is verified on Nightly can we get some Aurora/Beta uplift nomination requests please?
Flags: needinfo?(standard8)
Please see Comment 31 and Comment 35.  This bug fix is for the standalone app (not the desktop code); desktop is not affected.  That is why we haven't requested uplift.  

This bug fix will get pushed out with the next standalone app update (which the Ops folks handle for us).  We are updating the standlone app on the order of once per week currently.
Flags: needinfo?(standard8)
(In reply to Maire Reavy [:mreavy] (Plz needinfo me) from comment #40)
> This bug fix will get pushed out with the next standalone app update (which
> the Ops folks handle for us).  We are updating the standlone app on the
> order of once per week currently.

This was actually deployed last Friday.
Status: RESOLVED → VERIFIED
Comment on attachment 8507239 [details] [diff] [review]
Second call with a Loop url gets disconnected - ensure stream published/subscribed flags are properly reset.

Approval Request Comment
Landed on aurora per IRC with lsblakk with a=loop-only
Attachment #8507239 - Flags: approval-mozilla-aurora?
Attachment #8507239 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: