Closed
Bug 1074517
Opened 10 years ago
Closed 10 years ago
Second call with URL gets disconnected
Categories
(Hello (Loop) :: General, defect, P1)
Hello (Loop)
General
Tracking
(firefox33 wontfix, firefox35 fixed, firefox36 verified)
backlog | Fx34+ |
People
(Reporter: drno, Assigned: standard8)
References
Details
(Whiteboard: [standalone only])
Attachments
(3 files)
6.03 KB,
text/x-log
|
Details | |
47.25 KB,
text/x-log
|
Details | |
4.89 KB,
patch
|
dmosedale
:
review+
lsblakk
:
approval-mozilla-aurora+
|
Details | Diff | Splinter Review |
- Generate a link clicker URL
- Establish the call and enjoy...
- Hang up the call from either end
- Use the same URL again to call a second time
- Call gets established just fine (audio and video working)
Actual behavior:
- After 3-5 seconds into the second call the call terminates for both sides
Expected behavior:
- Second call should run without getting terminated.
Nils, can you please see if this reproduces with the latest Aurora and Beta builds?
Flags: needinfo?(drno)
Reporter | ||
Comment 2•10 years ago
|
||
Reproduces on:
- Beta
- Aurora
- Nightly
as the once creating the link. My guess is at this point that it is a server side issue.
status-firefox33:
--- → affected
status-firefox34:
--- → affected
status-firefox35:
--- → affected
Component: Client → Server
Flags: needinfo?(drno)
QA Contact: anthony.s.hughes
Assignee | ||
Comment 3•10 years ago
|
||
Nils, it would be useful to turn on some logging to see narrow down the issue.
In FF desktop, there's a "loop.debug.websocket" preference, if you can set that to true, and monitor the browser console on the link generation side.
When the call connects, there should be some extra info appear along the lines of "WS <something>".
My suspicion is that the call is timing out, but getting that information might help.
You can also monitor this on the link-clicker side, by opening up a web console, and entering
localStorage.setItem("debug.websocket", true);
You can clear it later if you want with:
localStorage.removeItem("debug.websocket");
Assignee | ||
Updated•10 years ago
|
Flags: needinfo?(drno)
Comment 4•10 years ago
|
||
I tried placing several calls from Nightly (calling Aurora). I had loop.debug.websocket set to try on link generation (Nightly) side.
I made about 6 calls. The first and last calls failed (disconnected), but all the ones in-between succeeded.
Log from failed call: http://pastebin.mozilla.org/6672346
Log from successful call: http://pastebin.mozilla.org/6672352
Wontfix for 33 since Loop has been disabled there.
Comment 7•10 years ago
|
||
Changing the component here since we don't know if that's related to the server or the client.
Component: Server → General
QA Contact: anthony.s.hughes
Reporter | ||
Comment 8•10 years ago
|
||
Reporter | ||
Comment 9•10 years ago
|
||
Reporter | ||
Comment 10•10 years ago
|
||
So both sides of the call as you can see in the logs say:
"WS Receiving" "{"messageType":"progress","state":"half-connected"}"
and then later
"WS Receiving" "{"messageType":"progress","state":"terminated","reason":"timeout"}"
But the call is up and running. So who sends this progress terminated message, even though the call is up and running (I see video on both ends, I hear a nice audio feedback loop)?
Flags: needinfo?(drno)
Assignee | ||
Comment 11•10 years ago
|
||
Given the logs, I suspect this is a duplicate/related to bug 1073410. The media hasn't quite managed to finish exchanging fully before the server timeout kicks in.
Depends on: 1073410
Comment 12•10 years ago
|
||
Maybe we should add a new action to give more time to the connection process if needed.
Abr do you think this could be relevant?
Flags: needinfo?(adam)
Comment 13•10 years ago
|
||
Remy, I fear we will always have this case (even if we have a new action to give more time): at some point in time, the connection timer will expire and the server will end the WS connection.
Exactly when the server decides the timer is over, the client can still send it's media up, but it's too late since the server already closed the connection.
I believe the following happens:
> Client1 Server Client2
> | init conn. | init conn. |
> |----------------->|<---------------------|
> | | (server starts timer)|
> | media Up | |
> |----------------->| |
> | | Time pass, timer |
> | | expires |
> | close conn | close conn. |
> |<-----------------|--------------------->|
> | | media-up |
> | |<---------------------|
I don't know how that's being handled on the client, or if that's the problem we're seeing?
Comment 14•10 years ago
|
||
If you can see both video with good feedback, it means that one of the two clients didn't send the connected action fast enough so that the server get's it before the timeout.
We could increase the connection timer.
Comment 15•10 years ago
|
||
So, "media-up" is sent when the two clients are connected, in case the client is connected and then gets disconnected from the websocket, I believe it shouldn't have any impact and the call should not be terminated, so maybe that's just for the client to handle the socket connection differently once the media had been setup? (then, yes, we would have an inconsistency between what the server knows and the actual reality, but well, that's a border case, I believe).
Comment 16•10 years ago
|
||
If both webrtc party can know they are connected to each other, that could work.
Comment 17•10 years ago
|
||
(In reply to Alexis Metaireau (:alexis) from comment #15)
> So, "media-up" is sent when the two clients are connected, in case the
> client is connected and then gets disconnected from the websocket, I believe
> it shouldn't have any impact and the call should not be terminated, so maybe
> that's just for the client to handle the socket connection differently once
> the media had been setup? (then, yes, we would have an inconsistency between
> what the server knows and the actual reality, but well, that's a border
> case, I believe).
I'd like to see logging that indicates that this is the issue before we go about changing behavior.
If this is really just a race between the media coming up on both sides and the clients sending the "media-up" events, then I think increasing the timers (again) would be okay.
But I'm really rather hesitant to take either action unless we see logging that confirms that one of these two things is happening.
Given the timeframes involved here (10 seconds is a pretty long time to establish media streams), I'm actually very much inclined to agree with Mark's guess: this feels like Bug 1073410. My preferred solution here is to add a self-view prior to contacting the server, using the technique I demonstrate here: https://bugzilla.mozilla.org/page.cgi?id=splinter.html&bug=1039224&attachment=8497795 (which obviously needs some clean up, but it demonstrates an effective approach for self-view before activating the call).
Flags: needinfo?(adam)
Comment 18•10 years ago
|
||
That's correct I have no logs to back up what I'm thinking is the issue, and I've actually not be able to reproduce the issue.
I think this case could happen, though, and we would need to check what's the client behavior in this case, nevertheless.
I'll let you debug on Bug 1073410, feel free to fill a server bug if you think server might be the problem here.
Assignee | ||
Comment 19•10 years ago
|
||
Once we get something for bug 1073410, we'll be able to remove the case where the user takes a "long" time to accept the prompt, which typically causes this. We can then more easily watch for the cases where its just the media taking a long time to connect once the call is actually accepted.
Reporter | ||
Comment 20•10 years ago
|
||
Just to make it clear this bug to me looks/feels quite different then bug 1073410 (so they are not dupes IMHO).
Here is why:
- I follow exactly the same behavior for the first and second call. If this would a timing issue of clicking on the door hanger (and yes there is obviously delay as I drive two browsers) then the first call should fail, but it does not.
- The second call gets established perfectly fine! Both sides can see each other. Both sides can hear each other (a nice audio feedback loop). about:webrtc reports working ICE connections with packets flowing.
Not sure if you noticed, but in the log files I see that the link generator sends this pretty early on:
WS Sending" Object { messageType: "action", event: "media-up" }
Is it reporting here that the media flow is up and running?
But then it receives the report from the link clicker side:
WS Receiving" "{"messageType":"progress","state":"half-connected"}
And the link clicker side for what ever reason never reports that media is up fully.
FYI I actually had exactly this issue on a real call with Maire a couple of minutes ago. The first call attempt was not successful because of user error. Then she clicked on the link a second time. We got connected for a couple of seconds (with audio and video just fine). And then the call terminated for both of us.
Reporter | ||
Comment 21•10 years ago
|
||
So here is comparison of the WS messages from a successful call:
Link generator:
"WS Sending" Object { messageType: "action", event: "media-up" }
"WS Receiving" "{"messageType":"progress","state":"connected"}" websocket.js:252
"WS Close" close { target: WebSocket, isTrusted: true, wasClean: false, code:
Link clicker:
"WS Sending" Object { messageType: "hello", callId:
"WS Receiving" "{"messageType":"hello","state":"init"}" websocket.js:277
"WS Receiving" "{"messageType":"progress","state":"alerting"}" websocket.js:277
"WS Receiving" "{"messageType":"progress","state":"connecting"}"
"WS Sending" Object { messageType: "action", event: "media-up" }
"WS Receiving" "{"messageType":"progress","state":"half-connected"}"
"WS Receiving" "{"messageType":"progress","state":"connected"}" websocket.js:277
"WS Close" close { target: WebSocket, isTrusted: true, wasClean: false, code:
Now for the failure case of this bug the WS messages look like this:
Link generator:
"WS Sending" Object { messageType: "action", event: "media-up" }
"WS Receiving" "{"messageType":"progress","state":"half-connected"}"
"WS Receiving"
"WS Close" close { target: WebSocket, isTrusted: true, wasClean: false, code:
Link clicker:
"WS Sending" Object { messageType: "hello", callId:
"WS Receiving" "{"messageType":"hello","state":"init"}" websocket.js:277
"WS Receiving" "{"messageType":"progress","state":"alerting"}" websocket.js:277
"WS Receiving" "{"messageType":"progress","state":"connecting"}"
"WS Receiving" "{"messageType":"progress","state":"half-connected"}"
"WS Receiving"
"WS Close" close { target: WebSocket, isTrusted: true, wasClean: false, code:
BTW really annoying that the "WS Receiving" has an extra new line in the log message for the "{"messageType":"progress","state":"terminated","reason":"timeout"}" messages. So the "emtpy" lines above are in fact where both clients received the state terminated messages.
The noticeable difference in my opinion here is that the for the second call the link clicker never sends this:
"WS Sending" Object { messageType: "action", event: "media-up" }
My guess is that makes the server side believe the call never got fully established and therefore tells both clients to terminate the call?!
Assignee | ||
Comment 22•10 years ago
|
||
> Link generator:
> "WS Sending" Object { messageType: "action", event: "media-up" }
> "WS Receiving" "{"messageType":"progress","state":"connected"}"
> websocket.js:252
> "WS Close" close { target: WebSocket, isTrusted: true, wasClean: false, code:
This is a bit strange, and looks like you've missed a few lines from the log copying (both sides go through the "hello" messages, and both sides should get all progress updates.
> The noticeable difference in my opinion here is that the for the second call
> the link clicker never sends this:
> "WS Sending" Object { messageType: "action", event: "media-up" }
> My guess is that makes the server side believe the call never got fully
> established and therefore tells both clients to terminate the call?!
So for "connection", both sides have to send "media-up". "media-up" is sent once the local stream is published and the remote stream has been subscribed to (not quite shown normally, but close enough.
If that doesn't happen within a 10 second period after the server changes the progress to "connecting" (e.g. the call was accepted), then it times out the connection.
I've generally seen different call setup times call to call using the same machine, so it doesn't entirely surprise me that a second call might be different.
This is generally why I'm thinking this is the same as bug 1073410. Although what might be useful, is to get timestamps on the "state":"connecting" through "state":"terminated", I believe there's some way of making the console do this, but I've never worked out the magic command.
Reporter | ||
Comment 23•10 years ago
|
||
(In reply to Mark Banner (:standard8) from comment #22)
> > Link generator:
> > "WS Sending" Object { messageType: "action", event: "media-up" }
> > "WS Receiving" "{"messageType":"progress","state":"connected"}"
> > websocket.js:252
> > "WS Close" close { target: WebSocket, isTrusted: true, wasClean: false, code:
>
> This is a bit strange, and looks like you've missed a few lines from the log
> copying (both sides go through the "hello" messages, and both sides should
> get all progress updates.
One explanation here is that I might cleared the logs before starting the call. The other could be that the browser console buffer seems to be limited. Is there a way to hold it more log messages?
> > The noticeable difference in my opinion here is that the for the second call
> > the link clicker never sends this:
> > "WS Sending" Object { messageType: "action", event: "media-up" }
> > My guess is that makes the server side believe the call never got fully
> > established and therefore tells both clients to terminate the call?!
>
> So for "connection", both sides have to send "media-up". "media-up" is sent
> once the local stream is published and the remote stream has been subscribed
> to (not quite shown normally, but close enough.
>
> If that doesn't happen within a 10 second period after the server changes
> the progress to "connecting" (e.g. the call was accepted), then it times out
> the connection.
Is it possible that this 10 seconds after "connecting" gets confused, because I'm placing the first call, hang up and then start a second call briefly after clicking the green feedback smileys? Could the server side get mix up the state of these two calls for the same URL?
> I've generally seen different call setup times call to call using the same
> machine, so it doesn't entirely surprise me that a second call might be
> different.
>
> This is generally why I'm thinking this is the same as bug 1073410. Although
> what might be useful, is to get timestamps on the "state":"connecting"
> through "state":"terminated", I believe there's some way of making the
> console do this, but I've never worked out the magic command.
If you let me know how I'll be happy to give it a try.
Reporter | ||
Comment 24•10 years ago
|
||
Some more notes from trying to debug this:
- Richard and me were able to reproduce this problem 100% across two different machines and different FF versions (Aurora and Nightly)
- When I tried to today to wait 30min in between the first and the second call it did not help (I thought maybe some timeouts would got reset by that time)
- Today with Nightly after hitting this problem once I was not even able to make any call with a new URL any more fore more then 10s (In other words: even if I generated a new URL, the initial call on that new URL would get terminated before 10s!).
One thing to potentially investigate is: check if it helps to close the tab from the first call and wait some time to make sure the PeerConnections are garbage collected....
Updated•10 years ago
|
Flags: qe-verify+
Updated•10 years ago
|
backlog: --- → Fx34+
Comment 26•10 years ago
|
||
We're fixing bug 1073410 first, and then we'll see if that helps this bug.
Reporter | ||
Comment 27•10 years ago
|
||
If the tab from the first call gets closed and another gets opened where you start the second call everything works fine.
So only if you submit the feedback after the first call and then wait the 5 seconds to get bounced back to the original start call page and you start the second call from there this problem appears.
I have the feeling that now it takes longer until the second call gets dis-connected (maybe some timeout value on the server side got increased?), but it still happens.
Assignee | ||
Comment 28•10 years ago
|
||
Ok, I've hit one case where I can reproduce this issue:
1) Generate a link
2) Click the link in a browser
3) At the generation end, select Accept
4) Wait (do not accept the camera/microphone prompt)
5) Let the call timeout.
6) Submit feedback if necessary (bug 1081959 should change it to "call failed")
7) Start the call again (start or retry)
8) Accept the call
9) Accept the camera/microphone prompt
Actual Results
=> Call connects and times out a few seconds later
Expected Results
=> Call connects and continues happily
Assignee | ||
Comment 29•10 years ago
|
||
(In reply to Nils Ohlmeier [:drno] from comment #27)
> If the tab from the first call gets closed and another gets opened where you
> start the second call everything works fine.
> So only if you submit the feedback after the first call and then wait the 5
> seconds to get bounced back to the original start call page and you start
> the second call from there this problem appears.
Yes this sounds right as well.
> I have the feeling that now it takes longer until the second call gets
> dis-connected (maybe some timeout value on the server side got increased?),
> but it still happens.
Yeah, we increased the timeout.
This is primarily an issue on the standalone side, I have a vague handle on what's going on.
Assignee: nobody → standard8
Iteration: --- → 36.1
No longer depends on: 1073410
Priority: -- → P1
Target Milestone: --- → mozilla36
Assignee | ||
Comment 30•10 years ago
|
||
Ok, I found the problem. There's a couple of flags that are used to work out when we have a published and a subscribed stream - these aren't getting reset when a new call is started, so on a "redial" the standalone client doesn't notify the media is up, so the server times out the call.
Adding the resets to both end and start session, is really just bullet-proofing in case there's a situation I haven't thought of, as the code is quite complex.
I checked the newer code based on the flux model for the outgoing direct calling, and it doesn't suffer from this problem, as the variables are already cleaned up.
Attachment #8507239 -
Flags: review?(dmose)
Assignee | ||
Comment 31•10 years ago
|
||
Note: although some of this code is shared with desktop, it doesn't actually affect the desktop code - only standalone.
Whiteboard: [standalone only]
Comment 32•10 years ago
|
||
Comment on attachment 8507239 [details] [diff] [review]
Second call with a Loop url gets disconnected - ensure stream published/subscribed flags are properly reset.
Review of attachment 8507239 [details] [diff] [review]:
-----------------------------------------------------------------
r=dmose, with the points here addressed as you see fit.
::: browser/components/loop/content/shared/js/models.js
@@ +189,5 @@
> this.session.disconnect();
> + this.set({
> + publishedStream: false,
> + subscribedStream: false,
> + outgoing: false
The above line looks like it wants to be "ongoing: false" instead of "outgoing: false".
::: browser/components/loop/test/shared/models_test.js
@@ +335,5 @@
> it("should set the ongoing attribute to false", function() {
> model.endSession();
>
> expect(model.get("ongoing")).eql(false);
> });
The fact that the above test didn't start failing on the original change from "outgoing" to "ongoing" kinda suggests that we should be stubbing out session.disconnect.
Attachment #8507239 -
Flags: review?(dmose) → review+
Updated•10 years ago
|
Target Milestone: mozilla36 → ---
Assignee | ||
Comment 33•10 years ago
|
||
(In reply to Dan Mosedale (:dmose) - not reading bugmail; needinfo? for response from comment #32)
> ::: browser/components/loop/test/shared/models_test.js
> @@ +335,5 @@
> > it("should set the ongoing attribute to false", function() {
> > model.endSession();
> >
> > expect(model.get("ongoing")).eql(false);
> > });
>
> The fact that the above test didn't start failing on the original change
> from "outgoing" to "ongoing" kinda suggests that we should be stubbing out
> session.disconnect.
Turns out the test hadn't set ongoing to true to begin with, and the default was false.
Now fixed.
Assignee | ||
Comment 34•10 years ago
|
||
Target Milestone: --- → mozilla36
Assignee | ||
Comment 35•10 years ago
|
||
Resetting the affected flags - although this affects shared code, it only affects the standalone client as this part of the shared code isn't run on desktop - new incoming calls have different conversation windows.
The outgoing conversation windows currently have a different code route and are not affected by this.
status-firefox34:
affected → ---
status-firefox35:
affected → ---
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Comment 37•10 years ago
|
||
[Tracking Requested - why for this release]: nominated to track since this is a Loop 34 blocker.
Nils, can you please verify this is fixed in the latest mozilla-central build?
status-firefox34:
--- → affected
status-firefox35:
--- → affected
status-firefox36:
--- → fixed
tracking-firefox34:
--- → ?
tracking-firefox35:
--- → ?
tracking-firefox36:
--- → ?
Flags: needinfo?(drno)
QA Contact: anthony.s.hughes → drno
Reporter | ||
Comment 38•10 years ago
|
||
Verified with 36.0a1 (2014-10-24) on Mac OSX that it no longer drops the second call for me.
Flags: needinfo?(drno)
Updated•10 years ago
|
Comment 39•10 years ago
|
||
Since this is verified on Nightly can we get some Aurora/Beta uplift nomination requests please?
Flags: needinfo?(standard8)
Comment 40•10 years ago
|
||
Please see Comment 31 and Comment 35. This bug fix is for the standalone app (not the desktop code); desktop is not affected. That is why we haven't requested uplift.
This bug fix will get pushed out with the next standalone app update (which the Ops folks handle for us). We are updating the standlone app on the order of once per week currently.
status-firefox34:
affected → ---
status-firefox35:
affected → ---
tracking-firefox34:
+ → ---
tracking-firefox35:
+ → ---
tracking-firefox36:
+ → ---
Flags: needinfo?(standard8)
Assignee | ||
Comment 41•10 years ago
|
||
(In reply to Maire Reavy [:mreavy] (Plz needinfo me) from comment #40)
> This bug fix will get pushed out with the next standalone app update (which
> the Ops folks handle for us). We are updating the standlone app on the
> order of once per week currently.
This was actually deployed last Friday.
Comment 42•10 years ago
|
||
status-firefox35:
--- → fixed
Comment 43•10 years ago
|
||
Comment on attachment 8507239 [details] [diff] [review]
Second call with a Loop url gets disconnected - ensure stream published/subscribed flags are properly reset.
Approval Request Comment
Landed on aurora per IRC with lsblakk with a=loop-only
Attachment #8507239 -
Flags: approval-mozilla-aurora?
Updated•10 years ago
|
Attachment #8507239 -
Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
You need to log in
before you can comment on or make changes to this bug.
Description
•