Closed
Bug 1062266
Opened 10 years ago
Closed 10 years ago
Loop server misses timeout messages
Categories
(Hello (Loop) :: Server, defect)
Hello (Loop)
Server
Tracking
(Not tracked)
VERIFIED
FIXED
People
(Reporter: standard8, Assigned: rhubscher)
References
Details
(Whiteboard: [qa+])
Attachments
(1 file)
STR (I've been using the standalone client whilst developing bug 1000237):
- Start a loop server
- Get call details for both ends of a call
- Connect the websockets
=> Server gives out progress with state "alerting".
- Wait for 60 seconds
=> The ringing timeout message is only occasionally received.
Restarting the server doesn't seem to reset anything, it just works rarely.
Using latest master - 4f16dc65cd72d86b932493f301a2519fabe5b40d
Might be related to bug 1059734.
Reporter | ||
Comment 1•10 years ago
|
||
This makes manual checking of some parts of bug 1000237 very difficult.
Blocks: 1000237
Reporter | ||
Comment 2•10 years ago
|
||
I've also seen this a couple of times:
- Get call details for both ends of a call
- Connect the websockets
=> Server responds to "hello" at both ends
=> However, no progress message with "alerting" is received
I've seen it once straight after a server restart, and once when restarting a dialing attempt on a previous call.
Updated•10 years ago
|
Whiteboard: [qa+]
Comment 3•10 years ago
|
||
I don't have any idea why you don't receive the messages all the time.
I tested manually and wasn't able to reproduce, do you have any idea how I can reproduce this locally?
Reporter | ||
Comment 4•10 years ago
|
||
This appears to be intermittent, possibly varying on setup. I've just been trying with the dev server, and got maybe 3 out of 30 runs fail.
Here's how I was testing:
- Get latest nightly build
- Set the following prefs:
loop.server;https://loop-dev.stage.mozaws.net
loop.debug.websocket;true
- Restart Firefox, open the JS console
- Get a panel url, put that in another instance of FF on the same machine (that's what I've been doing).
- Click start in the other FF whilst watching the JS console of the main one.
- Wait > 30 seconds
=> On the console, this is the "success case":
"WS Sending" Object { messageType: "hello", callId: "...", auth: "..." }
"WS Receiving" "{"messageType":"hello","state":"init"}"
"WS Receiving" "{"messageType":"progress","state":"alerting"}"
"WS Close" close { target: WebSocket, isTrusted: true, wasClean: false, code: 1006, reason: "", eventPhase: 0, bubbles: false, cancelable: false, defaultPrevented: false, timeStamp: 1409769943987958, originalTarget: WebSocket }
=> Occasionally, it fails, and you simply get:
"WS Sending" Object { messageType: "hello", callId: "...", auth: "..." }
"WS Receiving" "{"messageType":"hello","state":"init"}"
"WS Receiving" "{"messageType":"progress","state":"alerting"}"
Reporter | ||
Comment 5•10 years ago
|
||
This seems very similar to bug 1062487, log below.
- "Timeout1" is the initial connection timeout where both clients should be connected within 10 seconds.
- "Timeout2" is the ringing timeout of 30 seconds.
Although the log has only bits, the ttl is first 30 seconds (ringing timeout?), then 10 seconds (connection timeout?).
In this case, it looks like Timeout1 fires just before the redis ttl expires. Hence, Timeout1 correctly doesn't do anything.
Timeout2 then fires after 30 seconds, finds the call state missing in the database (as its ttl has expired it), it appears the call data itself may also have expired (?), so it gets back a state of null, and hence doesn't send out the terminate message.
It seems concerning that there's two sets of timeouts here - the ones in the database, and the manual timeouts. It looks like they aren't working well together.
info: op=request.summary, code=200, path=/calls/sf4oN4Z3KLc, method=get, , agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:35.0) Gecko/20100101 Firefox/35.0, time=2014-09-04T09:49:32Z, uid=undefined, callId=undefined, token=sf4oN4Z3KLc, v=0.11.0-DEV, name=mozilla-loop-server, hostname=Mark-Banners-MacBook-Pro.local, lang=en-US,en;q=0.5, ip=127.0.0.1, errno=0
info: op=request.summary, code=200, path=/calls/sf4oN4Z3KLc, method=get, , agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:35.0) Gecko/20100101 Firefox/35.0, time=2014-09-04T09:49:33Z, uid=undefined, callId=undefined, token=sf4oN4Z3KLc, v=0.11.0-DEV, name=mozilla-loop-server, hostname=Mark-Banners-MacBook-Pro.local, lang=en-US,en;q=0.5, ip=127.0.0.1, errno=0
setCallState bccaa626b6b2bfcc3e11ce0858c82e5f init
setCallState bccaa626b6b2bfcc3e11ce0858c82e5f init
ttl 30
setCallState bccaa626b6b2bfcc3e11ce0858c82e5f init
ttl 10
info: op=request.summary, code=200, path=/calls/sf4oN4Z3KLc, method=post, , agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:35.0) Gecko/20100101 Firefox/35.0, time=2014-09-04T09:49:36Z, uid=undefined, callId=bccaa626b6b2bfcc3e11ce0858c82e5f, token=sf4oN4Z3KLc, v=0.11.0-DEV, name=mozilla-loop-server, hostname=Mark-Banners-MacBook-Pro.local, lang=en-US,en;q=0.5, ip=127.0.0.1, errno=0
getCallState bccaa626b6b2bfcc3e11ce0858c82e5f
getCallState bccaa626b6b2bfcc3e11ce0858c82e5f
helloState init
setCallState bccaa626b6b2bfcc3e11ce0858c82e5f init.caller
ttl 9.904
getCallState bccaa626b6b2bfcc3e11ce0858c82e5f
getCallState bccaa626b6b2bfcc3e11ce0858c82e5f
info: op=request.summary, code=200, path=/calls, method=get, version=1409824176321, agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:35.0) Gecko/20100101 Firefox/35.0, time=2014-09-04T09:49:38Z, uid=952ccd77c40097bb6110722e6d709b5f12eb86544276d24ed7b29750e9caccd9, callId=undefined, token=undefined, v=0.11.0-DEV, name=mozilla-loop-server, hostname=Mark-Banners-MacBook-Pro.local, lang=en-US, en, ip=127.0.0.1, errno=0
getCallState bccaa626b6b2bfcc3e11ce0858c82e5f
getCallState bccaa626b6b2bfcc3e11ce0858c82e5f
helloState init
setCallState bccaa626b6b2bfcc3e11ce0858c82e5f init.callee
ttl 7.496
getCallState bccaa626b6b2bfcc3e11ce0858c82e5f
Timeout1
getCallState bccaa626b6b2bfcc3e11ce0858c82e5f
Timeout1
getCallState bccaa626b6b2bfcc3e11ce0858c82e5f
Timeout1 state alerting
Timeout1 state alerting
Timeout2
getCallState bccaa626b6b2bfcc3e11ce0858c82e5f
Falling back
get call callback null!
Timeout2 state null
Depends on: 1062487
Assignee | ||
Comment 6•10 years ago
|
||
Attachment #8487141 -
Flags: review?(tarek)
Assignee | ||
Updated•10 years ago
|
Attachment #8487141 -
Flags: feedback?(standard8)
Assignee | ||
Comment 8•10 years ago
|
||
Standard8 — I am not sure I've got everything you explained before.
What should happen is:
- Supervisory timer until the callee answer (10s)
- Ringing timeout until the callee picks up (30s)
- Connection timer until both gets media-up (10s)
When a timeout fires, it looks at the current state that should be:
- half-initiated for the supervisory timer
- alerting for the ringing timer
- !connected for the connection timer
If the state didn't change it raise a terminated:timeout.
My patch change the state ttl behavior to match the call ttl.
Updated•10 years ago
|
Attachment #8487141 -
Flags: review?(tarek) → review+
Assignee | ||
Comment 9•10 years ago
|
||
https://github.com/mozilla-services/loop-server/commit/6fca550fb6dbed8881c9989280be235bc7a674a9
I have merged it. Also will make a smoketest using loads to make sure we have the correct behavior.
Assignee: nobody → rhubscher
Status: NEW → ASSIGNED
Reporter | ||
Comment 10•10 years ago
|
||
Comment on attachment 8487141 [details] [review]
Link to Github PR — #201
As I mentioned on irc yesterday, I'm now not getting any timeouts for the alerting state - the server doesn't send anything.
The issue seemed to be a result of the "state" being null in the timeout that handled the alerting state.
I've re-checked today with the latest master.
Attachment #8487141 -
Flags: feedback?(standard8) → feedback-
Reporter | ||
Comment 11•10 years ago
|
||
Having remembered about there being multiple timeouts - several for the messages themselves, plus one for the call in the redis database, I took another look. Here's the workaround:
diff --git a/loop/config.js b/loop/config.js
index ec3b634..5d88b5d 100644
--- a/loop/config.js
+++ b/loop/config.js
@@ -280,7 +280,7 @@ var conf = convict({
callDuration: {
doc: "The duration we want to store the call info (in seconds)",
format: Number,
- default: 30
+ default: 60
},
maxHTTPSockets: {
doc: "The maximum of HTTP sockets to use when doing requests",
I think callDuration should be the sum of the timeouts supervisoryDuration + ringingDuration + connectionDuration + safety margin. I'll let the server team decide whether that is done via documentation or change of code.
The reason that unit tests don't see this is callDuration isn't set for them in config.json. However config/test.json overrides the timeouts mentioned above to 0.05s. Hence, there's no chance of this happening there.
The unit tests could probably benefit from sinon's fake timers - that way you could have the default values and still run the tests quickly (or probably even quicker than they are currently).
Remy: needinfo you to make sure you see this, I'm not planning on doing the fix for it.
Flags: needinfo?(rhubscher)
Assignee | ||
Comment 12•10 years ago
|
||
Yes I am working on it, I will hopefully have a patch ready in an hour.
Flags: needinfo?(rhubscher)
Assignee | ||
Comment 13•10 years ago
|
||
https://github.com/mozilla-services/loop-server/commit/f074d57735b415b7642d702364fa7a46feae9089
Standard8, can you validate that everything works for you know?
Thank you,
Rémy
Flags: needinfo?(standard8)
Assignee | ||
Updated•10 years ago
|
Attachment #8487141 -
Flags: feedback- → feedback?(standard8)
Reporter | ||
Comment 14•10 years ago
|
||
Comment on attachment 8487141 [details] [review]
Link to Github PR — #201
I've given it a short run and it seems to be working fine now. I'll keep an eye on it.
Attachment #8487141 -
Flags: feedback?(standard8) → feedback+
Flags: needinfo?(standard8)
Assignee | ||
Comment 15•10 years ago
|
||
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•