1062266 - Loop server misses timeout messages

Reporter

Description

•

10 years ago

STR (I've been using the standalone client whilst developing bug 1000237): - Start a loop server - Get call details for both ends of a call - Connect the websockets => Server gives out progress with state "alerting". - Wait for 60 seconds => The ringing timeout message is only occasionally received. Restarting the server doesn't seem to reset anything, it just works rarely. Using latest master - 4f16dc65cd72d86b932493f301a2519fabe5b40d Might be related to bug 1059734.

Mark Banner (:standard8)

Reporter

Comment 1

•

10 years ago

This makes manual checking of some parts of bug 1000237 very difficult.

Blocks: 1000237

Mark Banner (:standard8)

Reporter

Comment 2

•

10 years ago

I've also seen this a couple of times: - Get call details for both ends of a call - Connect the websockets => Server responds to "hello" at both ends => However, no progress message with "alerting" is received I've seen it once straight after a server restart, and once when restarting a dialing attempt on a previous call.

James Bonacci [:jbonacci]

Updated

•

10 years ago

Whiteboard: [qa+]

Alexis Metaireau (:alexis)

Comment 3

•

10 years ago

I don't have any idea why you don't receive the messages all the time. I tested manually and wasn't able to reproduce, do you have any idea how I can reproduce this locally?

Mark Banner (:standard8)

Reporter

Comment 4

•

10 years ago

This appears to be intermittent, possibly varying on setup. I've just been trying with the dev server, and got maybe 3 out of 30 runs fail. Here's how I was testing: - Get latest nightly build - Set the following prefs: loop.server;https://loop-dev.stage.mozaws.net loop.debug.websocket;true - Restart Firefox, open the JS console - Get a panel url, put that in another instance of FF on the same machine (that's what I've been doing). - Click start in the other FF whilst watching the JS console of the main one. - Wait > 30 seconds => On the console, this is the "success case": "WS Sending" Object { messageType: "hello", callId: "...", auth: "..." } "WS Receiving" "{"messageType":"hello","state":"init"}" "WS Receiving" "{"messageType":"progress","state":"alerting"}" "WS Close" close { target: WebSocket, isTrusted: true, wasClean: false, code: 1006, reason: "", eventPhase: 0, bubbles: false, cancelable: false, defaultPrevented: false, timeStamp: 1409769943987958, originalTarget: WebSocket } => Occasionally, it fails, and you simply get: "WS Sending" Object { messageType: "hello", callId: "...", auth: "..." } "WS Receiving" "{"messageType":"hello","state":"init"}" "WS Receiving" "{"messageType":"progress","state":"alerting"}"

Mark Banner (:standard8)

Reporter

Comment 5

•

10 years ago

This seems very similar to bug 1062487, log below. - "Timeout1" is the initial connection timeout where both clients should be connected within 10 seconds. - "Timeout2" is the ringing timeout of 30 seconds. Although the log has only bits, the ttl is first 30 seconds (ringing timeout?), then 10 seconds (connection timeout?). In this case, it looks like Timeout1 fires just before the redis ttl expires. Hence, Timeout1 correctly doesn't do anything. Timeout2 then fires after 30 seconds, finds the call state missing in the database (as its ttl has expired it), it appears the call data itself may also have expired (?), so it gets back a state of null, and hence doesn't send out the terminate message. It seems concerning that there's two sets of timeouts here - the ones in the database, and the manual timeouts. It looks like they aren't working well together. info: op=request.summary, code=200, path=/calls/sf4oN4Z3KLc, method=get, , agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:35.0) Gecko/20100101 Firefox/35.0, time=2014-09-04T09:49:32Z, uid=undefined, callId=undefined, token=sf4oN4Z3KLc, v=0.11.0-DEV, name=mozilla-loop-server, hostname=Mark-Banners-MacBook-Pro.local, lang=en-US,en;q=0.5, ip=127.0.0.1, errno=0 info: op=request.summary, code=200, path=/calls/sf4oN4Z3KLc, method=get, , agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:35.0) Gecko/20100101 Firefox/35.0, time=2014-09-04T09:49:33Z, uid=undefined, callId=undefined, token=sf4oN4Z3KLc, v=0.11.0-DEV, name=mozilla-loop-server, hostname=Mark-Banners-MacBook-Pro.local, lang=en-US,en;q=0.5, ip=127.0.0.1, errno=0 setCallState bccaa626b6b2bfcc3e11ce0858c82e5f init setCallState bccaa626b6b2bfcc3e11ce0858c82e5f init ttl 30 setCallState bccaa626b6b2bfcc3e11ce0858c82e5f init ttl 10 info: op=request.summary, code=200, path=/calls/sf4oN4Z3KLc, method=post, , agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:35.0) Gecko/20100101 Firefox/35.0, time=2014-09-04T09:49:36Z, uid=undefined, callId=bccaa626b6b2bfcc3e11ce0858c82e5f, token=sf4oN4Z3KLc, v=0.11.0-DEV, name=mozilla-loop-server, hostname=Mark-Banners-MacBook-Pro.local, lang=en-US,en;q=0.5, ip=127.0.0.1, errno=0 getCallState bccaa626b6b2bfcc3e11ce0858c82e5f getCallState bccaa626b6b2bfcc3e11ce0858c82e5f helloState init setCallState bccaa626b6b2bfcc3e11ce0858c82e5f init.caller ttl 9.904 getCallState bccaa626b6b2bfcc3e11ce0858c82e5f getCallState bccaa626b6b2bfcc3e11ce0858c82e5f info: op=request.summary, code=200, path=/calls, method=get, version=1409824176321, agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:35.0) Gecko/20100101 Firefox/35.0, time=2014-09-04T09:49:38Z, uid=952ccd77c40097bb6110722e6d709b5f12eb86544276d24ed7b29750e9caccd9, callId=undefined, token=undefined, v=0.11.0-DEV, name=mozilla-loop-server, hostname=Mark-Banners-MacBook-Pro.local, lang=en-US, en, ip=127.0.0.1, errno=0 getCallState bccaa626b6b2bfcc3e11ce0858c82e5f getCallState bccaa626b6b2bfcc3e11ce0858c82e5f helloState init setCallState bccaa626b6b2bfcc3e11ce0858c82e5f init.callee ttl 7.496 getCallState bccaa626b6b2bfcc3e11ce0858c82e5f Timeout1 getCallState bccaa626b6b2bfcc3e11ce0858c82e5f Timeout1 getCallState bccaa626b6b2bfcc3e11ce0858c82e5f Timeout1 state alerting Timeout1 state alerting Timeout2 getCallState bccaa626b6b2bfcc3e11ce0858c82e5f Falling back get call callback null! Timeout2 state null

Depends on: 1062487

Rémy Hubscher (:natim)

Assignee

Comment 6

•

10 years ago

Attached file Link to Github PR — #201 — Details

Attachment #8487141 - Flags: review?(tarek)

Rémy Hubscher (:natim)

Assignee

Updated

•

10 years ago

Attachment #8487141 - Flags: feedback?(standard8)

Rémy Hubscher (:natim)

Assignee

Comment 8

•

10 years ago

Standard8 — I am not sure I've got everything you explained before. What should happen is: - Supervisory timer until the callee answer (10s) - Ringing timeout until the callee picks up (30s) - Connection timer until both gets media-up (10s) When a timeout fires, it looks at the current state that should be: - half-initiated for the supervisory timer - alerting for the ringing timer - !connected for the connection timer If the state didn't change it raise a terminated:timeout. My patch change the state ttl behavior to match the call ttl.

Tarek Ziadé (:tarek)

Updated

•

10 years ago

Attachment #8487141 - Flags: review?(tarek) → review+

Rémy Hubscher (:natim)

Assignee

Comment 9

•

10 years ago

https://github.com/mozilla-services/loop-server/commit/6fca550fb6dbed8881c9989280be235bc7a674a9 I have merged it. Also will make a smoketest using loads to make sure we have the correct behavior.

Assignee: nobody → rhubscher

Status: NEW → ASSIGNED

Mark Banner (:standard8)

Reporter

Comment 10

•

10 years ago

Comment on attachment 8487141 [details] [review] Link to Github PR — #201 As I mentioned on irc yesterday, I'm now not getting any timeouts for the alerting state - the server doesn't send anything. The issue seemed to be a result of the "state" being null in the timeout that handled the alerting state. I've re-checked today with the latest master.

Attachment #8487141 - Flags: feedback?(standard8) → feedback-

Mark Banner (:standard8)

Reporter

Comment 11

•

10 years ago

Having remembered about there being multiple timeouts - several for the messages themselves, plus one for the call in the redis database, I took another look. Here's the workaround: diff --git a/loop/config.js b/loop/config.js index ec3b634..5d88b5d 100644 --- a/loop/config.js +++ b/loop/config.js @@ -280,7 +280,7 @@ var conf = convict({ callDuration: { doc: "The duration we want to store the call info (in seconds)", format: Number, - default: 30 + default: 60 }, maxHTTPSockets: { doc: "The maximum of HTTP sockets to use when doing requests", I think callDuration should be the sum of the timeouts supervisoryDuration + ringingDuration + connectionDuration + safety margin. I'll let the server team decide whether that is done via documentation or change of code. The reason that unit tests don't see this is callDuration isn't set for them in config.json. However config/test.json overrides the timeouts mentioned above to 0.05s. Hence, there's no chance of this happening there. The unit tests could probably benefit from sinon's fake timers - that way you could have the default values and still run the tests quickly (or probably even quicker than they are currently). Remy: needinfo you to make sure you see this, I'm not planning on doing the fix for it.

Flags: needinfo?(rhubscher)

Rémy Hubscher (:natim)

Assignee

Comment 12

•

10 years ago

Yes I am working on it, I will hopefully have a patch ready in an hour.

Flags: needinfo?(rhubscher)

Rémy Hubscher (:natim)

Assignee

Comment 13

•

10 years ago

https://github.com/mozilla-services/loop-server/commit/f074d57735b415b7642d702364fa7a46feae9089 Standard8, can you validate that everything works for you know? Thank you, Rémy

Flags: needinfo?(standard8)

Rémy Hubscher (:natim)

Assignee

Updated

•

10 years ago

Attachment #8487141 - Flags: feedback- → feedback?(standard8)

Mark Banner (:standard8)

Reporter

Comment 14

•

10 years ago

Comment on attachment 8487141 [details] [review] Link to Github PR — #201 I've given it a short run and it seems to be working fine now. I'll keep an eye on it.

Attachment #8487141 - Flags: feedback?(standard8) → feedback+

Flags: needinfo?(standard8)

Rémy Hubscher (:natim)

Assignee

Comment 15

•

10 years ago

https://github.com/mozilla-services/loop-server/commit/6fca550fb6dbed8881c9989280be235bc7a674a9

Status: ASSIGNED → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

James Bonacci [:jbonacci]

Comment 16

•

10 years ago

Release 0.12.2 is now out in Production.

Status: RESOLVED → VERIFIED