Please deploy release 0.13.0 to loop-server Stage



Cloud Services
Operations: Deployment Requests
3 years ago
3 years ago


(Reporter: alexis, Assigned: mostlygeek)


Firefox Tracking Flags

(Not tracked)




3 years ago
I just tagged 0.13, which brings rooms support.

This releases has a number of configuration changes with it.

* the Changelog is at
* the (quite big) diff between 0.12.7 and 0.13 is at

Note this depends on 0.12.7 going to production (Bug 1091647 should be marked as VERIFIED) because we cannot have both 0.12.7 and 0.13.0 in staging at the same time.

Comment 1

3 years ago
I assigned this to Benson, as my understanding was that he might be the only ops available when Bug 1091647 will be VERIFIED, but feel free to reassign to whoever it makes sense.


3 years ago
Depends on: 1094754
0.12.7 has been verified so we can move on and deploy 0.13 to stage.
0.13 built.  Next steps: configuration changes.
This has been deployed.
Last Resolved: 3 years ago
Resolution: --- → FIXED
Looks like the deployment has some problems - loop/node process isn't running, but I'm not sure this is with the deployment itself or a larger issue.  

[rpappalardo@ip-172-31-6-103 loop-server]$ cat /data/loop-server/VERSION.txt
aaa7ec2 Release 0.13.0
[rpappalardo@ip-172-31-6-103 loop-server]$ ps aux | grep node
loop      1165  7.0  0.0      0     0 ?        Zs   02:44   0:00 [node] <defunct>
513       1168  0.0  0.0 103248   832 pts/0    S+   02:44   0:00 grep node
[rpappalardo@ip-172-31-6-103 loop-server]$ ps aux | grep loop
loop      1169  7.0  0.0      0     0 ?        Zs   02:44   0:00 [node] <defunct>
513       1172  0.0  0.0 103248   832 pts/0    S+   02:44   0:00 grep loop
[rpappalardo@ip-172-31-6-103 loop-server]$

It should look like this (on i-9385ab76  - 0.12.7)
[rpappalardo@ip-172-31-29-95 ~]$ ps aux | grep loop
loop      3590  0.7  4.2 1084196 165816 ?      Ssl  Oct31  76:28 node loop/index.js
514      11492  0.0  0.0 103252   836 pts/0    S+   03:06   0:00 grep loop

2014-10-24 15:36:21 circus[1799] [INFO] Starting master on pid 1799
2014-10-24 15:36:21 circus[1799] [INFO] Arbiter now waiting for commands
2014-10-24 15:36:21 circus[1799] [INFO] loopserver started
2014-10-24 15:36:22 circus[1799] [INFO] hekad-loop started
2014-10-24 15:36:23 circus[1799] [INFO] hekad-loop stopped
2014-10-24 15:36:23 circus[1799] [INFO] hekad-loop started
2014-11-08 01:15:54 circus[1799] [INFO] loopserver stopped
2014-11-08 01:15:54 circus[1799] [INFO] loopserver started

last entries:
2014/10/24 15:36:23 Input started: LoopServer
2014/10/24 15:36:23 Input started: LoopWebserver
2014/10/24 15:36:23 Input started: LoopWebserverError
Flags: needinfo?(bobm)
There's quite a few errors getting logged to 

Redis connection unhanded error

1200062 events.js:72
1200063         throw er; // Unhandled 'error' event
1200064               ^
1200065 Error: Redis connection to        379 failed - getaddrinfo ENOTFOUND
1200066     at RedisClient.on_error (/data/loop-server/node_modules/redis/index.js:196:24)
1200067     at Socket.<anonymous> (/data/loop-server/node_modules/redis/index.js:106:14)
1200068     at Socket.emit (events.js:95:17)
1200069     at net.js:833:16
1200070     at process._tickCallback (node.js:419:13)
1200072 events.js:72
1200073         throw er; // Unhandled 'error' event
1200074               ^
1200075 Error: Redis connection to        379 failed - getaddrinfo ENOTFOUND
1200076     at RedisClient.on_error (/data/loop-server/node_modules/redis/index.js:196:24)
1200077     at Socket.<anonymous> (/data/loop-server/node_modules/redis/index.js:106:14)
1200078     at Socket.emit (events.js:95:17)
1200079     at net.js:833:16
1200080     at process._tickCallback (node.js:419:13)


1387252 /data/loop-server/node_modules/convict/lib/convict.js:378
1387253         throw new Error(errBuf);
1387254               ^
1387255 Error: rooms.HKDFSalt: Should be an 16 bytes key encoded as hexadecimal: value was "o        verridden by secrets repo"
1387256     at Object.rv.validate (/data/loop-server/node_modules/convict/lib/convict.js:378:        15)
1387257     at Object.<anonymous> (/data/loop-server/loop/config.js:476:6)
1387258     at Module._compile (module.js:456:26)
1387259     at Object.Module._extensions..js (module.js:474:10)
1387260     at Module.load (module.js:356:32)
1387261     at Function.Module._load (module.js:312:12)
1387262     at Module.require (module.js:364:17)
1387263     at require (module.js:380:17)
1387264     at Object.<anonymous> (/data/loop-server/loop/index.js:7:12)
1387265     at Module._compile (module.js:456:26)
1387267 /data/loop-server/node_modules/convict/lib/convict.js:378
1387268         throw new Error(errBuf);
1387269               ^
1387270 Error: rooms.HKDFSalt: Should be an 16 bytes key encoded as hexadecimal: value was "o        verridden by secrets repo"
1387271     at Object.rv.validate (/data/loop-server/node_modules/convict/lib/convict.js:378:        15)
1387272     at Object.<anonymous> (/data/loop-server/loop/config.js:476:6)
1387273     at Module._compile (module.js:456:26)
1387274     at Object.Module._extensions..js (module.js:474:10)
1387275     at Module.load (module.js:356:32)
1387276     at Function.Module._load (module.js:312:12)
1387277     at Module.require (module.js:364:17)
1387278     at require (module.js:380:17)
1387279     at Object.<anonymous> (/data/loop-server/loop/index.js:7:12)
1387280     at Module._compile (module.js:456:26)
Actually, this first redis error was also getting logged in 0.12.7, though it appeared to have stopped just prior to load testing so I didn't pursue it further. 
NOTE: the 0.12.7 err log is only 220 lines - compare to the 0.13.0 log, which is quite verbose.

see comments from 0.12.7 stage ticket:

> 172 events.js:72
> 173         throw er; // Unhandled 'error' event
> 174               ^
> 175 Error: Redis connection to
> failed -
> connect ETIMEDOUT
> 176     at RedisClient.on_error
> (/data/loop-server/node_modules/redis/index.js:196:24)
> 177     at Socket.<anonymous>
> (/data/loop-server/node_modules/redis/index.js:106:14)
> 178     at Socket.emit (events.js:95:17)
> 179     at net.js:440:14
> 180     at process._tickCallback (node.js:419:13) 
HKDFSalterror means that we are missing a configuration variable.

bobm — you can use:

    $ node
    var crypto = require("crypto");

To generate a key for that configuration.
Resolution: FIXED → ---
Bob - 
I realize you guys are in work week, but if you get a chance to re-deploy stage, pls ping me.  At the moment i'm blocked until I can re-verify stage. after we have a working stage, we'll still need to go through a thorough cycle of loadtesting on this release.



Here it comes!

+This release introduces support for rooms.
+- [security] Add rate limiting rules for calls (Bug 1070000)
+- [api-change] POST /registration accepts both simple push URLs or nothing (Bug 1072184, Bug 1094662)
+- [feature] Implement the rooms feature (Bug 1067971, Bug 1087041, Bug 1080640)
+- [feature] Handle DELETE on /accounts /session and /registration (Bug 1072184)
+- [bugfix] Websockets: add the a reason when, on hello the call state is set to terminate (Bug 1075138)
+- [bugfix] Make sure redis is writable on heartbeat (Bug 1070889)
+- [bugfix] Make sure GET /calls returns the callerId if present (Bug 1079950)
+- [config] Consolidate the call url configuration (Bug 1072293)
+- [config] Refactor logging configuration (Bug 1072271)
+- [config] Change the default webapp URLs (Bug 1088187)
+- [logs] Add calleeId to logs on POST /calls/:token (Bug 1074468)
+- [tooling] Add and callurls_per_user tools.

Bug 1094662 - A client registered with 0.12.X or previous would not be notified with its simple push URL until she registers again

simplepush/loop: push endpoint is now injected via cf

simplepush/loop: add prod hiera

heka: bump default version to latest stable (0.8.0-1)


There are 4 loop-server app servers on STAGE
I'll be validating:  aaa7ec2 Release 0.13.0
Instance:  i-9b9b96d8 (loop-server app server (loopstage2-loop-server-app-20141107200110))

$ rpm -qa | grep loop
puppet-config-loop 20141108005904-1 x86_64 16442
loop-server-svcops 0.13.0-1 x86_64 27108381

aaa7ec2 Release 0.13.0

Verified processes: loop (node), stackdriver, circus, heka, nginx

/opt/rh - EMPTY
/media/ephemeral0/nginx/logs/default.access.log (not in use)
/media/ephemeral0/nginx/logs/default.error.log (not in use)


{"name":"mozilla-loop-server","description":"The Mozilla Loop (WebRTC App) server","version":"0.13.0","homepage":"","endpoint":"","fakeTokBox":false,"fxaOAuth":true,"i18n":{"defaultLang":"en-US"}}


curl -I
HTTP/1.1 200 OK
Date: Tue, 11 Nov 2014 00:48:06 GMT
Content-Type: application/json; charset=utf-8
Content-Length: 273
Connection: keep-alive
ETag: W/"111-88258607"
Timestamp: 1415666886

'make test' on localhost with 0.13.0 tag shows all unit tests PASS
quick make test (loadtest):

Duration: 7.63 seconds
Hits: 17
Started: 2014-11-11 00:52:24.397957
Approximate Average RPS: 2
Average request time: 0.29s
Opened web sockets: 2
Bytes received via web sockets : 454

Success: 1
Errors: 0
Failures: 0

:mostlygeek -

The deploy looks good, with the following exception:
I thought bobm fixed this, but we're till seeing this same error that :natim clarified in Comment 8 (above).
If you can fix this and re-deploy, I will quickly turn around and verify.  

After that we should be able to start loadtests.


        throw new Error(errBuf);
Error: rooms.HKDFSalt: Should be an 16 bytes key encoded as hexadecimal: value was "overridden by secrets repo"
    at Object.rv.validate (/data/loop-server/node_modules/convict/lib/convict.js:378:15)
    at Object.<anonymous> (/data/loop-server/loop/config.js:476:6)
    at Module._compile (module.js:456:26)
    at Object.Module._extensions..js (module.js:474:10)
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:312:12)
    at Module.require (module.js:364:17)
    at require (module.js:380:17)
    at Object.<anonymous> (/data/loop-server/loop/index.js:7:12)
Flags: needinfo?(bobm) → needinfo?(bwong)

Comment 11

3 years ago
The secrets issue was fixed a couple of hours ago. It was us putting the HKDF secret at the wrong level of JSON.
Flags: needinfo?(bwong)
Thanks, Benson!  because there's no time stamp, it slipped my mind to check if the errors were new.

Remy -

Tried running load tests.  bench works fine, but seeing errors when trying to run megabench.
Would you be willing to take a look?

Users [20]
Hits None
Agents 5
Duration 600
Server URL

Tests over 0 
Successes 0
Failures 0
Errors 3
TCP Hits 0
Opened web sockets 0
Total web sockets 0
Bytes/websockets 0
Requests / second (RPS) 0

3 occurrences:

  File "/home/ubuntu/loads/loads/runners/", line 167, in execute
  File "/home/ubuntu/loads/loads/runners/", line 247, in _execute
  File "/home/ubuntu/loads/loads/runners/", line 252, in _run_python_tests
  File "/home/ubuntu/loads/loads/runners/", line 89, in _resolve_name
    self.test = resolve_name(self.fqn)
  File "/home/ubuntu/loads/loads/", line 165, in resolve_name
    raise ImportError(parts[0])
Flags: needinfo?(rhubscher)
Yes we are on it with Alexis.
Flags: needinfo?(rhubscher)

Comment 14

3 years ago
So, I spent most of this morning finding why this wasn't working. Turns out the configuration change for the load tests wasn't good enough and was causing import to not work on the loads agents.

The fix is at

I had also had to do these steps to get the loadtest working:

1. Cherry-pick the loads config change of Rémy from the 0.12.X branch into master:;
2. Remove the old fingerprint of the ssh connection using the command "ssh-keygen -f "/home/alexis/.ssh/known_hosts" -R" (adapt to your user);
3. connect on ssh and accept the new fingerprint;

Make sure you update the latest *master* in order to run the loadtests properly.
I've started a loadtest on staging, and it should be ready to be reviewed once you get up, rpapa!
Last Resolved: 3 years ago3 years ago
Resolution: --- → FIXED
Hey, thanks for the hard work guys! I really appreciate it.  I'll try and complete loadtests today.
loadtests - 1st pass -30 min tests

resource usage fairly consistent 
CUP Usage ~4%
Memory Usage ~22%

looks good - no Errors

run #1 - Users [40]/ Agents 5 / 30 min

Tests over 2551 
Successes 2551
Failures 0
Errors 0
TCP Hits 41674
Opened web sockets 5238
Total web sockets 5238
Bytes/websockets 1184310
Requests / second (RPS) 23

Custom metrics
delete-room 2014
register 12988
generate-call-url 2626
refresh-room-presence 3841
initiate-call 2621
num-participants-4 506
num-participants-5 547
num-participants-2 495
leave-room 4552
num-participants-1 533
join-room 7736
num-participants-3 515
list-pending-calls 2620
create-room 2596 

run #2 - Users [80]/ Agents 5 / 30 min

Tests over 2078 
Successes 2078
Failures 0
Errors 0
TCP Hits 35124
Opened web sockets 4565
Total web sockets 4565
Bytes/websockets 1005402
Requests / second (RPS) 19

Custom metrics
delete-room 1652
register 10909
generate-call-url 2312
refresh-room-presence 3146
initiate-call 2289
num-participants-4 436
num-participants-5 439
num-participants-2 431
leave-room 3757
num-participants-1 424
join-room 6372
num-participants-3 432
list-pending-calls 2285
create-room 2162 

run #3 - Users [120] / Agents 5/ 30 min 

Tests over 1796
Successes 1796
Failures 0
Errors 0
TCP Hits 30258
Opened web sockets 4000
Total web sockets 3332
Bytes/websockets 875432
Requests / second (RPS) 16

Custom metrics
delete-room 1404
register 9345
generate-call-url 2028
refresh-room-presence 2713
initiate-call 2007
num-participants-4 370
num-participants-5 365
num-participants-2 382
leave-room 3256
num-participants-1 390
join-room 5402
num-participants-3 356
list-pending-calls 2000
create-room 1863
I've just given the latest clients a run by the staging server - with loop.server set to

I've found the following issues:

- The call urls generated are the correct new ones, but the loop-client server isn't configured correctly, i.e. gives 404 not found.

The additional nginx configuration for this was meant to be from bug 1094754, however that doesn't seem to have worked/stuck.

- The room urls are not configured and are giving:


rather than
Mark -
mostlygeek fixed the config issues you found and redeployed loop-client:
This seems to have resolved the issues, but since you logged the issue here, I wasn't sure if there was also a loop-server config change needed.
Flags: needinfo?(standard8)
loadtests - 2st pass - duration tests

Ran several more load tests today with inconsistent results. Since we're using the new loadstool, I tried pushing the number of users a bit higher (all tests using 5 agents).  Initial 3 runs with [40,60,120] users were w/out error (though a lot of 499s).  

CPU and memory consumption remained largely unchanged throughout, however, even when scaling back, some runs with 60 users were error free and others not.

some examples of 500,503s from the logs:

Alexis - any idea why we're seeing such inconsistent results?


Here's a summary of the remaining runs from today:

run #4: 4 hour test w/ 120 users - lots of 500 errors at 39 minutes into the test.
        cancelled run to try scaling back

run #5: 1 hour test w/ 80 users - 4 x 503 errors

run #6: 1 hour test w/ 60 users - NO errors

run #7: 4 hour test w/ 60 users - NO errors 

Attempted another 4 hour test w/ 60, 80 users, but seeing 500 errors again a few mins into test:
Flags: needinfo?(alexis+bugs)
About the 503 error it doesn't looks to be a Redis error so it should be a TokBox error.

About the HandshakeError, the problem comes from the SSL layer which is handled by the ELB.

At some loads it is actually supposed to fail.

The good thing is that we are not supposed to have this load on loop.

> - The room urls are not configured and are giving:

This is something we need to configure on the server config side.

Comment 21

3 years ago
All the 503s are due to tokbox failing to generate the tokens apparently.

The 500s are less obvious to me, especially because I don't see *anything* about them in sentry. They seem to occur only on GET /calls?version=200 and GET /websocket.

I'm not sure if sentry is configured properly here or if we're missing a way to get logs out the machine or what, but without debugging information I can only make suppositions.
Flags: needinfo?(alexis+bugs)

Comment 22

3 years ago
Also, ops, please note that the participantTTL should be increased for 0.13.0 to 60mn.

This is to avoid a code redeploy, and will be fixed and back to normal with 0.13.1 (it's on its way)


Comment 23

3 years ago
Yay, we finally managed to get inside the bastion host and then jump on the staging hosts to debug!

Trying to find out if Sentry was really plugged in, I wrote the following:

> CONFIG_FILES=/data/loop-server/config/settings.json NODE_ENV=stage node -p "var conf=require('./loop/config').conf; var raven=require('raven'); var ravenClient = new raven.Client(conf.get('sentryDSN')); ravenClient.captureError(new Error('server is able to communicate with sentry'));

As a result, an error showed up in the sentry stage console, so the problem is not with the connection between the stage and the host. I believe we should automate this thing and have a message sent to sentry each time we redeploy to avoid confusions like that. I created bug 1098241 for that.

Also, it turns out that some errors are in the standard error output and successfully catched by circus. 

> Error: Redis connection to failed - connect ETIMEDOUT
>    at RedisClient.on_error (/data/loop-server/node_modules/redis/index.js:196:24)
>    at Socket.<anonymous> (/data/loop-server/node_modules/redis/index.js:106:14)
>    at Socket.emit (events.js:95:17)
>    at net.js:440:14
>    at process._tickCallback (node.js:419:13)

Ideally these errors should be catched by the app itself and should show up in sentry instead. Because the circus configuration doesn't include any timestamp info in its output we cannot know what caused these but I'm going to assume they are a result of the loadtest Richard did yesterday.

We should configure circus so that it includes timestamp information. I created bug 1098243 to track that.

Comment 24

3 years ago
Also, in /media/ephemeral0/circus/loop_server/loop_server.out.log I see a lot of tokbox errors (387 of them)

> [Error: Failed to createSession. Error: The request failed: Error: ETIMEDOUT]

This is *not* related to our deploy anyway, so I'm going to ignore them for now.

Nevertheless, I believe they told us we should use their production environment for our staging. The correct URL to point at is the default one. (*not* ""). Ideally their mantis server should be able to have this kind of load but I believe it's not since we're having timeouts on their side, so I guess it's time to change the configuration of staging to use their prod settings. I don't think this should be blocking 0.13.0, so I created bug 1098248 for that.

Comment 25

3 years ago
There is yet another configuration change to do here. Once loop-client have been deployed and working with the new NGINX configuration, we should update staging to use the new rooms webapp url.

It means that the rooms.webAppUrl setting [0] should be set to "".

It seems that the puppet-config repo already contains this change [1]. I haven't been able to find out what's doing the transformation of this yaml file into the actual configuration file, so the problem may lie there.

(In reply to Richard Pappalardo [:rpapa][:rpappalardo] from comment #18)
> mostlygeek fixed the config issues you found and redeployed loop-client:
> This seems to have resolved the issues, but since you logged the issue here,
> I wasn't sure if there was also a loop-server config change needed.

Alexis has comment to required config changes in comment 22 and comment 25.

Once bug 1094754 is fully fixed, and the config changes are done, we should be good to go from that perspective.
Flags: needinfo?(standard8)

Comment 27

3 years ago
My take on this was that the pub/sub times out from time to time, and that the pub/sub isn't re-connecting, and as such a bunch of errors are triggered on the websocket endpoint.

After some more investigation from the host itself (having access there was really helpful), it turns out that we were using all the CPU of the Redis instance when doing our loadtests. As such, it wasn't able to answer quickly and that's when things start to get mad.

The things that worries me the most is that we haven't been notified about this CPU at 100%, and as such we weren't able to do anything with it. We're currently using a cache.m1.small for stage and that's not good enough: we should be replicating what we have in prod settings.

Then, the next thing that needs to be changed is how we are handling the errors, we got multiple timeouts in the pub/sub that weren't catched. This should be fixed with 0.13.1 which contains a bugfix for that (see

This means we'll tag a 0.13.1 release with this deployment bugfix and the bugfixes maire and Standard8 are looking for about rooms support.

Comment 28

3 years ago
I just tagged 0.13.1 with the bugfixes we want. Please go ahead and deploy this new version.

The comparison is at

Configuration changes need to be done as specified in comment 22 and comment 25.

Comment 29

3 years ago
Actually, since 0.13.1 is tagged with the bugfix in it we don't need to change the participant TTL, so please ignore comment 22.

(In reply to Alexis Metaireau (:alexis) from comment #28)
> I just tagged 0.13.1 with the bugfixes we want. Please go ahead and deploy
> this new version.
> The comparison is at
> Configuration changes need to be done as specified in comment 22 and comment
> 25.

Tickets created:
Bug 1098462 - Please deploy release 0.13.1 to loop-server Stage
Bug 1098464 - Please deploy release 0.13.1 to loop-server Production 

Once we verify the new (larger) redis instance on 0.13.0 stage,
we're going to skip deployment of 0.13.0 to prod and move forward with 0.13.1
Resolution: FIXED → INVALID
You need to log in before you can comment on or make changes to this bug.