Closed Bug 1067397 Opened 5 years ago Closed 5 years ago

Loop — Please deploy loop-server 0.12.2 to Stage

Categories

(Cloud Services :: Operations: Deployment Requests, task)

x86_64
Linux
task
Not set

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: rhubscher, Assigned: dwilson)

References

Details

(Whiteboard: [qa+])

No description provided.
Blocks: 1067399
+ Added some people...

Richard P. will be my assistant for this release.
QA Contact: jbonacci
Whiteboard: [qa+]
Assignee: nobody → dwilson
This has been deployed to staging.
Let's back up and start again shall we?
Please provide full details of the release 
Tag
Build
Bugs fixed/addressed
Bugs not fixed/addressed
Over in AWS, I have verified the usual loop-server Stage stack:
CF Stack in US East: loop-server-stage
ELB: loop-server-st-ELB-1EELW8IRXEQSQ

Instance - m3.medium: i-f524b6d8


Dean - what is the actual Production configuration/deployment for Loop-Server?
Is it built on 1+ m3.mediums or something else?

I would like to see Loop-Server Stage be a closer match (scaled down accordingly) to Production, if at all possible.

As we get closer to "going live", it would be best to have Stage be an accurate/scaled version of Production.
OK, for the QA watching this bug:
After checking the deployment via AWS, I usually reinstall loop-server on a localhost, load up redis, and run the unit tests. After that, I build the load tests.

Moving on to checking out the host...
Verified the host is running the following code:
puppet-config-loop 20140915171505-1 x86_64 15158
loop-server-svcops 0.12.0-1 x86_64 26825462


/data/loop-server/VERSION.txt
shows
922165c Release 0.12.0

Verified all the processes: node, loop, stackdriver, circus, heka, nginx

Files:
/data/loop-server
especially /data/loop-server/config/settings.json
/data/hekad
/opt/aws
/opt/ec2
/opt/openresty
/opt/rh
/opt/stackdriver
/etc/circus.ini
/etc/circus.d
/etc/circus.d/loopserver.ini
/etc/circus.d/hekad-loop.ini
/var/log/*
/etc/puppet/*
especially the following:
/etc/puppet/yaml/app/loop_server.dev.yaml
/etc/puppet/yaml/app/loop_server.prod.yaml
/etc/puppet/yaml/app/loop_server.stage.yaml
/etc/puppet/yaml/app/loop_server.yaml

Logs:
/var/log/circus.log
/var/log/hekad/loop.stdout.log
/var/log/hekad/loop.stderr.log
/media/ephemeral0/circus/loop_server/loop_server.err.log
/media/ephemeral0/circus/loop_server/loop_server.eout.log
/media/ephemeral0/nginx/logs/default.access.log (not in use)
/media/ephemeral0/nginx/logs/default.error.log (not in use)
/media/ephemeral0/nginx/logs/loop_server.access.log
/media/ephemeral0/nginx/logs/loop_server.error.log
Quick checks:

In the browser, and with curl, https://loop.stage.mozaws.net
returns
{"name":"mozilla-loop-server","description":"The Mozilla Loop (WebRTC App) server","version":"0.12.0","homepage":"https://github.com/mozilla-services/loop-server/","endpoint":"https://loop.stage.mozaws.net","fakeTokBox":false,"fxaOAuth":true,"i18n":{"defaultLang":"en-US"}}

Having 'fakeTokBox' = 'false' is right, since we want to talk to a live server not the mock server.

curl -I https://loop.stage.mozaws.net
returns
HTTP/1.1 200 OK

A quick 'make test' from the loadtests directory on the localhost was successful, so we are ready for load testing.
Our first load test window is today, 2pm - 5pm.

We are going to be stair-stepping the load tests for this release.
We can not really change the number of agents, so we will be upping the number of concurrent users per test.

Two methods:
1. Run consecutive tests with Users values being increased by 5 or 10 for each 30min load test
2. Run consecutive tests automatically with the 'values array': 20:30:40 for example

At the same time, we will be watching (we meaning more than me) the following dashboards, if relevant:
Stackdriver
Graphite
Heka
Kibana

and, in addition, TOP from the host itself, since we want to watch CPU and memory usage on that m3.medium instance.

We will be trying to do three things with all this:
1. Determine a more accurate server-side RPS
2. Create enough traffic to actually stress the live third-party server
3. Knock over Loop-Server
First round of load testing is starting now...
Results:
https://loads.services.mozilla.com/run/867a35ac-c96f-4a50-b60a-0e3a5b4fa4e2
With this result, we saw some errors that need to be investigated:
Errors     15
 3 occurrences:

Invalid response
  File "/usr/lib/python2.7/unittest/case.py", line 332, in run
    testMethod()
  File "loadtest.py", line 47, in test_all
    self._test_websockets(*params)
  File "loadtest.py", line 271, in _test_websockets
    caller_ws = self.create_ws(progress_url, callback=_handle_caller)
  File "loadtest.py", line 34, in create_ws
    ws = TestCase.create_ws(self, *args, **kw)
  File "/home/ubuntu/loads/loads/case.py", line 65, in create_ws
    test_case=self)
  File "/home/ubuntu/loads/loads/websockets.py", line 56, in create_ws
    extensions, klass, test_case)
  File "/home/ubuntu/loads/loads/websockets.py", line 82, in _create_ws
    socket.connect()
  File "/home/ubuntu/loads/local/lib/python2.7/site-packages/ws4py/client/__init__.py", line 126, in connect
    raise HandshakeError("Invalid response")

12 occurrences:

HTTPSConnectionPool(host='loop.stage.mozaws.net', port=443): Max retries exceeded with url: /calls/cytk9Eu8iVo (Caused by <class 'httplib.BadStatusLine'>: '')
  File "/usr/lib/python2.7/unittest/case.py", line 332, in run
    testMethod()
  File "loadtest.py", line 46, in test_all
    params = self.setupCall()
  File "loadtest.py", line 41, in setupCall
    call_data = self.initiate_call(token)
  File "loadtest.py", line 322, in initiate_call
    headers={'Content-Type': 'application/json'}
  File "/home/ubuntu/loads/local/lib/python2.7/site-packages/requests-2.2.1-py2.7.egg/requests/sessions.py", line 425, in post
    return self.request('POST', url, data=data, **kwargs)
  File "/home/ubuntu/loads/loads/measure.py", line 80, in request
    method, url, headers=headers, **kwargs)
  File "/home/ubuntu/loads/local/lib/python2.7/site-packages/requests-2.2.1-py2.7.egg/requests/sessions.py", line 383, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/ubuntu/loads/loads/measure.py", line 88, in send
    res = _Session.send(self, request, **kwargs)
  File "/home/ubuntu/loads/local/lib/python2.7/site-packages/requests-2.2.1-py2.7.egg/requests/sessions.py", line 486, in send
    r = adapter.send(request, **kwargs)
  File "/home/ubuntu/loads/local/lib/python2.7/site-packages/requests-2.2.1-py2.7.egg/requests/adapters.py", line 378, in send
    raise ConnectionError(e)



Also, TOP showed the node process consistently using 95% CPU
(roughly split evenly between %us and %st)
And, it showed the node process consistently using 44% Memory

Trolling the logs now...
OK, so this is interesting. The node process continues to strangle the CPU on the host.
So, not sure if that is true from the moment it is (re)started, or if this is a consequence of the load test.
It might be a consequence of the loadtest and the fact that we are creating a lot of setTimeout functions. We will investigate that tomorrow.
Nginx error log loop_server.error.log is empty

Nginx access log: loop_server.access.log
The usual 200s and 101s. The usual 499s at the end of the load test.
Which leaves the 404s and 307s - most/all of which appear to be bot activity.

Apps logs:
Nothing interesting in the loop_server.out.log

loop_server.err.log:
Has some traceback in it:
/data/loop-server/loop/middlewares.js:83
        line.error = res.headers["www-authenticate"];
                                ^
TypeError: Cannot read property 'www-authenticate' of undefined
    at ServerResponse.<anonymous> (/data/loop-server/loop/middlewares.js:83:33)
    at ServerResponse.emit (events.js:117:20)
    at ServerResponse.OutgoingMessage._finish (http.js:1016:8)
    at ServerResponse.OutgoingMessage.end (http.js:999:10)
    at ServerResponse.send (/data/loop-server/node_modules/express/lib/response.js:191:8)
    at ServerResponse.json (/data/loop-server/node_modules/express/lib/response.js:235:15)
    at sendError (/data/loop-server/loop/utils.js:28:20)
    at hawkSendError (/data/loop-server/loop/auth.js:77:5)
    at /data/loop-server/node_modules/express-hawkauth/lib/hawk.js:157:13
    at /data/loop-server/node_modules/express-hawkauth/node_modules/hawk/node_modules/hoek/lib/index.js:543:22
    at process._tickCallback (node.js:419:13)

The timestamp of the file lines up with the load test.

I am going to repeat this load test to see if I can duplicate the errors.
Round 2:
20 users
30 minutes

TOP shows the CPU still maxed out near 90%
Memory use is building slowly from about 5% to 14.6% at 8min into the 30min test
43% at 15min into a 30min test

So, the memory use and release was consistent between the two runs, but we should still make sure this memory use is expected...

After the test ends, you can watch some node and hekad activity on TOP, then it all clears.
No node process eating memory.
The Node process running now has a different PID.
Interesting...


Results: https://loads.services.mozilla.com/run/93579de6-694a-4ff2-ba0d-d5e3bd166e5d
This time, no errors are showing in the dashboard.


Logs:
Nothing new to report
Well this is bad:
Round 3:
40 users
30 minutes

Lots of errors now... :-)

TOP shows not overall change with CPU and Memory.

But, there are a lot of errors on the dashboard:
https://loads.services.mozilla.com/run/57d0687d-ce16-43b5-821c-0b583995e6d1

The *appear* to be load related, which means I am probably overloading the medium instance.
I will check the logs when this test ends.
OK, so doubling the concurrent users really caused massive failure.
https://loads.services.mozilla.com/run/57d0687d-ce16-43b5-821c-0b583995e6d1

Details:
Tests over     43166 
Successes     36574
Failures     6279
Errors     313
TCP Hits     153691
Opened web sockets     73337
Total web sockets     73337
Bytes/websockets     12921264
Requests / second (RPS)     83

Errors
3 occurrences: Invalid response...etc...
97 occurrences: HTTPSConnectionPool...etc...

My concern here is this value:
Failures     6279


I will check the logs for specific errors...
Well, the nginx error log is no longer empty!
loop_server.error.log:
2014/09/16 23:24:42 [error] 2214#0: *910467 connect() failed (111: Connection refused) while connecting to upstream, client: 10.84.118.219, server: , request: "POST /registration HTTP/1.1", upstream: "http://127.0.0.1:8000/registration", host: "loop.stage.mozaws.net"
and these
2014/09/16 23:24:36 [error] 2214#0: *910640 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 10.31.142.28, server: , request: "POST /registration HTTP/1.1", upstream: "http://127.0.0.1:8000/registration", host: "loop.stage.mozaws.net"
Not a surprise really...

loop_server.access.log
Well, if I filter out the 200s, the 101s, the 499s, the 404s, and the 307s, this is what I see:
1410909876.248 "54.218.210.15" "POST /registration HTTP/1.1" 502 178 "-" "python-requests/2.2.1 CPython/2.7.3 Linux/3.5.0-23-generic" 38.355 38.355 "-"
1410909876.375 "54.218.210.15" "POST /registration HTTP/1.1" 502 178 "-" "python-requests/2.2.1 CPython/2.7.3 Linux/3.5.0-23-generic" 0.000 0.000 "-"
...etc...

A 30min test generated around 6200 502s


loop_server.err.log
New, but same traceback as with the first load test:
/data/loop-server/loop/middlewares.js:83
        line.error = res.headers["www-authenticate"];
                                ^
TypeError: Cannot read property 'www-authenticate' of undefined
    at ServerResponse.<anonymous> (/data/loop-server/loop/middlewares.js:83:33)
    at ServerResponse.emit (events.js:117:20)
    at ServerResponse.OutgoingMessage._finish (http.js:1016:8)
    at ServerResponse.OutgoingMessage.end (http.js:999:10)
    at ServerResponse.send (/data/loop-server/node_modules/express/lib/response.js:191:8)
    at ServerResponse.json (/data/loop-server/node_modules/express/lib/response.js:235:15)
    at sendError (/data/loop-server/loop/utils.js:28:20)
    at hawkSendError (/data/loop-server/loop/auth.js:77:5)
    at /data/loop-server/node_modules/express-hawkauth/lib/hawk.js:157:13
    at /data/loop-server/node_modules/express-hawkauth/node_modules/hawk/node_modules/hoek/lib/index.js:543:22
    at process._tickCallback (node.js:419:13)



loop_server.out.log
nothing new to report
We need to debug all of the above, of course, but to start...
Dean - see notes WAY back in https://bugzilla.mozilla.org/show_bug.cgi?id=1067397#c6

If we need a better Stage environment, let's so that next.
One m3.medium is not allowing for much traffic (not even counting the CPU issue)
The current production loop-server stack is 2 m3.mediums
Current staging is 1 m3.medium.

I can change the number of instances in staging but wanted to check you're ok with this as it'll make comparing old vs new test results harder.

I should also note that I'm unsure of where the sizing of the live instances come from. They could possibly be a basic default rather than determined from testing.
The traceback with "www-authenticate" being undefined is bug 1068602, should be addressed shortly.
About the following error:

>  12 occurrences:

> HTTPSConnectionPool(host='loop.stage.mozaws.net', port=443): Max retries exceeded with url: /calls/cytk9Eu8iVo (Caused by <class 'httplib.BadStatusLine'>: '')
>  File "/usr/lib/python2.7/unittest/case.py", line 332, in run
>    testMethod()
>  File "loadtest.py", line 46, in test_all
>    params = self.setupCall()
>  File "loadtest.py", line 41, in setupCall
>    call_data = self.initiate_call(token)
>  File "loadtest.py", line 322, in initiate_call
>    headers={'Content-Type': 'application/json'}
>  File "/home/ubuntu/loads/local/lib/python2.7/site-packages/requests-2.2.1-py2.7.egg/requests/sessions.py", line 425, in post
>    return self.request('POST', url, data=data, **kwargs)
>  File "/home/ubuntu/loads/loads/measure.py", line 80, in request
>    method, url, headers=headers, **kwargs)
>  File "/home/ubuntu/loads/local/lib/python2.7/site-packages/requests-2.2.1-py2.7.egg/requests/sessions.py", line 383, in request
>    resp = self.send(prep, **send_kwargs)
>  File "/home/ubuntu/loads/loads/measure.py", line 88, in send
>    res = _Session.send(self, request, **kwargs)
>  File "/home/ubuntu/loads/local/lib/python2.7/site-packages/requests-2.2.1-py2.7.egg/requests/sessions.py", 
line 486, in send
>    r = adapter.send(request, **kwargs)
>  File "/home/ubuntu/loads/local/lib/python2.7/site-packages/requests-2.2.1-py2.7.egg/requests/adapters.py", line 378, in send
>    raise ConnectionError(e)

When the server stops, it is unavailable during the restarting period. This could be solved by using the pre-fork model for load balancing the socket between two loop-server node process.

Circus allows to do that and it is currently implemented in loop-server. To use it, we should use `--fd fd://{id}`. This is being tracked in bug 1068604.
James, thanks for all this hard work :)

Can you confirm that this CPU usage was generated by only 20 users on 5 agents? (!!) If that's the case this is really blocking because we cannot handle real traffic.

What kind of traffic were you sending before that on the servers? (previous versions) and was it behaving differently?

As Remy pointed out, this is likely due to the fact we're keeping a bunch of timeouts in memory, but I would like to double check that's the reason.
To investigate why the CPU usage is so high, I created bug 1068667
:alexis - 
CPU was near max even with just 20 users. On the first run the CPU seemed to be held at max. For Subsequent runs, the CPU was max only until the load test ended.

I will need to run more tests though to be absolutely sure....


Dean - 
I am going to guess that 2 mediums was a basic/arbitrary choice. I don't think it's going to hold up when we go live. We should compare to other Production deployments like FxA-Auth or Tokenserver perhaps...
Going with new release tag 0.12.1:
https://github.com/mozilla-services/loop-server/releases/tag/0.12.1

Going with a bigger instance size also.
Tag 0.12.1 is built and deployed to staging. Instance size has been increased to c3.large.
OK. Verified the new instance deployed with 0.12.1 (it's actually "advertised" as 0.13.0-DEV)

CF Stack in US East: loop-server-stage
ELB: loop-server-st-ELB-1EELW8IRXEQSQ

Instance - c3.large: i-71aa345c

Verified actual instance CPU/Mem sizing with
/proc/cpuinfo and /proc/meminfo


RPM information for code/versions:
puppet-config-loop 20140915171505-1 x86_64 15158
loop-server-svcops 0.12.1-1 x86_64 26686075

VERSION.txt shows
053b77f Clear the timeouts when the connection closes

Once again verified the processes, files, and logs.

https://loop.stage.mozaws.net (in the browser) and
curl https://loop.stage.mozaws.net
show this:
{"name":"mozilla-loop-server","description":"The Mozilla Loop (WebRTC App) server","version":"0.13.0-DEV","homepage":"https://github.com/mozilla-services/loop-server/","endpoint":"https://loop.stage.mozaws.net","fakeTokBox":false,"fxaOAuth":true,"i18n":{"defaultLang":"en-US"}}

I think the 0.13.0-DEV is correct for this tagged release...

QA plan is to repeat testing from yesterday to watch errors, CPU usage, memory usage.
We have 3 hours ;-)

Run 1: 20 users, 5 agents, 30 minutes
Run 2: 40 users, 5 agents, 30 minutes
(with repeats as necessary for debugging)
Did two rounds of 30min tests:
20 Users
5 Agents
30 Minutes

One: https://loads.services.mozilla.com/run/d625272a-fb02-43f7-b22b-09240f809602
Two: https://loads.services.mozilla.com/run/8e05c794-a495-434e-878b-c619cd4d2a04

Results on the dashboard or good.
(Client-side) RPS is really good (that's m3.medium --> c3.large!)

Logs are pretty clean.

With TOP, memory usage was very low, so we can discount that now.
CPU went to 75% for the node process at the beginning of the load test and stayed there until the test finished.
After that, the CPU usage went back to normal.


Going to double the concurrent users now...
Let's see if the CPU usage holds or moves up towards 95%...
Run 3:
40 Users
5 Agents
30 Minutes

https://loads.services.mozilla.com/run/77304522-cc16-4677-9f53-e53b549b91e2

Between 20 - 25 minutes into this, I started to see some serious failures.
CPU held at 70 - 75% usage
Memory usage was low.
https://loads.services.mozilla.com/run/77304522-cc16-4677-9f53-e53b549b91e2


Results
Tests over     74321 
Successes     69730
Failures     4563
Errors     28
TCP Hits     284624
Opened web sockets     139565
Total web sockets     139565
Bytes/websockets     27188532
Requests / second (RPS)      156


Errors
2 occurrences:
No JSON object could be decoded
  File "/usr/lib/python2.7/unittest/case.py", line 332, in run
    testMethod()
  File "loadtest.py", line 46, in test_all
    params = self.setupCall()
  File "loadtest.py", line 42, in setupCall
    calls = self.list_pending_calls()
  File "loadtest.py", line 333, in list_pending_calls
    data = self._get_json(resp)
  File "loadtest.py", line 284, in _get_json
    return resp.json()
  File "/home/ubuntu/loads/local/lib/python2.7/site-packages/requests-2.2.1-py2.7.egg/requests/models.py", line 741, in json
    return json.loads(self.text, **kwargs)
  File "/usr/lib/python2.7/json/__init__.py", line 328, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.7/json/decoder.py", line 365, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python2.7/json/decoder.py", line 383, in raw_decode
    raise ValueError("No JSON object could be decoded")

95 occurrences:
Call Initialization failed: <html>
<head><title>502 Bad Gateway</title></head>
<body bgcolor="white">
<center><h1>502 Bad Gateway</h1></center>
<hr><center>openresty/1.7.0.1</center>
</body>
</html>

  File "/usr/lib/python2.7/unittest/case.py", line 332, in run
    testMethod()
  File "loadtest.py", line 46, in test_all
    params = self.setupCall()
  File "loadtest.py", line 41, in setupCall
    call_data = self.initiate_call(token)
  File "loadtest.py", line 325, in initiate_call
    "Call Initialization failed: %s" % resp.content)
  File "/usr/lib/python2.7/unittest/case.py", line 516, in assertEqual
    assertion_func(first, second, msg=msg)
  File "/usr/lib/python2.7/unittest/case.py", line 509, in _baseAssertEqual
    raise self.failureException(msg)

3 occurrences:
Invalid response status: 502 Bad Gateway
  File "/usr/lib/python2.7/unittest/case.py", line 332, in run
    testMethod()
  File "loadtest.py", line 47, in test_all
    self._test_websockets(*params)
  File "loadtest.py", line 271, in _test_websockets
    caller_ws = self.create_ws(progress_url, callback=_handle_caller)
  File "loadtest.py", line 34, in create_ws
    ws = TestCase.create_ws(self, *args, **kw)
  File "/home/ubuntu/loads/loads/case.py", line 65, in create_ws
    test_case=self)
  File "/home/ubuntu/loads/loads/websockets.py", line 56, in create_ws
    extensions, klass, test_case)
  File "/home/ubuntu/loads/loads/websockets.py", line 82, in _create_ws
    socket.connect()
  File "/home/ubuntu/loads/local/lib/python2.7/site-packages/ws4py/client/__init__.py", line 132, in connect
    self.process_response_line(response_line)
  File "/home/ubuntu/loads/local/lib/python2.7/site-packages/ws4py/client/__init__.py", line 182, in process_response_line
    raise HandshakeError("Invalid response status: %s %s" % (code, status))
Logs:

Nginx error log is again full of these:
2014/09/17 19:24:46 [error] 2270#0: *1465233 connect() failed (111: Connection refused) ...
2014/09/17 19:24:42 [error] 2271#0: *1465726 upstream prematurely closed connection while reading response header from upstream...
2014/09/17 19:24:42 [error] 2270#0: *1474743 recv() failed (104: Connection reset by peer)...

Nginx access log 
Again, I filter out the 200s, the 101s, the 499s, the 404s, and the 307s, this is what I see:
1410981882.275 "54.218.210.15" "POST /calls/5P0pF_6dTj8 HTTP/1.1" 502 178 "-" "python-requests/2.2.1 CPython/2.7.3 Linux/3.5.0-23-generic" 3.558 3.558
 "-"
1410981882.437 "54.218.210.15" "POST /registration HTTP/1.1" 502 178 "-" "python-requests/2.2.1 CPython/2.7.3 Linux/3.5.0-23-generic" 0.000 0.000 "-"
...etc...


The loop_server.out.log is ok

The loop_server.err.log shows the following traceback:
/data/loop-server/loop/middlewares.js:83
        line.error = res.headers["www-authenticate"];
                                ^
TypeError: Cannot read property 'www-authenticate' of undefined
    at ServerResponse.<anonymous> (/data/loop-server/loop/middlewares.js:83:33)
    at ServerResponse.emit (events.js:117:20)
    at ServerResponse.OutgoingMessage._finish (http.js:1016:8)
    at ServerResponse.OutgoingMessage.end (http.js:999:10)
    at ServerResponse.send (/data/loop-server/node_modules/express/lib/response.js:191:8)
    at ServerResponse.json (/data/loop-server/node_modules/express/lib/response.js:235:15)
    at sendError (/data/loop-server/loop/utils.js:28:20)
    at hawkSendError (/data/loop-server/loop/auth.js:77:5)
    at /data/loop-server/node_modules/express-hawkauth/lib/hawk.js:157:13
    at /data/loop-server/node_modules/express-hawkauth/node_modules/hawk/node_modules/hoek/lib/index.js:543:22
    at process._tickCallback (node.js:419:13)
So we achieved two things:
1. A large instance allowed for better CPU/memory usage, so we should keep larges for Stage and Prod
(Stage 1, Prod 2)
Also check out the RPS on the dashboard. Much higher with large instance vs. medium
We should still investigate the 75% usage with large, 95% usage with medium (see the separate bug)

2. The additional concurrent users/agents is really killing the service (we should never see 502s), regardless of the instance size.

Also, I thought the traceback issue (in loop_server.err.log) was addressed, but maybe not...


I am going to stop here. We need to address the massive failures with higher traffic before we can continue.
I think these are related to bug 1068604, let's configure that and try again :)

cpu usage of 75% seems pretty high still. needs investigation :)
OK, I will see if I can get another test window for tomorrow morning or afternoon.
I sent an email to our third-party. Will see if Thu afternoon is open for more testing...
We have released a fix that you can find here: https://github.com/mozilla-services/loop-server/releases/tag/0.12.1
Summary: Loop — Please deploy loop-server 0.12.0 to Stage → Loop — Please deploy loop-server 0.12.1 to Stage
I don't know what happened but it seems some commit are missing in the release. Also the version is not right in the release.

https://github.com/mozilla-services/loop-server/blob/0.12.1/loop/middlewares.js#L83

Should we deploy 0.13-dev snapshot from loop-server master to stage?
That is correct, the 0.12.1 release is meant to handle just the timeouts connections to see if that was making any difference. I will tag 0.12.2 shortly with the bugfixes that landed in master yesterday.
(Also, we don't want to deploy something that's not tagged).
0.12.2 is tagged and being deployed on stage.

BTW, Remy just told me that we might not want to tag each version before going to stage. Instead, we could use a master snapshot, test it, and once it's ready tag a version and deploy it to prod.

Dean, what's the best way to handle things for you? (I personally don't care).
0.12.2 has been deployed to staging.
I don't think anything should go to staging that's not built from a tag. Tagging is quite a quick, and cheap, operation and i think that tagged version numbers are worth it.
:deanw is correct. Only tagged releases to Stage. This matches what we would do for Production , and what we generally do for all our services.
https://github.com/mozilla-services/loop-server/releases/tag/0.12.2
+- [bugfix] Clear the setTimeout() on connection close (ws);
+- [bugfix] Check the headers using .get, not .headers;
+- [bugfix] Be sure to handle tokbox failures on time;
+- [config] Rename "host" into "ip".

I will look this over then start some load testing as soon as I hear back from our third party.
Dean, please note that there is a config change but that shouldn't impact you too much: "host" is now "ip".

The default values should work okay for you, though.
:alexis thanks for that info - I just noticed that myself.
So there is a bit more work to do here for this:
+- [config] Rename "host" into "ip".
Summary: Loop — Please deploy loop-server 0.12.1 to Stage → Loop — Please deploy loop-server 0.12.2 to Stage
Kicking this out.
Please rebuild 0.12.2 with the fix for Prod 11.2.
You can call this 0.12.3.
Depends on: 1069168
> Please rebuild 0.12.2 with the fix for Prod 11.2.

0.12.2 already have this fix inside it.
Thanks very much for the update.
So, we just need the config change and we can still hit our afternoon window for load testing.
Searching for the config change now...




PS: This is the last time we morph a Stage ticket ;-)
The loop-server stack has been updated with the latest config change.
Depends on: 1047667
OK, to bring everybody up to speed, this is where we are:

We are running 0.12.2, got there from here:
Release info: 
https://github.com/mozilla-services/loop-server/releases
https://github.com/mozilla-services/loop-server/compare/0.11.0...0.11.1
https://github.com/mozilla-services/loop-server/compare/0.11.1...0.12.0
https://github.com/mozilla-services/loop-server/compare/0.12.0...0.12.1
https://github.com/mozilla-services/loop-server/compare/0.12.1...0.12.2
	
0.12.0: https://github.com/mozilla-services/loop-server/releases/tag/0.12.0
Changelog is here: https://github.com/mozilla-services/loop-server/commit/922165c8d09f06c83e3d708d6bb1ec7a8bb07b6b
	
0.12.1: https://github.com/mozilla-services/loop-server/releases/tag/0.12.1
Changelog is here: https://github.com/mozilla-services/loop-server/commit/053b77f119d266fba9d922c3eaf1678f367ecd59
	
0.12.2: https://github.com/mozilla-services/loop-server/releases/tag/0.12.2
https://github.com/mozilla-services/loop-server/commit/5dfd3eec2ef129f6435c53a229f683151b454df1
	
and for further reference:
0.11.2: https://github.com/mozilla-services/loop-server/releases/tag/0.11.2
https://github.com/mozilla-services/loop-server/commit/a71973c721e9b44bc24d35ed9d97a16f6913db0c


Opened just a few bugs and issues along the way:
https://bugzilla.mozilla.org/show_bug.cgi?id=1068604
https://bugzilla.mozilla.org/show_bug.cgi?id=1068602
with https://github.com/mozilla-services/loop-server/pull/208
https://bugzilla.mozilla.org/show_bug.cgi?id=1068667
with https://github.com/mozilla-services/loop-server/pull/209
https://github.com/mozilla-services/loop-server/pull/211	
https://bugzilla.mozilla.org/show_bug.cgi?id=1069168
with https://github.com/mozilla-services/loop-server/pull/212
https://bugzilla.mozilla.org/show_bug.cgi?id=1069193
https://github.com/mozilla-services/puppet-config/pull/877
OK, so I am putting a temporary hold on this deployment until I can hear from client and server Devs on whether or not:
1. bug 1047667 is a blocker for the current deployment
2. server-side elements of bug 1047667 need to go out with this release or can hold for the next

After talking with :mreavy, it would seem that the server-side changes needed for/from bug 1047667 are blocking client-side testing and release.

stealing a bit of IRC:
14:56 < mreavy> we will be blocked on Monday/Tuesday
14:56 < mreavy> it will block our QA from testing FxA and from landing on nightly
No longer depends on: 1047667
Unblocked. Moving forward.
CF Stack in US East:
loop-server-stage
ELB: loop-server-st-ELB-1EELW8IRXEQSQ
Instance - c3.large: i-ef64ba01

Same as previous deploys, for the most part.

Code version is now
puppet-config-loop 20140915171505-1 x86_64 15158
loop-server-svcops 0.12.2-1 x86_64 26689998

VERSION.txt reports the following:
5dfd3ee Update the changelog for 0.12.2

And from the browser:
https://loop.stage.mozaws.net
returns:
{"name":"mozilla-loop-server","description":"The Mozilla Loop (WebRTC App) server","version":"0.12.2","homepage":"https://github.com/mozilla-services/loop-server/","endpoint":"https://loop.stage.mozaws.net","fakeTokBox":false,"fxaOAuth":true,"i18n":{"defaultLang":"en-US","lang":"en-us"}}


Verified the processes, files, and logs on this one c3.large instance.

On to load testing...
30 min load test - 12 failures, 27 errors
https://loads.services.mozilla.com/run/37d8e9ed-1c0a-4735-949f-4f3412c6d498


Errors
12 occurrences:

Call-Url creation failed: {"code":401,"errno":110,"error":{"statusCode":401,"error":"Unauthorized","message":"Unknown credentials"}}
  File "/usr/lib/python2.7/unittest/case.py", line 332, in run
    testMethod()
  File "loadtest.py", line 46, in test_all
    params = self.setupCall()
  File "loadtest.py", line 40, in setupCall
    token = self.generate_token()
  File "loadtest.py", line 312, in generate_token
    "Call-Url creation failed: %s" % resp.content)
  File "/usr/lib/python2.7/unittest/case.py", line 516, in assertEqual
    assertion_func(first, second, msg=msg)
  File "/usr/lib/python2.7/unittest/case.py", line 509, in _baseAssertEqual
    raise self.failureException(msg)

27 occurrences:

'calls'
  File "/usr/lib/python2.7/unittest/case.py", line 332, in run
    testMethod()
  File "loadtest.py", line 46, in test_all
    params = self.setupCall()
  File "loadtest.py", line 42, in setupCall
    calls = self.list_pending_calls()
  File "loadtest.py", line 334, in list_pending_calls
    return data['calls']
I found only two 401s in the nginx access log:
1411081228.739 "54.218.210.15" "POST /call-url HTTP/1.1" 401 106 "-" "python-requests/2.2.1 CPython/2.7.3 Linux/3.5.0-23-generic" 0.004 0.004 "-"

But the weird thing is that the instance stopped receiving traffic from the load test 5 - 10 minutes before it even finished running. The dashboard on the other hand shows a completed 30min load test...

/me shrugs
OK, I had :rpapa restart that but if failed immediately with a high count of 503s.
The loop-server seems healthy, so I reset the loads cluster.

So, that did not address the problem and the cluster works for other services.

These 503s are very specific to loop-server. Even if you run 'make test' from the loadtests folder, you get this:
1 occurrences of: 
    AssertionError: Registration failed: {"code":503,"errno":201,"error":"Service Unavailable"}    Traceback: 
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/unittest/case.py", line 331, in run
    testMethod()
  File "loadtest.py", line 46, in test_all
    params = self.setupCall()
  File "loadtest.py", line 39, in setupCall
    self.register()
  File "loadtest.py", line 294, in register
    "Registration failed: %s" % resp.content)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/unittest/case.py", line 515, in assertEqual
    assertion_func(first, second, msg=msg)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/unittest/case.py", line 508, in _baseAssertEqual
    raise self.failureException(msg)


so at some point in the last 30 - 45 minutes, we got a run of 401s, followed by a flood of 503s...
The 503s were almost entirely all to the /registration endpoint.

While trying to isolate the problem Operations tested connectivity from the loop-server instance to all the resources listed in the settings.json file, all of which were successful:

apiUri
webAppUrl
content_uri
oauth_uri
profile_uri
fxaVerifier
statsd: host
storage: settings: host

After setting fakeTokBox to 'true' the problem was resolved, which seems to indicate that the problem was in conversing with the tokBox endpoint.
OK, well, wow...

So the problem was resolved using the mock server, but only for a short time. Running a 10min load test, the same errors come up at around the 5 minute mark (401s anyway) and the logs stop showing any traffic or requests...
https://loads.services.mozilla.com/run/eb4884a6-1ad4-43cf-aa8f-8a6abc2566cb

See the example of the 401s in Comment 59.


I have no idea what is going on there, but this seems plenty broken. We can not run any load at all for any length of time before the traffic grinds to a halt (so to speak) and we get nothing but errors.

So, with live server, we are getting 401s and 503s and an apparent drop in traffic to 0.
With the mock server, we are getting 401s and an apparent drop in traffic to 0.
The 503 on the TokBox server may be related to the configuration of the TokBox timeout.

This value for usage of TokBox: https://github.com/mozilla-services/loop-server/blob/master/loop/config.js#L214

This value for test of tokbox: https://github.com/mozilla-services/loop-server/blob/master/loop/config.js#L298

Maybe we can push it to 3000 ms as :tarek proposed earlier.

I am going to investigate the 401.
The 401 are not related to tokbox: we're not using tokbox at all during registration.

The 503 could be normal, in case the tokbox staging server was down.

I would check for time skew between the client and the server for the authentication errors. It also seems we could need to have some more debugging information in the loop-server code about authentication failures.

I'm also investigating the 401s, let's see what happens!
Sorry, let me rephrase, I misread: 

- 503s are probably related to tokbox not working since they happen on /calls (need to check that on the stage dashboard);
- I don't know what causes the 401, but if the writes fail, we would got a 503 and next queries would 401.
In the stage sentry, I see:

> Error: OOM command not allowed when used memory > 'maxmemory'.

So it seems that redis is full. I wonder what we do with redis to fill it. Maybe the loadtests are putting too much data there for its sizing.

Can you confirm this redis server had been recreated and was empty before the load test?
From an ops perspective we don't recreate or freedb the elasticache instance.

This has however shown that we need some monitoring on the elasticache memory usage.
So, after some investigation, here is what we found:

- Redis was full on staging, and so the database wasn't accepting writes anymore, like what I was explaining in comment 67.

- After a check, Redis, is *almost* full on prod as well, so we need to find out what's happening. It might be that the redis instance is undersized or that we're leaking data.

This should not be considered as a release blocker, since that's deployment specific and not related to the code itself. In production, this should work okay and as we will not have any increase in traffic it should work okay.

However, we need to find out what's going on with redis: are we storing data we shouldn't or is the instance just undersized? I filled bug 1069995 about that.
30 min load test - 1 error
https://loads.services.mozilla.com/run/5685041a-f65f-4c6d-904b-46ababe14735

1 Error
[Errno 1] _ssl.c:504: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed
http://pastebin.mozilla.org/6553719
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
60 min load test - no errors
https://loads.services.mozilla.com/run/aabf7b00-fcb1-43ef-9bbb-defe44b33c4b

jbonacci: SSL cert error might have been an ELB-related anomaly
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.