Closed Bug 1082919 Opened 10 years ago Closed 10 years ago

Please deploy release 0.12.5 to loop-server Stage

Categories

(Cloud Services :: Operations: Deployment Requests - DEPRECATED, task, P1)

x86_64
Linux
task

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: alexis+bugs, Assigned: bobm)

References

Details

(Whiteboard: [qa+])

There is a bugfix release to get out pretty urgently (setting to P1/critical on purpose, as we need this to be out for beta).

There are no configuration change, the changelog from the last tag 0.12.4 is here https://github.com/mozilla-services/loop-server/compare/0.12.4...0.12.5

I assigned to Bob but whoever can do it is welcome to do so,

Thanks folks!
QA Contact: rpappalardo
Whiteboard: [qa+]
The 0.12.5 release has been built and uploaded the Cloud Services S3 yum repository.
The 0.12.4 stack has been updated to 0.12.5.  Please QA.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
thanks!
Stealing this back so :rpapa is not so overloaded. :-)
He will be my backup.
I will do the basic verifications today.
Load testing is tomorrow with optional load on Friday...
QA Contact: rpappalardo → jbonacci
Also:
:rpapa says
btw, when you do file checks on 12.5, see if loop_server.yaml is correct.  i was checking it yesterday on 0.12.4 and it seemed like a config change to connectionDuration hadn't propagated
was going to verify this in 0.12.4, but since ticket was closed, we may want to verify this on 0.12.5:

Bug 1068667 - Investigate why the CPU is so high during the load tests
First round of checks:

Verified we are still using the same stack and ELB.
Verified the deployment to one c3.large instance:
i-6db3b186
ec2-54-237-181-243

Version check is good:
puppet-config-loop 20141014182605-1 x86_64 15668
loop-server-svcops 0.12.5-1 x86_64 26719810

VERSION.txt shows
af89539 Tag 0.12.5

Verified processes and files

Some config changes which are expected:
/data/loop-server/config/settings.json
which does have a new entry:
"callDuration": 80,
and an updated entry value:
    "timers": {
        "connectionDuration": 30

/etc/puppet/yaml/app/loop_server.yaml
which has a new entry:
callDuration: 80
and an updated entry value:
    timers:
        connectionDuration: 30


curl https://loop.stage.mozaws.net
{"name":"mozilla-loop-server","description":"The Mozilla Loop (WebRTC App) server","version":"0.12.5","homepage":"https://github.com/mozilla-services/loop-server/","endpoint":"https://loop.stage.mozaws.net","fakeTokBox":false,"fxaOAuth":true,"i18n":{"defaultLang":"en-US"}}

curl -I https://loop.stage.mozaws.net
HTTP/1.1 200 OK
Date: Wed, 15 Oct 2014 18:02:06 GMT
Content-Type: application/json; charset=utf-8
Content-Length: 273
Connection: keep-alive
ETag: W/"111-dd29fcc1"
Timestamp: 1413396126
OK, so here is an issue:
/media/ephemeral0/nginx/logs/loop_server.error.log

Contains errors/messages of the following type (which seem new to me):
" while reading PROXY protocol, client: 127.0.0.1, server: 0.0.0.0:80
2014/10/15 10:09:15 [error] 2252#0: *2384 broken header: "GET /nginx_status HTTP/1.1
User-Agent: stackdriver_agent/5.3.0-174.el6
Host: 127.0.0.1
Accept: */*

These are from about 8 hours ago or longer, so possibly only during deployment?
FYI the stackdriver process is running. Not sure if there are stackdriver-specific logs anywhere.


:alexis and :bobm or :deanw to review ...
Flags: needinfo?(dwilson)
Flags: needinfo?(bobm)
Flags: needinfo?(alexis+bugs)
That's stackdriver asking with a broken header? Weird.
Flags: needinfo?(alexis+bugs)
(In reply to James Bonacci [:jbonacci] from comment #8)
> OK, so here is an issue:
> /media/ephemeral0/nginx/logs/loop_server.error.log
> 
> Contains errors/messages of the following type (which seem new to me):
> " while reading PROXY protocol, client: 127.0.0.1, server: 0.0.0.0:80
> 2014/10/15 10:09:15 [error] 2252#0: *2384 broken header: "GET /nginx_status
> HTTP/1.1
> User-Agent: stackdriver_agent/5.3.0-174.el6
> Host: 127.0.0.1
> Accept: */*
> 
> These are from about 8 hours ago or longer, so possibly only during
> deployment?
> FYI the stackdriver process is running. Not sure if there are
> stackdriver-specific logs anywhere.

This only impacts sending Nginx status data to the StackDriver agent.  It is something we should fix, but it isn't breaking anything with the deployment.
Flags: needinfo?(bobm)
Per :jbonacci
we should be able to hit these while load testing...
need to verify:
Bug 1068255 - [Rooms] Add appropriate API endpoints to Loop-Server load test
Flags: needinfo?(dwilson)
As it turns out, the Rooms support for load testing is not in this release. Removing that testing requirement.

I had to rebuild the loop-server repo on my localhost to use a 0.12.x branch rather than the Master branch (which defaults to Rooms support for the loadtest).

We finally have load tests running again...
Initial 30 minute run is here:
https://loads.services.mozilla.com/run/cd0abfc1-3e68-41bc-93f6-39f8083879b7

Tests over     58239 
Successes     58083
Failures     0
Errors     156
TCP Hits     232945
Opened web sockets     116166
Total web sockets     116166
Bytes/websockets     24947700
Requests / second (RPS)     126

We got a large number of ConnectionPool errors for a 30minute test. The error type is expected, but the high error count is unexpected. I will follow-up with OPs.

Errors
100 + occurrences:
HTTPSConnectionPool(host='loop.stage.mozaws.net', port=443): Max retries exceeded with url: /calls?version=200 (Caused by <class 'socket.error'>: [Errno 104] Connection reset by peer)


I will try again with a lower user value...
The second 30minute run is here:
https://loads.services.mozilla.com/run/9f6ce46b-f63a-4565-b114-650d644d0ddd

This is much better:
Tests over     70790 
Successes     70790
Failures     0
Errors     0
TCP Hits     283472
Opened web sockets     141630
Total web sockets     141630
Bytes/websockets     30608184
Requests / second (RPS)     153

Noting here the increase in tests completed and the increase in RPS.


Going back to 20 users and trying another load test...
60min looks good: https://loads.services.mozilla.com/run/5894236b-b543-42e7-bd71-473bacbb6581
Moving on to a 4hour.

Looking into the logs as well...
So, about 3 hours into the 4 hour test, I decided to pull the plug. There are some rather strange errors showing up in the Loads dashboard (sorry, in advance, for the long comment):

Link:
https://loads.services.mozilla.com/run/a0a03a1b-9d3f-4134-85e2-0c637ae4473e


Tests over     267310 
Successes     267298
Failures     0
Errors     12
TCP Hits     1069689
Opened web sockets     534735
Total web sockets     534735
Bytes/websockets     112791584
Requests / second (RPS)     104

The count of errors is not too high, from this perspective, but the types of errors are new.
Never seen these...

 3 occurrences:

Invalid response
  File "/usr/lib/python2.7/unittest/case.py", line 332, in run
    testMethod()
  File "loadtest.py", line 47, in test_all
    self._test_websockets(*params)
  File "loadtest.py", line 271, in _test_websockets
    caller_ws = self.create_ws(progress_url, callback=_handle_caller)
  File "loadtest.py", line 34, in create_ws
    ws = TestCase.create_ws(self, *args, **kw)
  File "/home/ubuntu/loads/loads/case.py", line 65, in create_ws
    test_case=self)
  File "/home/ubuntu/loads/loads/websockets.py", line 56, in create_ws
    extensions, klass, test_case)
  File "/home/ubuntu/loads/loads/websockets.py", line 82, in _create_ws
    socket.connect()
  File "/home/ubuntu/loads/local/lib/python2.7/site-packages/ws4py/client/__init__.py", line 126, in connect
    raise HandshakeError("Invalid response")

1 occurrences:

HTTPSConnectionPool(host='loop.stage.mozaws.net', port=443): Max retries exceeded with url: /registration (Caused by <class 'httplib.BadStatusLine'>: '')
  File "/usr/lib/python2.7/unittest/case.py", line 332, in run
    testMethod()
  File "loadtest.py", line 46, in test_all
    params = self.setupCall()
  File "loadtest.py", line 39, in setupCall
    self.register()
  File "loadtest.py", line 292, in register
    data={'simple_push_url': 'http://httpbin.org/deny'})
  File "/home/ubuntu/loads/local/lib/python2.7/site-packages/requests-2.2.1-py2.7.egg/requests/sessions.py", line 425, in post
    return self.request('POST', url, data=data, **kwargs)
  File "/home/ubuntu/loads/loads/measure.py", line 80, in request
    method, url, headers=headers, **kwargs)
  File "/home/ubuntu/loads/local/lib/python2.7/site-packages/requests-2.2.1-py2.7.egg/requests/sessions.py", line 383, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/ubuntu/loads/loads/measure.py", line 88, in send
    res = _Session.send(self, request, **kwargs)
  File "/home/ubuntu/loads/local/lib/python2.7/site-packages/requests-2.2.1-py2.7.egg/requests/sessions.py", line 486, in send
    r = adapter.send(request, **kwargs)
  File "/home/ubuntu/loads/local/lib/python2.7/site-packages/requests-2.2.1-py2.7.egg/requests/adapters.py", line 378, in send
    raise ConnectionError(e)

3 occurrences:
'calls'
  File "/usr/lib/python2.7/unittest/case.py", line 332, in run
    testMethod()
  File "loadtest.py", line 46, in test_all
    params = self.setupCall()
  File "loadtest.py", line 42, in setupCall
    calls = self.list_pending_calls()
  File "loadtest.py", line 334, in list_pending_calls
    return data['calls']

1 occurrences:
can't start new thread
  File "/usr/lib/python2.7/unittest/case.py", line 332, in run
    testMethod()
  File "loadtest.py", line 47, in test_all
    self._test_websockets(*params)
  File "loadtest.py", line 271, in _test_websockets
    caller_ws = self.create_ws(progress_url, callback=_handle_caller)
  File "loadtest.py", line 34, in create_ws
    ws = TestCase.create_ws(self, *args, **kw)
  File "/home/ubuntu/loads/loads/case.py", line 65, in create_ws
    test_case=self)
  File "/home/ubuntu/loads/loads/websockets.py", line 56, in create_ws
    extensions, klass, test_case)
  File "/home/ubuntu/loads/loads/websockets.py", line 82, in _create_ws
    socket.connect()
  File "/home/ubuntu/loads/local/lib/python2.7/site-packages/ws4py/client/__init__.py", line 110, in connect
    self.sock.connect((self.host, int(self.port)))
  File "/home/ubuntu/loads/local/lib/python2.7/site-packages/gevent/ssl.py", line 328, in connect
    socket.connect(self, addr)
  File "/home/ubuntu/loads/local/lib/python2.7/site-packages/gevent/socket.py", line 332, in connect
    r = getaddrinfo(address[0], address[1], sock.family, sock.type, sock.proto)
  File "/home/ubuntu/loads/local/lib/python2.7/site-packages/gevent/socket.py", line 621, in getaddrinfo
    return get_hub().resolver.getaddrinfo(host, port, family, socktype, proto, flags)
  File "/home/ubuntu/loads/local/lib/python2.7/site-packages/gevent/resolver_thread.py", line 34, in getaddrinfo
    return self.pool.apply_e(self.expected_errors, _socket.getaddrinfo, args, kwargs)
  File "/home/ubuntu/loads/local/lib/python2.7/site-packages/gevent/threadpool.py", line 222, in apply_e
    success, result = self.spawn(wrap_errors, expected_errors, function, args, kwargs).get()
  File "/home/ubuntu/loads/local/lib/python2.7/site-packages/gevent/threadpool.py", line 159, in spawn
    self.adjust()
  File "/home/ubuntu/loads/local/lib/python2.7/site-packages/gevent/threadpool.py", line 133, in adjust
    self._adjust_step()
  File "/home/ubuntu/loads/local/lib/python2.7/site-packages/gevent/threadpool.py", line 113, in _adjust_step
    self._add_thread()
  File "/home/ubuntu/loads/local/lib/python2.7/site-packages/gevent/threadpool.py", line 142, in _add_thread
    start_new_thread(self._worker, ())

4 occurrences:
[Errno 8] _ssl.c:504: EOF occurred in violation of protocol
  File "/usr/lib/python2.7/unittest/case.py", line 332, in run
    testMethod()
  File "loadtest.py", line 46, in test_all
    params = self.setupCall()
  File "loadtest.py", line 39, in setupCall
    self.register()
  File "loadtest.py", line 292, in register
    data={'simple_push_url': 'http://httpbin.org/deny'})
  File "/home/ubuntu/loads/local/lib/python2.7/site-packages/requests-2.2.1-py2.7.egg/requests/sessions.py", line 425, in post
    return self.request('POST', url, data=data, **kwargs)
  File "/home/ubuntu/loads/loads/measure.py", line 80, in request
    method, url, headers=headers, **kwargs)
  File "/home/ubuntu/loads/local/lib/python2.7/site-packages/requests-2.2.1-py2.7.egg/requests/sessions.py", line 383, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/ubuntu/loads/loads/measure.py", line 88, in send
    res = _Session.send(self, request, **kwargs)
  File "/home/ubuntu/loads/local/lib/python2.7/site-packages/requests-2.2.1-py2.7.egg/requests/sessions.py", line 486, in send
    r = adapter.send(request, **kwargs)
  File "/home/ubuntu/loads/local/lib/python2.7/site-packages/requests-2.2.1-py2.7.egg/requests/adapters.py", line 385, in send
    raise SSLError(e)


Tagging Dev to start first round debug, while I check the logs...
Flags: needinfo?(rhubscher)
Flags: needinfo?(alexis+bugs)
/media/ephemeral0/nginx/logs/loop_server.error.log has some recent content but we have already discussed it:
...etc...
" while reading PROXY protocol, client: 127.0.0.1, server: 0.0.0.0:80
...etc...


/media/ephemeral0/nginx/logs/loop_server.access.log
Ignoring the 200s, 101s, 404s and 307s (bot stuff), the 499s (load test artifact)
leaves just a few stray 400s


Worth a look all around.
I will wait till I hear back from the Devs before proceeding.

/media/ephemeral0/circus/loop_server/loop_server.err.log
is empty

/media/ephemeral0/circus/loop_server/loop_server.out.log
has a significant number of these:
[Error: Failed to createSession. Error: The request failed: Error: ETIMEDOUT]
So we have different things here:

- The "session failed" are, I think, due to Tokbox not being able to create new sessions, this could be because their staging environment is over capacity (because the requests timed out); anyway, we retry on timeout and if the client doesn't show up errors probably we've been able to get the session from them;

- The "can't start new thread" error is due to loads not being able to start a new thread, so that's not related to loop itself;

- We have SSL handshake errors here as well and I'm not sure what's causing them;

- The 'calls' error are probably a false negative here, and due to other errors;

Anyway, that's 13 errors over 267310 (~0,005%) so I that doesn't seem too problematic. But I would like to investigate the SSL handshake problems, because that could mean the ELB is not terminating SSL properly (but again, we have a very low number of them here)
Flags: needinfo?(alexis+bugs)
OK. Adding Dean and Bob on the NeedInfo list to help out with the possible ELB issue.

The "session failed" issue is a bit troubling for me since 1) we have not seen this before and 2) they once told me our load was not really impacting their environment at all (in terms of traffic)

But, yea, pretty small number of errors. And, the service/nginx logs are relatively clean...
Flags: needinfo?(rhubscher)
Flags: needinfo?(dwilson)
Flags: needinfo?(bobm)
Initial look at ELB by :deanw turned up nothing of interest.
I am curious as to why the loads dashboard showed something, but the loop-server logs and ELB logs show nothing.

Contacting our third-party today to get another test window to try to repro this...
Flags: needinfo?(dwilson)
Flags: needinfo?(bobm)
First, fresh attempt at a load test produced nearly 100%
Call-Url creation failed: {"code":401,"errno":110,"error":{"statusCode":401,"error":"Unauthorized","message":"Stale timestamp"}}

This turned out to be the loads cluster.
I had to reboot all 3 instances then restart circus

Running short load now to check cluster health.
5min and 10min tests were clean. Trying a 30min, 60min, 4hour.
OK. Here is our first 60min load test of the week:
https://loads.services.mozilla.com/run/cde942ef-42e0-41bd-8006-73436cf340a9

We got too many errors for a 60min test - some new, some we have seen before:

Tests over     97210 
Successes     97179
Failures     7
Errors     24
TCP Hits     389329
Opened web sockets     194558
Total web sockets     194558
Bytes/websockets     41884250
Requests / second (RPS)     107

Seen previously:
24 occurrences:
'calls'

New:
Call Initialization failed: {"code":404,"errno":105,"error":"Token not found."}
  File "/usr/lib/python2.7/unittest/case.py", line 332, in run
    testMethod()
  File "loadtest.py", line 46, in test_all
    params = self.setupCall()
  File "loadtest.py", line 41, in setupCall
    call_data = self.initiate_call(token)
  File "loadtest.py", line 325, in initiate_call
    "Call Initialization failed: %s" % resp.content)
  File "/usr/lib/python2.7/unittest/case.py", line 516, in assertEqual
    assertion_func(first, second, msg=msg)
  File "/usr/lib/python2.7/unittest/case.py", line 509, in _baseAssertEqual
    raise self.failureException(msg)



I am going to try to reproduce this with another 60min load test...
This second 60min test failed immediately with these errors:
Call-Url creation failed: {"code":401,"errno":110,"error":{"statusCode":401,"error":"Unauthorized","message":"Unknown credentials"}}

(vs. the "Stale timestamp")

All the 401s looks something like this:
1413832903.703 "54.245.44.231" "POST /call-url HTTP/1.1" 401 106 "-" "python-requests/2.2.1 CPython/2.7.3 Linux/3.5.0-23-generic" 0.002 0.002 "-"

In addition, I see some 503s:
1413832904.401 "54.245.44.231" "POST /registration HTTP/1.1" 503 54 "-" "python-requests/2.2.1 CPython/2.7.3 Linux/3.5.0-23-generic" 0.001 0.001 "-"


So, I am stopping here and bouncing the bug until we can figure out what is going on.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
So, now, after a break, I try again and get nothing but 503s for 'make test' and 'make megabench'.
Hitting loop-server Stage in the browser does return 200.
Hitting other Stage envs with loads cluster is also fine, so it's not that.
Related, from Sentry, some traceback:

Error
OOM command not allowed when used memory > 'maxmemory'.
    /data/loop-server/node_modules/redis/lib/parser/hiredis.js ? in HiredisReplyParser.execute
        };
        HiredisReplyParser.prototype.execute = function (data) {
            var reply;
            this.reader.feed(data);
            while (true) {
                try {
                    reply = this.reader.get();
                } catch (err) {
                    this.emit("error", err);
                    break;
                }
                if (reply === undefined) {
                    break;
    /data/loop-server/node_modules/redis/index.js ? in RedisClient.on_data
        RedisClient.prototype.on_data = function (data) {
            if (exports.debug_mode) {
                console.log("net read " + this.host + ":" + this.port + " id " + this.connection_id + ": " + data.toString());
            }
            try {
                this.reply_parser.execute(data);
            } catch (err) {
                // This is an unexpected parser problem, an exception that came from the parser code itself.
                // Parser should emit "error" events if it notices things are out of whack.
                // Callbacks that throw exceptions will land in return_reply(), below.
                // TODO - it might be nice to have a different "error" event for different types of errors
                this.emit("error", err);
            }
    /data/loop-server/node_modules/redis/index.js ? in Socket.<anonymous>
            var self = this;
            this.stream.on("connect", function () {
                self.on_connect();
            });
            this.stream.on("data", function (buffer_from_socket) {
                self.on_data(buffer_from_socket);
            });
            this.stream.on("error", function (msg) {
                self.on_error(msg.message);
            });
            this.stream.on("close", function () {
    events.js ? in Socket.emit at line 95
    _stream_readable.js ? in Socket.<anonymous> at line 748
    events.js ? in Socket.emit at line 92
    _stream_readable.js ? in emitReadable_ at line 410
    _stream_readable.js ? in emitReadable at line 406
    _stream_readable.js ? in readableAddChunk at line 168
    _stream_readable.js ? in Socket.Readable.push at line 130
    net.js ? in TCP.onread at line 528
This is because the database is full, that's what the "OOM command not allowed when used memory > 'maxmemory'." thing means.

We need to find a way, for staging, to clean the database each time we redeploy, because there is no valuable data there.
All the errors you reported are due to the server database being full, doing a flushdb and reruning the tests should help.
I have made a note to have OPs flush the Stage Redis DB with each deploy.
That should probably become some sort of documented step or automated step for OPs...

Anyway...
Load tests are running.
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
OK, 30min sanity load test is clean: https://loads.services.mozilla.com/run/afcb0948-8875-4939-b420-ee5eda772a7e

Moving on to 2 60min and 1 4 hour
First 60min test: https://loads.services.mozilla.com/run/2985afd4-1f96-42ee-bc37-f1b61afc19fb

Shows only 2 errors of the following type:
 2 occurrences:

Invalid response
  File "/usr/lib/python2.7/unittest/case.py", line 332, in run
    testMethod()
  File "loadtest.py", line 47, in test_all
    self._test_websockets(*params)
  File "loadtest.py", line 270, in _test_websockets
    callee_ws = self.create_ws(progress_url, callback=_handle_callee)
  File "loadtest.py", line 34, in create_ws
    ws = TestCase.create_ws(self, *args, **kw)
  File "/home/ubuntu/loads/loads/case.py", line 65, in create_ws
    test_case=self)
  File "/home/ubuntu/loads/loads/websockets.py", line 56, in create_ws
    extensions, klass, test_case)
  File "/home/ubuntu/loads/loads/websockets.py", line 82, in _create_ws
    socket.connect()
  File "/home/ubuntu/loads/local/lib/python2.7/site-packages/ws4py/client/__init__.py", line 126, in connect
    raise HandshakeError("Invalid response")


Going to run another 60min test...
The second 60min test is clean:
https://loads.services.mozilla.com/run/6e8a76bf-3dc8-4dba-a72d-b595c0bf7882

Logs are clean.

I am going to quit while I am ahead...
Status: RESOLVED → VERIFIED
This is also going to Stage: bug 1085498
Code is already verified in Stage, so keeping the bug status as is...
Depends on: 1085498
OK, something was changed because this was restarted:
node loop/index.js
and this was touched:
/data/loop-server/config/settings.json

:-)


And that file has the new API key info for Default, Aurora, Nightly:
    "tokBox": {
        "apiUrl": "BLAH",
        "credentials": {
            "aurora": {
                "apiKey": "BLAH",
                "apiSecret": "BLAH"
            },
            "default": {
                "apiKey": "BLAH",
                "apiSecret": "BLAH",
                "apiUrl": "BLAH"
            },
            "nightly": {
                "apiKey": "BLAH",
                "apiSecret": "BLAH"
            }
        }
    },

And that information matches what was specified in the other two bugs...
Can you confirm that BLAH from aurora and nightly are the same?
Flags: needinfo?(jbonacci)
They are indeed the same on Stage and Prod.
Flags: needinfo?(jbonacci)
You need to log in before you can comment on or make changes to this bug.