Closed Bug 994961 Opened 6 years ago Closed 5 years ago

Start load testing the staging environment

Categories

(Hello (Loop) :: Server, defect, P3)

x86_64
Linux
defect

Tracking

(Not tracked)

VERIFIED FIXED
mozilla32
Blocking Flags:
backlog mlp+

People

(Reporter: alexis+bugs, Assigned: jbonacci)

References

Details

(Whiteboard: [est:?][p=.25][s=mlpnightly3][qa!])

We need to send some load to our staging environment in order to see how many concurrent connections it is able to handle. Creating the bug mainly for james to track it :)

This is awaiting some contractual work with our partners (we don't want to send too much load to their servers without noticing them beforehand), and blocking on actually having a stage environment deployed.
Whiteboard: [est:?]
James, can you put an estimation of the time it will take to do? Thanks!
Flags: needinfo?(jbonacci)
backlog: --- → mlp+
On a repeating scale - load test time various...
1. For 1-per-day deployments to Stage, load tests should scale up as follows: 1 hour, 2 hour, 4 hour, 8 hour
Per-minimum should be two hours.
Once Stage stabilizes, then an overnight load test seems reasonable.

2. For 1-per-week deployments to Stage, load tests follow a similar pattern as above, with emphasis on debug work and longer-term load testing.
Day of deployment: 1 hour, 2 hour, 4 hour, 8 hour
During the current Stage deployment: longer load tests as needed for debug, functional checks, error, performance, and so on.

Hope that helps...
Flags: needinfo?(jbonacci)
To answer mreavy I understood that we want to mock OpenTok calls for the load test.
We will work on it this week in order to be able to run the load test the week after (if that works for James.)
Depends on: 996488
:natim - so far, next week is looking good. After that we have the Fx29 launch...
The code is ready, we are now waiting for mostlygeek to deploy the staging environment so jbonacci can start load testing.
Status: NEW → ASSIGNED
Maire to ask Alexis about 987086, 994360, 994961 -- are they really needed for the first nightly of MLP?  978086 is the real question;  the other two are unlikely to block and could come a week.
Priority: -- → P2
Target Milestone: --- → mozilla32
Priority: P2 → P3
marking bugs with no estimate as p=.25, if you own - please change to p=# estimating described here https://wiki.mozilla.org/Media/WebRTC#Bugzilla_Mark-up_.28under_final_review.29
Whiteboard: [est:?] → [est:?][p=.25]
Whiteboard: [est:?][p=.25] → [est:?][p=.25][s=mlpnightly3]
Depends on: 1006698
Stage environment is up. Once I verify it, we can start some preliminary (read "functional) load tests to work out the bugs, logging, reporting issues that we find.
Thanks for the heads up, don't hesitate to ping myself once verified so we can start playing with it.
OK The stage environment is ready for a light load test. So this is what is currently set up: 

- https://loop.stage.mozaws.net, 3x m3.medium instances (similar to a prod deploy)
- Elasticache redis - an m1.small instance
- https://loop-delayed-response.stage.mozaws.net, 1x m1.small w/ a random latency between 100:500ms

I would hit it w/ a light load test first. Then crank it up. 

It'll be interesting to see: 

- https CPU overhead for loop-server calling https://loop-delayed-response.s.m.n
 > SSL is offloaded to ELBs so you should be able to hammer this pretty hard 
 > server has 1024 file handlers, so what happens when > 1024 concurrent connections
- what it takes for the whole cluster to fall over
 > what happens when we have 1K, 2K, 4K concurrent users

Let me know if you want me to scale anything up / tweak settings.
:mostlygeek thanks for all this work.
I will try to get to the load test today to check your configs/changes/etc.
I will be out tomorrow, so if not today, then certainly Monday!
Temporarily blocked by bug 1014224
Depends on: 1014274
Master is now updated so that you can load test it.

Mostlygeek, can you update stage so that James can load test it?
Flags: needinfo?(bwong)
I opened a Stage deploy ticket here: bug 1019053
Now I am blocked by bug 1019929
Depends on: 1019929
Flags: needinfo?(bwong)
We are on our way!
Quick testing of Stage was successful from my Mac laptop, and also from 3 linux AWS instances.

Keeping this bug open until I can get at least 30min load tests running. Then we can evaluate closing this bug or using it for tracking the first official round of load tests against Stage...


Example:
$ make test
./venv/bin/loads-runner --config=./config/test.ini --server-url=https://loop.stage.mozaws.net:443 loadtest.TestLoop.test_all
[============================================================================================] 100%
Duration: 1.59 seconds
Hits: 5
Started: 2014-06-04 20:35:19.486795
Approximate Average RPS: 3
Average request time: 0.27s
Opened web sockets: 0
Bytes received via web sockets : 0

Success: 1
Errors: 0
Failures: 0


Slowest URL: https://loop.stage.mozaws.net:443/registration
Average Request Time: 0.888493

Stats by URLs:
- https://loop.stage.mozaws.net:443/registration                                                                                          
Average request time: 0.888493	Hits success rate: 1.0

- https://loop.stage.mozaws.net:443/calls/GCbSXurMWshF3ayVDwyH051g03fbotPAThB_3mXbPLm4ywHRehzIgkMPWGLz8o8QLpJeQuW9XNW1VVUW2xwdXOWrH9DKn3LKyoX567kod7sVwn9g3oVC98qfhoqwyt-cQq4XhYER_93Z5KLUbd7inbmWLZRiQuf_lzXCzWzhhpBmTNrrOYRApctri_szuCJyKbLyP4OimiuudK2JDGPTFVhUhj8
Average request time: 0.151973	Hits success rate: 1.0

- https://loop.stage.mozaws.net:443/calls?version=200
Average request time: 0.108467	Hits success rate: 1.0

- https://loop.stage.mozaws.net:443/call-url
Average request time: 0.106362	Hits success rate: 1.0

- https://loop.stage.mozaws.net:443/calls/id/3af69d76b65a32fdb63fae3a2dc65ce6
Average request time: 0.093307	Hits success rate: 1.0
With some very light testing, I am seeing 404s in the nginx access.log file. Here are some typical examples:
1401925447.398 "-" "PUT /push HTTP/1.1" 404 28 "-" "-" 0.001 0.001
1401925447.423 "54.188.254.148" "GET /calls/id/f095c71c33713d938bb156d2b06740ef HTTP/1.1" 404 60 "-" "python-requests/2.3.0 CPython/2.6.6 Linux/2.6.32-431.11.2.el6.x86_64" 0.003 0.003

Are these expected in Stage?
In the nginx access.log I also see a significant number of 204s:

1401924889.127 "24.7.94.153" "DELETE /calls/id/632e434a9351982c43b81045bd733c0d HTTP/1.1" 204 0 "-" "python-requests/2.3.0 CPython/2.7.5 Darwin/13.2.0" 0.003 0.00
3
1401925191.052 "54.189.190.206" "DELETE /calls/id/b84099e88d92ff0a2ce65d201c91580c HTTP/1.1" 204 0 "-" "python-requests/2.3.0 CPython/2.7.6 Linux/3.13.0-24-generi
c" 0.039 0.039
1401925191.063 "54.188.254.148" "DELETE /calls/id/78f44bc968b0c78b9103ff60562a92a0 HTTP/1.1" 204 0 "-" "python-requests/2.3.0 CPython/2.6.6 Linux/2.6.32-431.11.2.
el6.x86_64" 0.018 0.018
1401925191.089 "54.189.190.206" "DELETE /calls/id/f5cf86ac7f25672b59ff4e07803f48d9 HTTP/1.1" 204 0 "-" "python-requests/2.3.0 CPython/2.7.6 Linux/3.13.0-24-generi
c" 0.022 0.022

(varying Python versions)

I assume these are correct given the type of load we are sending, but just wanted to make sure...

If correct, do we really want to be logging these along with the standard 200s?
The longer, more test client intensive 'make bench' works as well. Here is sample console output from rhel linux AWS:
make bench  SERVER_URL=https://loop.stage.mozaws.net
./venv/bin/loads-runner --config=./config/bench.ini --server-url=https://loop.stage.mozaws.net loadtest.TestLoop.test_all
[================================================================================================] 100%
Duration: 300.41 seconds
Hits: 29905
Started: 2014-06-04 23:39:00.242098
Approximate Average RPS: 99
Average request time: 0.20s
Opened web sockets: 0
Bytes received via web sockets : 0

Success: 5625
Errors: 0
Failures: 0


Slowest URL: https://loop.stage.mozaws.net/calls/BLAH1
Average Request Time: 0.638324

Stats by URLs (10 slowests):
- https://loop.stage.mozaws.net/calls/BLAH1
Average request time: 0.638324	Hits success rate: 1.0

- https://loop.stage.mozaws.net/calls/BLAH2
Average request time: 0.631144	Hits success rate: 1.0

- https://loop.stage.mozaws.net/calls/BLAH3
Average request time: 0.577574	Hits success rate: 1.0

- https://loop.stage.mozaws.net/calls/BLAH4
Average request time: 0.574913	Hits success rate: 1.0

- https://loop.stage.mozaws.net/calls/BLAH5
Average request time: 0.57401	Hits success rate: 1.0

- https://loop.stage.mozaws.net/calls/BLAH6
Average request time: 0.537536	Hits success rate: 1.0

- https://loop.stage.mozaws.net/calls/BLAH7
Average request time: 0.533552	Hits success rate: 1.0

- https://loop.stage.mozaws.net/calls/BLAH8
Average request time: 0.522361	Hits success rate: 1.0

- https://loop.stage.mozaws.net/calls/BLAH9
Average request time: 0.516951	Hits success rate: 1.0

- https://loop.stage.mozaws.net/calls/BLAH10
Average request time: 0.512179	Hits success rate: 1.0
Moving on to a 30min standard load test, today.
Will continue with more testing tomorrow...
OK. First attempt at a 'make megabench' (that is, a full-on load test) was a failure. The test ended almost immediately :

Testing loadtest.TestLoop.test_all

Test was launched by     jbonacci
Run Id     940659b4-26a7-4bab-b8f4-b86d4e4c69b2
Duration     9 sec
Started     2014-06-05 00:19:04 UTC
Ended     2014-06-05 00:19:13 UTC
State     Ended

Users     [1]
Hits     None
Agents     1
Duration     300
Server URL     https://loop.stage.mozaws.net

Tests over     0 
Successes     0
Failures     0
Errors     1
TCP Hits     0
Opened web sockets     0
Total web sockets     0
Bytes/websockets     0
Requests / second (RPS)     0



and got the following error (as seen on the loads cluster dashboard):
 1 occurrences:

loadtest
  File "/home/ubuntu/loads/loads/runners/local.py", line 167, in execute
    self._execute()
  File "/home/ubuntu/loads/loads/runners/local.py", line 247, in _execute
    self._run_python_tests()
  File "/home/ubuntu/loads/loads/runners/local.py", line 252, in _run_python_tests
    self._resolve_name()
  File "/home/ubuntu/loads/loads/runners/local.py", line 89, in _resolve_name
    self.test = resolve_name(self.fqn)
  File "/home/ubuntu/loads/loads/util.py", line 162, in resolve_name
    raise ImportError(parts[0])


Nothing of significance in the logs on the server...


I will not file a new bug yet, until we can debug this error, but consider this a blocker...
This is likely to be a loads-runner CLI argument missing. The include_file directive that doesn't load the right file or something like so.
great job alexis thanks \o/
Yes. Nice catch! Redeploying now - will test shortly...
My first 5min load test was successful.
The second 5min load test had one error:

Errors
1 occurrences:

No JSON object could be decoded
  File "/usr/lib/python2.7/unittest/case.py", line 332, in run
    testMethod()
  File "loadtest.py", line 19, in test_all
    call_id, session, caller_token, api_key = self.initiate_call(token)
  File "loadtest.py", line 50, in initiate_call
    res = self.session.post(self.server_url + '/calls/%s' % token).json()
  File "/home/ubuntu/loads/local/lib/python2.7/site-packages/requests-2.2.1-py2.7.egg/requests/models.py", line 741, in json
    return json.loads(self.text, **kwargs)
  File "/usr/lib/python2.7/json/__init__.py", line 328, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.7/json/decoder.py", line 365, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python2.7/json/decoder.py", line 383, in raw_decode
    raise ValueError("No JSON object could be decoded")



Running a 10min load test now...
10min load test showed more errors. 
I need the JSON errors investigated.
I am not too surprised by the HTTPSConnectionPool error, we have seen this on other AWS stacks during load testing...

I will write up some new bugs after I look at the server logs...

Errors
2 occurrences:

No JSON object could be decoded
  File "/usr/lib/python2.7/unittest/case.py", line 332, in run
    testMethod()
  File "loadtest.py", line 19, in test_all
    call_id, session, caller_token, api_key = self.initiate_call(token)
  File "loadtest.py", line 50, in initiate_call
    res = self.session.post(self.server_url + '/calls/%s' % token).json()
  File "/home/ubuntu/loads/local/lib/python2.7/site-packages/requests-2.2.1-py2.7.egg/requests/models.py", line 741, in json
    return json.loads(self.text, **kwargs)
  File "/usr/lib/python2.7/json/__init__.py", line 328, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.7/json/decoder.py", line 365, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python2.7/json/decoder.py", line 383, in raw_decode
    raise ValueError("No JSON object could be decoded")

1 occurrences:

HTTPSConnectionPool(host='loop.stage.mozaws.net', port=443): Max retries exceeded with url: /call-url (Caused by <class 'httplib.BadStatusLine'>: '')
  File "/usr/lib/python2.7/unittest/case.py", line 332, in run
    testMethod()
  File "loadtest.py", line 18, in test_all
    token = self.generate_token()
  File "loadtest.py", line 43, in generate_token
    auth=self.hawk_auth
  File "/home/ubuntu/loads/local/lib/python2.7/site-packages/requests-2.2.1-py2.7.egg/requests/sessions.py", line 425, in post
    return self.request('POST', url, data=data, **kwargs)
  File "/home/ubuntu/loads/loads/measure.py", line 79, in request
    method, url, headers=headers, **kwargs)
  File "/home/ubuntu/loads/local/lib/python2.7/site-packages/requests-2.2.1-py2.7.egg/requests/sessions.py", line 383, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/ubuntu/loads/loads/measure.py", line 87, in send
    res = _Session.send(self, request, **kwargs)
  File "/home/ubuntu/loads/local/lib/python2.7/site-packages/requests-2.2.1-py2.7.egg/requests/sessions.py", line 486, in send
    r = adapter.send(request, **kwargs)
  File "/home/ubuntu/loads/local/lib/python2.7/site-packages/requests-2.2.1-py2.7.egg/requests/adapters.py", line 378, in send
    raise ConnectionError(e)
I think we want to keep track of all the access logs at the same location. 204 and 200 are normal behavior here.

Interesting to see SSL starting fail ;)
On the server side, the /var/log/nginx/access.log is a mix of 200s, 204s, and 404s, as I previously reported.
Nothing else of interest.

The /var/log/loop_out.log shows the following (recent):
...etc...
Server listening on http://127.0.0.1:8000
apiKey undefined
Calls to TokBox are now mocked.
Server listening on http://127.0.0.1:8000
{ [Error: connect ETIMEDOUT] code: 'ETIMEDOUT', errno: 'ETIMEDOUT', syscall: 'connect' }
{ [Error: socket hang up] code: 'ECONNRESET' }
{ [Error: socket hang up] code: 'ECONNRESET' }
{ [Error: socket hang up] code: 'ECONNRESET' }

That's all I can find right now.
For the next test, I lowered the number of users from 20 to 10, but kept the agents at 5.
This is still a 10 minute load test.

The error counts is up from 3 to 6:
3 occurrences: No JSON object could be decoded
1 occurrences: HTTPSConnectionPool(host='loop.stage.mozaws.net', port=443): Max retries exceeded with url

and something new this round:
 2 occurrences:

string indices must be integers
  File "/usr/lib/python2.7/unittest/case.py", line 332, in run
    testMethod()
  File "loadtest.py", line 19, in test_all
    call_id, session, caller_token, api_key = self.initiate_call(token)
  File "loadtest.py", line 51, in initiate_call
    return (res['callId'], res['sessionId'], res['sessionToken'],
Several load tests in a row were successful, then this happened:
https://loads.services.mozilla.com/run/b6b41a0e-2af6-460b-b6bf-2e08f985f687

If you can't see that, here are the details:
Test was launched by     jbonacci
Run Id     b6b41a0e-2af6-460b-b6bf-2e08f985f687
Duration     1 h 6 min and 53 sec.
Started     2014-06-05 23:22:37 UTC
Ended     2014-06-06 00:29:30 UTC
State     Ended

Users     [20]
Hits     None
Agents     10
Duration     3600
Server URL     https://loop.stage.mozaws.net


Tests over     283179 
Successes     192230
Failures     0
Errors     90936
TCP Hits     1111685
Opened web sockets     0
Total web sockets     0
Bytes/websockets     0
Requests / second (RPS)     276


addFailure     13

The dashboard is showing the following errors:
 18 occurrences:

No JSON object could be decoded
  File "/usr/lib/python2.7/unittest/case.py", line 332, in run
    testMethod()
  File "loadtest.py", line 18, in test_all
    token = self.generate_token()
  File "loadtest.py", line 45, in generate_token
    call_url = resp.json()['call_url']
  File "/home/ubuntu/loads/local/lib/python2.7/site-packages/requests-2.2.1-py2.7.egg/requests/models.py", line 741, in json
    return json.loads(self.text, **kwargs)
  File "/usr/lib/python2.7/json/__init__.py", line 328, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.7/json/decoder.py", line 365, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python2.7/json/decoder.py", line 383, in raw_decode
    raise ValueError("No JSON object could be decoded")

82 occurrences:

'hawk-session-token'
  File "/usr/lib/python2.7/unittest/case.py", line 332, in run
    testMethod()
  File "loadtest.py", line 17, in test_all
    self.register()
  File "loadtest.py", line 36, in register
    resp.headers['hawk-session-token'])
  File "/home/ubuntu/loads/local/lib/python2.7/site-packages/requests-2.2.1-py2.7.egg/requests/structures.py", line 77, in __getitem__
    return self._store[key.lower()][1]

The 'hawk-session-token' errors are new for the load test.
Moving all error info over to this bug: 1021119
This bug is getting too long and we have proved that we can load test Stage.
Closing this out in favor of more specific load test bugs, like bug 1021119
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Status: RESOLVED → VERIFIED
QA Contact: jbonacci
Whiteboard: [est:?][p=.25][s=mlpnightly3] → [est:?][p=.25][s=mlpnightly3][qa!]
You need to log in before you can comment on or make changes to this bug.