Closed
Bug 1038772
Opened 11 years ago
Closed 11 years ago
Deploy Release 0.4.3 to msisdn-gateway Stage
Categories
(Cloud Services :: Operations: Deployment Requests - DEPRECATED, task)
Cloud Services
Operations: Deployment Requests - DEPRECATED
Tracking
(Not tracked)
VERIFIED
FIXED
People
(Reporter: jbonacci, Assigned: dwilson)
References
Details
(Whiteboard: [qa+])
New release 0.4.1 so new bug!
Comment 2•11 years ago
|
||
Configuration changes:
- move the nexmoCredential config under the new smsGateways config
- remove the moVerifier configuration
- leave everything else as default for now
| Reporter | ||
Comment 3•11 years ago
|
||
Once this is in Stage, I will test it - mostly likely Friday, 7/18.
Thursday has been blocked out for Loop-Server/live-server testing.
Status: NEW → ASSIGNED
| Reporter | ||
Comment 4•11 years ago
|
||
Here is our release (via :natim):
https://github.com/mozilla-services/msisdn-gateway/releases/tag/0.4.1
| Reporter | ||
Comment 5•11 years ago
|
||
Also, will need the config flipped to talk to the live server. Thanks!
| Reporter | ||
Comment 7•11 years ago
|
||
:natim
:alexis
or
:tarek
Please let QA and OPs know what release we are going with: 0.3.2 or 0.4.1.
Comment 8•11 years ago
|
||
It is going to be 0.4.1
Comment 9•11 years ago
|
||
No. We were going with 0.3.2 to production to fix the bug.
Then we will stage/qa the 0.4.x branch. When it is ready we will promote it to prod and maintain 0.4.x until 0.5.x replaces it.
Updated•11 years ago
|
Assignee: nobody → dwilson
Comment 10•11 years ago
|
||
I would rather deploy last master snapshot.
| Reporter | ||
Comment 11•11 years ago
|
||
OK. I will figure this out on Monday.
| Reporter | ||
Comment 12•11 years ago
|
||
OK. See the dependent bug 1037604
| Reporter | ||
Updated•11 years ago
|
Summary: Deploy Release 0.4.1 to msisdn-gateway Stage → Deploy Release 0.4.2 to msisdn-gateway Stage
Comment 13•11 years ago
|
||
0.4.2-0snap201408110138git80ea3b deployed to stage w/ updated puppet-config settings. When this passes QA we'll turn the git hash: 80ea3b into 0.4.2 and release it to prod with that.
BTW: stage deploy is a single m3.medium. Don't hit it too hard. :)
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
| Reporter | ||
Comment 14•11 years ago
|
||
Interesting, so a query of installed apps, shows this:
msisdn-gateway-svcops 0.4.2-0snap201408110138git80ea3b x86_64 52573829
puppet-config-msisdn 20140811185304-1 x86_64 12723
Yet the browser check shows this:
https://msisdn.stage.mozaws.net
{"name":"mozilla-msisdn-gateway","description":"The Mozilla MSISDN Gateway","version":"0.5.0-DEV","homepage":"https://github.com/mozilla-services/msisdn-gateway/","endpoint":"https://msisdn.stage.mozaws.net"}
What's up?
Comment 15•11 years ago
|
||
I fixed the loadtests on master. Make sure to pull last master in order to loadtest it.
| Reporter | ||
Comment 16•11 years ago
|
||
Apparently we are sticking with 0.4.2, but the version string info is incorrect.
Noting that here.
Going to reinstall the Master branch of msisdn-gateway to pick up the latest fixes for the load test.
(which I thought I already had, lets' see....)
:mostlygeek if we go to Prod with this? What will it look like?
0.4.2 or 0.5.0-DEV?
FYI, that fix is here:
https://github.com/mozilla-services/msisdn-gateway/commit/1372eb02acedaa0fd1a5ec7ac525bfbf3e2838d4
Comment 17•11 years ago
|
||
It will be 0.4.2, 0.5.0-DEV is because it is a master snapshot deploy.
| Reporter | ||
Comment 18•11 years ago
|
||
got it. thanks.
| Reporter | ||
Comment 19•11 years ago
|
||
:natim 's load test looked good.
Running my own now. 30min, then 60min.
Then maybe drive up the users/agents.
| Reporter | ||
Comment 20•11 years ago
|
||
My first 1000hit/30min test:
https://loads.services.mozilla.com/run/69f24cd3-8386-465c-97a6-c5ea3bcac909
Results look good so far, no errors.
Since this is one m3.medium, I can't hit it too hard, but will increase a bit and continue...
| Reporter | ||
Comment 21•11 years ago
|
||
FYI: Increased the number of hits and number of users.
| Reporter | ||
Comment 22•11 years ago
|
||
The load test started out ok, but began getting errors/failures after about 90min.
Debugging...
Failures 1703
Errors 2
100 occurrences:
Start MT Flow failed: (504)
File "/usr/lib/python2.7/unittest/case.py", line 332, in run
testMethod()
File "loadtest.py", line 43, in test_all
resp.status_code))
File "/usr/lib/python2.7/unittest/case.py", line 516, in assertEqual
assertion_func(first, second, msg=msg)
File "/usr/lib/python2.7/unittest/case.py", line 509, in _baseAssertEqual
raise self.failureException(msg)
| Reporter | ||
Comment 23•11 years ago
|
||
Finally had to stop this test after just under 2 hours as errors/failures started to pile up.
Here is what is looks like:
Link: https://loads.services.mozilla.com/run/face7e6b-2b63-4f6f-a181-fbec1860c6a0
Pastebin: https://jbonacci.pastebin.mozilla.org/5935032
It's possible, I guess, that this was just too much load for one m3.medium.
Debugging in the logs now...
| Reporter | ||
Comment 24•11 years ago
|
||
OK, well this is not good. A single 'make test' results in the following:
1 occurrences of:
AssertionError: Start MOMT Flow failed: (504) Traceback:
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/unittest/case.py", line 331, in run
testMethod()
File "loadtest.py", line 49, in test_all
resp.status_code))
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/unittest/case.py", line 515, in assertEqual
assertion_func(first, second, msg=msg)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/unittest/case.py", line 508, in _baseAssertEqual
raise self.failureException(msg)
Slowest URL: https://msisdn.stage.mozaws.net/sms/momt/nexmo_callback?msisdn=33610753979&text=%2Fsms%2Fmomt%2Fverify+a3b7a0bf23df5fb0a3e8f81db59f8e6e96bb6c6e5b21c279e84fe6ebc723bad9 Average Request Time: 59.267674
Stats by URLs:
- https://msisdn.stage.mozaws.net/sms/momt/nexmo_callback?msisdn=33610753979&text=%2Fsms%2Fmomt%2Fverify+a3b7a0bf23df5fb0a3e8f81db59f8e6e96bb6c6e5b21c279e84fe6ebc723bad9 Average request time: 59.267674 Hits success rate: 0.0
- https://msisdn.stage.mozaws.net/discover Average request time: 0.859158 Hits success rate: 1.0
- https://msisdn.stage.mozaws.net/register Average request time: 0.095902 Hits success rate: 1.0
Custom metrics:
- register : 1
- momt-flow : 1
make: *** [test] Error 1
| Reporter | ||
Comment 25•11 years ago
|
||
The instance seems healthy.
The procs are all running.
So, in the logs:
/media/ephemeral0/nginx/logs/msisdn-gateway.error.log
Shows some "Connection refused) while connecting to upstream" errors
and some of these errors:
[error] 2198#0: *362 upstream prematurely closed connection while reading response header from upstream
/media/ephemeral0/nginx/logs/msisdn-gateway.access.log
Shows ELB-HealthChecker messages and 200s.
Has a significant number of 204s, 400s, 499s
Some 404s, some 304s
200s: 2237920
400s: 209051
499s: 9920
404s: 6769
304s: 19081
/media/ephemeral0/msisdn-gateway/msisdn-gateway_err.log
Has some traceback, but that might be older than this current loadtest
Loadtest:
Started 2014-08-11 22:39:05 UTC
Ended 2014-08-12 00:33:07 UTC
Last update to this file was 2014-08-11 19:51
/media/ephemeral0/msisdn-gateway/msisdn-gateway_out.log
has the usual msisdn "text" entries in it
| Reporter | ||
Comment 26•11 years ago
|
||
When I run 'make test' while tracking the nginx access log - I can see my requests:
1407807864.975 "24.7.94.153" "POST /discover HTTP/1.1" 200 123 "-" "python-requests/2.3.0 CPython/2.7.5 Darwin/13.3.0" 0.003 0.003 "-"
1407807865.078 "24.7.94.153" "POST /register HTTP/1.1" 200 89 "-" "python-requests/2.3.0 CPython/2.7.5 Darwin/13.3.0" 0.004 0.004 "-"
Just those two.
then a delay
then this:
1407807924.902 "24.7.94.153" "POST /sms/mt/verify HTTP/1.1" 499 0 "-" "python-requests/2.3.0 CPython/2.7.5 Darwin/13.3.0" 59.728 - "-"
Then, again, from my localhost running the 'make test', I get this:
AssertionError: Start MT Flow failed: (504)
I see nothing added to the app logs.
| Reporter | ||
Comment 27•11 years ago
|
||
For some reason, I can no longer SSH onto omxen.dev.mozaws.net and look around.
So, I can't verify that is running correctly.
I am blocked.
Comment 28•11 years ago
|
||
Something happened to omxen.dev.mozaws.net. I fixed it.
| Reporter | ||
Comment 29•11 years ago
|
||
:mostlygeek can you be a bit more specific? ;-)
1. What was most likely wrong?
2. What did you do to fix it?
3. Is this an anomaly or something to watch out for that may be caused by excessive load traffic?
OK, so I will restart my testing tomorrow morning with a longer, but lighter load test.
Thanks, man!
Comment 30•11 years ago
|
||
When I couldn't SSH in I tried to trigger a reboot on the box via the AWS web console. That didn't work. So I stopped / started the instance. That seemed to have brought the box back.
I run omxen in tmux using: while true; do bin/omzen; done
nginx is reverse proxying that to port 80.
So I'm not entirely sure what happened to be the box. It could be that the load test killed it, that it ate up too much memory and it sort of crashed the whole box. Though that's unlikely it's not impossible. If it happens again after a load test we'll have to come up with something a little more robust. :)
Comment 31•11 years ago
|
||
James if you want to handle more charge on MSISDN we need to load balance using circus both omxen and MSISDN. Hopefully we don't need to handle that many user for MSISDN.
Comment 32•11 years ago
|
||
I restarted a long loadtest without problems: https://loads.services.mozilla.com/run/c8ab01f9-8dfc-4cd5-baf0-7e00dfcee649
| Reporter | ||
Comment 33•11 years ago
|
||
Yep. Looks clean. Also notice that with the same options set, yours completed in less than 2 hours. Mine spun out of control at 1.5 hours and I killed it just before 2 hours.
I will start up another 1.5 hour test with less users before signing off on this.
| Reporter | ||
Comment 34•11 years ago
|
||
After all that testing, I had to drop back into the archived access log to find my examples of 404s and 499s:
404:
1407787936.979 "63.245.219.53" "GET /favicon.ico HTTP/1.1" 404 12 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:32.0) Gecko/20100101 Firefox/32.0" 0.000 0.000 "-"
499:
1407806111.511 "24.7.94.153" "GET /sms/momt/nexmo_callback?msisdn=BLAH HTTP/1.1" 499 0 "-" "python-requests/2.3.0 CPython/2.7.5 Darwin/13.3.0" 59.269 - "-"
In addition, here is one of those 504s are started seeing when the load test started to fail badly:
1407802622.516 "54.218.210.15" "GET /sms/momt/nexmo_callback?msisdn=BLAH HTTP/1.1" 504 188 "-" "python-requests/2.2.1 CPython/2.7.3 Linux/3.5.0-23-generic" 60.001 60.001 "-"
| Reporter | ||
Comment 35•11 years ago
|
||
OK, my last test looks clean:
https://loads.services.mozilla.com/run/d07d307e-8714-48be-b602-410064939cf0
Let's move on to Production.
Status: RESOLVED → VERIFIED
| Reporter | ||
Updated•11 years ago
|
Status: VERIFIED → REOPENED
Resolution: FIXED → ---
| Reporter | ||
Updated•11 years ago
|
Summary: Deploy Release 0.4.2 to msisdn-gateway Stage → Deploy Release 0.4.3 to msisdn-gateway Stage
| Reporter | ||
Comment 36•11 years ago
|
||
Quick verification shows 1 m3.medium instance: ec2-54-91-192-127
App versions:
msisdn-gateway-svcops 0.4.3-1 x86_64 54250861
puppet-config-msisdn 20140812210007-1 x86_64 13723
Moving on to a quick load test...
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
| Reporter | ||
Comment 37•11 years ago
|
||
OK. The quick (1000 hit) load test was good:
https://loads.services.mozilla.com/run/fe145f46-131b-4c32-91ad-c529fa73a391
Logs look clean.
Status: RESOLVED → VERIFIED
You need to log in
before you can comment on or make changes to this bug.
Description
•