Closed Bug 1038772 Opened 11 years ago Closed 11 years ago

Deploy Release 0.4.3 to msisdn-gateway Stage

Categories

(Cloud Services :: Operations: Deployment Requests - DEPRECATED, task)

task
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: jbonacci, Assigned: dwilson)

References

Details

(Whiteboard: [qa+])

New release 0.4.1 so new bug!
To pick up fixes for bug 1037604
Whiteboard: [qa+]
Blocks: 1036736
No longer blocks: 1036736
Configuration changes: - move the nexmoCredential config under the new smsGateways config - remove the moVerifier configuration - leave everything else as default for now
Once this is in Stage, I will test it - mostly likely Friday, 7/18. Thursday has been blocked out for Loop-Server/live-server testing.
Status: NEW → ASSIGNED
Also, will need the config flipped to talk to the live server. Thanks!
Ignore Comment 5, wrong bug...
:natim :alexis or :tarek Please let QA and OPs know what release we are going with: 0.3.2 or 0.4.1.
It is going to be 0.4.1
No. We were going with 0.3.2 to production to fix the bug. Then we will stage/qa the 0.4.x branch. When it is ready we will promote it to prod and maintain 0.4.x until 0.5.x replaces it.
Blocks: 1037604
Assignee: nobody → dwilson
I would rather deploy last master snapshot.
OK. I will figure this out on Monday.
OK. See the dependent bug 1037604
No longer blocks: 1037604
Depends on: 1037604
Summary: Deploy Release 0.4.1 to msisdn-gateway Stage → Deploy Release 0.4.2 to msisdn-gateway Stage
0.4.2-0snap201408110138git80ea3b deployed to stage w/ updated puppet-config settings. When this passes QA we'll turn the git hash: 80ea3b into 0.4.2 and release it to prod with that. BTW: stage deploy is a single m3.medium. Don't hit it too hard. :)
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Interesting, so a query of installed apps, shows this: msisdn-gateway-svcops 0.4.2-0snap201408110138git80ea3b x86_64 52573829 puppet-config-msisdn 20140811185304-1 x86_64 12723 Yet the browser check shows this: https://msisdn.stage.mozaws.net {"name":"mozilla-msisdn-gateway","description":"The Mozilla MSISDN Gateway","version":"0.5.0-DEV","homepage":"https://github.com/mozilla-services/msisdn-gateway/","endpoint":"https://msisdn.stage.mozaws.net"} What's up?
I fixed the loadtests on master. Make sure to pull last master in order to loadtest it.
Apparently we are sticking with 0.4.2, but the version string info is incorrect. Noting that here. Going to reinstall the Master branch of msisdn-gateway to pick up the latest fixes for the load test. (which I thought I already had, lets' see....) :mostlygeek if we go to Prod with this? What will it look like? 0.4.2 or 0.5.0-DEV? FYI, that fix is here: https://github.com/mozilla-services/msisdn-gateway/commit/1372eb02acedaa0fd1a5ec7ac525bfbf3e2838d4
It will be 0.4.2, 0.5.0-DEV is because it is a master snapshot deploy.
got it. thanks.
:natim 's load test looked good. Running my own now. 30min, then 60min. Then maybe drive up the users/agents.
My first 1000hit/30min test: https://loads.services.mozilla.com/run/69f24cd3-8386-465c-97a6-c5ea3bcac909 Results look good so far, no errors. Since this is one m3.medium, I can't hit it too hard, but will increase a bit and continue...
FYI: Increased the number of hits and number of users.
Blocks: 1052186
The load test started out ok, but began getting errors/failures after about 90min. Debugging... Failures 1703 Errors 2 100 occurrences: Start MT Flow failed: (504) File "/usr/lib/python2.7/unittest/case.py", line 332, in run testMethod() File "loadtest.py", line 43, in test_all resp.status_code)) File "/usr/lib/python2.7/unittest/case.py", line 516, in assertEqual assertion_func(first, second, msg=msg) File "/usr/lib/python2.7/unittest/case.py", line 509, in _baseAssertEqual raise self.failureException(msg)
Finally had to stop this test after just under 2 hours as errors/failures started to pile up. Here is what is looks like: Link: https://loads.services.mozilla.com/run/face7e6b-2b63-4f6f-a181-fbec1860c6a0 Pastebin: https://jbonacci.pastebin.mozilla.org/5935032 It's possible, I guess, that this was just too much load for one m3.medium. Debugging in the logs now...
OK, well this is not good. A single 'make test' results in the following: 1 occurrences of: AssertionError: Start MOMT Flow failed: (504) Traceback: File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/unittest/case.py", line 331, in run testMethod() File "loadtest.py", line 49, in test_all resp.status_code)) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/unittest/case.py", line 515, in assertEqual assertion_func(first, second, msg=msg) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/unittest/case.py", line 508, in _baseAssertEqual raise self.failureException(msg) Slowest URL: https://msisdn.stage.mozaws.net/sms/momt/nexmo_callback?msisdn=33610753979&text=%2Fsms%2Fmomt%2Fverify+a3b7a0bf23df5fb0a3e8f81db59f8e6e96bb6c6e5b21c279e84fe6ebc723bad9 Average Request Time: 59.267674 Stats by URLs: - https://msisdn.stage.mozaws.net/sms/momt/nexmo_callback?msisdn=33610753979&text=%2Fsms%2Fmomt%2Fverify+a3b7a0bf23df5fb0a3e8f81db59f8e6e96bb6c6e5b21c279e84fe6ebc723bad9 Average request time: 59.267674 Hits success rate: 0.0 - https://msisdn.stage.mozaws.net/discover Average request time: 0.859158 Hits success rate: 1.0 - https://msisdn.stage.mozaws.net/register Average request time: 0.095902 Hits success rate: 1.0 Custom metrics: - register : 1 - momt-flow : 1 make: *** [test] Error 1
The instance seems healthy. The procs are all running. So, in the logs: /media/ephemeral0/nginx/logs/msisdn-gateway.error.log Shows some "Connection refused) while connecting to upstream" errors and some of these errors: [error] 2198#0: *362 upstream prematurely closed connection while reading response header from upstream /media/ephemeral0/nginx/logs/msisdn-gateway.access.log Shows ELB-HealthChecker messages and 200s. Has a significant number of 204s, 400s, 499s Some 404s, some 304s 200s: 2237920 400s: 209051 499s: 9920 404s: 6769 304s: 19081 /media/ephemeral0/msisdn-gateway/msisdn-gateway_err.log Has some traceback, but that might be older than this current loadtest Loadtest: Started 2014-08-11 22:39:05 UTC Ended 2014-08-12 00:33:07 UTC Last update to this file was 2014-08-11 19:51 /media/ephemeral0/msisdn-gateway/msisdn-gateway_out.log has the usual msisdn "text" entries in it
When I run 'make test' while tracking the nginx access log - I can see my requests: 1407807864.975 "24.7.94.153" "POST /discover HTTP/1.1" 200 123 "-" "python-requests/2.3.0 CPython/2.7.5 Darwin/13.3.0" 0.003 0.003 "-" 1407807865.078 "24.7.94.153" "POST /register HTTP/1.1" 200 89 "-" "python-requests/2.3.0 CPython/2.7.5 Darwin/13.3.0" 0.004 0.004 "-" Just those two. then a delay then this: 1407807924.902 "24.7.94.153" "POST /sms/mt/verify HTTP/1.1" 499 0 "-" "python-requests/2.3.0 CPython/2.7.5 Darwin/13.3.0" 59.728 - "-" Then, again, from my localhost running the 'make test', I get this: AssertionError: Start MT Flow failed: (504) I see nothing added to the app logs.
For some reason, I can no longer SSH onto omxen.dev.mozaws.net and look around. So, I can't verify that is running correctly. I am blocked.
Something happened to omxen.dev.mozaws.net. I fixed it.
:mostlygeek can you be a bit more specific? ;-) 1. What was most likely wrong? 2. What did you do to fix it? 3. Is this an anomaly or something to watch out for that may be caused by excessive load traffic? OK, so I will restart my testing tomorrow morning with a longer, but lighter load test. Thanks, man!
When I couldn't SSH in I tried to trigger a reboot on the box via the AWS web console. That didn't work. So I stopped / started the instance. That seemed to have brought the box back. I run omxen in tmux using: while true; do bin/omzen; done nginx is reverse proxying that to port 80. So I'm not entirely sure what happened to be the box. It could be that the load test killed it, that it ate up too much memory and it sort of crashed the whole box. Though that's unlikely it's not impossible. If it happens again after a load test we'll have to come up with something a little more robust. :)
James if you want to handle more charge on MSISDN we need to load balance using circus both omxen and MSISDN. Hopefully we don't need to handle that many user for MSISDN.
Yep. Looks clean. Also notice that with the same options set, yours completed in less than 2 hours. Mine spun out of control at 1.5 hours and I killed it just before 2 hours. I will start up another 1.5 hour test with less users before signing off on this.
After all that testing, I had to drop back into the archived access log to find my examples of 404s and 499s: 404: 1407787936.979 "63.245.219.53" "GET /favicon.ico HTTP/1.1" 404 12 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:32.0) Gecko/20100101 Firefox/32.0" 0.000 0.000 "-" 499: 1407806111.511 "24.7.94.153" "GET /sms/momt/nexmo_callback?msisdn=BLAH HTTP/1.1" 499 0 "-" "python-requests/2.3.0 CPython/2.7.5 Darwin/13.3.0" 59.269 - "-" In addition, here is one of those 504s are started seeing when the load test started to fail badly: 1407802622.516 "54.218.210.15" "GET /sms/momt/nexmo_callback?msisdn=BLAH HTTP/1.1" 504 188 "-" "python-requests/2.2.1 CPython/2.7.3 Linux/3.5.0-23-generic" 60.001 60.001 "-"
OK, my last test looks clean: https://loads.services.mozilla.com/run/d07d307e-8714-48be-b602-410064939cf0 Let's move on to Production.
Status: RESOLVED → VERIFIED
Status: VERIFIED → REOPENED
Resolution: FIXED → ---
Summary: Deploy Release 0.4.2 to msisdn-gateway Stage → Deploy Release 0.4.3 to msisdn-gateway Stage
Quick verification shows 1 m3.medium instance: ec2-54-91-192-127 App versions: msisdn-gateway-svcops 0.4.3-1 x86_64 54250861 puppet-config-msisdn 20140812210007-1 x86_64 13723 Moving on to a quick load test...
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
OK. The quick (1000 hit) load test was good: https://loads.services.mozilla.com/run/fe145f46-131b-4c32-91ad-c529fa73a391 Logs look clean.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.