Closed Bug 1020899 Opened 11 years ago Closed 11 years ago

Deploy the msisdn server to stage

Categories

(Cloud Services :: Operations: Deployment Requests - DEPRECATED, task, P1)

x86
macOS
task

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: tarek, Assigned: mostlygeek)

References

Details

(Whiteboard: [qa+])

Please puppetize/deploy msisdn on aws the repo is at : https://github.com/mozilla-services/msisdn-gateway There's a deployed aws dev. Rémy is the main contact. Thanks!
I believe stage urls are <projectname>.stage.mozaws.net
ok
Yes, :alexis that is fairly standard now: Dev: http://loop.dev.mozaws.net http://msisdn.dev.mozaws.net Stage: https://loop.stage.mozaws.net http://msisdn.stage.mozaws.net <---- we could do this. Others: Content Server: https://accounts.stage.mozaws.net Auth Server: https://api-accounts.stage.mozaws.net TokenServer: https://token.stage.mozaws.net Verifier: https://verifier.stage.mozaws.net Sync: https://<NODE>.stage.mozaws.net This makes sense to me for Prod: prod url: https://msisdn.services.mozilla.com
Whiteboard: [qa+]
whatever you guys think is the best pick
Well loadtests are ready so feel free to deploy asap.
Added :bobm and :mostlygeek - not sure who will be doing the actual deploy.
Priority: -- → P1
deadline is before the end of june
Well, hopefully we can just do it next week ;-)
I have a PR https://github.com/mozilla-services/msisdn-gateway/pull/84 that I would like merged before deploying the staging version. Also can somebody provide me with the relevant config options for the staging environment?
Flags: needinfo?(tarek)
Two PRs we'll need on the cloudops side to merge before we can deploy to stage: - https://github.com/mozilla-services/puppet-config/pull/596 - https://github.com/mozilla-services/svcops/pull/166
I commented the various PRs
Flags: needinfo?(tarek)
:tarek thanks for your help w/ the config.js PR. I'm not sure about the SMS configuration for stage. Does somebody have SMS gateway information for me?
:mostlygeek for stage we should use the omxen mock to be able to loadtest it. https://github.com/mozilla-services/omxen/
So, do we need omxen to be part of this Stage stack? Much like we have https://loop-delayed-response.stage.mozaws.net for Loop-Server Stage?
This is a solution but I think you directly use the yet deployed one. Can you put a omxen.dev.mozaws.net domain name on it please? The current server stand at http://ec2-54-203-73-122.us-west-2.compute.amazonaws.com/
Flags: needinfo?(bwong)
:natim dns set up.
Flags: needinfo?(bwong)
Ok, thank you, I will check tomorrow.
:natim can you get me access to ec2-54-203-73-122.us-west-2.compute.amazonaws.com ? Are you using a specific .pem file for the credentials? Also, how was it deployed? OPs did it? or awsbox/awsboxen? or you created your own AWS instance? Finally, I can ping ec2-54-203-73-122.us-west-2.compute.amazonaws.com but I can not ping omxen.dev.mozaws.net I was expecting it to redirect but instead I get a "server not found"
Status: NEW → ASSIGNED
> Also, how was it deployed? OPs did it? or awsbox/awsboxen? or you created your own AWS instance? It's a custom instance > I can not ping omxen.dev.mozaws.net The host is still unknown. It seems not propagated yet. Once propagated we'll probably need a nginx config change
Yes same here: > Host omxen.dev.mozaws.net not found: 3(NXDOMAIN)
Assignee: nobody → bwong
omxen.dev.mozaws.net should be working now. [bwong@ip-10-101-151-161 ~]$ dig omxen.dev.mozaws.net ; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.23.rc1.el6_5.1 <<>> omxen.dev.mozaws.net ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 17941 ;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;omxen.dev.mozaws.net. IN A ;; ANSWER SECTION: omxen.dev.mozaws.net. 45 IN CNAME ec2-54-203-73-122.us-west-2.compute.amazonaws.com. ec2-54-203-73-122.us-west-2.compute.amazonaws.com. 172785 IN A 54.203.73.122 ;; Query time: 0 msec ;; SERVER: 172.16.0.23#53(172.16.0.23) ;; WHEN: Wed Jun 18 19:07:56 2014 ;; MSG SIZE rcvd: 117
Yes tested here, it works thanks.
ping omxen.dev.mozaws.net does now resolve to ec2-54-203-73-122.us-west-2.compute.amazonaws.com In the browser, http://omxen.dev.mozaws.net/ returns "OMXEN SMS GATEWAY" curl http://omxen.dev.mozaws.net/ OMXEN SMS GATEWAY curl -I http://omxen.dev.mozaws.net/ HTTP/1.1 200 OK Server: nginx/1.4.6 (Ubuntu) Date: Wed, 18 Jun 2014 21:32:43 GMT Content-Type: text/html; charset=UTF-8 Content-Length: 0 Connection: keep-alive In the browser, https://omxen.dev.mozaws.net/ returns (eventually) "The connection to omxen.dev.mozaws.net was interrupted while the page was loading." curl -I https://omxen.dev.mozaws.net/ curl: (35) Server aborted the SSL handshake We should probably address this...
For the ping it works here: > $ ping omxen.dev.mozaws.net > PING ec2-54-203-73-122.us-west-2.compute.amazonaws.com (54.203.73.122) 56(84) bytes of data. There is no HTTPS for omxen yet do you want me to setup it on the server?
Flags: needinfo?(bwong)
:natim well, we should probably address it one way or another. We are all use to hitting https servers now, so it is possible that people might type in or use https://omxen.dev.mozaws.net/ by accident...
:natim ping works now. You have to allow ICMP traffic in your security group.
Flags: needinfo?(bwong)
Also for omxen: - should be in us-east-1 (where all of our stage stacks are), unless you want to test east => west AWS latencies - if you want SSL, use an ELB + (*.stage.mozaws.net) wilcard. This only for SSL termination w/ out moving certs/keys around. - I can map omxen.stage.mozaws.net DNS to the ELB
Fernando, you can point at https://msisdn-dev.stage.mozaws.net/ for now in dev. Future endpoints will be https://msisdn.services.mozilla.com for production and https://msisdn.stage.mozaws.net (not deployed yet).
Blocks: 1030140
OK it's in stage now. - app version 0.3.0-0snap201406261346git1c01a2, github commit: 1c01a2 - 1 x m3.medium (to verify it works/configured correctly) - puppet-config version: 20140627181006
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
How is it configured? Using omxen or nexmo?
Flags: needinfo?(bwong)
I have configured the Nexmo number +12182967993 for stage. It should work from the US for you to test.
I'd like to run a loadtest to make sure everything works. Can you tell me what oxmen url is currently in use?
Flags: needinfo?(bwong)
:natim It's configured with the nexmo credentials you emailed me. You can ssh into the server via the bastion host. In us-east-1 search for msisdn-gateway-stage in the EC2 instance list. The app is configured at /data/msisdn_gateway
So far, so good on the verification of deployment. There is a bug in the log file names/locations. There seem to be some duplicates: /var/log/hekad/msisdn_gateway.stderr.log /var/log/hekad/msisdn_gateway.stdout.log /var/log/msisdn-gateway_err.log /var/log/msisdn-gateway_out.log According to :mostlygeek, these two: /var/log/msisdn-gateway_err.log /var/log/msisdn-gateway_out.log should actually be here: /media/ephemeral0/msisdn_gateway/
:natim: loadtests/loadtest.py: omxen_url = "http://ec2-54-203-73-122.us-west-2.compute.amazonaws.com"
I made a PR to fix the logging locations for circus: https://github.com/mozilla-services/puppet-config/pull/639
:natim The server starts with a lot of these warnings: Bad locale=[en_US] missing .key-value-json files in [/data/msisdn-gateway/app/i18n/en_US/messages.json]. See locale/README (Error: Cannot find module '/data/msisdn-gateway/app/i18n/en_US/messages.json') Bad locale=[en_ZA] missing .key-value-json files in [/data/msisdn-gateway/app/i18n/en_ZA/messages.json]. See locale/README (Error: Cannot find module '/data/msisdn-gateway/app/i18n/en_ZA/messages.json') Bad locale=[eo] missing .key-value-json files in [/data/msisdn-gateway/app/i18n/eo/messages.json]. See locale/README (Error: Cannot find module '/data/msisdn-gateway/app/i18n/eo/messages.json') Bad locale=[es] missing .key-value-json files in [/data/msisdn-gateway/app/i18n/es/messages.json]. See locale/README (Error: Cannot find module '/data/msisdn-gateway/app/i18n/es/messages.json') Bad locale=[es_AR] missing .key-value-json files in [/data/msisdn-gateway/app/i18n/es_AR/messages.json]. See locale/README (Error: Cannot find module '/data/msisdn-gateway/app/i18n/es_AR/messages.json') Bad locale=[es_CL] missing .key-value-json files in [/data/msisdn-gateway/app/i18n/es_CL/messages.json]. See locale/README (Error: Cannot find module '/data/msisdn-gateway/app/i18n/es_CL/messages.json') Bad locale=[es_MX] missing .key-value-json files in [/data/msisdn-gateway/app/i18n/es_MX/messages.json]. See locale/README (Error: Cannot find module '/data/msisdn-gateway/app/i18n/es_MX/messages.json') Bad locale=[et] missing .key-value-json files in [/data/msisdn-gateway/app/i18n/et/messages.json]. See locale/README (Error: Cannot find module '/data/msisdn-gateway/app/i18n/et/messages.json') Bad locale=[eu] missing .key-value-json files in [/data/msisdn-gateway/app/i18n/eu/messages.json]. See locale/README (Error: Cannot find module '/data/msisdn-gateway/app/i18n/eu/messages.json') Bad locale=[fa] missing .key-value-json files in [/data/msisdn-gateway/app/i18n/fa/messages.json]. See locale/README (Error: Cannot find module '/data/msisdn-gateway/app/i18n/fa/messages.json') Bad locale=[ff] missing .key-value-json files in [/data/msisdn-gateway/app/i18n/ff/messages.json]. See locale/README (Error: Cannot find module '/data/msisdn-gateway/app/i18n/ff/messages.json') Bad locale=[fi] missing .key-value-json files in [/data/msisdn-gateway/app/i18n/fi/messages.json]. See locale/README (Error: Cannot find module '/data/msisdn-gateway/app/i18n/fi/messages.json') Bad locale=[fr] missing .key-value-json files in [/data/msisdn-gateway/app/i18n/fr/messages.json]. See locale/README (Error: Cannot find module '/data/msisdn-gateway/app/i18n/fr/messages.json') Bad locale=[fy] missing .key-value-json files in [/data/msisdn-gateway/app/i18n/fy/messages.json]. See locale/README (Error: Cannot find module '/data/msisdn-gateway/app/i18n/fy/messages.json') Bad locale=[fy_NL] missing .key-value-json files in [/data/msisdn-gateway/app/i18n/fy_NL/messages.json]. See locale/README (Error: Cannot find module '/data/msisdn-gateway/app/i18n/fy_NL/messages.json') Bad locale=[ga] missing .key-value-json files in [/data/msisdn-gateway/app/i18n/ga/messages.json]. See locale/README (Error: Cannot find module '/data/msisdn-gateway/app/i18n/ga/messages.json') Bad locale=[ga_IE] missing .key-value-json files in [/data/msisdn-gateway/app/i18n/ga_IE/messages.json]. See locale/README (Error: Cannot find module '/data/msisdn-gateway/app/i18n/ga_IE/messages.json') Bad locale=[gd] missing .key-value-json files in [/data/msisdn-gateway/app/i18n/gd/messages.json]. See locale/README (Error: Cannot find module '/data/msisdn-gateway/app/i18n/gd/messages.json') Bad locale=[gl] missing .key-value-json files in [/data/msisdn-gateway/app/i18n/gl/messages.json]. See locale/README (Error: Cannot find module '/data/msisdn-gateway/app/i18n/gl/messages.json') Bad locale=[gu] missing .key-value-json files in [/data/msisdn-gateway/app/i18n/gu/messages.json]. See locale/README (Error: Cannot find module '/data/msisdn-gateway/app/i18n/gu/messages.json') Bad locale=[gu_IN] missing .key-value-json files in [/data/msisdn-gateway/app/i18n/gu_IN/messages.json]. See locale/README (Error: Cannot find module '/data/msisdn-gateway/app/i18n/gu_IN/messages.json') Bad locale=[he] missing .key-value-json files in [/data/msisdn-gateway/app/i18n/he/messages.json]. See locale/README (Error: Cannot find module '/data/msisdn-gateway/app/i18n/he/messages.json') Bad locale=[hi_IN] missing .key-value-json files in [/data/msisdn-gateway/app/i18n/hi_IN/messages.json]. See locale/README (Error: Cannot find module '/data/msisdn-gateway/app/i18n/hi_IN/messages.json') Bad locale=[hr] missing .key-value-json files in [/data/msisdn-gateway/app/i18n/hr/messages.json]. See locale/README (Error: Cannot find module '/data/msisdn-gateway/app/i18n/hr/messages.json') Bad locale=[ht] missing .key-value-json files in [/data/msisdn-gateway/app/i18n/ht/messages.json]. See locale/README (Error: Cannot find module '/data/msisdn-gateway/app/i18n/ht/messages.json') Bad locale=[hu] missing .key-value-json files in [/data/msisdn-gateway/app/i18n/hu/messages.json]. See locale/README (Error: Cannot find module '/data/msisdn-gateway/app/i18n/hu/messages.json') How do I get rid of them? I couldn't figure out how to generate the language files and bake them into the RPM for deployment.
Flags: needinfo?(rhubscher)
This is a configuration variable in the msisdn-gateway server: https://github.com/mozilla-services/msisdn-gateway/blob/master/msisdn-gateway/config.js#L258 For now we only have EN and FR translations that stands here: https://github.com/mozilla-services/msisdn-gateway-l10n The system is the same as for the fxa-content-server-l10n repository.
Flags: needinfo?(rhubscher)
:natim could you give me some instructions on how to fix it? - what do I run to generate /data/msisdn-gateway/app/i18n/en_US/messages.json? - how do I pull the sources in from mozilla-services/msisdn-gateway-l10n and convert those into the .json files? I can update the config so only "en_US" and "fr" are defined.
Flags: needinfo?(rhubscher)
Here are the steps to compile the messages: - you've got to copy the locale directory from https://github.com/mozilla-services/msisdn-gateway-l10n/ - Then you can run *make compile-message* or *./node_modules/.bin/compile-json locale app/i18n* Ok to configure with en_US and fr for now.
Flags: needinfo?(rhubscher)
Depends on: 1032270
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
> [rhubscher@ip-10-80-127-202 ~]$ ssh ec2-54-197-86-149.compute-1.amazonaws.com > Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
Flags: needinfo?(bwong)
OK. I verified access to the new instance: ec2-54-197-86-149 But, it is unclear to me if the necessary fixes are in see https://bugzilla.mozilla.org/show_bug.cgi?id=1020899#c41 and https://bugzilla.mozilla.org/show_bug.cgi?id=1020899#c42 :mostlygeek what's our status? :natim not sure why you don't have access. Did you hop on the Stage Bastion host first?
Yes I don't know either.
- Waiting on 1032270 to be finished to get the RPM built w/ l10n files baked in. - natim: if that server is part of the stage cluster I built you need to ssh through our bastion host first: ssh bastion.shared.us-east-1.dev.mozaws.net -p 2222
Flags: needinfo?(bwong)
OK stage has been updated: - Accounts created by default for natim (rhubscher), alexis and tarek - Languages/l10n/i18n files are now baked into the RPM - configs updated for i18n appropriately - stage is now uses http://omxen.dev.mozaws.net Once :jbonacci verifies w/ tests I'll push it to prod and point https://msisdn.services.mozilla.com at it
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
:mostlygeek that will be tomorrow at the earliest I need several hours for this very first test of msisdn-gateway in Stage. Adjust your schedule accordingly ;-) Will start looking at this after the QA meeting.
Took a bit of hunting around on the single instance, but I assume the important information is stored here: /data/msisdn-gateway/config/production.json
OK, besides the above config file, I verified the following: AWS CF stack, ELB, and a single m3.medium instance: ec2-54-204-72-36 Versions installed: msisdn-gateway-svcops 0.3.0-0snap201407021023gite74cc9 x86_64 49842124 puppet-config-msisdn 20140702184605-1 x86_64 9669 Checked out the processes and the files. Checked out all the new logs: /media/ephemeral0/msisdn-gateway/msisdn-gateway_err.log /media/ephemeral0/msisdn-gateway/msisdn-gateway_out.log /media/ephemeral0/nginx/logsdefault.access.log (not in use) /media/ephemeral0/nginx/logsdefault.error.log (not in use) /media/ephemeral0/nginx/logsmsisdn-gateway.access.log /media/ephemeral0/nginx/logsmsisdn-gateway.error.log /var/log/circus.log /var/log/hekad/msisdn_gateway.stderr.log /var/log/hekad/msisdn_gateway.stdout.log curl https://msisdn.stage.mozaws.net returns {"name":"mozilla-msisdn-gateway","description":"The Mozilla MSISDN Gateway","version":"0.3.0-DEV","homepage":"https://github.com/mozilla-services/msisdn-gateway/","endpoint":"http://msisdn.stage.mozaws.net"} curl -I https://msisdn.stage.mozaws.net returns HTTP/1.1 200 OK Content-length: 207 Content-Type: application/json; charset=utf-8 Date: Thu, 03 Jul 2014 00:38:55 GMT ETag: W/"cf-2959630074" Timestamp: 1404347935320 Connection: keep-alive
OK. Some progress here. But I need help debugging. Assuming the deploy is correct and the configuration file is correct (see https://bugzilla.mozilla.org/show_bug.cgi?id=1020899#c49) I tried this: make test SERVER_URL=https://msisdn.stage.mozaws.net and got the following error: IndexError: list index out of range Traceback: File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/unittest/case.py", line 331, in run testMethod() File "loadtest.py", line 43, in test_all message = self.read_message() File "loadtest.py", line 144, in read_message return messages[0]["text"] A bit more detail: https://jbonacci.pastebin.mozilla.org/5510856 I will hold here until we can figure this out...
I can confirm that I know have access to the stage VirtualMachine. Also the production.json configuration file is missing the `protocol: "https"` configuration for Hmac to work. I can reproduce James error so I will investigate today to see why oxmen doesn't grab our messages. Thank you very much guys, we are almost there.
Ok James, I have fixed this. With a configuration modification we were actually configuring Leonix for french number with no credentials. I made a patch for that.
We probabely want to add stdout_stream.time_format = [%Y/%m/%d | %H:%M:%S] and stderr_stream.time_format = [%Y/%m/%d | %H:%M:%S] To our circus.ini files. Also you can drop this config that doesn't exists anymore: stdout_stream.refresh_time = 0.5 stderr_stream.refresh_time = 0.5
Flags: needinfo?(bwong)
Here is my first loadtests attempt: OMXEN_URL=http://omxen.dev.mozaws.net ./venv/bin/loads-runner --config=./config/bench.ini --server-url=https://msisdn.stage.mozaws.net loadtest.TestMSISDN.test_all USING http://omxen.dev.mozaws.net OMXEN endpoint [==============================================================================================] 100% Duration: 60.05 seconds Hits: 5275 Started: 2014-07-03 07:42:28.304685 Approximate Average RPS: 87 Average request time: 0.23s Opened web sockets: 0 Bytes received via web sockets : 0 Success: 786 Errors: 0 Failures: 0 Slowest URL: http://54.203.73.122:80/receive?to=33340639441 Average Request Time: 1.497538 Stats by URLs (10 slowests): - http://54.203.73.122:80/receive?to=33340639441 Average request time: 1.497538 Hits success rate: 1.0 - http://54.203.73.122:80/receive?to=33926114841 Average request time: 1.29362 Hits success rate: 1.0 - http://54.203.73.122:80/receive?to=33862351510 Average request time: 1.249569 Hits success rate: 1.0 - http://54.203.73.122:80/receive?to=33050170256 Average request time: 1.238933 Hits success rate: 1.0 - http://54.203.73.122:80/receive?to=33168810187 Average request time: 1.235262 Hits success rate: 1.0 - http://54.203.73.122:80/receive?to=33674270298 Average request time: 1.224954 Hits success rate: 1.0 - http://54.203.73.122:80/receive?to=33376007350 Average request time: 1.219644 Hits success rate: 1.0 - http://54.203.73.122:80/receive?to=33709582369 Average request time: 1.217724 Hits success rate: 1.0 - http://54.203.73.122:80/receive?to=33531442687 Average request time: 1.214007 Hits success rate: 1.0 - http://54.203.73.122:80/receive?to=33538209298 Average request time: 1.210075 Hits success rate: 1.0 Custom metrics: - mt-flow : 382 - ask-for-certificate : 511 - try-wrong-code : 282 - try-right-code : 513 - momt-flow : 416
Flags: needinfo?(bwong)
Investigating why loadtests doesn't works on the loads cluster.
Ok I found it, the date wasn't accurate on loads-master and loads-slave. It is now working: https://loads.services.mozilla.com/run/856a1599-9a20-4b9c-b0fb-dc23139915ca
Ok I fixed the server and ran a 30 minutes loadtest with no errors: https://loads.services.mozilla.com/run/0d1a9d4e-f9ab-421d-9ec0-a6bc177fdeeb We need to make sure the configuration is right. I am releasing 0.5.0 so we set it live for production.
OK new version of stage deployed - added "protocol" configuration - tweaked other config ops to match :natim's changes on the old stage server - deployed 0.3.0-1 (offical release) of app
Verified the new instance is up Version: msisdn-gateway-svcops 0.3.0-1 x86_64 49828972 puppet-config-msisdn 20140703171505-1 x86_64 9919 Everything else looks good, so I am focusing on load testing...
This is working now: make test SERVER_URL=https://msisdn.stage.mozaws.net `\o/` :natim, where can I get the complete list of custom metrics? mt-flow ask-for-certificate try-wrong-code try-right-code momt-flow That is all I have seen so far (and not all of them all the time)...
I am glad to see this works as well: make bench SERVER_URL=https://msisdn.stage.mozaws.net ./venv/bin/loads-runner --config=./config/bench.ini --server-url=https://msisdn.stage.mozaws.net loadtest.TestMSISDN.test_all [==============================================================================================] 100% Duration: 300.04 seconds Hits: 45578 Started: 2014-07-03 19:41:11.607557 Approximate Average RPS: 151 Average request time: 0.12s Opened web sockets: 0 Bytes received via web sockets : 0 Success: 6839 Errors: 0 Failures: 0 Slowest URL: https://msisdn.stage.mozaws.net/sms/momt/nexmo_callback?msisdn=BLAH Average Request Time: 8.436277 Stats by URLs (10 slowests): - https://msisdn.stage.mozaws.net/sms/momt/nexmo_callback?msisdn=BLAH&text=%2Fsms%2Fmomt%2Fverify+BLAH Average request time: 8.436277 Hits success rate: 1.0 - http://54.203.73.122:80/receive?to=BLAH Average request time: 0.504357 Hits success rate: 1.0 - http://54.203.73.122:80/receive?to=BLAH Average request time: 0.501866 Hits success rate: 1.0 - https://msisdn.stage.mozaws.net/sms/momt/nexmo_callback?msisdn=BLAH&text=%2Fsms%2Fmomt%2Fverify+BLAH Average request time: 0.485666 Hits success rate: 1.0 - https://msisdn.stage.mozaws.net/sms/momt/nexmo_callback?msisdn=BLAH&text=%2Fsms%2Fmomt%2Fverify+BLAH Average request time: 0.439516 Hits success rate: 1.0 - https://msisdn.stage.mozaws.net/sms/momt/nexmo_callback?msisdn=BLAH&text=%2Fsms%2Fmomt%2Fverify+BLAH Average request time: 0.43656 Hits success rate: 1.0 - http://54.203.73.122:80/receive?to=BLAH Average request time: 0.413988 Hits success rate: 1.0 - http://54.203.73.122:80/receive?to=BLAH Average request time: 0.410544 Hits success rate: 1.0 - https://msisdn.stage.mozaws.net/sms/momt/nexmo_callback?msisdn=BLAH&text=%2Fsms%2Fmomt%2Fverify+BLAH Average request time: 0.404122 Hits success rate: 1.0 - http://54.203.73.122:80/receive?to=BLAH Average request time: 0.385496 Hits success rate: 1.0 Custom metrics: - mt-flow : 3369 - ask-for-certificate : 4499 - momt-flow : 3483 - try-right-code : 4499 - try-wrong-code : 2345
OK, the first 10min run against the Loads Cluster was a success: Link: https://loads.services.mozilla.com/run/ab42cba8-ef46-4f4e-b510-dde144e15f9a Tests over 31470 Successes 31470 Failures 0 Errors 0 TCP Hits 209556 Opened web sockets 0 Total web sockets 0 Bytes/websockets 0 Requests / second (RPS) 341 Custom metrics mt-flow 15434 ask-for-certificate 20472 momt-flow 16115 try-right-code 20485 try-wrong-code 11023 Comparing the metrics breakdown against what is defined in the loadtest: https://github.com/mozilla-services/msisdn-gateway/blob/master/loadtests/loadtest.py#L19-L22 PERCENTAGE_OF_MT_FLOW = 50 # Remining are MOMT flows PERCENTAGE_OF_WRONG_CODES = 34 # Remining are valid ones. PERCENTAGE_OF_SHORT_CODES = 50 # Remining are right ones. MAX_OMXEN_TIMEOUT = 2 # Seconds to poll from omxen. Not really easy to do a side-by-side, so asking Dev to look at those metrics and see if that is about what is expected for a 10min test. Looks about right to me. Moving on to a 30min, then a 60min, followed by a log review.
And the 30min: https://loads.services.mozilla.com/run/d9a11da3-66ff-4048-a5e1-0ea09a683eac Results Tests over 91318 Successes 91318 Failures 0 Errors 0 TCP Hits 607780 Opened web sockets 0 Total web sockets 0 Bytes/websockets 0 Requests / second (RPS) 335 Custom metrics mt-flow 45458 ask-for-certificate 59581 momt-flow 45945 try-right-code 59593 try-wrong-code 31774
Not sure of the cause, but early into the 60min load test, I am seeing all kinds of errors/failures: Tests over 83340 Successes 54171 Failures 29165 Errors 4 TCP Hits 390165 Opened web sockets 0 Total web sockets 0 Bytes/websockets 0 Requests / second (RPS) 375 Custom metrics omxen-message-collision 7 And also: 4 occurrences: No JSON object could be decoded File "/usr/lib/python2.7/unittest/case.py", line 332, in run testMethod() File "loadtest.py", line 30, in test_all self.register() File "loadtest.py", line 85, in register sessionToken = resp.json()['msisdnSessionToken'] File "/home/ubuntu/loads/local/lib/python2.7/site-packages/requests-2.2.1-py2.7.egg/requests/models.py", line 741, in json return json.loads(self.text, **kwargs) File "/usr/lib/python2.7/json/__init__.py", line 328, in loads return _default_decoder.decode(s) File "/usr/lib/python2.7/json/decoder.py", line 365, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib/python2.7/json/decoder.py", line 383, in raw_decode raise ValueError("No JSON object could be decoded")
And this, of course: addFailure 29165 Killing the test to investigate...
Trolling the Logs: Nothing useful in /media/ephemeral0/msisdn-gateway msisdn-gateway_err.log is 0 msisdn-gateway_out.log has all the mocked call data /media/ephemeral0/nginx/logs default.access.log and default.error.log are 0 msisdn-gateway.error.log is 0 msisdn-gateway.access.log has the usual 200s and heartbeat messages plus a high number of 400s for the following requests: /sms/verify_code Example: 1404416475.120 "24.7.94.153" "POST /sms/verify_code HTTP/1.1" 400 46 "-" "python-requests/2.3.0 CPython/2.7.5 Darwin/13.3.0" 0.006 0.006 "-" about 73704 of these, if I am counting correctly 4652 of these are from this IP: 24.7.94.153 36599 of these are from this IP: 54.218.210.15 32453 of these are from this IP: 54.245.44.231 and a few 401s /var/log/circus.log is normal /var/log/hekad/msisdn_gateway.stdout.log is 0 /var/log/hekad/msisdn_gateway.stderr.log has the usual heka messaging That's all I can find. Not sure why 30min load tests were more or less clean and 24min into a 60, I see all the errors and addFailures. Dev to debug...
Looks lik we are missing logs in the msisdn app when we get "collisions" at least. What do you think Remy?
Flags: needinfo?(rhubscher)
Hello, About the metrics, you've got all the metrics here: https://github.com/mozilla-services/msisdn-gateway/blob/master/loadtests/loadtest.py mt-flow momt-flow try-wrong-code try-right-code ask-for-certificate omxen-message-collision The addFailure is added automatically by loads. Also we don't have all the debug informations because the loads dashboard doesn't explain failures. The random configuration helps to try all kind of scenarios running test_all. It is normal to have 400 in case of wrong code scenario. Collisions are not handled by the msisdn app but by the loadtest (In case we tried a right-code but get a 400 invalid code) To me this looks like you've got the wrong omxen configured in your loadtest file.
Flags: needinfo?(rhubscher)
jbonnacci, could you try with the last version of loadtests.py?
note to everyone: loads.services.mozilla.com is password protected - let me know if you want an access
:natim Hmmmmm, I followed your link in Comment 73. It's interesting that you got the load test to run and I did not. Hopefully, this means the latest fix is in and working as you expected. So, yea, I will try the latest loadtests.py. Looking at the repo, I am assuming you mean the latest based on this commit: https://github.com/mozilla-services/msisdn-gateway/commit/d1d2f4fa3c17273db287d8c76246fe7231550f31 I will give it a try on Monday!
Yes that's it :)
It seems that the circus configuration as not been updated. - Removing refresh_time and adding time_filter - stdout_stream.refresh_time = 0.5 + stdout_stream.time_format = [%Y/%m/%d | %H:%M:%S] - stderr_stream.refresh_time = 0.5 + stderr_stream.time_format = [%Y/%m/%d | %H:%M:%S]
:natim - is the above (Comment 77) in a new commit? Or, in other words, is the load test ready to be used?
Comment 77 is not directly related to the loadtest but related to the circus configuration. Also it may help if we need to investigate loadtest errors.
(In reply to Rémy Hubscher (:natim) from comment #77) > It seems that the circus configuration as not been updated. > > - Removing refresh_time and adding time_filter > > - stdout_stream.refresh_time = 0.5 > + stdout_stream.time_format = [%Y/%m/%d | %H:%M:%S] > - stderr_stream.refresh_time = 0.5 > + stderr_stream.time_format = [%Y/%m/%d | %H:%M:%S] Won't do (for now). We have a standard circus module that we deploy with. If these are not affecting operation we'll change them later as multiple projects depend on the core configuration. For example we'd need to change the heka configurations as well to match.
That's fine. :natim - I think this bug is getting too long and deviating from its original intent. ;-) I will run load tests today and open new bugs as needed related directly to 1. the load test 2. the msisdn-gateway code :mostlygeek - I think we hammered enough on Stage to show that the deploy was good. Let's move on to Production.
Status: RESOLVED → VERIFIED
Blocks: 1035268
BLEH! bug 1035459 I should have checked before marking this one Verified.
You need to log in before you can comment on or make changes to this bug.