Closed Bug 774918 Opened 13 years ago Closed 13 years ago

token: Deploy tokenserver 0.10-2 to dev/stage token servers

Categories

(Cloud Services :: Operations: Deployment Requests - DEPRECATED, task)

task
Not set
major

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: jbonacci, Assigned: bobm)

Details

(Whiteboard: [qa+])

Attachments

(1 file)

For some reason we did not have a deploy bug for 0.8.1, 0.8.2, 0.8.3... This bug supercedes https://bugzilla.mozilla.org/show_bug.cgi?id=773461 Please deploy tokenserver0.8.4 to Dev and Stage token* Dev: token{1..3}.reg.mtv1.dev.svc.mozilla.com Stage: token{1..3}.reg.scl2.stage.svc.mozilla.com
A little bit late creating this but now we can put test and debug information in here as we find it, especially for the Stage deploy.
Assignee: nobody → bobm
Status: NEW → ASSIGNED
QA Contact: jbonacci
Whiteboard: [qa+]
Dev currently has this installed: python26-tokenlib-0.1.0-1.noarch python26-tokenserver-0.8.4-1.noarch Stage currently has this installed: python26-tokenlib-0.1.0-1.noarch python26-tokenserver-0.8.4-1.noarch Now we are on to debugging 500s/503s in Stage token and stoken.
Here are my original issues: The current TokenServer load test allows an option for enabling testing with new users created against stoken. Example: ./workers.sh -n 4 (tests token and stoken) vs. ./workers.sh 4 (tests token only) When I run a standard ts load test - ./workers.sh 4, for example - I do not see any errors in the zeus logs. If I switch to running the same load test with the "-n" option, then I see significant spikes of 500s and 503s on the token* and stoken* servers in Stage. A sample token graph in Pencil: http://pencil.scl2.stage.svc.mozilla.com/dash/stage/token?start=07%2F14%2F2012&duration=12+hours&width=900&height=400 A sample stoken graph for the same timeframe: http://pencil.scl2.stage.svc.mozilla.com/dash/stage/stoken?start=07%2F14%2F2012&duration=12+hours&width=900&height=400 (or similar for the 24 hours comprising any of the weekend) And, the zeus logs: On zlb{1,2}.pub.scl2.stage.svc.mozilla.com Check out /var/log/zeus stage-token.services.mozilla.com.access_2012-07-13-{0..23} stage-token.services.mozilla.com.access_2012-07-14-{0..23} stage-token.services.mozilla.com.access_2012-07-15-{0..23} Look for 500s and 503s. On zlb{1,2}.scl2.stage.svc.mozilla.com Check out /var/log/zeus stoken.sreg.scl2.stage.svc.mozilla.com.access_2012-07-13-{0..23} stoken.sreg.scl2.stage.svc.mozilla.com.access_2012-07-14-{0..23} stoken.sreg.scl2.stage.svc.mozilla.com.access_2012-07-15-{0..23} Look for 500s and 503s.
Looks like database pool contention in stoken. This error traceback appears repeatedly in the gunicorn output log on stoken1: TimeoutError: QueuePool limit of size 20 overflow 5 reached, connection timed out, timeout 30 Full traceback: http://rfkelly.pastebin.mozilla.org/1704998 I recall getting some of these in AITC because database connections and result cursors were not cleaned up correctly, resulting in connections not being returned to the pool. Maybe something similar is going on here? Cheers, Ryan
On 7/16/12 5:52 AM, Ryan Kelly wrote: > On 16/07/12 13:05, James Bonacci wrote: >> Team, >> >> The current TokenServer load test allows an option for enabling >> testing with new users created against stoken. >> >> Example: >> ./workers.sh -n 4 (tests token and stoken) >> vs. >> ./workers.sh 4 (tests token only) >> >> >> When I run a standard ts load test - ./workers.sh 4, for example - I >> do not see any errors in the zeus logs. >> >> If I switch to running the same load test with the "-n" option, then >> I see significant spikes of 500s and 503s on the token* and stoken* >> servers in Stage. > > > Looks like database pool contention in stoken. This error traceback > appears repeatedly in the gunicorn output log on stoken1: > TimeoutError: QueuePool limit of size 20 overflow 5 reached, > connection timed out, timeout 30 > > > Full traceback: > > http://rfkelly.pastebin.mozilla.org/1704998 > > I recall getting some of these in AITC because database connections > and result cursors were not cleaned up correctly, resulting in > connections not being returned to the pool. Maybe something similar > is going on here? I am pretty sure there's an issue in the database side. I'd control the DB logs first, and if nothing wrong appears there, raise the pool size -- IIRC it was higher If it was a non-returning connector, you would have 100% 503s rate after a bit of time. > > > Cheers, > > Ryan
And I have attached, ad nauseum, the entire #tokenserver IRC chatter from 7/10/2012 through 7/17/2012 to catch all the work we have done thus far on deploying and testing TokenServer in Stage. And, this saves me trying to do a 7-day summary... ;-)
So the next episode was: * the user_nodes table was 65 rows * it took ages for each query to return * bobm wiped it * jbonnacci did a new test
65 rows => 65 millions rows
I see these on stoken: IntegrityError: (IntegrityError) (1062, u"Duplicate entry 'user758410@mockmyid.com-aitc-1.0' for key 'userlookup_idx'") 'insert into user_nodes\n (service, email, node)\nvalues\n (%s, %s, %s)\n' (u'aitc-1.0', u'user758410@mockmyid.com', 'https://stage-aitc5.services.mozilla.com') that happens on stoken on node allocation I suspect the indexes where not wiped correctly because that query is sent *only* if the previous query says there's no corresponding (service, email) line
10-hour overnight load test from client4 to Token Stage completed with some 500s and 503s reappearing Graphs: Token: http://pencil.scl2.stage.svc.mozilla.com/dash/stage/token?start=7%2F17%2F2012&duration=24+hours&width=1000&height=400 Stoken: http://pencil.scl2.stage.svc.mozilla.com/dash/stage/stoken?start=7%2F17%2F2012&duration=24+hours&width=1000&height=400 Cycle Results for 10 hours: * Cycle result: **FAILURE**, 12394 success, 32 failure, 0 errors. * Cycle result: **FAILURE**, 12371 success, 39 failure, 0 errors. * Cycle result: **FAILURE**, 12352 success, 31 failure, 0 errors. * Cycle result: **FAILURE**, 12364 success, 39 failure, 0 errors. * Cycle result: **FAILURE**, 12377 success, 2 failure, 0 errors. * Cycle result: **FAILURE**, 12422 success, 1 failure, 0 errors. * Cycle result: **FAILURE**, 12296 success, 1 failure, 0 errors. * Cycle result: **FAILURE**, 12347 success, 1 failure, 0 errors. * Cycle result: **FAILURE**, 12422 success, 1 failure, 0 errors. * Cycle result: **FAILURE**, 12428 success, 1 failure, 0 errors. * Cycle result: **FAILURE**, 12407 success, 1 failure, 0 errors. * Cycle result: **FAILURE**, 12400 success, 1 failure, 0 errors. * Cycle result: **FAILURE**, 12221 success, 2 failure, 0 errors. * Cycle result: **FAILURE**, 12309 success, 1 failure, 0 errors. * Cycle result: **FAILURE**, 12418 success, 1 failure, 0 errors. * Cycle result: **FAILURE**, 12466 success, 2 failure, 0 errors. * Cycle result: **FAILURE**, 12453 success, 1 failure, 0 errors. * Cycle result: **FAILURE**, 12441 success, 1 failure, 0 errors. * Cycle result: **FAILURE**, 12410 success, 1 failure, 0 errors. * Cycle result: **FAILURE**, 12422 success, 1 failure, 0 errors. * Cycle result: **FAILURE**, 12432 success, 4 failure, 0 errors. * Cycle result: **FAILURE**, 12442 success, 2 failure, 0 errors. * Cycle result: **FAILURE**, 12426 success, 2 failure, 0 errors. * Cycle result: **FAILURE**, 12409 success, 2 failure, 0 errors. * Cycle result: **FAILURE**, 12332 success, 1 failure, 0 errors. * Cycle result: **FAILURE**, 12309 success, 1 failure, 0 errors. * Cycle result: **FAILURE**, 12445 success, 2 failure, 0 errors. * Cycle result: **FAILURE**, 12409 success, 1 failure, 0 errors. * Cycle result: **FAILURE**, 12406 success, 2 failure, 0 errors. * Cycle result: **FAILURE**, 12359 success, 2 failure, 0 errors. * Cycle result: **FAILURE**, 12336 success, 2 failure, 0 errors. * Cycle result: **FAILURE**, 12448 success, 1 failure, 0 errors. * Cycle result: **FAILURE**, 12439 success, 1 failure, 0 errors. * Cycle result: **FAILURE**, 12376 success, 9 failure, 0 errors. * Cycle result: **FAILURE**, 12321 success, 12 failure, 0 errors. * Cycle result: **FAILURE**, 12361 success, 21 failure, 0 errors. * Cycle result: **FAILURE**, 12317 success, 13 failure, 0 errors. * Cycle result: **FAILURE**, 12456 success, 1 failure, 0 errors. * Cycle result: **FAILURE**, 12341 success, 1 failure, 0 errors. * Cycle result: **FAILURE**, 12392 success, 1 failure, 0 errors. * Cycle result: **FAILURE**, 12404 success, 1 failure, 0 errors. * Cycle result: **FAILURE**, 12285 success, 1 failure, 0 errors. * Cycle result: **FAILURE**, 12290 success, 1 failure, 0 errors. * Cycle result: **FAILURE**, 12393 success, 2 failure, 0 errors. * Cycle result: **FAILURE**, 12414 success, 1 failure, 0 errors. * Cycle result: **FAILURE**, 12445 success, 1 failure, 0 errors. * Cycle result: **FAILURE**, 12404 success, 16 failure, 0 errors. * Cycle result: **FAILURE**, 12407 success, 28 failure, 0 errors. * Cycle result: **FAILURE**, 12224 success, 26 failure, 0 errors. * Cycle result: **FAILURE**, 12366 success, 25 failure, 0 errors. * Cycle result: **FAILURE**, 12432 success, 1 failure, 0 errors. * Cycle result: **FAILURE**, 12595 success, 1 failure, 0 errors. * Cycle result: **FAILURE**, 12475 success, 1 failure, 0 errors. * Cycle result: **FAILURE**, 12412 success, 31 failure, 0 errors. * Cycle result: **FAILURE**, 12397 success, 39 failure, 0 errors. * Cycle result: **FAILURE**, 12516 success, 29 failure, 0 errors. * Cycle result: **FAILURE**, 12438 success, 37 failure, 0 errors. * Cycle result: **FAILURE**, 12527 success, 1 failure, 0 errors. * Cycle result: **FAILURE**, 12511 success, 1 failure, 0 errors. * Cycle result: **FAILURE**, 12515 success, 1 failure, 0 errors. * Cycle result: **FAILURE**, 12511 success, 1 failure, 0 errors. * Cycle result: **FAILURE**, 12474 success, 1 failure, 0 errors. * Cycle result: **FAILURE**, 12469 success, 1 failure, 0 errors. * Cycle result: **FAILURE**, 12473 success, 1 failure, 0 errors. * Cycle result: **FAILURE**, 12494 success, 1 failure, 0 errors. * Cycle result: **FAILURE**, 12508 success, 1 failure, 0 errors. * Cycle result: **FAILURE**, 12484 success, 1 failure, 0 errors. * Cycle result: **FAILURE**, 12505 success, 1 failure, 0 errors. * Cycle result: **FAILURE**, 12458 success, 64 failure, 0 errors. * Cycle result: **FAILURE**, 12469 success, 59 failure, 0 errors. * Cycle result: **FAILURE**, 12477 success, 66 failure, 0 errors. * Cycle result: **FAILURE**, 12472 success, 71 failure, 0 errors. * Cycle result: **FAILURE**, 12407 success, 1 failure, 0 errors. (nothing else of interest in the load test STDOUT)
So, not sure about the failure count above (need to check logs). But by comparison, the 5xx spikes are quite small compared to previous runs in the last 48 hours: token: http://pencil.scl2.stage.svc.mozilla.com/dash/stage/token?start=48+hours+ago&duration=&width=1000&height=400 stoken: http://pencil.scl2.stage.svc.mozilla.com/dash/stage/stoken?start=48+hours+ago&duration=&width=1000&height=400
From what I can see token is healthy and stoken is doing IntegrityError when allocating users to a node. The two only ways it can happen is a race condition of asking for a node or a broken index. I will make the request more robust wrt race condition, but we need to check the index
OK, stoken code was changed and tagged per :tarek. Deployment ticket for stoken to Dev and Stage is here: https://bugzilla.mozilla.org/show_bug.cgi?id=775279 AITC functional tests against token in Dev and Stage are generating 503s. We are now investigating some errors in the token* application.log files: http://jbonacci.pastebin.mozilla.org/1707857
Functional tests passed for ts/st. Running a series of short load tests across the clientX boxes right now. After that, a big, long load test...
There were a number of issues that came up today with ts load testing. See this ticket for full details: https://bugzilla.mozilla.org/show_bug.cgi?id=775279
Changing this to deploy rpm-0.9-1, at the request of the Services Development team.
Summary: Deploy tokenserver 0.8.4 to dev/stage token servers → Deploy tokenserver 0.10 to dev/stage token servers
Okay, last change. Now deploying rpm-0.10-1. Also requires change to puppetized production.ini file. The [metlog] section of this file should be replaced w/ the following: [metlog] backend = mozsvc.metrics.MetlogPlugin sender_class = metlog.senders.UdpSender sender_host = 127.0.0.1 sender_port = 5566 The rest of the production.ini file should remain as before.
Summary: Deploy tokenserver 0.10 to dev/stage token servers → token: Deploy tokenserver 0.10 to dev/stage token servers
This has been deployed to the tokenserver dev and staging servers. It required the addition of "enable = true" to the metlog section of the production.ini in order to work.
Verified the deployment to Dev and Stage: Dev: token{1..3}.reg.mtv1.dev.svc.mozilla.com RPMs: python26-tokenserver-0.10-1.noarch python26-tokenlib-0.1.0-1.noarch Processes: supervise gunicorn-token /usr/bin/python /usr/bin/gunicorn -k gevent -w 12 -b 127.0.0.1:8000 tokenserver.run:application (13 of the above process) Stage token{1..3}.reg.scl2.stage.svc.mozilla.com RPMs: python26-tokenlib-0.1.0-1.noarch python26-tokenserver-0.10-1.noarch Processes: supervise gunicorn-token supervise token-couchbase-collector /usr/bin/python /usr/bin/gunicorn -k gevent -w 12 -b 127.0.0.1:8000 tokenserver.run:application (13 of the above process) But, we found issues with functional and load testing in Stage...this is being investigated...
OK, so during functional and load tests, we found more or less complete failures with tokenserver giving a 500, zeus transforming it into a 503. According to :rfkelly, it looks like the databases were not updated for new table scheme to support ToS. The ToS stuff needs an additional column. See http://jbonacci.pastebin.mozilla.org/1715152.
In addition to the database migration issue, a bug in the metlog sender code caused a new metlog-py and tokenserver release. We now need to deploy tokenserver rpm-0.10-2. Note that the rpm-0.10-2 tag is not on the git master branch. If you have any problems w/ the build telling you "rpm-0.10-2" doesn't exist, try "git fetch" in the build directory to make sure all of the branches are pulled down.
What we were used to do is to dump the current tables and recreate them (the creation process will be automatic) + repopulate the db. We clearly will need a way to handle database migrations in the future, so I created bug777650
Summary: token: Deploy tokenserver 0.10 to dev/stage token servers → token: Deploy tokenserver 0.10-2 to dev/stage token servers
It seemed that we missed to tag a new release for stoken as well, hence the problem :bobm had to write to the db: the deployed stoken version wasn't using the right wimms (was using the old db schema). I went ahead and did the appropriate changes to stokenserver and then created a rpm-0.4-1 tag there.
Schema modifications were completed using pt-online-schema change: # 2012-07-27T15:40:13 /usr/bin/pt-online-schema-change started # 2012-07-27T15:40:13 USE `aitc_users` # 2012-07-27T15:40:13 Alter table user_nodes using temporary table __tmp_user_nodes # 2012-07-27T15:40:13 Checking if table user_nodes can be altered # 2012-07-27T15:40:13 SHOW TRIGGERS FROM `aitc_users` LIKE 'user_nodes' # 2012-07-27T15:40:13 Table user_nodes can be altered # 2012-07-27T15:40:13 Chunk column uid, index PRIMARY # 2012-07-27T15:40:13 Chunked table user_nodes into 3754 chunks # 2012-07-27T15:40:13 Starting online schema change # 2012-07-27T15:40:13 CREATE TABLE `aitc_users`.`__tmp_user_nodes` LIKE `aitc_users`.`user_nodes` # 2012-07-27T15:40:13 ALTER TABLE `aitc_users`.`__tmp_user_nodes` ADD accepted_conditions tinyint(1) DEFAULT NULL AFTER service,DROP KEY nodelookup_idx,ADD KEY `nodelookup_idx` (`node`,`service`,`accepted_conditions`) # 2012-07-27T15:40:14 Shared columns: email, node, service, uid # 2012-07-27T15:40:14 Calling OSCCaptureSync::capture() # 2012-07-27T15:40:14 CREATE TRIGGER mk_osc_del AFTER DELETE ON `aitc_users`.`user_nodes` FOR EACH ROW DELETE IGNORE FROM `aitc_users`.`__tmp_user_nodes` WHERE `aitc_users`.`__tmp_user_nodes`.uid = OLD.uid # 2012-07-27T15:40:14 CREATE TRIGGER mk_osc_upd AFTER UPDATE ON `aitc_users`.`user_nodes` FOR EACH ROW REPLACE INTO `aitc_users`.`__tmp_user_nodes` (email, node, service, uid) VALUES (NEW.email, NEW.node, NEW.service, NEW.uid) # 2012-07-27T15:40:14 CREATE TRIGGER mk_osc_ins AFTER INSERT ON `aitc_users`.`user_nodes` FOR EACH ROW REPLACE INTO `aitc_users`.`__tmp_user_nodes` (email, node, service, uid) VALUES(NEW.email, NEW.node, NEW.service, NEW.uid) # 2012-07-27T15:40:14 Calling CopyRowsInsertSelect::copy() Copying rows: 23% 01:36 remain Copying rows: 44% 01:15 remain Copying rows: 58% 01:04 remain Copying rows: 65% 01:01 remain Copying rows: 70% 01:02 remain Copying rows: 74% 01:00 remain Copying rows: 78% 00:57 remain Copying rows: 82% 00:50 remain Copying rows: 88% 00:35 remain Copying rows: 94% 00:17 remain # 2012-07-27T15:45:21 Calling OSCCaptureSync::sync() # 2012-07-27T15:45:21 Renaming tables # 2012-07-27T15:45:21 RENAME TABLE `aitc_users`.`user_nodes` TO `aitc_users`.`__old_user_nodes`, `aitc_users`.`__tmp_user_nodes` TO `aitc_users`.`user_nodes` # 2012-07-27T15:45:21 Original table user_nodes renamed to __old_user_nodes # 2012-07-27T15:45:21 Calling CopyRowsInsertSelect::cleanup() # 2012-07-27T15:45:21 Calling OSCCaptureSync::cleanup() # 2012-07-27T15:45:21 DROP TRIGGER IF EXISTS `aitc_users`.`mk_osc_del` # 2012-07-27T15:45:21 DROP TRIGGER IF EXISTS `aitc_users`.`mk_osc_ins` # 2012-07-27T15:45:21 DROP TRIGGER IF EXISTS `aitc_users`.`mk_osc_upd` # 2012-07-27T15:45:21 /usr/bin/pt-online-schema-change ended, exit status 0
Verified the following in Dev: token{1..3}.reg.mtv1.dev python26-tokenserver-0.10-2.noarch logstash-metlog-0.8.4-1.x86_64 logstash-1.1.0-1.noarch python26-tokenlib-0.1.0-1.noarch Also verified token processes and logs. Still waiting on Stage...
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Verified the following in Stage: token{1..3}.reg.scl2.stage python26-tokenserver-0.10-2.noarch logstash-1.1.0-1.noarch logstash-metlog-0.8.4-1.x86_64 python26-tokenlib-0.1.0-1.noarch Also verified the token and logstash processes and the log files.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: