Closed
Bug 774918
Opened 13 years ago
Closed 13 years ago
token: Deploy tokenserver 0.10-2 to dev/stage token servers
Categories
(Cloud Services :: Operations: Deployment Requests - DEPRECATED, task)
Cloud Services
Operations: Deployment Requests - DEPRECATED
Tracking
(Not tracked)
VERIFIED
FIXED
People
(Reporter: jbonacci, Assigned: bobm)
Details
(Whiteboard: [qa+])
Attachments
(1 file)
|
190.13 KB,
text/plain
|
Details |
For some reason we did not have a deploy bug for 0.8.1, 0.8.2, 0.8.3...
This bug supercedes https://bugzilla.mozilla.org/show_bug.cgi?id=773461
Please deploy tokenserver0.8.4 to Dev and Stage token*
Dev:
token{1..3}.reg.mtv1.dev.svc.mozilla.com
Stage:
token{1..3}.reg.scl2.stage.svc.mozilla.com
| Reporter | ||
Comment 1•13 years ago
|
||
A little bit late creating this but now we can put test and debug information in here as we find it, especially for the Stage deploy.
Assignee: nobody → bobm
Status: NEW → ASSIGNED
QA Contact: jbonacci
Whiteboard: [qa+]
| Reporter | ||
Comment 2•13 years ago
|
||
Dev currently has this installed:
python26-tokenlib-0.1.0-1.noarch
python26-tokenserver-0.8.4-1.noarch
Stage currently has this installed:
python26-tokenlib-0.1.0-1.noarch
python26-tokenserver-0.8.4-1.noarch
Now we are on to debugging 500s/503s in Stage token and stoken.
| Reporter | ||
Comment 3•13 years ago
|
||
Here are my original issues:
The current TokenServer load test allows an option for enabling testing with new users created against stoken.
Example:
./workers.sh -n 4 (tests token and stoken)
vs.
./workers.sh 4 (tests token only)
When I run a standard ts load test - ./workers.sh 4, for example - I do not see any errors in the zeus logs.
If I switch to running the same load test with the "-n" option, then I see significant spikes of 500s and 503s on the token* and stoken* servers in Stage.
A sample token graph in Pencil:
http://pencil.scl2.stage.svc.mozilla.com/dash/stage/token?start=07%2F14%2F2012&duration=12+hours&width=900&height=400
A sample stoken graph for the same timeframe:
http://pencil.scl2.stage.svc.mozilla.com/dash/stage/stoken?start=07%2F14%2F2012&duration=12+hours&width=900&height=400
(or similar for the 24 hours comprising any of the weekend)
And, the zeus logs:
On zlb{1,2}.pub.scl2.stage.svc.mozilla.com
Check out /var/log/zeus
stage-token.services.mozilla.com.access_2012-07-13-{0..23}
stage-token.services.mozilla.com.access_2012-07-14-{0..23}
stage-token.services.mozilla.com.access_2012-07-15-{0..23}
Look for 500s and 503s.
On zlb{1,2}.scl2.stage.svc.mozilla.com
Check out /var/log/zeus
stoken.sreg.scl2.stage.svc.mozilla.com.access_2012-07-13-{0..23}
stoken.sreg.scl2.stage.svc.mozilla.com.access_2012-07-14-{0..23}
stoken.sreg.scl2.stage.svc.mozilla.com.access_2012-07-15-{0..23}
Look for 500s and 503s.
| Reporter | ||
Comment 4•13 years ago
|
||
Looks like database pool contention in stoken. This error traceback
appears repeatedly in the gunicorn output log on stoken1:
TimeoutError: QueuePool limit of size 20 overflow 5 reached,
connection timed out, timeout 30
Full traceback:
http://rfkelly.pastebin.mozilla.org/1704998
I recall getting some of these in AITC because database connections and
result cursors were not cleaned up correctly, resulting in connections
not being returned to the pool. Maybe something similar is going on here?
Cheers,
Ryan
| Reporter | ||
Comment 5•13 years ago
|
||
On 7/16/12 5:52 AM, Ryan Kelly wrote:
> On 16/07/12 13:05, James Bonacci wrote:
>> Team,
>>
>> The current TokenServer load test allows an option for enabling
>> testing with new users created against stoken.
>>
>> Example:
>> ./workers.sh -n 4 (tests token and stoken)
>> vs.
>> ./workers.sh 4 (tests token only)
>>
>>
>> When I run a standard ts load test - ./workers.sh 4, for example - I
>> do not see any errors in the zeus logs.
>>
>> If I switch to running the same load test with the "-n" option, then
>> I see significant spikes of 500s and 503s on the token* and stoken*
>> servers in Stage.
>
>
> Looks like database pool contention in stoken. This error traceback
> appears repeatedly in the gunicorn output log on stoken1:
> TimeoutError: QueuePool limit of size 20 overflow 5 reached,
> connection timed out, timeout 30
>
>
> Full traceback:
>
> http://rfkelly.pastebin.mozilla.org/1704998
>
> I recall getting some of these in AITC because database connections
> and result cursors were not cleaned up correctly, resulting in
> connections not being returned to the pool. Maybe something similar
> is going on here?
I am pretty sure there's an issue in the database side. I'd control the
DB logs first, and if nothing wrong appears there, raise the pool size
-- IIRC it was higher
If it was a non-returning connector, you would have 100% 503s rate after
a bit of time.
>
>
> Cheers,
>
> Ryan
| Reporter | ||
Comment 6•13 years ago
|
||
And I have attached, ad nauseum, the entire #tokenserver IRC chatter from 7/10/2012 through 7/17/2012 to catch all the work we have done thus far on deploying and testing TokenServer in Stage.
And, this saves me trying to do a 7-day summary...
;-)
| Reporter | ||
Comment 7•13 years ago
|
||
Comment 8•13 years ago
|
||
So the next episode was:
* the user_nodes table was 65 rows
* it took ages for each query to return
* bobm wiped it
* jbonnacci did a new test
Comment 9•13 years ago
|
||
65 rows => 65 millions rows
Comment 10•13 years ago
|
||
I see these on stoken:
IntegrityError: (IntegrityError) (1062, u"Duplicate entry 'user758410@mockmyid.com-aitc-1.0' for key 'userlookup_idx'") 'insert into user_nodes\n (service, email, node)\nvalues\n (%s, %s, %s)\n' (u'aitc-1.0', u'user758410@mockmyid.com', 'https://stage-aitc5.services.mozilla.com')
that happens on stoken on node allocation
I suspect the indexes where not wiped correctly because that query is sent *only* if the previous query says there's no corresponding (service, email) line
| Reporter | ||
Comment 11•13 years ago
|
||
10-hour overnight load test from client4 to Token Stage completed with some 500s and 503s reappearing
Graphs:
Token:
http://pencil.scl2.stage.svc.mozilla.com/dash/stage/token?start=7%2F17%2F2012&duration=24+hours&width=1000&height=400
Stoken:
http://pencil.scl2.stage.svc.mozilla.com/dash/stage/stoken?start=7%2F17%2F2012&duration=24+hours&width=1000&height=400
Cycle Results for 10 hours:
* Cycle result: **FAILURE**, 12394 success, 32 failure, 0 errors.
* Cycle result: **FAILURE**, 12371 success, 39 failure, 0 errors.
* Cycle result: **FAILURE**, 12352 success, 31 failure, 0 errors.
* Cycle result: **FAILURE**, 12364 success, 39 failure, 0 errors.
* Cycle result: **FAILURE**, 12377 success, 2 failure, 0 errors.
* Cycle result: **FAILURE**, 12422 success, 1 failure, 0 errors.
* Cycle result: **FAILURE**, 12296 success, 1 failure, 0 errors.
* Cycle result: **FAILURE**, 12347 success, 1 failure, 0 errors.
* Cycle result: **FAILURE**, 12422 success, 1 failure, 0 errors.
* Cycle result: **FAILURE**, 12428 success, 1 failure, 0 errors.
* Cycle result: **FAILURE**, 12407 success, 1 failure, 0 errors.
* Cycle result: **FAILURE**, 12400 success, 1 failure, 0 errors.
* Cycle result: **FAILURE**, 12221 success, 2 failure, 0 errors.
* Cycle result: **FAILURE**, 12309 success, 1 failure, 0 errors.
* Cycle result: **FAILURE**, 12418 success, 1 failure, 0 errors.
* Cycle result: **FAILURE**, 12466 success, 2 failure, 0 errors.
* Cycle result: **FAILURE**, 12453 success, 1 failure, 0 errors.
* Cycle result: **FAILURE**, 12441 success, 1 failure, 0 errors.
* Cycle result: **FAILURE**, 12410 success, 1 failure, 0 errors.
* Cycle result: **FAILURE**, 12422 success, 1 failure, 0 errors.
* Cycle result: **FAILURE**, 12432 success, 4 failure, 0 errors.
* Cycle result: **FAILURE**, 12442 success, 2 failure, 0 errors.
* Cycle result: **FAILURE**, 12426 success, 2 failure, 0 errors.
* Cycle result: **FAILURE**, 12409 success, 2 failure, 0 errors.
* Cycle result: **FAILURE**, 12332 success, 1 failure, 0 errors.
* Cycle result: **FAILURE**, 12309 success, 1 failure, 0 errors.
* Cycle result: **FAILURE**, 12445 success, 2 failure, 0 errors.
* Cycle result: **FAILURE**, 12409 success, 1 failure, 0 errors.
* Cycle result: **FAILURE**, 12406 success, 2 failure, 0 errors.
* Cycle result: **FAILURE**, 12359 success, 2 failure, 0 errors.
* Cycle result: **FAILURE**, 12336 success, 2 failure, 0 errors.
* Cycle result: **FAILURE**, 12448 success, 1 failure, 0 errors.
* Cycle result: **FAILURE**, 12439 success, 1 failure, 0 errors.
* Cycle result: **FAILURE**, 12376 success, 9 failure, 0 errors.
* Cycle result: **FAILURE**, 12321 success, 12 failure, 0 errors.
* Cycle result: **FAILURE**, 12361 success, 21 failure, 0 errors.
* Cycle result: **FAILURE**, 12317 success, 13 failure, 0 errors.
* Cycle result: **FAILURE**, 12456 success, 1 failure, 0 errors.
* Cycle result: **FAILURE**, 12341 success, 1 failure, 0 errors.
* Cycle result: **FAILURE**, 12392 success, 1 failure, 0 errors.
* Cycle result: **FAILURE**, 12404 success, 1 failure, 0 errors.
* Cycle result: **FAILURE**, 12285 success, 1 failure, 0 errors.
* Cycle result: **FAILURE**, 12290 success, 1 failure, 0 errors.
* Cycle result: **FAILURE**, 12393 success, 2 failure, 0 errors.
* Cycle result: **FAILURE**, 12414 success, 1 failure, 0 errors.
* Cycle result: **FAILURE**, 12445 success, 1 failure, 0 errors.
* Cycle result: **FAILURE**, 12404 success, 16 failure, 0 errors.
* Cycle result: **FAILURE**, 12407 success, 28 failure, 0 errors.
* Cycle result: **FAILURE**, 12224 success, 26 failure, 0 errors.
* Cycle result: **FAILURE**, 12366 success, 25 failure, 0 errors.
* Cycle result: **FAILURE**, 12432 success, 1 failure, 0 errors.
* Cycle result: **FAILURE**, 12595 success, 1 failure, 0 errors.
* Cycle result: **FAILURE**, 12475 success, 1 failure, 0 errors.
* Cycle result: **FAILURE**, 12412 success, 31 failure, 0 errors.
* Cycle result: **FAILURE**, 12397 success, 39 failure, 0 errors.
* Cycle result: **FAILURE**, 12516 success, 29 failure, 0 errors.
* Cycle result: **FAILURE**, 12438 success, 37 failure, 0 errors.
* Cycle result: **FAILURE**, 12527 success, 1 failure, 0 errors.
* Cycle result: **FAILURE**, 12511 success, 1 failure, 0 errors.
* Cycle result: **FAILURE**, 12515 success, 1 failure, 0 errors.
* Cycle result: **FAILURE**, 12511 success, 1 failure, 0 errors.
* Cycle result: **FAILURE**, 12474 success, 1 failure, 0 errors.
* Cycle result: **FAILURE**, 12469 success, 1 failure, 0 errors.
* Cycle result: **FAILURE**, 12473 success, 1 failure, 0 errors.
* Cycle result: **FAILURE**, 12494 success, 1 failure, 0 errors.
* Cycle result: **FAILURE**, 12508 success, 1 failure, 0 errors.
* Cycle result: **FAILURE**, 12484 success, 1 failure, 0 errors.
* Cycle result: **FAILURE**, 12505 success, 1 failure, 0 errors.
* Cycle result: **FAILURE**, 12458 success, 64 failure, 0 errors.
* Cycle result: **FAILURE**, 12469 success, 59 failure, 0 errors.
* Cycle result: **FAILURE**, 12477 success, 66 failure, 0 errors.
* Cycle result: **FAILURE**, 12472 success, 71 failure, 0 errors.
* Cycle result: **FAILURE**, 12407 success, 1 failure, 0 errors.
(nothing else of interest in the load test STDOUT)
| Reporter | ||
Comment 12•13 years ago
|
||
So, not sure about the failure count above (need to check logs).
But by comparison, the 5xx spikes are quite small compared to previous runs in the last 48 hours:
token: http://pencil.scl2.stage.svc.mozilla.com/dash/stage/token?start=48+hours+ago&duration=&width=1000&height=400
stoken: http://pencil.scl2.stage.svc.mozilla.com/dash/stage/stoken?start=48+hours+ago&duration=&width=1000&height=400
Comment 13•13 years ago
|
||
From what I can see token is healthy and stoken is doing IntegrityError when allocating users to a node. The two only ways it can happen is a race condition of asking for a node or a broken index.
I will make the request more robust wrt race condition, but we need to check the index
| Reporter | ||
Comment 14•13 years ago
|
||
OK, stoken code was changed and tagged per :tarek.
Deployment ticket for stoken to Dev and Stage is here:
https://bugzilla.mozilla.org/show_bug.cgi?id=775279
AITC functional tests against token in Dev and Stage are generating 503s.
We are now investigating some errors in the token* application.log files:
http://jbonacci.pastebin.mozilla.org/1707857
| Reporter | ||
Comment 15•13 years ago
|
||
Functional tests passed for ts/st.
Running a series of short load tests across the clientX boxes right now.
After that, a big, long load test...
| Reporter | ||
Comment 16•13 years ago
|
||
There were a number of issues that came up today with ts load testing.
See this ticket for full details:
https://bugzilla.mozilla.org/show_bug.cgi?id=775279
| Assignee | ||
Comment 17•13 years ago
|
||
Changing this to deploy rpm-0.9-1, at the request of the Services Development team.
Updated•13 years ago
|
Summary: Deploy tokenserver 0.8.4 to dev/stage token servers → Deploy tokenserver 0.10 to dev/stage token servers
Comment 18•13 years ago
|
||
Okay, last change. Now deploying rpm-0.10-1. Also requires change to puppetized production.ini file. The [metlog] section of this file should be replaced w/ the following:
[metlog]
backend = mozsvc.metrics.MetlogPlugin
sender_class = metlog.senders.UdpSender
sender_host = 127.0.0.1
sender_port = 5566
The rest of the production.ini file should remain as before.
Summary: Deploy tokenserver 0.10 to dev/stage token servers → token: Deploy tokenserver 0.10 to dev/stage token servers
| Assignee | ||
Comment 19•13 years ago
|
||
This has been deployed to the tokenserver dev and staging servers. It required the addition of "enable = true" to the metlog section of the production.ini in order to work.
| Reporter | ||
Comment 20•13 years ago
|
||
Verified the deployment to Dev and Stage:
Dev:
token{1..3}.reg.mtv1.dev.svc.mozilla.com
RPMs:
python26-tokenserver-0.10-1.noarch
python26-tokenlib-0.1.0-1.noarch
Processes:
supervise gunicorn-token
/usr/bin/python /usr/bin/gunicorn -k gevent -w 12 -b 127.0.0.1:8000 tokenserver.run:application
(13 of the above process)
Stage
token{1..3}.reg.scl2.stage.svc.mozilla.com
RPMs:
python26-tokenlib-0.1.0-1.noarch
python26-tokenserver-0.10-1.noarch
Processes:
supervise gunicorn-token
supervise token-couchbase-collector
/usr/bin/python /usr/bin/gunicorn -k gevent -w 12 -b 127.0.0.1:8000 tokenserver.run:application
(13 of the above process)
But, we found issues with functional and load testing in Stage...this is being investigated...
| Reporter | ||
Comment 21•13 years ago
|
||
OK, so during functional and load tests, we found more or less complete failures with tokenserver giving a 500, zeus transforming it into a 503.
According to :rfkelly, it looks like the databases were not updated for new table scheme to support ToS. The ToS stuff needs an additional column.
See http://jbonacci.pastebin.mozilla.org/1715152.
Comment 22•13 years ago
|
||
In addition to the database migration issue, a bug in the metlog sender code caused a new metlog-py and tokenserver release. We now need to deploy tokenserver rpm-0.10-2.
Note that the rpm-0.10-2 tag is not on the git master branch. If you have any problems w/ the build telling you "rpm-0.10-2" doesn't exist, try "git fetch" in the build directory to make sure all of the branches are pulled down.
Comment 23•13 years ago
|
||
What we were used to do is to dump the current tables and recreate them (the creation process will be automatic) + repopulate the db.
We clearly will need a way to handle database migrations in the future, so I created bug777650
| Reporter | ||
Updated•13 years ago
|
Summary: token: Deploy tokenserver 0.10 to dev/stage token servers → token: Deploy tokenserver 0.10-2 to dev/stage token servers
Comment 24•13 years ago
|
||
It seemed that we missed to tag a new release for stoken as well, hence the problem :bobm had to write to the db: the deployed stoken version wasn't using the right wimms (was using the old db schema).
I went ahead and did the appropriate changes to stokenserver and then created a rpm-0.4-1 tag there.
| Assignee | ||
Comment 25•13 years ago
|
||
Schema modifications were completed using pt-online-schema change:
# 2012-07-27T15:40:13 /usr/bin/pt-online-schema-change started
# 2012-07-27T15:40:13 USE `aitc_users`
# 2012-07-27T15:40:13 Alter table user_nodes using temporary table __tmp_user_nodes
# 2012-07-27T15:40:13 Checking if table user_nodes can be altered
# 2012-07-27T15:40:13 SHOW TRIGGERS FROM `aitc_users` LIKE 'user_nodes'
# 2012-07-27T15:40:13 Table user_nodes can be altered
# 2012-07-27T15:40:13 Chunk column uid, index PRIMARY
# 2012-07-27T15:40:13 Chunked table user_nodes into 3754 chunks
# 2012-07-27T15:40:13 Starting online schema change
# 2012-07-27T15:40:13 CREATE TABLE `aitc_users`.`__tmp_user_nodes` LIKE `aitc_users`.`user_nodes`
# 2012-07-27T15:40:13 ALTER TABLE `aitc_users`.`__tmp_user_nodes` ADD accepted_conditions tinyint(1) DEFAULT NULL AFTER service,DROP KEY nodelookup_idx,ADD KEY `nodelookup_idx` (`node`,`service`,`accepted_conditions`)
# 2012-07-27T15:40:14 Shared columns: email, node, service, uid
# 2012-07-27T15:40:14 Calling OSCCaptureSync::capture()
# 2012-07-27T15:40:14 CREATE TRIGGER mk_osc_del AFTER DELETE ON `aitc_users`.`user_nodes` FOR EACH ROW DELETE IGNORE FROM `aitc_users`.`__tmp_user_nodes` WHERE `aitc_users`.`__tmp_user_nodes`.uid = OLD.uid
# 2012-07-27T15:40:14 CREATE TRIGGER mk_osc_upd AFTER UPDATE ON `aitc_users`.`user_nodes` FOR EACH ROW REPLACE INTO `aitc_users`.`__tmp_user_nodes` (email, node, service, uid) VALUES (NEW.email, NEW.node, NEW.service, NEW.uid)
# 2012-07-27T15:40:14 CREATE TRIGGER mk_osc_ins AFTER INSERT ON `aitc_users`.`user_nodes` FOR EACH ROW REPLACE INTO `aitc_users`.`__tmp_user_nodes` (email, node, service, uid) VALUES(NEW.email, NEW.node, NEW.service, NEW.uid)
# 2012-07-27T15:40:14 Calling CopyRowsInsertSelect::copy()
Copying rows: 23% 01:36 remain
Copying rows: 44% 01:15 remain
Copying rows: 58% 01:04 remain
Copying rows: 65% 01:01 remain
Copying rows: 70% 01:02 remain
Copying rows: 74% 01:00 remain
Copying rows: 78% 00:57 remain
Copying rows: 82% 00:50 remain
Copying rows: 88% 00:35 remain
Copying rows: 94% 00:17 remain
# 2012-07-27T15:45:21 Calling OSCCaptureSync::sync()
# 2012-07-27T15:45:21 Renaming tables
# 2012-07-27T15:45:21 RENAME TABLE `aitc_users`.`user_nodes` TO `aitc_users`.`__old_user_nodes`, `aitc_users`.`__tmp_user_nodes` TO `aitc_users`.`user_nodes`
# 2012-07-27T15:45:21 Original table user_nodes renamed to __old_user_nodes
# 2012-07-27T15:45:21 Calling CopyRowsInsertSelect::cleanup()
# 2012-07-27T15:45:21 Calling OSCCaptureSync::cleanup()
# 2012-07-27T15:45:21 DROP TRIGGER IF EXISTS `aitc_users`.`mk_osc_del`
# 2012-07-27T15:45:21 DROP TRIGGER IF EXISTS `aitc_users`.`mk_osc_ins`
# 2012-07-27T15:45:21 DROP TRIGGER IF EXISTS `aitc_users`.`mk_osc_upd`
# 2012-07-27T15:45:21 /usr/bin/pt-online-schema-change ended, exit status 0
| Reporter | ||
Comment 26•13 years ago
|
||
Verified the following in Dev:
token{1..3}.reg.mtv1.dev
python26-tokenserver-0.10-2.noarch
logstash-metlog-0.8.4-1.x86_64
logstash-1.1.0-1.noarch
python26-tokenlib-0.1.0-1.noarch
Also verified token processes and logs.
Still waiting on Stage...
| Reporter | ||
Updated•13 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
| Reporter | ||
Comment 27•13 years ago
|
||
Verified the following in Stage:
token{1..3}.reg.scl2.stage
python26-tokenserver-0.10-2.noarch
logstash-1.1.0-1.noarch
logstash-metlog-0.8.4-1.x86_64
python26-tokenlib-0.1.0-1.noarch
Also verified the token and logstash processes
and the log files.
Status: RESOLVED → VERIFIED
You need to log in
before you can comment on or make changes to this bug.
Description
•