Closed
Bug 1091313
Opened 11 years ago
Closed 11 years ago
Please deploy tokenserver 1.2.11 to stage
Categories
(Cloud Services :: Operations: Deployment Requests - DEPRECATED, task)
Cloud Services
Operations: Deployment Requests - DEPRECATED
Tracking
(Not tracked)
VERIFIED
FIXED
People
(Reporter: rfkelly, Unassigned)
References
Details
(Whiteboard: [qa+])
Attachments
(2 files)
This version of tokenserver includes updated versions of many of our dependencies, as well as the following changes:
Bug 1009361 - add script for allocating user to a specific node
Bug 1008081 - remove unused "node" column from users table
Bug 1009970 - decrement node allocation count during cleanup
GH #72 - ignore non-"delete" SQS events, rather than logging an error
Bug 1043753 - dont select nodes marked "backoff" for assignment
Bug 1015526 - incrementally release node storage capacity
Please deploy to stage, do the db migration, then the usual loadtest suite:
* Build the rpm and deploy to all webheads
* From one of the webheads, run the migration tool:
./bin/alembic upgrade head
Bob, note that this includes the stage-release-of-node-capacity feature that we talked about while I was in MV. You should consider whether you want to add any cron-based component to that to ensure capacity gets released in a more controlled fashion.
| Reporter | ||
Comment 1•11 years ago
|
||
As part of verification, we need to check whether this fixes the account-deletion errors noted in Bug 1090412 Comment 5.
Comment 2•11 years ago
|
||
1.2.11 deployed. I think it needs some messages from SQS before we can verify it is working correctly.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Comment 3•11 years ago
|
||
OK. I will hold off on verifying this until at least Friday.
| Reporter | ||
Comment 4•11 years ago
|
||
> I think it needs some messages from SQS before we can verify it is working correctly.
In theory this is as simple as creating, then deleting, an account on the stage FxA server. Assuming they're all plumbed together correctly after the move of course...
Comment 5•11 years ago
|
||
A little more detail would be helpful. So, this is TS+Verifier in Stage. The current tests do not talk to FxA correct? (unit, integration, load). But FxA talks to TS+Verifier, so once we have FxA in new Dev IAM (not sure when that happens), we could generate some SQS?
If that is true, then perhaps the verification of this bug would have to wait until at least next week (or until I can verify that FxA Stage is in new Dev IAM).
| Reporter | ||
Comment 6•11 years ago
|
||
> The current tests do not talk to FxA correct?
Correct.
> But FxA talks to TS+Verifier, so once we have FxA in new Dev IAM (not sure when that happens),
> we could generate some SQS?
Right, in theory the stage FxA should send deletion events to the stage tokenserver via SQS.
So we can do loadtesting etc in the meantime, but final verification will probably have to wait on stage FxA being up and running.
Comment 7•11 years ago
|
||
> Right, in theory the stage FxA should send deletion events to the stage tokenserver via SQS.
>
> So we can do loadtesting etc in the meantime, but final verification will probably have to wait on stage FxA being up and running.
Uh, stage fxa moved to the new IAM two weeks ago, and is publishing delete events to SNS. Are you connected to arn:aws:sqs:us-east-1:927034868273:ts-account-change-stage?
Comment 8•11 years ago
|
||
Yes the tokenserver stack is connected and listening to arn:aws:sqs:us-east-1:927034868273:ts-account-change-stage.
BTW: I forgot to do the mysql migration piece.
I tried to run it and this happened. I'm guessing that since it was a fresh RDS tokenserver 1.2.9 created the latest DB schema and we didn't need to do the schema upgrade.
INFO [alembic.migration] Context impl MySQLImpl.
INFO [alembic.migration] Will assume non-transactional DDL.
INFO [alembic.migration] Running upgrade None -> 17d209a72e2f, add replaced_at_idx
Traceback (most recent call last):
File "./local/bin/alembic", line 12, in <module>
load_entry_point('alembic==0.6.7', 'console_scripts', 'alembic')()
File "/data/tokenserver/local/lib/python2.6/site-packages/alembic/config.py", line 306, in main
CommandLine(prog=prog).main(argv=argv)
File "/data/tokenserver/local/lib/python2.6/site-packages/alembic/config.py", line 300, in main
self.run_cmd(cfg, options)
File "/data/tokenserver/local/lib/python2.6/site-packages/alembic/config.py", line 286, in run_cmd
**dict((k, getattr(options, k)) for k in kwarg)
File "/data/tokenserver/local/lib/python2.6/site-packages/alembic/command.py", line 129, in upgrade
script.run_env()
File "/data/tokenserver/local/lib/python2.6/site-packages/alembic/script.py", line 208, in run_env
util.load_python_file(self.dir, 'env.py')
File "/data/tokenserver/local/lib/python2.6/site-packages/alembic/util.py", line 230, in load_python_file
module = load_module_py(module_id, path)
File "/data/tokenserver/local/lib/python2.6/site-packages/alembic/compat.py", line 63, in load_module_py
mod = imp.load_source(module_id, path, fp)
File "tokenserver/assignment/sqlnode/migrations/env.py", line 75, in <module>
run_migrations_online()
File "tokenserver/assignment/sqlnode/migrations/env.py", line 68, in run_migrations_online
context.run_migrations()
File "<string>", line 7, in run_migrations
File "/data/tokenserver/local/lib/python2.6/site-packages/alembic/environment.py", line 696, in run_migrations
self.get_context().run_migrations(**kw)
File "/data/tokenserver/local/lib/python2.6/site-packages/alembic/migration.py", line 266, in run_migrations
change(**kw)
File "tokenserver/assignment/sqlnode/migrations/versions/17d209a72e2f_add_replaced_at_idx.py", line 20, in upgrade
op.create_index('replaced_at_idx', 'users', ['service', 'replaced_at'])
File "<string>", line 7, in create_index
File "/data/tokenserver/local/lib/python2.6/site-packages/alembic/operations.py", line 797, in create_index
self._index(name, table_name, columns, schema=schema, **kw)
File "/data/tokenserver/local/lib/python2.6/site-packages/alembic/ddl/impl.py", line 169, in create_index
self._exec(schema.CreateIndex(index))
File "/data/tokenserver/local/lib/python2.6/site-packages/alembic/ddl/impl.py", line 81, in _exec
conn.execute(construct, *multiparams, **params)
File "/data/tokenserver/local/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 729, in execute
return meth(self, multiparams, params)
File "/data/tokenserver/local/lib/python2.6/site-packages/sqlalchemy/sql/ddl.py", line 69, in _execute_on_connection
return connection._execute_ddl(self, multiparams, params)
File "/data/tokenserver/local/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 783, in _execute_ddl
compiled
File "/data/tokenserver/local/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 958, in _execute_context
context)
File "/data/tokenserver/local/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1159, in _handle_dbapi_exception
exc_info
File "/data/tokenserver/local/lib/python2.6/site-packages/sqlalchemy/util/compat.py", line 199, in raise_from_cause
reraise(type(exception), exception, tb=exc_tb)
File "/data/tokenserver/local/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 951, in _execute_context
context)
File "/data/tokenserver/local/lib/python2.6/site-packages/sqlalchemy/engine/default.py", line 436, in do_execute
cursor.execute(statement, parameters)
File "/data/tokenserver/local/lib/python2.6/site-packages/pymysql/cursors.py", line 132, in execute
result = self._query(query)
File "/data/tokenserver/local/lib/python2.6/site-packages/pymysql/cursors.py", line 271, in _query
conn.query(q)
File "/data/tokenserver/local/lib/python2.6/site-packages/pymysql/connections.py", line 726, in query
self._affected_rows = self._read_query_result(unbuffered=unbuffered)
File "/data/tokenserver/local/lib/python2.6/site-packages/pymysql/connections.py", line 861, in _read_query_result
result.read()
File "/data/tokenserver/local/lib/python2.6/site-packages/pymysql/connections.py", line 1064, in read
first_packet = self.connection._read_packet()
File "/data/tokenserver/local/lib/python2.6/site-packages/pymysql/connections.py", line 826, in _read_packet
packet.check_error()
File "/data/tokenserver/local/lib/python2.6/site-packages/pymysql/connections.py", line 370, in check_error
raise_mysql_exception(self._data)
File "/data/tokenserver/local/lib/python2.6/site-packages/pymysql/err.py", line 116, in raise_mysql_exception
_check_mysql_exception(errinfo)
File "/data/tokenserver/local/lib/python2.6/site-packages/pymysql/err.py", line 112, in _check_mysql_exception
raise InternalError(errno, errorvalue)
sqlalchemy.exc.InternalError: (InternalError) (1061, u"Duplicate key name 'replaced_at_idx'") 'CREATE INDEX replaced_at_idx ON users (service, replaced_at)' ()
| Reporter | ||
Comment 9•11 years ago
|
||
> Yes the tokenserver stack is connected and
> listening to arn:aws:sqs:us-east-1:927034868273:ts-account-change-stage.
OK, awesome, sounds like we can do a full verification any time convenient for James.
> BTW: I forgot to do the mysql migration piece.
>
> I tried to run it and this happened. I'm guessing that since it was a fresh
> RDS tokenserver 1.2.9 created the latest DB schema and we didn't need to do the schema upgrade.
Ah yes, that's almost certainly what happened. Instead you will want to do `alembic stamp` to mark the schema version in the db:
./local/bin/alembic stamp head
Obviously the proper migration will need to be applied in prod.
Comment 10•11 years ago
|
||
OK. I ran:
./local/bin/alembic stamp head
Since the the virtualenv is built in ./local
Otherwise it worked. :)
| Reporter | ||
Comment 11•11 years ago
|
||
I believe that loadtesting and verification on this is still pending; marking ni? :jbonacci so the status is a little more obvious
Flags: needinfo?(jbonacci)
| Reporter | ||
Comment 12•11 years ago
|
||
Sounds like :jbonacci is PTO, switching this to Karl
Flags: needinfo?(jbonacci) → needinfo?(kthiessen)
QA Contact: kthiessen
Comment 13•11 years ago
|
||
I am indeed acting as :jbonacci's backup -- do we have some idea of urgency/deadline for this?
Updated•11 years ago
|
Flags: needinfo?(kthiessen)
| Reporter | ||
Comment 14•11 years ago
|
||
No particular urgency, I just dont want it to get forgotten :-)
| Reporter | ||
Comment 15•11 years ago
|
||
During initial loadtest, we got a lot of 503s that seem to be due to:
OperationalError: (OperationalError) (2003, \"Can't connect to MySQL server on u'ts-rds-cloudops-3.czvvrkdqhklf.us-east-1.rds.amazonaws.com' ((1040, u'Too many connections'))
I vaguely recall scaling down the stage tokenserver RDS instance, perhaps we need to scale it back up for loadtesting? :mostlygeek any suggestion?
Flags: needinfo?(bwong)
Comment 16•11 years ago
|
||
Yes the DB was scaled down to an db.m3.medium. If we're doing a lot of load testing I can scale it up to match the PROD db sizing.
Flags: needinfo?(bwong)
| Reporter | ||
Comment 17•11 years ago
|
||
> I can scale it up to match the PROD db sizing.
OK, makes sense, please do so for the purposes of testing this release.
Comment 18•11 years ago
|
||
Ben, any word on resizing this db so we can get this tested and deployed?
Flags: needinfo?(bwong)
Comment 19•11 years ago
|
||
This is a good opportunity to make and load test some DB changes I've been planning:
What is in prod is currently:
- m1.large, multi-az with 100GB @ 1000PIOPS
What I want to change it to
- r3.large, multi-az with 500GB GP2 SSD
This would be less expensive and also give us hopefully the performance we need. We've never gone past 100 write IOPS/sec so we have enough IOPS capacity. What we should keep an eye on the IO latencies with GP2. If they average under 10ms I think we are OK to discuss this further for production.
I'll update this bug when the RDS has finished upgrading and the application cluster sizing matches production.
Also reopening this bug so we can use it to test things further. :)
Status: RESOLVED → REOPENED
Flags: needinfo?(bwong)
Resolution: FIXED → ---
Comment 20•11 years ago
|
||
Application cluster is 3x c3.large, will auto-scale up when CPU is > 60%
RDS server is now: r3.large, multi-az, 500GB GP2
Please commence load testing!
Comment 21•11 years ago
|
||
Let's hold off on loading testing this until bug 1110520 is done.
Comment 22•11 years ago
|
||
OK tokenserver on stage has been upgraded to use Verifier 0.3.0-1. It should be good to Q/A now.
Comment 23•11 years ago
|
||
:mostlygeek scaled down the application cluster over the weekend, and has just scaled it back up for testing.
I will be commencing a 40user/5agent/1hour loadtest as soon as the cluster is done scaling up -- about 11:30 Pacific.
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Comment 24•11 years ago
|
||
https://loads.services.mozilla.com/run/be65a847-1bc6-4db0-831a-e84e6317948a has run successfully. Thanks to :rfkelly for pointing out to me that the megabench config had acquired a new parameter.
Peak for this test was 186 RPS; I'm now going to try a 60 user/1 hour test to see if I can drive that just a bit higher.
Comment 25•11 years ago
|
||
Also, according to https://console.aws.amazon.com/rds/home?region=us-east-1#dbinstances:id=ts-rds-cloudops-3;sf=all;v=mm, that last test got us 700-800 write IOPS.
Comment 26•11 years ago
|
||
Attached RDS graphs of load test information:
- looks really good!
- latency for write hovers around 2ms!
- hit around 750IOPS ... 1/2 1500IOPs from the 500GB GP2 volume, spot on!
Looks great!
Comment 27•11 years ago
|
||
These test results have been looked over by :kthiessen, :rfkelly, :mostlygeek, and :bobm. We pronounce it good -- now on to the prod deploy (which will happen after the new year, as Ops is a bit short-staffed over the holidays.)
Production deploy ticket is bug 1114773.
Updated•11 years ago
|
Status: RESOLVED → VERIFIED
Comment 28•11 years ago
|
||
Results from longer load test. Other than initial spike in IO latency the GP2 io latency held solid for the load test.
Attached a screen shot showing load test stats on RDS.
You need to log in
before you can comment on or make changes to this bug.
Description
•