Closed Bug 1091313 Opened 11 years ago Closed 11 years ago

Please deploy tokenserver 1.2.11 to stage

Categories

(Cloud Services :: Operations: Deployment Requests - DEPRECATED, task)

task
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: rfkelly, Unassigned)

References

Details

(Whiteboard: [qa+])

Attachments

(2 files)

This version of tokenserver includes updated versions of many of our dependencies, as well as the following changes: Bug 1009361 - add script for allocating user to a specific node Bug 1008081 - remove unused "node" column from users table Bug 1009970 - decrement node allocation count during cleanup GH #72 - ignore non-"delete" SQS events, rather than logging an error Bug 1043753 - dont select nodes marked "backoff" for assignment Bug 1015526 - incrementally release node storage capacity Please deploy to stage, do the db migration, then the usual loadtest suite: * Build the rpm and deploy to all webheads * From one of the webheads, run the migration tool: ./bin/alembic upgrade head Bob, note that this includes the stage-release-of-node-capacity feature that we talked about while I was in MV. You should consider whether you want to add any cron-based component to that to ensure capacity gets released in a more controlled fashion.
As part of verification, we need to check whether this fixes the account-deletion errors noted in Bug 1090412 Comment 5.
1.2.11 deployed. I think it needs some messages from SQS before we can verify it is working correctly.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
OK. I will hold off on verifying this until at least Friday.
> I think it needs some messages from SQS before we can verify it is working correctly. In theory this is as simple as creating, then deleting, an account on the stage FxA server. Assuming they're all plumbed together correctly after the move of course...
A little more detail would be helpful. So, this is TS+Verifier in Stage. The current tests do not talk to FxA correct? (unit, integration, load). But FxA talks to TS+Verifier, so once we have FxA in new Dev IAM (not sure when that happens), we could generate some SQS? If that is true, then perhaps the verification of this bug would have to wait until at least next week (or until I can verify that FxA Stage is in new Dev IAM).
> The current tests do not talk to FxA correct? Correct. > But FxA talks to TS+Verifier, so once we have FxA in new Dev IAM (not sure when that happens), > we could generate some SQS? Right, in theory the stage FxA should send deletion events to the stage tokenserver via SQS. So we can do loadtesting etc in the meantime, but final verification will probably have to wait on stage FxA being up and running.
> Right, in theory the stage FxA should send deletion events to the stage tokenserver via SQS. > > So we can do loadtesting etc in the meantime, but final verification will probably have to wait on stage FxA being up and running. Uh, stage fxa moved to the new IAM two weeks ago, and is publishing delete events to SNS. Are you connected to arn:aws:sqs:us-east-1:927034868273:ts-account-change-stage?
Yes the tokenserver stack is connected and listening to arn:aws:sqs:us-east-1:927034868273:ts-account-change-stage. BTW: I forgot to do the mysql migration piece. I tried to run it and this happened. I'm guessing that since it was a fresh RDS tokenserver 1.2.9 created the latest DB schema and we didn't need to do the schema upgrade. INFO [alembic.migration] Context impl MySQLImpl. INFO [alembic.migration] Will assume non-transactional DDL. INFO [alembic.migration] Running upgrade None -> 17d209a72e2f, add replaced_at_idx Traceback (most recent call last): File "./local/bin/alembic", line 12, in <module> load_entry_point('alembic==0.6.7', 'console_scripts', 'alembic')() File "/data/tokenserver/local/lib/python2.6/site-packages/alembic/config.py", line 306, in main CommandLine(prog=prog).main(argv=argv) File "/data/tokenserver/local/lib/python2.6/site-packages/alembic/config.py", line 300, in main self.run_cmd(cfg, options) File "/data/tokenserver/local/lib/python2.6/site-packages/alembic/config.py", line 286, in run_cmd **dict((k, getattr(options, k)) for k in kwarg) File "/data/tokenserver/local/lib/python2.6/site-packages/alembic/command.py", line 129, in upgrade script.run_env() File "/data/tokenserver/local/lib/python2.6/site-packages/alembic/script.py", line 208, in run_env util.load_python_file(self.dir, 'env.py') File "/data/tokenserver/local/lib/python2.6/site-packages/alembic/util.py", line 230, in load_python_file module = load_module_py(module_id, path) File "/data/tokenserver/local/lib/python2.6/site-packages/alembic/compat.py", line 63, in load_module_py mod = imp.load_source(module_id, path, fp) File "tokenserver/assignment/sqlnode/migrations/env.py", line 75, in <module> run_migrations_online() File "tokenserver/assignment/sqlnode/migrations/env.py", line 68, in run_migrations_online context.run_migrations() File "<string>", line 7, in run_migrations File "/data/tokenserver/local/lib/python2.6/site-packages/alembic/environment.py", line 696, in run_migrations self.get_context().run_migrations(**kw) File "/data/tokenserver/local/lib/python2.6/site-packages/alembic/migration.py", line 266, in run_migrations change(**kw) File "tokenserver/assignment/sqlnode/migrations/versions/17d209a72e2f_add_replaced_at_idx.py", line 20, in upgrade op.create_index('replaced_at_idx', 'users', ['service', 'replaced_at']) File "<string>", line 7, in create_index File "/data/tokenserver/local/lib/python2.6/site-packages/alembic/operations.py", line 797, in create_index self._index(name, table_name, columns, schema=schema, **kw) File "/data/tokenserver/local/lib/python2.6/site-packages/alembic/ddl/impl.py", line 169, in create_index self._exec(schema.CreateIndex(index)) File "/data/tokenserver/local/lib/python2.6/site-packages/alembic/ddl/impl.py", line 81, in _exec conn.execute(construct, *multiparams, **params) File "/data/tokenserver/local/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 729, in execute return meth(self, multiparams, params) File "/data/tokenserver/local/lib/python2.6/site-packages/sqlalchemy/sql/ddl.py", line 69, in _execute_on_connection return connection._execute_ddl(self, multiparams, params) File "/data/tokenserver/local/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 783, in _execute_ddl compiled File "/data/tokenserver/local/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 958, in _execute_context context) File "/data/tokenserver/local/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1159, in _handle_dbapi_exception exc_info File "/data/tokenserver/local/lib/python2.6/site-packages/sqlalchemy/util/compat.py", line 199, in raise_from_cause reraise(type(exception), exception, tb=exc_tb) File "/data/tokenserver/local/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 951, in _execute_context context) File "/data/tokenserver/local/lib/python2.6/site-packages/sqlalchemy/engine/default.py", line 436, in do_execute cursor.execute(statement, parameters) File "/data/tokenserver/local/lib/python2.6/site-packages/pymysql/cursors.py", line 132, in execute result = self._query(query) File "/data/tokenserver/local/lib/python2.6/site-packages/pymysql/cursors.py", line 271, in _query conn.query(q) File "/data/tokenserver/local/lib/python2.6/site-packages/pymysql/connections.py", line 726, in query self._affected_rows = self._read_query_result(unbuffered=unbuffered) File "/data/tokenserver/local/lib/python2.6/site-packages/pymysql/connections.py", line 861, in _read_query_result result.read() File "/data/tokenserver/local/lib/python2.6/site-packages/pymysql/connections.py", line 1064, in read first_packet = self.connection._read_packet() File "/data/tokenserver/local/lib/python2.6/site-packages/pymysql/connections.py", line 826, in _read_packet packet.check_error() File "/data/tokenserver/local/lib/python2.6/site-packages/pymysql/connections.py", line 370, in check_error raise_mysql_exception(self._data) File "/data/tokenserver/local/lib/python2.6/site-packages/pymysql/err.py", line 116, in raise_mysql_exception _check_mysql_exception(errinfo) File "/data/tokenserver/local/lib/python2.6/site-packages/pymysql/err.py", line 112, in _check_mysql_exception raise InternalError(errno, errorvalue) sqlalchemy.exc.InternalError: (InternalError) (1061, u"Duplicate key name 'replaced_at_idx'") 'CREATE INDEX replaced_at_idx ON users (service, replaced_at)' ()
> Yes the tokenserver stack is connected and > listening to arn:aws:sqs:us-east-1:927034868273:ts-account-change-stage. OK, awesome, sounds like we can do a full verification any time convenient for James. > BTW: I forgot to do the mysql migration piece. > > I tried to run it and this happened. I'm guessing that since it was a fresh > RDS tokenserver 1.2.9 created the latest DB schema and we didn't need to do the schema upgrade. Ah yes, that's almost certainly what happened. Instead you will want to do `alembic stamp` to mark the schema version in the db: ./local/bin/alembic stamp head Obviously the proper migration will need to be applied in prod.
OK. I ran: ./local/bin/alembic stamp head Since the the virtualenv is built in ./local Otherwise it worked. :)
I believe that loadtesting and verification on this is still pending; marking ni? :jbonacci so the status is a little more obvious
Flags: needinfo?(jbonacci)
Sounds like :jbonacci is PTO, switching this to Karl
Flags: needinfo?(jbonacci) → needinfo?(kthiessen)
QA Contact: kthiessen
I am indeed acting as :jbonacci's backup -- do we have some idea of urgency/deadline for this?
Flags: needinfo?(kthiessen)
No particular urgency, I just dont want it to get forgotten :-)
During initial loadtest, we got a lot of 503s that seem to be due to: OperationalError: (OperationalError) (2003, \"Can't connect to MySQL server on u'ts-rds-cloudops-3.czvvrkdqhklf.us-east-1.rds.amazonaws.com' ((1040, u'Too many connections')) I vaguely recall scaling down the stage tokenserver RDS instance, perhaps we need to scale it back up for loadtesting? :mostlygeek any suggestion?
Flags: needinfo?(bwong)
Yes the DB was scaled down to an db.m3.medium. If we're doing a lot of load testing I can scale it up to match the PROD db sizing.
Flags: needinfo?(bwong)
> I can scale it up to match the PROD db sizing. OK, makes sense, please do so for the purposes of testing this release.
Ben, any word on resizing this db so we can get this tested and deployed?
Flags: needinfo?(bwong)
This is a good opportunity to make and load test some DB changes I've been planning: What is in prod is currently: - m1.large, multi-az with 100GB @ 1000PIOPS What I want to change it to - r3.large, multi-az with 500GB GP2 SSD This would be less expensive and also give us hopefully the performance we need. We've never gone past 100 write IOPS/sec so we have enough IOPS capacity. What we should keep an eye on the IO latencies with GP2. If they average under 10ms I think we are OK to discuss this further for production. I'll update this bug when the RDS has finished upgrading and the application cluster sizing matches production. Also reopening this bug so we can use it to test things further. :)
Status: RESOLVED → REOPENED
Flags: needinfo?(bwong)
Resolution: FIXED → ---
Application cluster is 3x c3.large, will auto-scale up when CPU is > 60% RDS server is now: r3.large, multi-az, 500GB GP2 Please commence load testing!
Let's hold off on loading testing this until bug 1110520 is done.
Depends on: 1110520
OK tokenserver on stage has been upgraded to use Verifier 0.3.0-1. It should be good to Q/A now.
No longer depends on: 1110520
:mostlygeek scaled down the application cluster over the weekend, and has just scaled it back up for testing. I will be commencing a 40user/5agent/1hour loadtest as soon as the cluster is done scaling up -- about 11:30 Pacific.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
https://loads.services.mozilla.com/run/be65a847-1bc6-4db0-831a-e84e6317948a has run successfully. Thanks to :rfkelly for pointing out to me that the megabench config had acquired a new parameter. Peak for this test was 186 RPS; I'm now going to try a 60 user/1 hour test to see if I can drive that just a bit higher.
Attached RDS graphs of load test information: - looks really good! - latency for write hovers around 2ms! - hit around 750IOPS ... 1/2 1500IOPs from the 500GB GP2 volume, spot on! Looks great!
These test results have been looked over by :kthiessen, :rfkelly, :mostlygeek, and :bobm. We pronounce it good -- now on to the prod deploy (which will happen after the new year, as Ops is a bit short-staffed over the holidays.) Production deploy ticket is bug 1114773.
Status: RESOLVED → VERIFIED
Results from longer load test. Other than initial spike in IO latency the GP2 io latency held solid for the load test. Attached a screen shot showing load test stats on RDS.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: