1091313 - Please deploy tokenserver 1.2.11 to stage

Reporter

Description

•

11 years ago

This version of tokenserver includes updated versions of many of our dependencies, as well as the following changes: Bug 1009361 - add script for allocating user to a specific node Bug 1008081 - remove unused "node" column from users table Bug 1009970 - decrement node allocation count during cleanup GH #72 - ignore non-"delete" SQS events, rather than logging an error Bug 1043753 - dont select nodes marked "backoff" for assignment Bug 1015526 - incrementally release node storage capacity Please deploy to stage, do the db migration, then the usual loadtest suite: * Build the rpm and deploy to all webheads * From one of the webheads, run the migration tool: ./bin/alembic upgrade head Bob, note that this includes the stage-release-of-node-capacity feature that we talked about while I was in MV. You should consider whether you want to add any cron-based component to that to ensure capacity gets released in a more controlled fashion.

Ryan Kelly [:rfkelly]

Reporter

Comment 1

•

11 years ago

As part of verification, we need to check whether this fixes the account-deletion errors noted in Bug 1090412 Comment 5.

Benson Wong [:mostlygeek]

Comment 2

•

11 years ago

1.2.11 deployed. I think it needs some messages from SQS before we can verify it is working correctly.

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

James Bonacci [:jbonacci]

Comment 3

•

11 years ago

OK. I will hold off on verifying this until at least Friday.

Ryan Kelly [:rfkelly]

Reporter

Comment 4

•

11 years ago

> I think it needs some messages from SQS before we can verify it is working correctly. In theory this is as simple as creating, then deleting, an account on the stage FxA server. Assuming they're all plumbed together correctly after the move of course...

James Bonacci [:jbonacci]

Comment 5

•

11 years ago

A little more detail would be helpful. So, this is TS+Verifier in Stage. The current tests do not talk to FxA correct? (unit, integration, load). But FxA talks to TS+Verifier, so once we have FxA in new Dev IAM (not sure when that happens), we could generate some SQS? If that is true, then perhaps the verification of this bug would have to wait until at least next week (or until I can verify that FxA Stage is in new Dev IAM).

Ryan Kelly [:rfkelly]

Reporter

Comment 6

•

11 years ago

> The current tests do not talk to FxA correct? Correct. > But FxA talks to TS+Verifier, so once we have FxA in new Dev IAM (not sure when that happens), > we could generate some SQS? Right, in theory the stage FxA should send deletion events to the stage tokenserver via SQS. So we can do loadtesting etc in the meantime, but final verification will probably have to wait on stage FxA being up and running.

John Morrison [:jrgm]

Comment 7

•

11 years ago

> Right, in theory the stage FxA should send deletion events to the stage tokenserver via SQS. > > So we can do loadtesting etc in the meantime, but final verification will probably have to wait on stage FxA being up and running. Uh, stage fxa moved to the new IAM two weeks ago, and is publishing delete events to SNS. Are you connected to arn:aws:sqs:us-east-1:927034868273:ts-account-change-stage?

Benson Wong [:mostlygeek]

Comment 8

•

11 years ago

Yes the tokenserver stack is connected and listening to arn:aws:sqs:us-east-1:927034868273:ts-account-change-stage. BTW: I forgot to do the mysql migration piece. I tried to run it and this happened. I'm guessing that since it was a fresh RDS tokenserver 1.2.9 created the latest DB schema and we didn't need to do the schema upgrade. INFO [alembic.migration] Context impl MySQLImpl. INFO [alembic.migration] Will assume non-transactional DDL. INFO [alembic.migration] Running upgrade None -> 17d209a72e2f, add replaced_at_idx Traceback (most recent call last): File "./local/bin/alembic", line 12, in <module> load_entry_point('alembic==0.6.7', 'console_scripts', 'alembic')() File "/data/tokenserver/local/lib/python2.6/site-packages/alembic/config.py", line 306, in main CommandLine(prog=prog).main(argv=argv) File "/data/tokenserver/local/lib/python2.6/site-packages/alembic/config.py", line 300, in main self.run_cmd(cfg, options) File "/data/tokenserver/local/lib/python2.6/site-packages/alembic/config.py", line 286, in run_cmd **dict((k, getattr(options, k)) for k in kwarg) File "/data/tokenserver/local/lib/python2.6/site-packages/alembic/command.py", line 129, in upgrade script.run_env() File "/data/tokenserver/local/lib/python2.6/site-packages/alembic/script.py", line 208, in run_env util.load_python_file(self.dir, 'env.py') File "/data/tokenserver/local/lib/python2.6/site-packages/alembic/util.py", line 230, in load_python_file module = load_module_py(module_id, path) File "/data/tokenserver/local/lib/python2.6/site-packages/alembic/compat.py", line 63, in load_module_py mod = imp.load_source(module_id, path, fp) File "tokenserver/assignment/sqlnode/migrations/env.py", line 75, in <module> run_migrations_online() File "tokenserver/assignment/sqlnode/migrations/env.py", line 68, in run_migrations_online context.run_migrations() File "<string>", line 7, in run_migrations File "/data/tokenserver/local/lib/python2.6/site-packages/alembic/environment.py", line 696, in run_migrations self.get_context().run_migrations(**kw) File "/data/tokenserver/local/lib/python2.6/site-packages/alembic/migration.py", line 266, in run_migrations change(**kw) File "tokenserver/assignment/sqlnode/migrations/versions/17d209a72e2f_add_replaced_at_idx.py", line 20, in upgrade op.create_index('replaced_at_idx', 'users', ['service', 'replaced_at']) File "<string>", line 7, in create_index File "/data/tokenserver/local/lib/python2.6/site-packages/alembic/operations.py", line 797, in create_index self._index(name, table_name, columns, schema=schema, **kw) File "/data/tokenserver/local/lib/python2.6/site-packages/alembic/ddl/impl.py", line 169, in create_index self._exec(schema.CreateIndex(index)) File "/data/tokenserver/local/lib/python2.6/site-packages/alembic/ddl/impl.py", line 81, in _exec conn.execute(construct, *multiparams, **params) File "/data/tokenserver/local/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 729, in execute return meth(self, multiparams, params) File "/data/tokenserver/local/lib/python2.6/site-packages/sqlalchemy/sql/ddl.py", line 69, in _execute_on_connection return connection._execute_ddl(self, multiparams, params) File "/data/tokenserver/local/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 783, in _execute_ddl compiled File "/data/tokenserver/local/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 958, in _execute_context context) File "/data/tokenserver/local/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1159, in _handle_dbapi_exception exc_info File "/data/tokenserver/local/lib/python2.6/site-packages/sqlalchemy/util/compat.py", line 199, in raise_from_cause reraise(type(exception), exception, tb=exc_tb) File "/data/tokenserver/local/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 951, in _execute_context context) File "/data/tokenserver/local/lib/python2.6/site-packages/sqlalchemy/engine/default.py", line 436, in do_execute cursor.execute(statement, parameters) File "/data/tokenserver/local/lib/python2.6/site-packages/pymysql/cursors.py", line 132, in execute result = self._query(query) File "/data/tokenserver/local/lib/python2.6/site-packages/pymysql/cursors.py", line 271, in _query conn.query(q) File "/data/tokenserver/local/lib/python2.6/site-packages/pymysql/connections.py", line 726, in query self._affected_rows = self._read_query_result(unbuffered=unbuffered) File "/data/tokenserver/local/lib/python2.6/site-packages/pymysql/connections.py", line 861, in _read_query_result result.read() File "/data/tokenserver/local/lib/python2.6/site-packages/pymysql/connections.py", line 1064, in read first_packet = self.connection._read_packet() File "/data/tokenserver/local/lib/python2.6/site-packages/pymysql/connections.py", line 826, in _read_packet packet.check_error() File "/data/tokenserver/local/lib/python2.6/site-packages/pymysql/connections.py", line 370, in check_error raise_mysql_exception(self._data) File "/data/tokenserver/local/lib/python2.6/site-packages/pymysql/err.py", line 116, in raise_mysql_exception _check_mysql_exception(errinfo) File "/data/tokenserver/local/lib/python2.6/site-packages/pymysql/err.py", line 112, in _check_mysql_exception raise InternalError(errno, errorvalue) sqlalchemy.exc.InternalError: (InternalError) (1061, u"Duplicate key name 'replaced_at_idx'") 'CREATE INDEX replaced_at_idx ON users (service, replaced_at)' ()

Ryan Kelly [:rfkelly]

Reporter

Comment 9

•

11 years ago

> Yes the tokenserver stack is connected and > listening to arn:aws:sqs:us-east-1:927034868273:ts-account-change-stage. OK, awesome, sounds like we can do a full verification any time convenient for James. > BTW: I forgot to do the mysql migration piece. > > I tried to run it and this happened. I'm guessing that since it was a fresh > RDS tokenserver 1.2.9 created the latest DB schema and we didn't need to do the schema upgrade. Ah yes, that's almost certainly what happened. Instead you will want to do `alembic stamp` to mark the schema version in the db: ./local/bin/alembic stamp head Obviously the proper migration will need to be applied in prod.

Benson Wong [:mostlygeek]

Comment 10

•

11 years ago

OK. I ran: ./local/bin/alembic stamp head Since the the virtualenv is built in ./local Otherwise it worked. :)

Ryan Kelly [:rfkelly]

Reporter

Comment 11

•

11 years ago

I believe that loadtesting and verification on this is still pending; marking ni? :jbonacci so the status is a little more obvious

Flags: needinfo?(jbonacci)

Ryan Kelly [:rfkelly]

Reporter

Comment 12

•

11 years ago

Sounds like :jbonacci is PTO, switching this to Karl

Flags: needinfo?(jbonacci) → needinfo?(kthiessen)

QA Contact: kthiessen

Karl Thiessen [:kthiessen, he/him]

Comment 13

•

11 years ago

I am indeed acting as :jbonacci's backup -- do we have some idea of urgency/deadline for this?

Karl Thiessen [:kthiessen, he/him]

Updated

•

11 years ago

Flags: needinfo?(kthiessen)

Ryan Kelly [:rfkelly]

Reporter

Comment 14

•

11 years ago

No particular urgency, I just dont want it to get forgotten :-)

Ryan Kelly [:rfkelly]

Reporter

Comment 15

•

11 years ago

During initial loadtest, we got a lot of 503s that seem to be due to: OperationalError: (OperationalError) (2003, \"Can't connect to MySQL server on u'ts-rds-cloudops-3.czvvrkdqhklf.us-east-1.rds.amazonaws.com' ((1040, u'Too many connections')) I vaguely recall scaling down the stage tokenserver RDS instance, perhaps we need to scale it back up for loadtesting? :mostlygeek any suggestion?

Flags: needinfo?(bwong)

Benson Wong [:mostlygeek]

Comment 16

•

11 years ago

Yes the DB was scaled down to an db.m3.medium. If we're doing a lot of load testing I can scale it up to match the PROD db sizing.

Flags: needinfo?(bwong)

Ryan Kelly [:rfkelly]

Reporter

Comment 17

•

11 years ago

> I can scale it up to match the PROD db sizing. OK, makes sense, please do so for the purposes of testing this release.

Karl Thiessen [:kthiessen, he/him]

Comment 18

•

11 years ago

Ben, any word on resizing this db so we can get this tested and deployed?

Flags: needinfo?(bwong)

Benson Wong [:mostlygeek]

Comment 19

•

11 years ago

This is a good opportunity to make and load test some DB changes I've been planning: What is in prod is currently: - m1.large, multi-az with 100GB @ 1000PIOPS What I want to change it to - r3.large, multi-az with 500GB GP2 SSD This would be less expensive and also give us hopefully the performance we need. We've never gone past 100 write IOPS/sec so we have enough IOPS capacity. What we should keep an eye on the IO latencies with GP2. If they average under 10ms I think we are OK to discuss this further for production. I'll update this bug when the RDS has finished upgrading and the application cluster sizing matches production. Also reopening this bug so we can use it to test things further. :)

Status: RESOLVED → REOPENED

Flags: needinfo?(bwong)

Resolution: FIXED → ---

Benson Wong [:mostlygeek]

Comment 20

•

11 years ago

Application cluster is 3x c3.large, will auto-scale up when CPU is > 60% RDS server is now: r3.large, multi-az, 500GB GP2 Please commence load testing!

Benson Wong [:mostlygeek]

Comment 21

•

11 years ago

Let's hold off on loading testing this until bug 1110520 is done.

Benson Wong [:mostlygeek]

Updated

•

11 years ago

Depends on: 1110520

Benson Wong [:mostlygeek]

Comment 22

•

11 years ago

OK tokenserver on stage has been upgraded to use Verifier 0.3.0-1. It should be good to Q/A now.

Benson Wong [:mostlygeek]

Updated

•

11 years ago

No longer depends on: 1110520

Karl Thiessen [:kthiessen, he/him]

Comment 23

•

11 years ago

:mostlygeek scaled down the application cluster over the weekend, and has just scaled it back up for testing. I will be commencing a 40user/5agent/1hour loadtest as soon as the cluster is done scaling up -- about 11:30 Pacific.

Status: REOPENED → RESOLVED

Closed: 11 years ago → 11 years ago

Resolution: --- → FIXED

Karl Thiessen [:kthiessen, he/him]

Comment 24

•

11 years ago

https://loads.services.mozilla.com/run/be65a847-1bc6-4db0-831a-e84e6317948a has run successfully. Thanks to :rfkelly for pointing out to me that the megabench config had acquired a new parameter. Peak for this test was 186 RPS; I'm now going to try a 60 user/1 hour test to see if I can drive that just a bit higher.

Karl Thiessen [:kthiessen, he/him]

Comment 25

•

11 years ago

Also, according to https://console.aws.amazon.com/rds/home?region=us-east-1#dbinstances:id=ts-rds-cloudops-3;sf=all;v=mm, that last test got us 700-800 write IOPS.

Benson Wong [:mostlygeek]

Comment 26

•

11 years ago

Attached image Screenshot 2014-12-22 13.48.18.png — Details

Attached RDS graphs of load test information: - looks really good! - latency for write hovers around 2ms! - hit around 750IOPS ... 1/2 1500IOPs from the 500GB GP2 volume, spot on! Looks great!

Karl Thiessen [:kthiessen, he/him]

Updated

•

11 years ago

Blocks: 1114773

Karl Thiessen [:kthiessen, he/him]

Comment 27

•

11 years ago

These test results have been looked over by :kthiessen, :rfkelly, :mostlygeek, and :bobm. We pronounce it good -- now on to the prod deploy (which will happen after the new year, as Ops is a bit short-staffed over the holidays.) Production deploy ticket is bug 1114773.

Karl Thiessen [:kthiessen, he/him]

Updated

•

11 years ago

Status: RESOLVED → VERIFIED

Benson Wong [:mostlygeek]

Comment 28

•

11 years ago

Attached image Screenshot 2014-12-24 09.19.28.png — Details

Results from longer load test. Other than initial spike in IO latency the GP2 io latency held solid for the load test. Attached a screen shot showing load test stats on RDS.

Screenshot 2014-12-22 13.48.18.png 11 years ago Benson Wong [:mostlygeek] 203.18 KB, image/png		Details
Screenshot 2014-12-24 09.19.28.png 11 years ago Benson Wong [:mostlygeek] 168.27 KB, image/png		Details