Closed Bug 1277304 Opened 10 years ago Closed 9 years ago

Switch SCL3 treeherder stage to Heroku

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: emorley)

References

Details

Attachments

(1 file, 1 obsolete file)

Now that the SCL3 prod DB has been copied to a new Heroku RDS instance (bug 1176486) we're almost ready to proceed. Other bugs left: * Fixing astral plan character hack to work with UCS2 Python (bug 1277300) * Adding the certs/keys to the SSL endpoint addon (bug 1277268) * Copying the SCL3 stage Hawk credentials over to the new RDS instance (bug 1277269) Then steps to perform here: * [DONE] Temporarily set SITE_URL to the herokuapp.com domain, so ingestion will work until DNS switched over (since I'm going to start ingestion early given replication broke) * [DONE] Deploy master to the treeherder-heroku app, and let the massive migration of bug 1276867 and friends run. * [DONE] Increase webhead/worker counts. * Create a branch that has the worker and gunicorn bin/ scripts disabled, for use in turning off the processed on SCL3 stage. * Re-run the config comparison script in bug 1176487 to check everything set up. * <Wait for dep bugs to be fixed> * File a bug for updating the stage DNS (the target CNAME can be found following https://devcenter.heroku.com/articles/custom-domains#view-existing-domains), asking them to prompt when available to perform the switch. * On the day: - Deploy the disable-workers/gunicorn branch to SCL3 stage - Update the heroku treeherder-stage app's SITE_URL, TREEHERDER_ALLOWED_HOSTS, TREEHERDER_REQUEST_HOST to use treeherder.allizom.org rather than the herokuapp.com domain - Get webops to make the DNS switch (the TTLs have already been lowered in bug 1273008). - Verify certs using https://devcenter.heroku.com/articles/ssl-endpoint#testing-ssl
Depends on: 1277726
Currently blocked on New Relic support in bug 1277726. That + the all hands coming up means we won't be doing the _stage_ SCL3 move prior to the all hands (maybe during depending on what we decide there, otherwise after).
Depends on: 1279169
Depends on: 1279264
Depends on: 1282839
Attachment #8766104 - Attachment is obsolete: true
Attachment #8766104 - Flags: review?(cdawson)
Attachment #8766104 - Flags: feedback?(wlachance)
Comment on attachment 8766107 [details] [review] [treeherder] mozilla:scl3-only-migration-changes > mozilla:master Bah, closing the PR (to prevent it from being accidentally merged) meant GitHub stoped accepting changes to the branch, and GitHub wouldn't let me reopen the PR, so I've had to create a new one.
Attachment #8766107 - Flags: review?(cdawson)
Attachment #8766107 - Flags: feedback?(wlachance)
Depends on: 1283111
Remaining pre-stage-migration steps: * Wait for blockers to be fixed: - Bug 1277726 - New Relic Python agent runtime instrumentation error - Bug 1283111 - Determine which Nagios alerts need adjusting during the Treeherder Heroku migration - Bug 1279169 - Check the new stage RDS instance's DB schema is still consistent with SCL3's -> Though this one wants to be done as close to migration as possible. * Ensure that the prod DB migration bug is at least in progress, so the prod Treeherder switch can happen soon after stage. See: - Bug 1283170 - Setup Treeherder replication to AWS (prod SCl3 -> prod RDS instance) * Pick a day to make the switch. * File a webops bug for updating the stage DNS, mentioning the rough day/time, but asking them to coordinate on the day before performing the switch for real. * Email tools-treeherder & the tools list about the stage downtimes (and stage being changed to the prod DB content), saying to follow tools-treeherder for further updates. On the day: * Deploy the same revision as on SCL3 stage on Heroku stage, & also rebase the `scl3-only-migration-changes` branch onto this revison. * Check no changes to Hawk credentials on SCL3 stage since last sync on 2016-06-28 (see last modified times on: https://treeherder.allizom.org/admin/credentials/credentials/). * Check no new environment variables have been set on SCL3 stage that aren't set on Heroku. * Re-run the config comparison script in bug 1176487 to check everything set up. * Check Heroku stage looks healthy on New Relic (no new exceptions / transaction times ok). * Request that webops make any Nagios changes if needed (depends on the result of bug 1283111) * Announce the imminent migration in #treeherder & on the tools-treeherder list (we won't be closing trees since only stage). * Deploy the `scl3-only-migration-changes` branch to SCL3 stage only (see PR attachment above). * Reduce the Heroku stage dyno count for the celerybeat/read_pulse_jobs worker types to zero (to stop ingestion whilst SITE_URL being updated) * Update the heroku treeherder-stage app's SITE_URL and TREEHERDER_ALLOWED_HOSTS to use treeherder.allizom.org rather than treeherder-stage.herokuapp.com. * Get webops to make the DNS switch (the TTLs have already been lowered in bug 1273008; the target CNAME can be found following https://devcenter.heroku.com/articles/custom-domains#view-existing-domains). * Once the DNS has propagated, verify certs using https://devcenter.heroku.com/articles/ssl-endpoint#testing-ssl * Change the celerybeat/read_pulse_jobs dyno worker count back to 1. * Check Heroku stage looks healthy on New Relic & by visiting https://treeherder.allizom.org. * Announce stage migration complete in #treeherder & on the tools-treeherder list.
I've also unset ELASTICSEARCH_URL on Heroku stage/prod, since we'll want to switch from SCL3 to Heroku as close to like-to-like as possible, rather than also turning on the Elasticsearch functionality at the same time.
Comment on attachment 8766107 [details] [review] [treeherder] mozilla:scl3-only-migration-changes > mozilla:master Looks good to me. I mean, there's no dancing cat .gif, so it's not PERFECT. But very nearly so... :)
Attachment #8766107 - Flags: review?(cdawson) → review+
Comment on attachment 8766107 [details] [review] [treeherder] mozilla:scl3-only-migration-changes > mozilla:master This looks reasonable to me.
Attachment #8766107 - Flags: feedback?(wlachance) → feedback+
No longer depends on: 1277726
Depends on: 1286702
Using the new scripts in bug 1176484, I've compared SCl3 with Heroku again, and fixed a few deviations... Added to Heroku stage: * PULSE_EXCHANGE_NAMESPACE='treeherder-stage' * PULSE_URI='amqp://treeherder-stage:REDACTED@pulse.mozilla.org:5671?ssl=true' * ORANGEFACTOR_HAWK_KEY='REDACTED' (not set on SCL3 stage, but useful for initial testing) Added to Heroku prod: * PULSE_EXCHANGE_NAMESPACE='treeherder' * PULSE_URI='amqp://treeherder:REDACTED@pulse.mozilla.org:5671?ssl=true' * PULSE_DATA_INGESTION_SOURCES='[{"exchange": "exchange/taskcluster-treeherder/v1/jobs","destinations": ["tc-treeherder"],"projects": ["#"]}]' * PULSE_DATA_INGESTION_CONFIG="amqp://treeherder-prod:REDACTED@pulse.mozilla.org:5671/?ssl=true" * ORANGEFACTOR_HAWK_KEY='REDACTED' Remaining differences: * SKIP_PREDEPLOY is set on Heroku prod (will unset on migration day, once DB replication stopped) * Need to set DATABASE_URL to correct values on Heroku prod once DB set up, we have credentials, and mysql set to read-only.
I've emailed Hawk credential owners warning about the need to ensure their Treeherder API requests retry during the maintenance window: https://groups.google.com/forum/#!topic/mozilla.tools.treeherder/Xn_zjQJ3-sA
Buildapi ingestion was hitting R14 Heroku platform errors last week (exceeding memory usage of the dyno, causing it to be killed), and pushlog this week. I've bumped both to the P2 dyno type to fix short term, but will file a bug to see about lowering memory usage (appears to be a regression from bug 1280913 / bug 1281056).
The Heroku stage logs contain errors like: "Warning: Invalid utf8 character string: '9C554F'" ...however they're also on SCL3, so doesn't block the migration. (See bug 1287502)
Depends on: 1289156
Depends on: 1291307
No longer depends on: 1282839
Depends on: 1283170
Depends on: 1307319
The Pulse ingestion issues are now fixed, so we can proceed. (They were only appearing in some environments, causing confusion as to parity between them, but we now know why - Bugzilla push ingestion is still using the API.) I've checked that: * the Hawk credentials in sync between SCL3 stage and Heroku stage * the environment variables in sync between SCL3 stage and Heroku stage * New Relic looks fine for Heroku stage * the SSL certs are served correctly from the Heroku SSL endpoint addon * no Nagios alerts need silencing before we switch DNS * Heroku stage is running an up to date revision And have just: * Pushed the scl3-only-migration-changes branch to the stage branch, and deployed to SCL3 stage * Checked SCL3 rabbitmq queues empty * Run `sudo mysql -e 'FLUSH TABLES; SET GLOBAL read_only = 1;'` on SCL3 stage DB2 (the master) (the above not needed for stage, since replication already stopped months ago, but good practice for the prod run) * requested the stage DNS change in bug 1307319 From now on please don't use Chief deploy for anything but prod. Heroku stage is currently auto-deploying from master, but happy to change that if needed.
And we're done for stage :-) $ nslookup treeherder.allizom.org Server: 8.8.8.8 Address: 8.8.8.8#53 Non-authoritative answer: treeherder.allizom.org canonical name = mie-37426.herokussl.com. mie-37426.herokussl.com canonical name = elb085634-599315446.us-east-1.elb.amazonaws.com. Name: elb085634-599315446.us-east-1.elb.amazonaws.com Address: 23.23.131.39 Name: elb085634-599315446.us-east-1.elb.amazonaws.com Address: 184.72.249.68 Name: elb085634-599315446.us-east-1.elb.amazonaws.com Address: 107.21.239.199
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
There have been a few hundred out of memory errors (RSS+swap peaked at 720MB) on the P1 web dynos overnight: https://dashboard.heroku.com/apps/treeherder-stage/metrics/web?starting=24-hours-ago We were using 3xP1 dynos (512MB RAM), with gunicorn concurrency set to 3. I've switched this to 3xP2 dynos (1GB RAM), with concurrency set to 4, and we can see how it goes. I did notice however that on SCL3 (ie the bin/run_gunicorn script) we had gunicorn --max-requests set to 150, whereas in the Procfile it's set to 2000 requests. Perhaps we've had a leak on the webheads for some time, but we've just been hiding it on SCL3. Anyway we can deal with this later, along with the other "reduce memory usage/leaks and switch to smaller dynos to save $$$" bugs.
Blocks: 1307741
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: