Closed Bug 1307741 Opened 8 years ago Closed 8 years ago

Switch SCL3 treeherder prod to Heroku

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: emorley)

References

Details

Stage was migrated in bug 1277304 on Monday, this is for prod.

We're performing the switch today at 0830EDT.
Final prep (30 minutes before):

* Check the same revision is deployed on SCL3 prod and Heroku prod.
* Rebase the `scl3-only-migration-changes` branch onto the `production` branch.
* Re-check the prod Heroku web/worker dyno counts match Heroku stage (other than celerybeat/Pulse listeners, which should be zero).
* Re-run ./export-envs.sh to check the environment variables match.
* Re-run ./export-heroku-app-info.sh to check rest of Heroku config matches Heroku stage.
* Check New Relic error analytics / transaction times.
* Disable the New Relic availability check for SCL3 prod.
* Check replication lag on both the SCL3 slave and the RDS instance:
    ssh th-prod-db2
    sudo mysql -e 'SHOW MASTER STATUS\G'
    ssh th-prod-db1
    sudo mysql -e 'SHOW SLAVE STATUS\G'
    mysql -h treeherder-prod.REDACTED.us-east-1.rds.amazonaws.com -u th_admin -p --ssl-mode=REQUIRED --ssl-verify-server-cert --ssl-ca=ca-bundle.pem
    SHOW SLAVE STATUS\G
* Pre-warm the AWS ELB used by the Heroku SSL Endpoint addon (so it scales up):
    nslookup tokyo-43605.herokussl.com
    Then run multiple instances of:
    while true; do curl --fail --max-time 5 -sSo /dev/null https://treeherder.mozilla.org --resolve 'treeherder.mozilla.org:443:IP_ADDRESS'; done
* Notify on-duty sheriff, #treeherder, #developers that the migration starting soon.


Migration plan:

* Check fubar online and free to assist.
* Ask fubar to silence these nagios DB checks:
    'check_mysql_config_diffs'
    'check_mysql_readonly'
* Ask fubar to merge the Terraform PR and fetch the changes locally, but not apply it yet.
    (makes the RDS instance public and changes the parameter group away from the replication-specific one)
    https://github.com/mozilla-platform-ops/devservices-aws/pull/4
* Notify on-duty sheriff, #treeherder, #developers that the migration about to start.
* Deploy the `scl3-only-migration-changes` branch to SCL3 prod, to stop ingestion/make API read-only.
* Whilst that's running, close all trees apart from try:
    https://api.pub.build.mozilla.org/treestatus
* Once SCL3 prod deploy complete, check no stuck processes on SCL3:
    ssh th-admin
    multi treeherder "ps ax -o ppid,pid,stime,command | egrep '^\s*1\s+.*[c]elery'"
    multi treeherder "ps ax -o ppid,pid,stime,command | egrep '^\s*1\ s+.*[g]unicorn'"
* Wait until the SCL3 prod rabbitmq queues are empty:
    http://treeherder-rabbitmq2.private.scl3.mozilla.com:15672/#/queues
* Fetch the last push ID ingested for the try repo on SCL3 (since we'll need to set it on Heroku prod's memcached later)
    for i in {1..3}; do echo 'get :1:try:last_push_id' | nc treeherder${i}.webapp.scl3.mozilla.com 11211; done
* Set the SCL3 master DB to read only (there shouldn't be any more writes, but just in case):
    ssh th-prod-db2 (NB: must be on the master)
    sudo mysql -e 'FLUSH TABLES; SET GLOBAL read_only = 1;'
    sudo mysql -e 'SHOW VARIABLES LIKE "read_only";'
* Check replication lag on both the SCL3 slave and the RDS instance:
    (whilst still on the master)
    sudo mysql -e 'SHOW MASTER STATUS\G'
    ssh th-prod-db1 (NB: must be on the slave)
    sudo mysql -e 'SHOW SLAVE STATUS\G'
    mysql -h treeherder-prod.REDACTED.us-east-1.rds.amazonaws.com -u th_admin -p --ssl-mode=REQUIRED --ssl-verify-server-cert --ssl-ca=ca-bundle.pem
    SHOW SLAVE STATUS\G
* Once `Seconds_Behind_Master` is zero, stop replication:
    (whilst still connected to the RDS instance)
    CALL mysql.rds_stop_replication;
* Ask fubar to run Terraform apply for the earlier PR (to make the RDS instance public and change the parameter group)
    Either this needs to use the 'apply immediately' option, or the instance will need a manual reboot after.
* Whilst the RDS instance is rebooting, set last_push_id for the try repo, in Heroku prod memcached:
    (we have to do this since try is still open, and >10 pushes [the json-pushes default count] may have landed since ingestion stopped)
    thp run ./manage.py shell
    from django.core.cache import cache
    cache.set("try:last_push_id", <VALUE_FROM_EARLIER>)
    cache.get("try:last_push_id")
* Check prod RDS instance finished rebooting and available:
    https://console.aws.amazon.com/rds/home?region=us-east-1#dbinstances:
* Check https://treeherder-stage.herokuapp.com/ can now access the DB.
* Ask fubar to update DNS:
    CNAME treeherder.mozilla.org -> tokyo-43605.herokussl.com
    NB: Keep 60s TTL.
* Scale up celerybeat + pulse listening dynos:
    thp ps:scale worker_beat=1 worker_read_pulse_resultsets=1 worker_read_pulse_jobs=1
* Check DNS changes propagated:
    nslookup treeherder.mozilla.org
* Check HTTP 200 and Heroku headers present for:
    curl -I https://treeherder.mozilla.org/
    curl -I https://treeherder.mozilla.org/api/
* Try https://treeherder.mozilla.org/ in browser.
* Check Heroku metrics:
    https://dashboard.heroku.com/apps/treeherder-prod/metrics/web?starting=24-hours-ago
    thp apps:errors --json
* Check New Relic error analytics / transaction times:
    https://rpm.newrelic.com/accounts/677903/applications/14179757
    https://rpm.newrelic.com/accounts/677903/applications/14179757/filterable_errors
* Reopen all trees:
    https://api.pub.build.mozilla.org/treestatus
* Notify on-duty sheriff, #treeherder, #developers that migration is complete.


Shortly after:

* Reply to the newsgroup post saying migration complete.
* Reset the RDS replication warning symbol:
    (reconnect to the prod RDS instance now rebooted)
    CALL mysql.rds_reset_external_master;
* Set require SSL on the prod RDS instance:
    GRANT USAGE ON *.* TO 'th_admin'@'%' REQUIRE SSL;
* Unset SKIP_PREDEPLOY (since future deploys are now safe to run the migrations):
    thp config:unset SKIP_PREDEPLOY
* Set the New Relic availability check URL for the Heroku prod app, once enough time passed to overcome their aggressive DNS caching.


Rollback plan:

* Undo SCL3 master DB read-only:
    ssh th-prod-db2 (NB: must be on the master)
    sudo mysql -e 'SET GLOBAL read_only = 0;'
* Deploy branch `production` to SCL3 prod.
* Ask fubar to revert DNS changes:
    CNAME treeherder.mozilla.org -> treeherder.vips.scl3.mozilla.com
* Check DNS changes propagated:
    nslookup treeherder.mozilla.org
* Check HTTP 200 and Zeus headers present for:
    curl -I https://treeherder.mozilla.org/
    curl -I https://treeherder.mozilla.org/api/
* Try https://treeherder.mozilla.org/ in browser.
* Check New Relic error analytics / transaction times.
* If >4hours since ingestion stopped, point ingestion at daily builds-4hr archive to backfill.
* Retrigger any Taskcluster jobs that completed whilst ingestion stopped.
* Reopen all trees:
    https://api.pub.build.mozilla.org/treestatus
* Notify on-duty sheriff, #treeherder, #developers that migration aborted.
All complete, and looking good so far :-)

Trees were closed for 25 minutes (other than try, which remained open throughout).

Timeline (in UTC+1):
13:37 - SCL3 prod deploy of `scl3-only-migration-changes` started.
13:41 - Push/job ingestion paused on SCL3 prod & maintenance banner shown.
13:41 - Non-try trees closed.
13:48 - Replication stopped on RDS instance.
13:50 - Terraform apply started.
13:52 - Terraform apply finished.
13:53 - RDS instance manually rebooted, since Terraform "apply immediately" didn't for some reason
13:54 - RDS instance back up.
13:56 - DNS updated in inventory.
14:02 - DNS update finished applying/has propagated.
14:06 - Trees reopened.

Couple of additions to the plan above:
* I had to scale the builds-4hr dyno to a P-M temporarily to work around bug 1307782.
* I've adjusted the gunicorn --max-requests value for Heroku to match SCL3 (bug 1307785) since it appears we had a pre-existing leak (in all environments) that was being wallpapered over better with the `bin/run_gunicorn` value for --max-requests.
Cameron/Will/James/Kendall/Jake, I don't suppose you could:

* Make sure you have the Heroku CLI installed (https://devcenter.heroku.com/articles/heroku-command-line)
* Make sure it's up to date (`heroku update`) and logged in
* Try out some commands (eg `heroku help`, `heroku ps -a treeherder-prod`, `heroku config -a treeherder-stage`)
* (Optionally) Add aliases for each environment (the environment variable can be used instead of `-a`), eg:
    alias ths="HEROKU_APP=treeherder-stage heroku"
    alias thp="HEROKU_APP=treeherder-prod heroku"
    Use like: `thp ps`
* (If using MySQLWorkbench) Add the treeherder stage/prod instances to the saved servers list.
   **NB: Make sure "SSL required" set.**
* (If using the CLI mysql client) Make sure you don't leak credentials when connecting - see: 
    https://github.com/mozilla-platform-ops/devservices-aws/blob/master/treeherder/README#L7-L23
    Use eg `ths config:get DATABASE_URL` to find the credentials.

For now, I'd like us to not deploy anything new to Heroku production for a day or two if possible, to make it easier to tell code regressions from infra regressions.

More docs will be coming in bug 1165259 within the next 24-48, however for now, some useful links...

Papertrail:
https://papertrailapp.com/systems/treeherder-stage/events
https://papertrailapp.com/systems/treeherder-prod/events

New Relic (as before; but see also Plugins section, not just APM):
https://rpm.newrelic.com/accounts/677903/applications

Heroku metrics (for dyno RAM/CPU stats):
https://dashboard.heroku.com/apps/treeherder-stage/metrics/web
https://dashboard.heroku.com/apps/treeherder-prod/metrics/web

Heroku environment variables (if not using CLI):
https://dashboard.heroku.com/apps/treeherder-stage/settings
https://dashboard.heroku.com/apps/treeherder-prod/settings

CloudAMQP (rabbitmq queue sizes etc):
https://addons-sso.heroku.com/apps/treeherder-stage/addons/cloudamqp
https://addons-sso.heroku.com/apps/treeherder-prod/addons/cloudamqp

AWS RDS instances:
https://console.aws.amazon.com/rds/home?region=us-east-1#dbinstances:

If interested in further reading:
https://devcenter.heroku.com/articles/how-heroku-works
Hi 
We got alerted <•nagios-scl3> treeherder.mozilla.org (54.225.148.109) is DOWN :PING CRITICAL - Packet loss = 100%

Is it OK to remove nagios monitoring for treeherder.mozilla.org?
(In reply to Ed Morley [:emorley] from comment #4)
> Cameron/Will/James/Kendall/Jake, I don't suppose you could:

working. \o/

(In reply to Vinh Hua [:vinh] from comment #5)
> Hi 
> We got alerted <•nagios-scl3> treeherder.mozilla.org (54.225.148.109) is
> DOWN :PING CRITICAL - Packet loss = 100%
> 
> Is it OK to remove nagios monitoring for treeherder.mozilla.org?

It looks like nagios was updated to only check for HTTPS certificate, which is ok. Ideally we should check http return code, but I think we're tracking that elsewhere.
(In reply to Vinh Hua [:vinh] from comment #5)
> We got alerted <•nagios-scl3> treeherder.mozilla.org (54.225.148.109) is
> DOWN :PING CRITICAL - Packet loss = 100%
> 
> Is it OK to remove nagios monitoring for treeherder.mozilla.org?

I'm puzzled why this alerted? We've kept the same domain post-migration, and I can't see anything with a 4XX or 5XX HTTP status code around that time?

Did the alert fail permanently, or just as a blip?
What was the error given & what time?
Also, what exact protocol/host/path is used and/or does it check for a substring in the response?

Thanks!
It's a ping check, and you can't ping the heroku ssl endpoints. It *should* be an http check, though.
Oh literal ping check.
It wasn't listed in bug 1283111 comment 1, so wasn't aware it needed changing.

On my post-migration checklist is to file a bug for adjusting Nagios alerts (mainly removal before decom),in the meantime, could you change this to an HTTP check? :-)
Thu 22:30:16 PDT [5545] treeherder.mozilla.org (54.225.148.109) is DOWN :PING CRITICAL - Packet loss = 100%
(In reply to Ed Morley [:emorley] from comment #9)
> in the meantime, could you change this to an
> HTTP check? :-)
Flags: needinfo?(achavez)
This situation is really frustrating:
* The nagios alerts are not shown in a public channel
* I have no way of knowing what alerts are set up, other than asking people to paste a list, and that list can (as unfortunately in this case) have omissions
* I have no way of confirming whether people have downtimed alerts, and if so for how long
* I'm not able to downtime them myself
* We've had several false alarms (and I feel bad for the noise to MOC), even after attempts to do the right thing (bug 1283111, and here)
 
Longer term I really wish we could make it easier for people outside of MOC to understand and interact with the alerting system, because we really do want to be nice to MOC and avoid false alarms, it's just exceptionally hard at present.

Are there any future plans that might help address this?
(In reply to Ashlee Chavez [:ashlee] from comment #10)
> Thu 22:30:16 PDT [5545] treeherder.mozilla.org (54.225.148.109) is DOWN
> :PING CRITICAL - Packet loss = 100%

(In reply to Ed Morley [:emorley] from comment #11)
> (In reply to Ed Morley [:emorley] from comment #9)
> > in the meantime, could you change this to an
> > HTTP check? :-)
Flags: needinfo?(ludovic)
Will do.
Flags: needinfo?(ludovic)
should stuff like         
'treeherder2.stage.webapp.scl3.mozilla.com' => {
            parents => 'esx-cluster1.ops.scl3.mozilla.com',
            hostgroups => [
                'generic-preprod',
                'virtual',
            ]
        },
be deocmissioned ? if so can you file a bug in Moc:Service request with things that need to be decomed from scl3 ?

done
Index: manifests/hosts/scl3.pp
===================================================================
--- manifests/hosts/scl3.pp	(revision 122429)
+++ manifests/hosts/scl3.pp	(working copy)
@@ -6818,7 +6818,7 @@
         'treeherder.mozilla.org' => {
             parents => 'treeherder.vips.scl3.mozilla.com',
             hostgroups => [
-                'https-websites'
+                'http-websites'
             ]
         },
         'login1.corpdmz.scl3.mozilla.com' => {
[ludo@Oulanl nagios]$ svn commit -m "switching check type 1307741"
Sending        manifests/hosts/scl3.pp
Transmitting file data .done
Committing transaction...
Committed revision 122430.
Flags: needinfo?(achavez) → needinfo?(emorley)
The http check still does ping check and fails. Shall I remove it altogether and monitor this from pingdom ?
By HTTP check I meant "make an HTTP request and check the response is HTTP 200", I'd copied the phrasing from Kendall (comment 8). I'm not sure what the pros/cons are of Pingdom in comparison.

I think I've reached the point where I say let's just delete this check entirely. In another bug (coming soon), we'll begin the process of migrating management of Heroku Treeherder to MOC (we're monitoring it for now, bug 1283111 comment 8), and it may be easier to start with a clean slate.

(In reply to Ludovic Hirlimann [:Usul] from comment #15)
> should stuff like         
> ...
> be deocmissioned ? if so can you file a bug in Moc:Service request with
> things that need to be decomed from scl3 ?

It was on the list for today (wanted to wait a day or two after migration, before asking for the VMs to be put in the powered-down holding pattern), will be filed as a dep bug off bug 1308354 :-)
Flags: needinfo?(emorley)
we alreday have a treeherdercheck in pingdom :)

so I'm just going to remove the one in nagios.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.