Closed Bug 964325 Opened 10 years ago Closed 10 years ago

Migrate affiliates.mozilla.org to the generic cluster

Categories

(Infrastructure & Operations Graveyard :: WebOps: Engagement, task)

x86_64
Linux
task
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: jd, Assigned: bburton)

References

()

Details

(Keywords: spring-cleaning, Whiteboard: [MIGRATION: 2014-03-08 07:00AM PST])

This is a tracker bug for moving affiliates.m.o from the engagement cluster to the generic cluster. This work is being done as the engagement cluster is no longer highly utilized and the hardware is end-of-life.

I have CCd a few folks who may be able to work with Webops on this move. The code and configuration will be trivial to move over. There should be no new net-flows required.

The point for discussion will need to be around the database. There are three options.

The simplest option would be to take some downtime. This is the usual, take site down, migrate database, bring site up in new location.

There next option we can explore if the app can be placed in read-only mode. This is the same as the usual method but the site stays up in the old location read-only instead of being hard down. So set site read-only, migrate database, switch site to new location read-write.

The final option also requires the site to be able to be placed in read-only mode. This option requires a bit more coordination and scheduling with the DBAs. First we set up a slave database in the new location which replicates from the current master. Then we place the site in the old location in read-only while simultaneously switching the site over to the new location. If all goes well there will be little appreciable read-only time for the end users. This option actually requires a bunch of additional background database steps and so cannot guarantee zero read-only or zero downtime, but is as close as things can get.

My preference would be for the second option as it has no hard downtime and is quite safe in terms of data integrity.

I have two questions that I need answered to move forward with this project. First is if the site can be placed in read-only mode? Second is when a good time would be to schedule this work? I am asking more day-of-week and time-of-day that the site has lowest traffic or is least critical. Keeping in mind of course that all of these options are set in an ideal world of dreams and cold hard reality might force some downtime during the move.

Thanks in advance for your help with this.
Blocks: 964338
- Can the site be placed in read-only mode?

Not as currently implemented. There's a possibility to implement this, but we're in the middle of starting up a rewrite of the main part of the site, so I'm reluctant to write a fairly complex read-only feature that will be used once. In this case, my preference is downtime if it's acceptable to the product owner (Chelsea!) depending on how long we expect it to take.

Do we have any estimates for how long downtime will be in the case of option 1?

- When is a good time to schedule this work?

According to Google Analytics, midnight Pacific seems to be our lowest-traffic time on the main website, but I don't have easy access to data about the other major part of the site: the referral links. Since they directly return 302s, we'd have to analyze the server logs for those urls (example: //affiliates.mozilla.org/link/banner/52190). Is it possible to get a dump of that data?
The database dump is 114M so it takes about 2 seconds to dump, say <30 seconds to transfer, <30 to load in new location. Then there is a required DNS change. If we lower the TTL to 30 seconds and flip the switch at the same time we start the database dump it can propagate concurrently.

So with a test run and having all the commands staged in advance, I think we would see less than 5 minutes of downtime. For notices to end users I would say 15 minutes. That way we have a bit of leeway in case things go awry and need some troubleshooting.

This all assumes that there are no code changes or the like to muddy the waters and as I said, a valid test in advance. This test is basically the site operational with a snapshot of the database on another IP. Then it can be vetted and QAd and then all the actual migration requires is a refresh of the database and a DNS switch.
Chelsea: Thoughts on 15 minutes of downtime? I'm sure we can whip together a quick "Sorry, Affiliates is down right now!" message to show in it's place.
Flags: needinfo?(cnovak)
FYI, with some stats Chelsea provided, I estimate around 400 clicks on Affiliate banners would be lost in a 15 minute downtime window, based on the number of clicks from the past week. Based on the average clicks per week over the past few months, the number goes a bit lower.
Flags: needinfo?(cnovak)
Depends on: 974155
Now that I've completed the dev|stage migrations, we're ready to proceed with the production migration

We'd like to schedule this migration for next week, our two options for scheduling are

1. Tue, Wed, or Thu (3/4,5,6) at 5:30AM PST
2. Sat, 3/8 @ 7AM PST

Depending on when a developer is available to perform testing and QA post migration

The migration will take 60 minutes, during which the site will be redirected to hardhat.mozilla.net, the plan of action is roughly

* rsync code and uploaded files to generic cluster
 * I'll have a copy of production finished tomorrow, so the sync up will be quick
* dump database, copy, and import into generic cluster
* put new settings files in place and test push with chief
* do testing via /etc/hosts
* cutover DNS

Rollback plan is

* Fail DNS back to existing cluster

Please let me know which of the two time options will work for developer availability so that we can schedule this

Thanks
Assignee: server-ops-webops → bburton
Flags: needinfo?(mkelly)
Chelsea has spoken, and the migration shall happen on SATURDAY SATURDAY SATURDAY
Flags: needinfo?(mkelly)
Great, I've sent a meeting invite for a reminder

I'll join #affiliates and coordinate testing and such there
Whiteboard: [MIGRATION: 2014-03-08 10:00AM PST]
Whiteboard: [MIGRATION: 2014-03-08 10:00AM PST] → [MIGRATION: 2014-03-08 07:00AM PST]
affiliates.mozilla.org 95% complete:
------------------------------

* puppet updates
 - apache config
 - crontab
 - manifest bits for weblogs
* copied /src/affiliates.mozilla.org directory to genericadm
* dumped and moved the database to generic1.db.phx1
* NFS for user uploads migrated and content copied
* rabbitmq config pushed with puppet and django config updated
* chief config copied
 * chief push works: http://genericadm.private.phx1.mozilla.com/chief/affiliates.prod/logs/7d22304ada7ffdb893fe8305630ec11eb84cfab5.1393868816
* celeryd manifests copied and deployed with puppet
* local.py updated with db, memcache, celery configs
* commander_settings.py updated and confirmed working with chief push above
* cronjobs are running as expected
* deploy worked as noted above

----------------

You can test out the site by adding the following to your /etc/hosts[1][2] file

63.245.217.86 affiliates.mozilla.org

Let me know if you have any questions, but everything looks good for the push tomorrow


[1] http://osxdaily.com/2012/08/07/edit-hosts-file-mac-os-x/
[2] http://helpdeskgeek.com/windows-7/windows-7-hosts-file/
(In reply to Brandon Burton [:solarce] from comment #8)
> Let me know if you have any questions, but everything looks good for the
> push tomorrow

Tomorrow? Don't you mean Saturday?
(In reply to Michael Kelly [:mkelly,:Osmose] from comment #9)
> (In reply to Brandon Burton [:solarce] from comment #8)
> > Let me know if you have any questions, but everything looks good for the
> > push tomorrow
> 
> Tomorrow? Don't you mean Saturday?

This is not reps.mo?! IS THIS SPARRRTTTTAAAA??!?!?!?!?!

(yes, saturday)
Site migrated to generic

* media files rsync'd to new NFS location
* database copied to new db cluster
* DNS updated
* Mana updated: https://mana.mozilla.org/wiki/display/websites/affiliates.mozilla.org
* leaderboard crons run

Post-flight checks from solarce and Osmose are all green
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Verified, mothership and Facebook app look good from my testing.
Status: RESOLVED → VERIFIED
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.