Closed Bug 863268 Opened 10 years ago Closed 9 years ago

Migrate buildapi off of kvm in scl1 and onto the releng cluster

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86
All
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: arich, Assigned: dustin)

References

Details

(Whiteboard: [2013Q4])

Attachments

(1 file)

We need to migrate the following vms off of kvm in scl1:

buildapi01.build.scl1.mozilla.com
redis01.build.scl1.mozilla.com

Catlee: is redis used for anything *but* buildapi?
Flags: needinfo?(catlee)
It's also used to store tokens and nonces for the signing server.
Flags: needinfo?(catlee)
Depends on: 884837
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
Whiteboard: [2013Q4]
Assignee: server-ops-releng → dustin
BuildAPI has unimplemented support for using Memcached as a backend.  So most likely we can just implement that and no longer use redis for Buildapi.
Depends on: 934593
Depends on: 934594
I'm going to morph this slightly, since we don't actually want to migrate redis.  We want to kill it.  Also, we need to migrate signing off of redis, which will be a new bug.
Summary: Migrate redis and buildapi off of kvm in scl1 → Migrate buildapi off of kvm in scl1 and onto the releng cluster
No longer depends on: 884837, 934593
Blocks: 735293
Depends on: 934593
No longer depends on: 934594
Blocks: 937781
Per bhearsum in bug 804334:
Note to whomever does this: please update the deployment docs at https://wiki.mozilla.org/ReleaseEngineering/BuildAPI#Updating_code
Depends on: 946334
Bug 946334 switches the messaging backend to the new backend, without moving the web service or changing the k/v store.
Since using mod_wsgi is new, I want to test it out in a staging environment first, so I'm knocking out bug 841345.
Depends on: 841345
Depends on: 957386
Depends on: 952266
Depends on: 957382
Depends on: 957384
Depends on: 957385
OK, the dep bugs have a bunch of patches that get buildapi into a shape where, for me at least, it works on mod_wsgi and talks reliably to RabbitMQ.

A few other code notes:

 * it's OK if mod_wsgi spawns multiple processes that are all running LoggingJobDoneConsumers, as the consumers are all equivalent (they just log the job completion in the DB).  Whether kombu correctly survives such forks, I don't know, and will test.
 * logging needs to go somewhere

And deployment notes:

 * the buildapi DB needs to be hosted on MySQL, not SQLite as it is now
   * this can be done from buildapi01
 * crontasks need to get pulled out and run on the admin host
   * once the DB is in MySQL, these can be set up before the rest is migrated

If we play our cards carefully, we can actually run the old and new buildapi instances in parallel, and transition from one to the other (and back, if necessary) with Apache config changes.
Depends on: 958297
Kombu seems fine with mod_wsgi.  Logging is set up (paster's logging config was full of fail, so I just configured it directly in buildapi.wsgi).  

I have the deployment largely figured out in staging.  I'm going to add a fake selfserve agent that will run on the admin node, so we can test the kombu stuff in staging.  Other than that, I'm waiting on flows for the DB and crontask changes above.
Blocks: 945927
I've disabled the automatic updates of buildapi01 from hg, before landing the patches in the dep bug.
Attached patch bug863268.patchSplinter Review
Attachment #8363708 - Flags: review?
Comment on attachment 8363708 [details] [diff] [review]
bug863268.patch

I'll need to apply this on the old instance, too, so that it can talk to the buildapi DB.
Comment on attachment 8363708 [details] [diff] [review]
bug863268.patch

r+ via irc (with an added comment)
Attachment #8363708 - Flags: review? → review+
Attachment #8363708 - Flags: checked-in+
OK, deployed on the production instance.  It's now storing all data outside of the VM - either in MySQL, Redis, or RabbitMQ.
Oh, and the new job IDs are 600000 and higher, so you can recognize them.
I enabled mod_wsgi on the production cluster, and set up buildapi at /buildapi_new.  It needs some flows before it will actually show data, but everything up to the point of connecting to the DB works fine.
(And an extra note to self: the prod instance is configured to use the staging instance's AMQP vhost, to avoid it consuming from the prod queue and making a mess.  Once everything else looks good, I'll fix that and verify proper consumption)
I just switched the prod instance to use the prod AMQP vhost, and changed the /buildapi URI to point to that instance.

nginx and paster are still running on buildapi01 as I believe some of the crontasks talk to localhost.  With that, buildapi01 is also consuming jobrequest-finished messages and recording them in the DB.  Redis is still running because some of them talk to redis.
I had missed adding `allowed_origins` to the paster config.  That's fixed.  It seems there was some caching somewhere along the line that caused that fix to take a while to "sink in".
Ergh, and gviz_api isn't installed either.  Fixing.
Depends on: 964370
Blocks: 960054
I just created a 'buildapi' user in LDAP so that we can assign ownership to it on the relengweb netapp share.
Depends on: 970513
At this point, most of the crontasks are running in parallel on the releng web admin host.  I can't run the tasks that talk to the buildapi HTTP service until bug 970513 is closed (there's always something).

Once that's done, we can make the cutover to serve this content directly from the webheads, at which point buildapi01 and redis01 can go away.
OK, I believe we're ready for the cutover.  There are essentially two copies of the 'buildjson' directory now - one on buildapi01, which the production builddata URLs are proxying to, and one on an NFS volume, which the staging builddata URLs are proxying to:
  http://builddata-pub-build.allizom.org/builddata/buildjson
  https://secure-pub-build.allizom.org/builddata/buildjson
(you can tell it's the netapp share by the presence of '@@@NETAPP' at the top - just a flag I added temporarily for my own sanity).

It's just an apache change to "cut over" production to the NFS share.  I'll organize when we can do that.
Depends on: 971168
The switch is complete, with no ill effects that I can see.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Blocks: 973922
Depends on: 976050
You need to log in before you can comment on or make changes to this bug.