Replace cruncher with one or more managed hosts

NEW
Unassigned

Status

Release Engineering
General
3 years ago
a year ago

People

(Reporter: coop, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

3 years ago
We run a lot of different mission-critical services on cruncher. We should have better deployment stories for those services (bug 919718) and be able to apply updates to the host more easily (bug 1231891).

Let's start by getting a list of the services we currently run on cruncher, figure out where our deployment documentation/process is lacking, and then figure out what set of managed hosts we need as replacements.

Here are the list of services I know of that run on cruncher:

* reportor
* slave health
* allthethings.json generation

...but there may be more.

Comment 1

3 years ago
coop is this something that you'd like alin and vlad to work on, they are looking for more projects
Flags: needinfo?(coop)
(Reporter)

Comment 2

3 years ago
(In reply to Kim Moir [:kmoir] from comment #1)
> coop is this something that you'd like alin and vlad to work on, they are
> looking for more projects

Yes, if they can start looking into the deployment story for the known tools and figuring out what other services are missing from my list, that would be great.
Flags: needinfo?(coop)
I checked the cruncher server and I found the following applications:
* postfix
* httpd 

I also checked buildduty home directory and I found the tools that have been written by coop above
This host is something of a dumping ground for side-projects and one-offs, so I think it will have a deep list of "things" on it, but I don't think we need to go very far down that list.

Most of those "things" are crontasks, so look in /etc/crontab, /etc/cron*/, and /var/spool/cron/* to see what else you can find.  The rest you can find by looking for crontasks on relengwebadm.private.scl3.mozilla.com that rsync from cruncher:

modules/webapp/manifests/admin/releng.pp:    # synchronizing dirty talos profiles - copy from cruncher per bug 750184
modules/webapp/manifests/admin/releng.pp:            command => "rsync -r syncbld@cruncher.build.mozilla.org:/var/www/html/talos_profiles/ /mnt/netapp/relengweb/talos-bundles/ 2>&1 |logger -t pull_talos_profiles",

modules/webapp/manifests/admin/releng.pp:    # copy reports from cruncher
modules/webapp/manifests/admin/releng.pp:            command => "rsync -r --links --delete syncbld@cruncher.build.mozilla.org:/var/www/html/builds/ /mnt/netapp/relengweb/builddata/reports/ 2>&1 | logger -t pull_builddata_reports",

I suspect that the best path would be to migrate the things we know about (starting by filing bugs):

 * reportor
 * slavehealth
 * allthethings.json
 * dirty talos profiles
 * mapper (bug 1153045)
 * "important" things from the crontabs

..then turn off all of the services except sshd (so disable cron, httpd) and see who screams.  Wait for them to migrate.

..then back up users' home directories in some encrypted tarballs, as well as the rest of the system in another encrypted tarball, and put those all on S3.  And shut off and delete the VM.
No longer blocks: 962870
Depends on: 1153045
(Reporter)

Comment 5

2 years ago
I've been digging into cruncher this week. AFAICT here's what's actually mission-critical:

* allthethings.json
* reportor
* slavehealth

With the trees currently closed, even those services aren't really required.

Joel already confirmed that the talos dirty profiles don't need updating any more.

Hal: can you confirm that the mapper services on cruncher are no longer required? You also have the following crontab entries setup. Do you still need them?

MAILTO=hwine+cruncher-vsl@mozilla.com
*/5 * * * *  apache rsync -a vcs-sync:/tmp/vcs-sync/ /var/www/html/vcs-sync/
*/5 * * * *  apache rsync -a vsl:/tmp/vcs-sync/ /var/www/html/vcs-sync/

The only other user crontabs worth saving belong to buildduty, and cover the above services. Here they are, for posterity:

MAILTO=release@mozilla.com
#m h d mth dow

# Slave Health
*/20 * * * * /home/buildduty/slave_health/slave_health_cron.sh 2>&1 | logger -t slave_health_cron.sh
19   0 * * * source /home/buildduty/buildduty/bin/activate && python /home/buildduty/slave_health/slave_health/scripts/generate_chart_objects.py -v 
22   0 1 * * source /home/buildduty/buildduty/bin/activate && python /home/buildduty/slave_health/slave_health/scripts/generate_chart_objects.py -m -v
6    * * * * /home/buildduty/slave_health/buildduty_report.sh 2>&1 | logger -t buildduty_report.sh

# Reportor!
2  0 * * *   find /var/www/html/builds/reportor -maxdepth 1 -type d -mtime +14 -exec rm -rf {} \;
30 0 * * *   $HOME/reportor/bin/reportor.sh -m reportor/reports/reports.yaml -o '/var/www/html/builds/\%Y-\%m-\%d' -s /var/www/html/builds/reportor/daily -l reportor.log daily
1  * * * *   find /var/www/html/builds/reportor -maxdepth 1 -type d -name '*:*' -mtime +2 -exec rm -rf {} \;
2  * * * *   $HOME/reportor/bin/reportor.sh -m reportor/reports/reports.yaml -o '/var/www/html/builds/\%Y-\%m-\%d:\%H' -s /var/www/html/builds/reportor/hourly -l reportor.log hourly

# Generate allthethings.json
*/15 * * * * $HOME/braindump/community/generate_allthethings_json.sh
# Remove older files that show differences between commits
5    0 * * * find /var/www/html/builds/ -type f -name "allthethings._*" -mtime +90 -delete
Flags: needinfo?(hwine)
(Reporter)

Comment 6

2 years ago
(In reply to Chris Cooper [:coop] from comment #5)
> * allthethings.json
> * reportor
> * slavehealth

I've setup all these services locally to see what's involved. Details below:

> * allthethings.json

Makes lots of assumptions about running on cruncher, but those can be abstracted away easily. Only needs to run when there are changes made to the buildbot-related repos.

If we setup a new host, we'd need to be able to sync data from it to the relengwebadm host.

> * reportor

Has a setup.py, but also needs a venv setup first. The shell script that's run as part of cron (reportor.sh) doesn't exist, but looks similar to this: 

#!/bin/bash
ROOT_DIR=$HOME/reportor
source ${ROOT_DIR}/bin/activate
umask 002
exec reportor $*

If we setup a new host, we'd need to be able to sync data from it to the relengwebadm host.

> * slavehealth

I've made some changes to slave_health this week to make bootstrapping it easier. It's still not pretty, but definitely easier to setup now. If we setup a new host for this, it will require flows to the buildbot and slavealloc databases. 

We'll also need to be able to sync the output to the relengwebadm host.
Could we fix allthethings.json to run either as a taskcluster task (perhaps just a hook, if it's OK for the information to lag by a few hours) or a travis job, deploying allthethings.json to an S3 bucket?

Similarly for reportor -- in fact, perhaps it can be split into several hooks that use similarly-shaped task definitions?

I don't recall if slave_health is built from crontasks or has interactive code?  If crontasks, then maybe we could do the same thing.  Otherwise, if we do need to stand up a host to run the service, let's do that with an AWS instance and then use `aws s3` to sync the result to a bucket instead of the releng web cluster.

For all of these "sync to S3" plans, we can use CloudFront to serve them over HTTPS.
(In reply to Chris Cooper [:coop] from comment #5)
> 
> Hal: can you confirm that the mapper services on cruncher are no longer
> required? You also have the following crontab entries setup. Do you still
> need them?
> 
> MAILTO=hwine+cruncher-vsl@mozilla.com
> */5 * * * *  apache rsync -a vcs-sync:/tmp/vcs-sync/ /var/www/html/vcs-sync/
> */5 * * * *  apache rsync -a vsl:/tmp/vcs-sync/ /var/www/html/vcs-sync/
> 
> The only other user crontabs worth saving belong to buildduty, and cover the
> above services. Here they are, for posterity:

These are needed until b2g_bumper and all releng manipulation of b2g manifests are stopped on any branch.

The legacy mapper service ran on cruncher. Some setup will be needed. See http://people.mozilla.org/~hwine/tmp/vcs2vcs/mapper_support.html
Flags: needinfo?(hwine)
(Reporter)

Comment 9

2 years ago
I met with Dustin this morning to dig into comment #7. Here are my notes:

* With the advent of hooks in TC that (in theory) allow us to schedule regular, recurring tasks, we should start using tasks on periodic hooks rather than cronjobs. I say "in theory" only because we have but a single example of hooks being used right now: TC VCS regeneration.

* This will require a small pool of identical workers (3 per region?) that are specially setup for releng tasks. They will have the basic environment required to run the job types we know about now (slave health, reportor, allthethings). We can amend this basic environment in the future, but in general, the tasks should be self-contained. 

* We can't (or shouldn't) use existing build or test containers for this. There is the real threat of those changing over time in a way that could break our stuff. 

* Workers should live on instances that are within the releng network so that the necessary flows to databases and other resources should already exist.

* Any credentials required by task should be stored in the secrets service.

* Output from these tasks should go into an S3 bucket. If it needs to be served over HTTP(S), we can use CloudFront for that. Tools that rely on this data should change to access it from the S3 bucket/CloudFront rather than the relengweb cluster. 

* This could take a while to setup. We'll need to setup a cruncher analog in AWS in the interim so we can continue feeding data to existing systems.
(Reporter)

Updated

2 years ago
Depends on: 1261974
(Reporter)

Updated

2 years ago
Depends on: 1263626
(Reporter)

Updated

2 years ago
Depends on: 1263636
(Reporter)

Comment 10

2 years ago
(In reply to Chris Cooper [:coop] from comment #9)
> * This will require a small pool of identical workers (3 per region?) that
> are specially setup for releng tasks. They will have the basic environment
> required to run the job types we know about now (slave health, reportor,
> allthethings). We can amend this basic environment in the future, but in
> general, the tasks should be self-contained. 
> 
> * We can't (or shouldn't) use existing build or test containers for this.
> There is the real threat of those changing over time in a way that could
> break our stuff. 
> 
> * Workers should live on instances that are within the releng network so
> that the necessary flows to databases and other resources should already
> exist.

Filed bug 1263626 for this. ^^

> * Output from these tasks should go into an S3 bucket. If it needs to be
> served over HTTP(S), we can use CloudFront for that. Tools that rely on this
> data should change to access it from the S3 bucket/CloudFront rather than
> the relengweb cluster. 

Filed bug 1263636 for this. ^^
(Assignee)

Updated

a year ago
Component: Tools → General
Product: Release Engineering → Release Engineering
You need to log in before you can comment on or make changes to this bug.