Closed Bug 618369 Opened 14 years ago Closed 14 years ago

db-backed slave allocator

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: dustin)

References

Details

(Whiteboard: [buildslaves][slavealloc])

Attachments

(1 file)

The next phase after bug 616344 is to write a db-backed slave allocator that generates buildbot.tac files on the fly, but based on a table giving the desired slave-master correspondence. Once this is done, we can work on making the allocator reactive, so that it will reassign slaves as they become available, rather than reassigning them and waiting for them to restart.
Blocks: 605278
Here's what I've got so far: http://hg.mozilla.org/users/dmitchell_mozilla.com/tools/file/ca271db05496/lib/python/buildtools/slavealloc install tools in a virtualenv and you've got a 'slavealloc' command that can initialize a database (use --slave-data lib/python/buildtools/slavealloc/slave-data.csv) and calculate the silos. I'm going to set up a masters table and a static allocation between masters and slaves, then write s 'slavealloc get-tac' command that will generate a tac file given a slavename. Throw in a few unit tests, and I can start doing dynamic allocation. I'm open to comments on use of sqlalchemy as well as ways to incorporate some kind of config file in here so there's no need to specify '-D some://database/url' for every command. Aki, any lessons from mozharness that would be useful here?
I created code in cancellator.py that reads a config file just for this very reason - would love to see it become part of a small releng-tools module or somesuch
Whiteboard: [buildslaves]
I've moved it out of the tools repo for the moment, and I'm tracking it at: http://hg.mozilla.org/users/dmitchell_mozilla.com/slavealloc This has both a command-line tool and a Twisted daemon. See https://intranet.mozilla.org/RelEngWiki/index.php/User:Dmitchell@mozilla.com/Slave_Allocator_Proposal for the what I'm planning here.
More progress on this yesterday: * hand out passwords (I've put fake passwords in the repo) * add basedir to buildbot.tac (from a column in the slaves table) * ability to "lock" a slave to a particular master: slavealloc lock linux-ix-slave13 sm03 * ability to disable slaves: slavealloc disable linux-ix-slave13 * optimizations The locking is done by returning a .tac that calls sys.exit(0). It remains to be seen whether that works with twistd.
Here we go: http://hg.mozilla.org/users/dmitchell_mozilla.com/slavealloc == Implemented features: == - Three-part Twisted Daemon: - 1. allocator (http at /gettac/SLAVENAME) 2. REST API (http at /api) 3. JS Management UI (http at /ui) These are available at http://euclid.r.igoro.us:8012 for the moment. The API is always included, but the allocator and ui can be run individually. The intent is to run them as separate processes, so that the allocator (which should be up most of the time) need not be taken down for updates to the UI backend. The UI frontend can be changed without a restart. You can run one of the .tac files manually if you want to try it out. - Command-line UI - E.g., |slavealloc lock win32-slave10 tm03|. These commands access the db directly right now, but will eventually do so via the REST API. You will probably want to use 'slavealloc dbinit' with the various *-data.csv files to populate the database. - Allocation - See https://intranet.mozilla.org/RelEngWiki/index.php/User:Dmitchell@mozilla.com/Slave_Allocator_Proposal for the details here. - Management UI - This is highly client-side - basically the DB is downloaded to the browser on load, and then displayed as necessary. I've never written an app like this before, although I'm pretty proud of the result. Feedback is appreciatd. - Disable, Lock - Slaves can be disabled (meaning they will be fed a tac that will exit immediately), and can be "locked" to a particular master. The password data is faked for obvious reasons. - Sketchy Data - The data in the .csv files is known to be kinda inaccurate. I will build an accurate DB by scraping master configs and buildbot.tac's on slaves when this goes into production. == Future Projects == - more reports (silos X masters, pools X masters, etc.) - UI fixes (better menu, "loading" throbber, dynamically update data, efficiency, error handling) - database table indices == Open Questions == - Where should I land this? - How should I deploy it? Bare? Nginx frontend? (c.f. bug 628797) - Does it make more sense to denormalize the schema and use a 'siloid' rather than using the (distro, bits, datacenter, trustlevel, environment, purpose) tuple explicitly? - Do we need to add 'speed' (vm, mini, ix) to the silo criteria?
> == Future Projects == > > - more reports (silos X masters, pools X masters, etc.) > - UI fixes (better menu, "loading" throbber, dynamically update data, > efficiency, error handling) > - database table indices - weighted balancing between masters. e.g. master2 is on far beefier hardware than master1. > == Open Questions == > > - Where should I land this? ...build/tools? I'm starting to worry about the cost of having build slaves clone useless (to them) server-side code from tools though. > > - How should I deploy it? Bare? Nginx frontend? (c.f. bug 628797) > > - Does it make more sense to denormalize the schema and use a 'siloid' rather > than using the (distro, bits, datacenter, trustlevel, environment, purpose) > tuple explicitly? +0 > - Do we need to add 'speed' (vm, mini, ix) to the silo criteria? +1
(In reply to comment #6) > - weighted balancing between masters. e.g. master2 is on far beefier hardware > than master1. noted. > ...build/tools? I'm starting to worry about the cost of having build slaves > clone useless (to them) server-side code from tools though. Which is why I've been going back and forth on this. However, with my hopes of making build/tools available everywhere (e.g., bug 623845), slave-side tools should probably go somewhere else, or find some means of making the clones/updates cheaper. So, build/tools it is. > > - Does it make more sense to denormalize the schema and use a 'siloid' rather > > than using the (distro, bits, datacenter, trustlevel, environment, purpose) > > tuple explicitly? > > +0 I'm at -0.5 since it's a denormalization, so no go on that idea. > > - Do we need to add 'speed' (vm, mini, ix) to the silo criteria? > > +1 OK, sounds good.
Whiteboard: [buildslaves] → [buildslaves][slavealloc]
This patch is against the tip of build/tools, but I've preserved history, and you can see the history at http://hg.mozilla.org/users/dmitchell_mozilla.com/tools/graph/afd2a61118c3 = complete in this patch = * merge into build/tools * speed as a silo param * add documentation in UI landing page * usePTY=False *everywhere* * add DB indices * UI error handling (very basic) * fix dbinit to read from a tarfile (or just init empty) * add dbdump command to dump to a tarfile * command line scripts talk to the API rather than DB = farmed out new bugs = * tests (633605) * set up on a real box (628797) * db (633374) * run runslave.py on Windows systems (616351, 616352) * ability to disable masters (633363) * master scraping (upness, slaves' status) (633364) * nagios checks (633365) * weighted master balance (633366) * pulse events (633368) * UI fixes, including more reports (633371)
Attachment #511891 - Flags: review?(catlee)
Comment on attachment 511891 [details] [diff] [review] m618369-tools-r1.patch I'm going to add a few extra r?'s to this - there are many forks in the road to this point where I just "guessed", so I'd appreciate feedback at every level from "you spelled slave 'salve' in this comment" to "I don't think Twisted is the right tool". By the way, I was torn about not adding tests. I'd like to get this deployed in staging to test it out and so that I can start moving on the slave-side pieces of this pie. I've informally promised myself that it will need tests before it goes to production.
Attachment #511891 - Flags: review?(bhearsum)
Attachment #511891 - Flags: review?(bear)
Comment on attachment 511891 [details] [diff] [review] m618369-tools-r1.patch Some random comments/questions: * Many Text columns in the model should be Strings instead. Especially slaves.name, which I think needs an index on it since it's the primary lookup for many operations. * Typo in # DATA FORMAT comment, "file format wil change" * I just skimmed over the JS code. If the API works, the JS is easy to fix later. * Can logging for write operations (render_PUT) include information to indicate who is doing it? For self-serve we have the user being forwarded in the X-Remote-User header by the proxy on build.m.o. * Is there a way to disable a master i.e. have slaves move off of a master as they reboot? * How do new slaves get added to the system? <featurecreep>It might be nice to have them get recorded by slavealloc and provide UI to accept them as official slaves, or reject them as mistakes</featurecreep> I really like how you've organized the code. You still haven't convinced me that twisted's the way to go here though ;) I think the only blocker here before going into staging is the last point above, how to deal with brand new slaves.
Attachment #511891 - Flags: review?(catlee) → review+
(In reply to comment #10) > * Many Text columns in the model should be Strings instead. Especially > slaves.name, which I think needs an index on it since it's the primary lookup > for many operations. OK, sounds good. > * Can logging for write operations (render_PUT) include information to indicate > who is doing it? For self-serve we have the user being forwarded in the > X-Remote-User header by the proxy on build.m.o. There's no authentication at all at the moment. > * Is there a way to disable a master i.e. have slaves move off of a master as > they reboot? Bug 633363 > * How do new slaves get added to the system? <featurecreep>It might be nice to > have them get recorded by slavealloc and provide UI to accept them as official > slaves, or reject them as mistakes</featurecreep> Current plan is to add them one at a time, by hand. Yes, that would be a good feature. > I really like how you've organized the code. You still haven't convinced me > that twisted's the way to go here though ;) Rough consensus and working code .. > I think the only blocker here before going into staging is the last point > above, how to deal with brand new slaves. Not sure I see why that's a blocker. The reason I haven't worked too hard on it is that I expect to use ad-hoc fabric scripts and scraping of puppet configs and whatnot to generate and ensure the accuracy of the initial db contents - and it's easy enough to 'import sqlalchemy' to make that happen.
bear, bhearsum - your reviews?
Comment on attachment 511891 [details] [diff] [review] m618369-tools-r1.patch I'm going to have to bow out here, I don't think I'll have the time to dig into this soon.
Attachment #511891 - Flags: review?(bhearsum)
My mom always told me not to poke bears, but .. bear?
Comment on attachment 511891 [details] [diff] [review] m618369-tools-r1.patch I was working up a long winded set of why I don't like having the presentation code (the html generation) being done in the same twisted code (because I think twisted's web layer sucks) but in the end I realize that this is version 0.1 stuff and we can refactor it later to split the two. Because you are generating json data it can even be handled by another layer. I'm giving it a r+ now after reading it a couple of times already and not finding anything that just leaps at me as "NO!" code - the real review will be in the iterating done during testing.
Attachment #511891 - Flags: review?(bear) → review+
Checked in: 8fe4dbc09d03 For the record, bear is concerned with serving the REST interface from Twisted, which is probably reasonable, but honestly no other Python system is any easier to deploy, and we're used to running Twisted daemons (buildmasters). I'll document how this gets deployed in https://wiki.mozilla.org/ReleaseEngineering/Applications and in particular I will be putting nginx in front of it, and configuring it to serve static content, so this should be pretty snappy.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: