Closed Bug 1020202 Opened 10 years ago Closed 9 years ago

Machines should be discovered automatically; we should not need to manage machine names in our configs

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: pmoore, Unassigned)

References

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2110] )

An example of this:

https://bug1013035.bugzilla.mozilla.org/attachment.cgi?id=8425210

This was a change that occurred due to move of machines to scl3 from scl1.

It is incredibly error-prone, managing things like this by hand.

I propose, that we should have naming conventions for hosts, and as soon as they are registered in dhcp, we should use dhcp as the source of information about which machines of this class exist in our network.

There's a whole bunch of automation that would simplify our lives massively: for example, if you put a new machine in the network following the naming convention, puppet could automatically detect it, and apply the appropriate puppet image, slavealloc could automatically add it, buildbot configs could get automatically updated, and before you know it - adding a machine into our infrastructure is just a case of adding a machine with a hostname corresponding to the correct naming convention, and everything else happens automatically.

There is absolutely no reason for us to maintain lists like this: https://bug1013035.bugzilla.mozilla.org/attachment.cgi?id=8425210.
I like this!

On-site, inventory would be a good source of this info.  On-demand/reserved instances in AWS could work with a query to EC2.  But how would it work with spot?
Historically, the reason we've done the mapping is to allow relatively quick re-adjustment of pool sizes. The theory was, we can quickly move additional builders to try builders.

Before opining on whether this is a good idea, I'd like to understand how our existing use cases are handled under this approach. Note that:
 a) too many of our use cases are unwritten lore
 b) "no longer a business need" is a valid answer

Use cases that I believe I was told the old mapping supported was:
 - easy to move machines from one pool to another without messing up relops & dcops (pre-inventory days)
(maybe that's it -- other team members with better lore should be consulted)

In general, I love the idea, as I hate those list comprehensions. I do not like "encoding" information into hostnames -- I believe in a uid for host names, and handle the rest with various lookups (inventory, CNAMES, relengapi) - "encoding" has many shortcomings that has bit both us and groups trying to work with us (think builder names, branch names).
I worry about:
* staging/dev, where we may have non-standard naming conventions, and we otherwise might want to point at non-production slave lists.  as long as we can manage that easily this should be ok
* aiui spot doesn't have dns, by design
* test-masters.sh may become [more?] dependent on a network connection.  i'd love to keep these things runnable when we're on a laptop without network.  we may be able to help this by pointing at a flatfile.
* new slave additions will still require a reconfig.  this probably isn't an issue as it isn't a new requirement, just noting that just adding new machines with the right name isn't sufficient.
We chatted a bit about this today. I'd like to propose that we treat slavealloc as the Source of Truth for these names.

For production, we add something to 'make update' that downloads the relevant bits from slavealloc and saves them locally. master.cfg (or some file it imports) then references that local data to construct the list of buildbot slave objects.

For dev/staging, we could do something similar. TBD what to do with test-masters.sh.

The drawback here is if a slave is pulled from production to do some testing, then a reconfig happens, then the slave is put back into production in slavealloc, it will require another reconfig to be 'live'. IMO, this is a pretty minor downside.
Slavealloc also isn't trustworthy.

Maybe it's time to fix that, perhaps by moving slavealloc into relengapi?
We're certainly treating it as trustworthy from buildbot and puppet.
No, we were actually *very* careful about that in puppet for exactly this reason.  Slave trustlevels, which are the important bit of data in terms of trustworthiness, are not gathered from slavealloc.  In fact, the only thing puppet currently uses is the environment.

I agree that we should have a single list of slaves in a database somewhere -- just not in slavealloc as it's currently defined.  I think that the easiest fix to this is to move the administrative bits of slavealloc into relengapi, which has proper authentication.  The allocator itself is so simple that it could also be rewritten easily, thereby freeing slavealloc of Twisted, but that's optional.

Related, will it make sense to synchronize that list of slaves from higher sources of truth, even though there are several of those (inventory for onsite, AWS for EC2, probably another vendor for cloud macs)?  That could certainly be a subsequent step in developing this fix.
See Also: → 1087013
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2101]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2101] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2110]
TC FTW
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.