Today we were having trouble with Amazon's uses1 zone so we decided to disable all of the masters on that zone. Unfortunately, slaves are locked to a zone which makes them not connect to other masters. Instead it sticks to the last master it was connected to. The reason is that slavealloc will return "no allocation available" rather than a buildbot.tac.  I assume runslave.py fallsback  I was thinking that perhaps we should have a prioritized list of pools that a slave could connect to that way it would try to see if there is a master available on each one of the available pools that it can connect to.  http://hg.mozilla.org/build/tools/file/default/lib/python/slavealloc/daemon/http/gettac.py#l29  http://hg.mozilla.org/build/puppet-manifests/file/tip/modules/buildslave/files/runslave.py#l141
Pools weren't originally intended for location-based allocation, although they were almost immediately used for htat purpose once slavealloc was put into production. I think that the better solution would be to make slavealloc's allocation algorithm more complicated, loading masters equally but preferring "local" masters. I think there's already a DC column giving the datacenter of each host and master.
I like Dustin's idea, because it would also yield windows hosts in our datacenters talking to local masters, minimising the number of interrupted jobs when the network has a little glitch on the way to AWS.
Component: General Automation → Tools
QA Contact: catlee → hwine
Component: Tools → General
Product: Release Engineering → Release Engineering
Status: NEW → RESOLVED
Last Resolved: 7 months ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.