So, slaveapi uses gevent while http://mxr.mozilla.org/build/source/tools/buildfarm/maintenance/reboot-idle-slaves.py does not. The issue though is that there is at least 1 bug in slaveapi that makes a request reboot-idle-slaves does take its full 5 hours before it times out and fails. reboot-idle-slaves also has its code which has workers which do nothing if it hasn't gotten a response (see line 72) via a time.sleep(30). to add to the hassle, reboot-idle-slaves has a 4hr cron with a lockfile... so as the workers fill up waiting for a bug like that, it takes longer and longer to do the work, and then we end up with a mostly idle slaveapi and a mostly idle slave-rebooter, but with a lot of work that should be done. A good fix to allow more work through the queue more often would be to take advantage of gevent in reboot-idle-slaves.py as well, which would allow us to use gevents sleep() and do other work during that sleep period.
Copied/pasted from https://bugzilla.mozilla.org/show_bug.cgi?id=1040150#c7: Before too much work is done on this, let's discuss at the next build duty meeting. From looking at the code, it appears to me that: Slave Rebooter observations =========================== 1) Slave Rebooter is smart enough about keeping x number of threads alive at any time. Every half a second it checks if any threads are finished, so it can add a new thread to the queue. (http://mxr.mozilla.org/build/source/tools/buildfarm/maintenance/reboot-idle-slaves.py#135) 2) Maybe the limit of 16 worker threads should be increased (I believe threads should be much less resource consuming than processes - I can imagine we could bump this up a lot - e.g. hundreds of threads). (http://mxr.mozilla.org/build/source/tools/buildfarm/maintenance/reboot-idle-slaves.py#25) 3) Slave Rebooter checks the status of the Buildbot slave graceful shutdown request every 30 seconds - with no limit on the number of times to recheck - it simply keeps querying Slave API until state is not 'PENDING' nor 'RUNNING' - probably Slave Rebooter should "give up" after a reasonable while, and push forward with the machine reboot even if Buildbot slave shutdown has not completed. (http://mxr.mozilla.org/build/source/tools/buildfarm/maintenance/reboot-idle-slaves.py#72) Slave API observations ====================== 1) Maybe Slave API could have a lower timeout for waiting for the slave to shutdown - certainly in the case that the slave is known not to be running a job, or if a job can be seen as stalled (e.g. Buildbot slave has not touched its log file in a certain period). Perhaps observation of the max timeout (currently 5 hours) could be limited only to the case when the slave is known to be running a job, and that job appears to be active. (https://github.com/bhearsum/slaveapi/blob/master/slaveapi/actions/shutdown_buildslave.py#L13) 2) I'd propose to rename the method shutdown_buildslave to shutdown_buildbot_slave - otherwise it can be interpreted as shutting down the slave machine, as opposed to rebooting the slave machine. (http://mozilla-slaveapi.readthedocs.org/en/latest/api/)
(In reply to Justin Wood (:Callek) from comment #0) > A good fix to allow more work through the queue more often would be to take > advantage of gevent in reboot-idle-slaves.py as well, which would allow us > to use gevents sleep() and do other work during that sleep period. Just to call out that we try to avoid gevent where we can: https://wiki.mozilla.org/ReleaseEngineering/Services_Best_Practices#Server-side If we're already tainted by slaveapi though, maybe I don't care. Maybe it's also time to revisit this particular best practice.