slaverebooter should take full advantage of slaveapi's workers, and issue work requests even while work is pending

NEW
Unassigned

Status

Release Engineering
Tools
3 years ago
2 years ago

People

(Reporter: Callek, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

3 years ago
So, slaveapi uses gevent while http://mxr.mozilla.org/build/source/tools/buildfarm/maintenance/reboot-idle-slaves.py does not.

The issue though is that there is at least 1 bug in slaveapi that makes a request reboot-idle-slaves does take its full 5 hours before it times out and fails.

reboot-idle-slaves also has its code which has workers which do nothing if it hasn't gotten a response (see line 72) via a time.sleep(30).

to add to the hassle, reboot-idle-slaves has a 4hr cron with a lockfile...

so as the workers fill up waiting for a bug like that, it takes longer and longer to do the work, and then we end up with a mostly idle slaveapi and a mostly idle slave-rebooter, but with a lot of work that should be done.

A good fix to allow more work through the queue more often would be to take advantage of gevent in reboot-idle-slaves.py as well, which would allow us to use gevents sleep() and do other work during that sleep period.
(Reporter)

Updated

3 years ago
Depends on: 981039
Copied/pasted from https://bugzilla.mozilla.org/show_bug.cgi?id=1040150#c7:

Before too much work is done on this, let's discuss at the next build duty meeting.

From looking at the code, it appears to me that:

Slave Rebooter observations
===========================

  1) Slave Rebooter is smart enough about keeping x number of threads alive at any time. Every half a second it checks if any threads are finished, so it can add a new thread to the queue. (http://mxr.mozilla.org/build/source/tools/buildfarm/maintenance/reboot-idle-slaves.py#135)
  2) Maybe the limit of 16 worker threads should be increased (I believe threads should be much less resource consuming than processes - I can imagine we could bump this up a lot - e.g. hundreds of threads). (http://mxr.mozilla.org/build/source/tools/buildfarm/maintenance/reboot-idle-slaves.py#25)
  3) Slave Rebooter checks the status of the Buildbot slave graceful shutdown request every 30 seconds - with no limit on the number of times to recheck - it simply keeps querying Slave API until state is not 'PENDING' nor 'RUNNING' - probably Slave Rebooter should "give up" after a reasonable while, and push forward with the machine reboot even if Buildbot slave shutdown has not completed. (http://mxr.mozilla.org/build/source/tools/buildfarm/maintenance/reboot-idle-slaves.py#72)

Slave API observations
======================
  1) Maybe Slave API could have a lower timeout for waiting for the slave to shutdown - certainly in the case that the slave is known not to be running a job, or if a job can be seen as stalled (e.g. Buildbot slave has not touched its log file in a certain period). Perhaps observation of the max timeout (currently 5 hours) could be limited only to the case when the slave is known to be running a job, and that job appears to be active. (https://github.com/bhearsum/slaveapi/blob/master/slaveapi/actions/shutdown_buildslave.py#L13) 
  2) I'd propose to rename the method shutdown_buildslave to shutdown_buildbot_slave - otherwise it can be interpreted as shutting down the slave machine, as opposed to rebooting the slave machine. (http://mozilla-slaveapi.readthedocs.org/en/latest/api/)

Comment 2

3 years ago
(In reply to Justin Wood (:Callek) from comment #0)
> A good fix to allow more work through the queue more often would be to take
> advantage of gevent in reboot-idle-slaves.py as well, which would allow us
> to use gevents sleep() and do other work during that sleep period.

Just to call out that we try to avoid gevent where we can:

https://wiki.mozilla.org/ReleaseEngineering/Services_Best_Practices#Server-side

If we're already tainted by slaveapi though, maybe I don't care. Maybe it's also time to revisit this particular best practice.
You need to log in before you can comment on or make changes to this bug.