Note: There are a few cases of duplicates in user autocompletion which are being worked on.

slaverebooter should take full advantage of slaveapi's workers, and issue work requests even while work is pending

NEW
Unassigned

Status

Release Engineering
General
3 years ago
3 months ago

People

(Reporter: Callek, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

3 years ago
So, slaveapi uses gevent while http://mxr.mozilla.org/build/source/tools/buildfarm/maintenance/reboot-idle-slaves.py does not.

The issue though is that there is at least 1 bug in slaveapi that makes a request reboot-idle-slaves does take its full 5 hours before it times out and fails.

reboot-idle-slaves also has its code which has workers which do nothing if it hasn't gotten a response (see line 72) via a time.sleep(30).

to add to the hassle, reboot-idle-slaves has a 4hr cron with a lockfile...

so as the workers fill up waiting for a bug like that, it takes longer and longer to do the work, and then we end up with a mostly idle slaveapi and a mostly idle slave-rebooter, but with a lot of work that should be done.

A good fix to allow more work through the queue more often would be to take advantage of gevent in reboot-idle-slaves.py as well, which would allow us to use gevents sleep() and do other work during that sleep period.
(Reporter)

Updated

3 years ago
Depends on: 981039
Copied/pasted from https://bugzilla.mozilla.org/show_bug.cgi?id=1040150#c7:

Before too much work is done on this, let's discuss at the next build duty meeting.

From looking at the code, it appears to me that:

Slave Rebooter observations
===========================

  1) Slave Rebooter is smart enough about keeping x number of threads alive at any time. Every half a second it checks if any threads are finished, so it can add a new thread to the queue. (http://mxr.mozilla.org/build/source/tools/buildfarm/maintenance/reboot-idle-slaves.py#135)
  2) Maybe the limit of 16 worker threads should be increased (I believe threads should be much less resource consuming than processes - I can imagine we could bump this up a lot - e.g. hundreds of threads). (http://mxr.mozilla.org/build/source/tools/buildfarm/maintenance/reboot-idle-slaves.py#25)
  3) Slave Rebooter checks the status of the Buildbot slave graceful shutdown request every 30 seconds - with no limit on the number of times to recheck - it simply keeps querying Slave API until state is not 'PENDING' nor 'RUNNING' - probably Slave Rebooter should "give up" after a reasonable while, and push forward with the machine reboot even if Buildbot slave shutdown has not completed. (http://mxr.mozilla.org/build/source/tools/buildfarm/maintenance/reboot-idle-slaves.py#72)

Slave API observations
======================
  1) Maybe Slave API could have a lower timeout for waiting for the slave to shutdown - certainly in the case that the slave is known not to be running a job, or if a job can be seen as stalled (e.g. Buildbot slave has not touched its log file in a certain period). Perhaps observation of the max timeout (currently 5 hours) could be limited only to the case when the slave is known to be running a job, and that job appears to be active. (https://github.com/bhearsum/slaveapi/blob/master/slaveapi/actions/shutdown_buildslave.py#L13) 
  2) I'd propose to rename the method shutdown_buildslave to shutdown_buildbot_slave - otherwise it can be interpreted as shutting down the slave machine, as opposed to rebooting the slave machine. (http://mozilla-slaveapi.readthedocs.org/en/latest/api/)

Comment 2

3 years ago
(In reply to Justin Wood (:Callek) from comment #0)
> A good fix to allow more work through the queue more often would be to take
> advantage of gevent in reboot-idle-slaves.py as well, which would allow us
> to use gevents sleep() and do other work during that sleep period.

Just to call out that we try to avoid gevent where we can:

https://wiki.mozilla.org/ReleaseEngineering/Services_Best_Practices#Server-side

If we're already tainted by slaveapi though, maybe I don't care. Maybe it's also time to revisit this particular best practice.
(Assignee)

Updated

3 months ago
Component: Tools → General
Product: Release Engineering → Release Engineering
You need to log in before you can comment on or make changes to this bug.