Windows test slaves pending count is on quite high levels. This *may* be related to the same slaveapi issue mentioned in https://bugzilla.mozilla.org/show_bug.cgi?id=1039313#c3. Another hypothesis is that this is related to a change of passwords notified by marcko in #mozbuild: markco: Heads up to anyone working with Windows. The passwords are being change this morning to what is currently in the password file.
Boxes not marked as broken in slave_health are being rebooted now - let's see if that improves the situation. Thanks Callek for your help!
fwiw many of them had never attempted a graceful shutdown, and others had failed graceful shutdown. NOTE that with slaverebooter we currently only go through a max of 16 machines in parallel, and when we issue the "graceful shutdown" command to a machine, we wait until we can verify it is shutdown with a max of ~6 hours. So with windows that means if we have 50 machines needing reboot, that can take a while while we wait to churn through those 6 hours.
(In reply to Justin Wood (:Callek) from comment #2) > fwiw many of them had never attempted a graceful shutdown, and others had > failed graceful shutdown. > > NOTE that with slaverebooter we currently only go through a max of 16 > machines in parallel, and when we issue the "graceful shutdown" command to a > machine, we wait until we can verify it is shutdown with a max of ~6 hours. > So with windows that means if we have 50 machines needing reboot, that can > take a while while we wait to churn through those 6 hours. Do we fire off new requests as each of the 16 complete, or does everything block on one failed slave?
(In reply to Justin Wood (:Callek) from comment #2) > NOTE that with slaverebooter we currently only go through a max of 16 > machines in parallel, and when we issue the "graceful shutdown" command to a > machine, we wait until we can verify it is shutdown with a max of ~6 hours. > So with windows that means if we have 50 machines needing reboot, that can > take a while while we wait to churn through those 6 hours. In case it's not clear, that's an unacceptable turnaround. If we end up waiting the full 6 hours for 16 slaves as it seems we were last week, we will barely manage to *try* to reboot all 390 Windows testers *once* in a given week, much less all the other slaves. This is a step backwards from kittenherder. Until we fix broken parallelism here, I advocate returning to the kittenherder model, i.e. creating separate .ini files for each slavetype on bm74 and returning running a separate invocation of slaverebooter for each type.
It looks like the initial issue of high pending win slaves is under control now and we have two follow up bugs to help mitigate this issue from happening again: full solution -- parallelize slaverebooter: bug 978898 (existing bug) tmp solution -- spawn many slaverebooter procs: no bug yet I can close this bug and file a new one for the tmp solution. I'm happy to try hacking on it while on buildduty. However, callek, any idea how much man hours bug 978898 would take? I'm wondering if it makes sense to just tackle the full solution and skip the band-aid. I can try and help with that as well; coop's concerns in comment 4 make it sound like this will be an issue again.
I'm going to work on tackling the full solution this week, I suspect it won't be any more effort [albeit, yes more time] than the tmp solution. (I could have the tmp solution in a day or two, the full solution I anticipate likely a week or two). Although the tmp solution can be mitigated by merely going through slavehealth and rebooting obstentious slaves manually (manual as in via slaveapi links, instead of using graceful) -- which is my current practice for daily-AM, in order to assist buildduty. I skip on any AM without pending (e.g. yesterday, a weekend). I'd personally prefer to leave this around as a reminder that its important of course.
Before too much work is done on this, let's discuss at the next build duty meeting. From looking at the code, it appears to me that: Slave Rebooter observations =========================== 1) Slave Rebooter is smart enough about keeping x number of threads alive at any time. Every half a second it checks if any threads are finished, so it can add a new thread to the queue. (http://mxr.mozilla.org/build/source/tools/buildfarm/maintenance/reboot-idle-slaves.py#135) 2) Maybe the limit of 16 worker threads should be increased (I believe threads should be much less resource consuming than processes - I can imagine we could bump this up a lot - e.g. hundreds of threads). (http://mxr.mozilla.org/build/source/tools/buildfarm/maintenance/reboot-idle-slaves.py#25) 3) Slave Rebooter checks the status of the Buildbot slave graceful shutdown request every 30 seconds - with no limit on the number of times to recheck - it simply keeps querying Slave API until state is not 'PENDING' nor 'RUNNING' - probably Slave Rebooter should "give up" after a reasonable while, and push forward with the machine reboot even if Buildbot slave shutdown has not completed. (http://mxr.mozilla.org/build/source/tools/buildfarm/maintenance/reboot-idle-slaves.py#72) Slave API observations ====================== 1) Maybe Slave API could have a lower timeout for waiting for the slave to shutdown - certainly in the case that the slave is known not to be running a job, or if a job can be seen as stalled (e.g. Buildbot slave has not touched its log file in a certain period). Perhaps observation of the max timeout (currently 5 hours) could be limited only to the case when the slave is known to be running a job, and that job appears to be active. (https://github.com/bhearsum/slaveapi/blob/master/slaveapi/actions/shutdown_buildslave.py#L13) 2) I'd propose to rename the method shutdown_buildslave to shutdown_buildbot_slave - otherwise it can be interpreted as shutting down the slave machine, as opposed to rebooting the slave machine. (http://mozilla-slaveapi.readthedocs.org/en/latest/api/)
> Do we fire off new requests as each of the 16 complete, or does everything > block on one failed slave? No. At any time, 16 requests are active. Within .5 seconds of any one of them completing, another thread will fire up with the next request in the queue. So we only block completely for 5 hours if all 16 are blocked for 5 hours. In any case, the 5 hour timeout in slave api is intended to allow jobs to complete - but slave rebooter is only rebooting slaves that aren't taking jobs - so it should not need to ever wait for a job to complete before shutting down the build slave.
There are some other issues: 1) Slave API fix: If a buildbot slave has shutdown, slave api will return a "FAILED" response to the shutdown request. This is because the operation to test the state of a slave via buildbot master is expensive, so slave api does not check the state, before attempting to shut it down. Then when the shutdown fails, it returns the failure code. We could work around this in several ways: i) for darwin / linux we can check the /builds/slave/twistd.pid file to see if there is a known process running, and then query the process table to check it is still running, and contains the correct process (i.e. process ids can get reused by OS - so want to make sure it really is the twistd process) and ii) for windows we could check the last modified date of 'C:\builds\moz2_slave\twistd.log' or maybe see if the last line is the shutdown is: "<timestamp> [-] Server Shut Down." 2) Slave Rebooter fix: Maybe we can skip the shutdown_buildslave step from slave rebooter altogether - since in theory it should only run against slaves that aren't taking jobs, or at a minimum, specify a shorter timeout than the default 5 hours.
We're much better here after the slaverebooter changes, more ongoing fixes to the slave configs themselves slaveapi and hardware capacity concerns will indeed have to happen but this bug as a problem tracker has outlived its usefulness imo