Last Comment Bug 1040150 - High pending count for Windows test slaves
: High pending count for Windows test slaves
Status: RESOLVED FIXED
:
Product: Release Engineering
Classification: Other
Component: Buildduty (show other bugs)
: unspecified
: x86 Mac OS X
-- normal (vote)
: ---
Assigned To: Nobody; OK to take it and work on it
: Justin Wood (:Callek)
: Chris AtLee [:catlee]
Mentors:
Depends on: 1045458 1048358
Blocks:
  Show dependency treegraph
 
Reported: 2014-07-17 09:20 PDT by Simone Bruno [:simone]
Modified: 2014-09-08 08:39 PDT (History)
4 users (show)
See Also:
Crash Signature:
(edit)
Machine State: ---
QA Whiteboard:
Iteration: ---
Points: ---


Attachments

Description User image Simone Bruno [:simone] 2014-07-17 09:20:04 PDT
Windows test slaves pending count is on quite high levels.
This *may* be related to the same slaveapi issue mentioned in https://bugzilla.mozilla.org/show_bug.cgi?id=1039313#c3.

Another hypothesis is that this is related to a change of passwords notified by marcko in #mozbuild:

markco: Heads up to anyone working with Windows. The passwords are being change this morning to what is currently in the password file.
Comment 1 User image Simone Bruno [:simone] 2014-07-18 09:30:19 PDT
Boxes not marked as broken in slave_health are being rebooted now - let's see if that improves the situation. Thanks Callek for your help!
Comment 2 User image Justin Wood (:Callek) 2014-07-18 09:33:08 PDT
fwiw many of them had never attempted a graceful shutdown, and others had failed graceful shutdown.

NOTE that with slaverebooter we currently only go through a max of 16 machines in parallel, and when we issue the "graceful shutdown" command to a machine, we wait until we can verify it is shutdown with a max of ~6 hours. So with windows that means if we have 50 machines needing reboot, that can take a while while we wait to churn through those 6 hours.
Comment 3 User image Chris Cooper [:coop] 2014-07-19 12:02:42 PDT
(In reply to Justin Wood (:Callek) from comment #2)
> fwiw many of them had never attempted a graceful shutdown, and others had
> failed graceful shutdown.
> 
> NOTE that with slaverebooter we currently only go through a max of 16
> machines in parallel, and when we issue the "graceful shutdown" command to a
> machine, we wait until we can verify it is shutdown with a max of ~6 hours.
> So with windows that means if we have 50 machines needing reboot, that can
> take a while while we wait to churn through those 6 hours.

Do we fire off new requests as each of the 16 complete, or does everything block on one failed slave?
Comment 4 User image Chris Cooper [:coop] 2014-07-21 07:04:49 PDT
(In reply to Justin Wood (:Callek) from comment #2)
> NOTE that with slaverebooter we currently only go through a max of 16
> machines in parallel, and when we issue the "graceful shutdown" command to a
> machine, we wait until we can verify it is shutdown with a max of ~6 hours.
> So with windows that means if we have 50 machines needing reboot, that can
> take a while while we wait to churn through those 6 hours.

In case it's not clear, that's an unacceptable turnaround. If we end up waiting the full 6 hours for 16 slaves as it seems we were last week, we will barely manage to *try* to reboot all 390 Windows testers *once* in a given week, much less all the other slaves.

This is a step backwards from kittenherder. Until we fix broken parallelism here, I advocate returning to the kittenherder model, i.e. creating separate .ini files for each slavetype on bm74 and returning running a separate invocation of slaverebooter for each type.
Comment 5 User image Jordan Lund (:jlund) 2014-07-21 12:40:44 PDT
It looks like the initial issue of high pending win slaves is under control now and we have two follow up bugs to help mitigate this issue from happening again:

full solution -- parallelize slaverebooter: bug 978898 (existing bug)
tmp solution -- spawn many slaverebooter procs: no bug yet

I can close this bug and file a new one for the tmp solution. I'm happy to try hacking on it while on buildduty.

However, callek, any idea how much man hours bug 978898 would take? I'm wondering if it makes sense to just tackle the full solution and skip the band-aid. I can try and help with that as well; coop's concerns in comment 4 make it sound like this will be an issue again.
Comment 6 User image Justin Wood (:Callek) 2014-07-21 12:49:42 PDT
I'm going to work on tackling the full solution this week, I suspect it won't be any more effort [albeit, yes more time] than the tmp solution.

(I could have the tmp solution in a day or two, the full solution I anticipate likely a week or two). Although the tmp solution can be mitigated by merely going through slavehealth and rebooting obstentious slaves manually (manual as in via slaveapi links, instead of using graceful) -- which is my current practice for daily-AM, in order to assist buildduty.

I skip on any AM without pending (e.g. yesterday, a weekend).

I'd personally prefer to leave this around as a reminder that its important of course.
Comment 7 User image Pete Moore [:pmoore][:pete] 2014-07-23 03:28:13 PDT
Before too much work is done on this, let's discuss at the next build duty meeting.

From looking at the code, it appears to me that:

Slave Rebooter observations
===========================

  1) Slave Rebooter is smart enough about keeping x number of threads alive at any time. Every half a second it checks if any threads are finished, so it can add a new thread to the queue. (http://mxr.mozilla.org/build/source/tools/buildfarm/maintenance/reboot-idle-slaves.py#135)
  2) Maybe the limit of 16 worker threads should be increased (I believe threads should be much less resource consuming than processes - I can imagine we could bump this up a lot - e.g. hundreds of threads). (http://mxr.mozilla.org/build/source/tools/buildfarm/maintenance/reboot-idle-slaves.py#25)
  3) Slave Rebooter checks the status of the Buildbot slave graceful shutdown request every 30 seconds - with no limit on the number of times to recheck - it simply keeps querying Slave API until state is not 'PENDING' nor 'RUNNING' - probably Slave Rebooter should "give up" after a reasonable while, and push forward with the machine reboot even if Buildbot slave shutdown has not completed. (http://mxr.mozilla.org/build/source/tools/buildfarm/maintenance/reboot-idle-slaves.py#72)

Slave API observations
======================
  1) Maybe Slave API could have a lower timeout for waiting for the slave to shutdown - certainly in the case that the slave is known not to be running a job, or if a job can be seen as stalled (e.g. Buildbot slave has not touched its log file in a certain period). Perhaps observation of the max timeout (currently 5 hours) could be limited only to the case when the slave is known to be running a job, and that job appears to be active. (https://github.com/bhearsum/slaveapi/blob/master/slaveapi/actions/shutdown_buildslave.py#L13) 
  2) I'd propose to rename the method shutdown_buildslave to shutdown_buildbot_slave - otherwise it can be interpreted as shutting down the slave machine, as opposed to rebooting the slave machine. (http://mozilla-slaveapi.readthedocs.org/en/latest/api/)
Comment 8 User image Pete Moore [:pmoore][:pete] 2014-07-23 07:21:00 PDT
> Do we fire off new requests as each of the 16 complete, or does everything
> block on one failed slave?

No. At any time, 16 requests are active. Within .5 seconds of any one of them completing, another thread will fire up with the next request in the queue. So we only block completely for 5 hours if all 16 are blocked for 5 hours.

In any case, the 5 hour timeout in slave api is intended to allow jobs to complete - but slave rebooter is only rebooting slaves that aren't taking jobs - so it should not need to ever wait for a job to complete before shutting down the build slave.
Comment 9 User image Pete Moore [:pmoore][:pete] 2014-07-23 07:28:10 PDT
There are some other issues:

1) Slave API fix: If a buildbot slave has shutdown, slave api will return a "FAILED" response to the shutdown request. This is because the operation to test the state of a slave via buildbot master is expensive, so slave api does not check the state, before attempting to shut it down. Then when the shutdown fails, it returns the failure code. We could work around this in several ways: i) for darwin / linux we can check the /builds/slave/twistd.pid file to see if there is a known process running, and then query the process table to check it is still running, and contains the correct process (i.e. process ids can get reused by OS - so want to make sure it really is the twistd process) and ii) for windows we could check the last modified date of 'C:\builds\moz2_slave\twistd.log' or maybe see if the last line is the shutdown is: "<timestamp> [-] Server Shut Down."

2) Slave Rebooter fix: Maybe we can skip the shutdown_buildslave step from slave rebooter altogether - since in theory it should only run against slaves that aren't taking jobs, or at a minimum, specify a shorter timeout than the default 5 hours.
Comment 10 User image Justin Wood (:Callek) 2014-09-08 08:39:55 PDT
We're much better here after the slaverebooter changes, more ongoing fixes to the slave configs themselves slaveapi and hardware capacity concerns will indeed have to happen but this bug as a problem tracker has outlived its usefulness imo

Note You need to log in before you can comment on or make changes to this bug.