Users should be able to "borrow" TC workers for work debugging test failures, etc. Since these are exclusively releng hardware, I'm hoping we can just modify the existing releng loaner process to suit taskcluster.
The loaner process is at https://wiki.mozilla.org/ReleaseEngineering/How_To/Loan_a_Slave I think we'll only need to modify the cleaning process. Maybe https://wiki.mozilla.org/index.php?title=ReleaseEngineering/How_To/Loan_a_Slave&diff=prev&oldid=1167558 is enough? Kim, are you the right person to ask about buildduty stuff? Or at least, know who is the right person? :D
So these tc workers are not just ones that are going to have jobs scheduled on them via bbb? Is that correct? If this is the case, then they will be able to be disabled via slavealloc. If this is the case the documentation changes look good. If they are not scheduled through bbb, but are pure tc workers how are they disabled from being used by production jobs?
Good point. They would be disabled, I think, simply by not having the worker installed. We don't have a good way to prevent a host from running tasks if the borrower is keen on doing so. That said, these are only testers, so stealing a few test jobs is not the end of the world. I should include some updates to how to select a worker in the first place and wait until it is idle. I think that, rather than waiting until it's idle, I'll just suggest rebooting it -- that may terminate a job with claim-expired, but it will be automatically re-run. Thanks!
https://wiki.mozilla.org/index.php?title=ReleaseEngineering/How_To/Loan_a_Slave&diff=1167656&oldid=1167558 ** For TaskCluster, interrupted jobs are automatically retried, so pick an arbitrary host, kill the generic-worker or taskcluster-worker process, and proceed. Hopefully that's sufficient :) As releng develops more tooling around taskcluster on hardware (slave-health, for example), maybe we can do a bit better job of disabling workers. Kim, does that sound reasonable?
I think you will need some more text around "pick an arbitrary host" <- pick a machine from the existing the mac test pool "kill the generic-worker or taskcluster-worker process" <- so ssh as root to the machine, and kill the generic-worker or taskcluster-worker process Are there new steps that needed after the developer is done with the loaner to return it to service? just to make it clear for the buildduty folks
Thanks! I think I addressed all of that.