Closed Bug 1353508 Opened 7 years ago Closed 7 years ago

Create a mechanism for loaning Mac OS X TC workers

Categories

(Release Engineering :: General, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: dustin)

Details

Users should be able to "borrow" TC workers for work debugging test failures, etc.

Since these are exclusively releng hardware, I'm hoping we can just modify the existing releng loaner process to suit taskcluster.
The loaner process is at
  https://wiki.mozilla.org/ReleaseEngineering/How_To/Loan_a_Slave

I think we'll only need to modify the cleaning process.  Maybe
  https://wiki.mozilla.org/index.php?title=ReleaseEngineering/How_To/Loan_a_Slave&diff=prev&oldid=1167558
is enough?

Kim, are you the right person to ask about buildduty stuff?  Or at least, know who is the right person? :D
Flags: needinfo?(kmoir)
So these tc workers are not just ones that are going to have jobs scheduled on them via bbb? Is that correct?  If this is the case, then they will be able to be disabled via slavealloc.  If this is the case the documentation changes look good.

If they are not scheduled through bbb, but are pure tc workers how are they disabled from being used by production jobs?
Flags: needinfo?(kmoir)
Good point.  They would be disabled, I think, simply by not having the worker installed.  We don't have a good way to prevent a host from running tasks if the borrower is keen on doing so.  That said, these are only testers, so stealing a few test jobs is not the end of the world.

I should include some updates to how to select a worker in the first place and wait until it is idle.  I think that, rather than waiting until it's idle, I'll just suggest rebooting it -- that may terminate a job with claim-expired, but it will be automatically re-run.

Thanks!
https://wiki.mozilla.org/index.php?title=ReleaseEngineering/How_To/Loan_a_Slave&diff=1167656&oldid=1167558
** For TaskCluster, interrupted jobs are automatically retried, so pick an arbitrary host, kill the generic-worker or taskcluster-worker process, and proceed.

Hopefully that's sufficient :)

As releng develops more tooling around taskcluster on hardware (slave-health, for example), maybe we can do a bit better job of disabling workers.

Kim, does that sound reasonable?
Flags: needinfo?(kmoir)
I think you will need some more text around 
"pick an arbitrary host" <- pick a machine from the existing the mac test pool
"kill the generic-worker or taskcluster-worker process" <- so ssh as root to the machine, and kill the  generic-worker or taskcluster-worker process

Are there new steps that needed after the developer is done with the loaner to return it to service?

just to make it clear for the buildduty folks
Flags: needinfo?(kmoir)
Thanks!  I think I addressed all of that.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.