Closed Bug 1480398 Opened 7 years ago Closed 4 years ago

Workers that fail OCC validation checks should be terminated

Categories

(Infrastructure & Operations :: RelOps: OpenCloudConfig, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED INACTIVE

People

(Reporter: pmoore, Unassigned)

Details

In bug 1480368 there was an issue which made workers unable to take jobs. Despite the workers taking no jobs, they continued to run for more than 24 hours, at which point I terminated them. If workers are not taking jobs, they shouldn't be allowed to run for so long without being terminated. In this case, the worker pool was limited to a maximum capacity of 4, and this resulted in the following provisioned workers: Worker type: gecko-t-win10-64-cu ================================ Region: us-east-1 i-062fc704cc1988f03 Region: us-west-1 i-0a4e06463a9a540d3 Region: us-west-2 i-0359899ac09a14448 Region: eu-central-1 i-09ec082685fd41dbf From papertrail it can be seen that these servers ran for an entire day before being terminated, yet never took a job.
i removed the age check from HaltOnIdle because of a request by the taskcluster team asking that occ not interfere in any way with running tasks. occ doesn't know if generic worker is performing a task. it only knows if the generic worker process is running. HaltOnIdle will still terminate machines if neither occ or generic-worker is found running. if a machine is running for more than 24 hours, but not taking tasks, it means that either occ or generic-worker are running but not doing anything (perhaps hung). in the case of occ, it may have just crashed and failed to remove the semaphore which indicates it's running. it's also possible that HaltOnIdle has crashed but i've never seen that happen. the other two scenarios i've seen plenty. the only way i can see to fix this issue from within occ is to reinstate the age check which terminates instances which are 24 hours old. at the taskcluster work week, it was suggested that worker health would be managed by the provisioner and termination of instances should occur only when the provisioner decides it is appropriate. in light of those comments, i'm going to move this bug into the provisioner component. if sentiment has changed and taskcluster folks do want occ to manage termination of old instances, feel free to move it back to the occ component and i'll reinstate the age check in HaltOnIdle.
Component: Relops: OpenCloudConfig → AWS-Provisioner
Product: Infrastructure & Operations → Taskcluster
QA Contact: rthijssen
Version: Production → unspecified
(In reply to Rob Thijssen (:grenade UTC+2) from comment #2) > i removed the age check from HaltOnIdle because of a request by the > taskcluster team asking that occ not interfere in any way with running > tasks. occ doesn't know if generic worker is performing a task. it only > knows if the generic worker process is running. HaltOnIdle will still > terminate machines if neither occ or generic-worker is found running. generic-worker never ran on these instances, because the NSSM download failed, so generic-worker was never installed as a service. It sounds like this HaltOnIdle check is therefore not working. See bug 1480368 for more details. > if a machine is running for more than 24 hours, but not taking tasks, it > means that either occ or generic-worker are running but not doing anything > (perhaps hung). in the case of occ, it may have just crashed and failed to > remove the semaphore which indicates it's running. > > it's also possible that HaltOnIdle has crashed but i've never seen that > happen. the other two scenarios i've seen plenty. > > the only way i can see to fix this issue from within occ is to reinstate the > age check which terminates instances which are 24 hours old. > Another solution is to shut down a worker if any of the OCC validation steps fail. > at the taskcluster work week, it was suggested that worker health would be > managed by the provisioner and termination of instances should occur only > when the provisioner decides it is appropriate. in light of those comments, > i'm going to move this bug into the provisioner component. Indeed, this is being done in https://github.com/taskcluster/taskcluster-rfcs/blob/master/rfcs/0124-worker-manager.md but will be several months of work. It would be nice to have an OCC failure check in place in the interim. Note, the request is not that OCC monitors generic-worker, it is that OCC terminates workers that fail its own validation steps. > if sentiment has changed and taskcluster folks do want occ to manage > termination of old instances, feel free to move it back to the occ component > and i'll reinstate the age check in HaltOnIdle. We don't really want an age-based check, perhaps workers should live more than 24 hours. The validation checks already in OCC should be sufficient to decide if a worker should be terminated or not. Many thanks.
Component: AWS-Provisioner → Relops: OpenCloudConfig
Product: Taskcluster → Infrastructure & Operations
QA Contact: rthijssen
Summary: Ineffective workers not terminated → Workers that fail OCC validation checks should be terminated

Not actively working on this at the moment.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → INACTIVE
You need to log in before you can comment on or make changes to this bug.