Closed Bug 1159965 Opened 9 years ago Closed 9 years ago

[aws-provisioner] AwsManager.prototype.rougeKiller needs to kill illegal instanceTypes

Categories

(Taskcluster :: Services, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1145688

People

(Reporter: jonasfj, Unassigned)

Details

We current crash with:
  AssertionError: false == true
  at [object Object].WorkerType.getInstanceType (/app/provisioner/data.js:159:3) 
  at [object Object].WorkerType.capacityOfType (/app/provisioner/data.js:186:15) 
  at /app/provisioner/aws-manager.js:392:32 
  at Array.forEach (native) 
  at /app/provisioner/aws-manager.js:391:22 
  at Array.forEach (native) 
  at AwsManager.capacityForType (/app/provisioner/aws-manager.js:381:20) 
  at /app/provisioner/provision.js:202:43

If an instance exists for a workerType that isn't allowed to have the instanceType the instance has.
This happens whenever we change the list of allowed instanceTypes for any workerType. And forces us to go in an manually kill the instances with the wrong instanceType.

Note: The workerType for which this happens does not get provisioned because of this issue.

We should probably fix it here: AwsManager.prototype.rougeKiller
By killing instances with illegal instanceType or region.
jhford, makes the following comment:
> Killing them immediately doesn't feel right to me. If we have 1000 r3.xlarge jobs which are
> underway but we want to switch to r3.2xlarge only because we're not getting xlarge ones anymore,
> why would we want to kill those 1000 jobs?

Another view is that in this scenario we shouldn't remove r3.xlarge from the list of allowed
instance types. If we remove r3.xlarge it must be because we've deemed the instanceType
incapable of reliably running the tasks we want to run and the speed we want to run them.
Or something like that, ie. the instanceType is no longer allowed.

Certainly removing an instanceType/region from being allowed, could relate to that configuration
no working reliable (region perhaps because of proxy missing; instanceType because resources).

---
To me both views has merits. It comes down to whether we want to:
 A) have immediate effect of configuration changes, or
 B) let configuration changes slowly propagate.

Considering that we can't assess capacity when a workerType is disallowed I tend to favor (A),
to avoid vaguely defined behaviour. Whatever we do I would like concise and accurate behavioural
definition in docs. We probably need this as quite a few people will be creating workerTypes.
This is also an edge case...  I think we've set the expectation that the provisioner creates instances and then gets out of the way.  How each worker type handles updates to the workerType definition is up to them, but if we kill all instances of the now missing type, the worker can't decide to do this on its own.

I have been in the situation many times where we wanted a gradual roll out but couldn't.  Being able to do gradual rollouts instead of nuking everything is highly desirable to me.
@jhford,
Just to be clear the provisioner has ownership for all aws resources with it's keyNamePrefix :)
But how the provisioner manages this responsibility is open to interpretation.
Clearly it should track such resources and ensure they are eventually cleaned up.

Anyways, You have a point that we wanted workers to listen for the workerType changed messages on pulse,
and then shutdown at end of their billing cycle. This actually sounds sane.
Let's fold this into bug 1145688 then, that's obviously where it belongs.

Besides if we need to do the hard kill (to reset immediately because of a bad workerType),
the provisioner has API to kill all instances, we just need to add it to the UI...
Presumably with 3 confirm dialogs asking if you really want to do this :)
(or we can use AWS console)
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → DUPLICATE
Component: TaskCluster → AWS-Provisioner
Product: Testing → Taskcluster
Component: AWS-Provisioner → Services
You need to log in before you can comment on or make changes to this bug.