Closed Bug 1374978 Opened 7 years ago Closed 3 years ago

Stop granting and requiring assume:worker-id:*

Categories

(Taskcluster :: Services, enhancement, P5)

enhancement

Tracking

(Not tracked)

RESOLVED MOVED

People

(Reporter: pmoore, Unassigned)

References

Details

There are no roles prefixed with "worker-id:" so API endpoints requiring "assume:worker-id:*" make no sense. We should therefore clean this up.

In general we should probably clean up all the API definitions that depend on "assume:" scopes, since we deprecated their use a loooong time ago.

This one is particularly useless since no such "worker-id:<workerId>" roles exist.

Note, the provisioner also grants this role to all worker types, so once our APIs don't depend on it, we can also remove the granting of assume:worker-id:* to temp creds generated by provisioner for workers.

This originally came up in bug 1374366.
See Also: → 1374981
We didn't actually deprecate this:
  https://docs.taskcluster.net/manual/design/devel/best-practices/scopes

but recent changes to the queue have introduced `queue:worker-id:..` which can serve the same purpose.  I suspect we could replace assume:worker-id with queue:worker-id.  We can't get rid of it entirely, though, as it serves as a means to prevent worker B from messing with a task claimed by worker A.  Currently the provisioner doesn't do a great job with these (it hands out assume:worker-id:*) due to issues with out-of-date information in the EC2 API (no way to determine the instanceId of a host, even after it has started).  The host-secrets configuration *does* support more specific worker-ids:
  https://tools.taskcluster.net/auth/roles/#project:releng:host-secrets:host:com.mozilla.scl3.releng.test.moonshot-*
and we should eventually do something similar in the provisioner.

So I think the action here is to finish the conversion to use the most up-to-date scope scheme for the queue (including priorities and schedulerIds, too).  Jonas, is there a bug or RFC I can link to for that?
Flags: needinfo?(jopsen)
I believe these are the affected endpoints, that currently require "assume:worker-id:<workerGroup>/<workerId>":

  * queue.claimTask
  * queue.reclaimTask
  * queue.reportCompleted
  * queue.reportFailed
  * queue.reportException
  * queue.createArtifact
(In reply to Dustin J. Mitchell [:dustin] from comment #1)
> Currently the provisioner doesn't do a great job with these (it hands out
> assume:worker-id:*) due to issues with out-of-date information in the EC2
> API (no way to determine the instanceId of a host, even after it has
> started).

What about if the provisioner continued to provide credentials, as it currently does, but instead of granting the current scopes it grants:

  * assume:worker-id:*
  * assume:worker-type:aws-provisioner-v1/<workerType>

it could instead grant the single scope:

  * queue:register-worker:<provisionerId>/<workerType>/<slugid>

Then once the worker starts up, it could call queue.registerWorker(provisionerId, workerType, slugId, workerGroup, workerId) using the provisioner-provided credentials, and this queue call would return new temporary credentials for the worker, iff no credentials have previously been provided for this slugId (otherwise would return a HTTP 403).

The queue would need to maintain a list of "used" aws-provisioner-v1 slugIds for at least 96 hours (since 96 hours is currently maximum lifetime of an aws-provisioner-v1 worker). Optionally, a worker could "deregister" as a last step before terminating (on a best effort basis). Deregistering might be overkill.

Another thing I like about this approach, is that the register-worker is a great placeholder to provide other pertinent metadata, such as a json schema for the worker's task payload, version information about the worker ({"name": "generic-worker", "version": "v10.0.5", ....}). Worker registration would logically therefore only be necessary for *provisioned* workers that require temporary credentials, so people can still write their throwaway/dev workers without needing to make this call (since they can create their own credentials for the worker).

This would be a generic solution independent of provisioner (e.g. can be used by scl3-puppet, aws-provisioner-v1, packet-net-v1, ....), and removes on thing that all provisioners might otherwise need to implement, placing it in the queue which already is somewhat of a worker administrator.

Lastly, it somewhat simplifies the analysis of provisioned worker pools, since all workers are registered and can be easily counted (although this can be currently calculated as a side effect of task claim/reclaims and claimWork polling, counting registration/deregistration calls is simpler).
Yeah, something like that would be neat.  Sounds like an RFC! :)
(In reply to Dustin J. Mitchell [:dustin] from comment #4)
> Yeah, something like that would be neat.  Sounds like an RFC! :)

https://github.com/taskcluster/taskcluster-rfcs/issues/71
I don't have a bug deprecating old scope patterns. So far I've been putting it off forever, but it's plausible that we should stop doing that.
Flags: needinfo?(jopsen)
Priority: -- → P5
Pete: does this cleanup still need to happen?
Flags: needinfo?(pmoore)
(In reply to Chris Cooper [:coop] pronoun: he from comment #7)
> Pete: does this cleanup still need to happen?

Yes. We made it P5, so it shouldn't get picked up in triage, but it is a genuine issue that would be nice to solve when we get the time to do so.

Note, most of the discussion of this topic went into 

https://github.com/taskcluster/taskcluster-rfcs/issues/71

That RFC was closed as we had more important things to work on, but contains relevant information if anyone does get the chance to pick this up.
Flags: needinfo?(pmoore)
QA Contact: dustin
Component: Authentication → Services
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → MOVED
You need to log in before you can comment on or make changes to this bug.