Closed Bug 1134342 Opened 10 years ago Closed 9 years ago

queue: Issue temporary credentials with claimTask and reclaimTask

Categories

(Taskcluster :: Services, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jonasfj, Assigned: jonasfj)

References

Details

Attachments

(1 file)

Currently all tasks carries a "task.scopes" property. The idea with this property is that the task specific payload should be allowed to use these scopes when executing. They may either be employed in a declarative nature, where docker-worker checks that the task has a scope required to unlock a feature. Or they may be used in imperative code, such as task specific code calling into taskcluster APIs through the auth proxy. Because the worker needs to authenticate requests with any scope sets listed in task.scopes. The worker is basically required to have a set of TC credentials with the '*' scope. To avoid this I suggest that the queue returns a set of temporary credentials with every call to queue.claimTask and queue.reclaimTask. Such that APIs in the future returns: { status: <task status structure>, runId: ..., workerGroup: ..., workerId: ..., takenUntil: ..., // temp credentials covering task.scopes, set to expire at takenUntil credentials: {...} } This way the worker doesn't need to have '*' scopes. And someone compromising the worker becomes a much smaller issue. Once a workertype is compromised an attacker can try to claim the task with the largest (or most interesting) task.scope set. But only for the given workerType. Note, in an other bug later, we can make the provisioner issue temporary credentials in user-data and completely remove the need for taskcluster credentials to be baked into docker-worker. And narrowing down the scopes dynamically. There are two implementation proposals, with different pro/cons: ### Option A) Issue temp creds in queue.claimTask and queue.reclaimTask The response for both: * queue.claimTask, and * queue.reclaimTask Contains temporary credentials that expires at takenUntil. +pro/-cons: + Credentials are short lived + takenUntil is a natural because claim needs to be renewed anyways - Credentials in auth-proxy must renewed (without restarting it) - URLs signed with these credentials will have a very limited expiration (if created just before reclaimTask is called, could be minutes) - If injected into task container through environment variables, how are we supposed to update them? (we could create meta-data service API) ### Option B) Issue temp creds in queue.claimTask only The response for: * queue.claimTask Contains temporary credentials that expires at task.deadline. +pros/-cons: - Credentials lives a lot longer than the run of the task (A lot less secure than the option above, deadline can be days) + No need to update these credentials in auth-proxy + We can inject these temp credentials into task container using env vars + Signed URLs can be valid for an extended period of time We could even create some that are valid past the task execution, for a few days - Deadline affects security, which is sad, as deadline should be unrelated from this. Too short deadline == high security == high risk of exception. garndt and I previously discussed injecting temporary credentials into the task container. Which seems like a potentially nice way of solve a series of problems. With Option (A) this would be hard, it would work perfectly with option (B). Granted temporary credentials could also be made available through a special end-point on the auth-proxy. Just like how you added signing to it. Note, access to temp creds should probably be hidden behind a feature flag and require a special scope. Anyways, these temporary credentials is just as much about reducing the need for docker-worker to have '*' credentials. Instead docker-worker will have to sets of credentials: I) temporary credentials from provisioner, specific to workerType and workerId, sufficient to claim and resolve tasks. Ie. task life cycle. (expiration would be something like 72 hours) II) temporary credentials from queue, specific to the task being executed and covering things needed to upload artifacts, and covers task.scopes. (expiration is either takenUntil or task.deadline) What are your thoughts on this. If you hurry up and give me the right answer I'll do this as part of the queue rewrite. It should be easy to do.
Flags: needinfo?(garndt)
I am +1 to A but I don't think this is particularly important to do immediately (we have a pretty big stack of other stuff pending)
Blocks: 1137821
I do like the idea that the expiration of creds is as short as possible (by issuing them with reclaims as well), but the one downside I do see (as mentioned above) is the signing of URLs. We are using these signed urls for providing builds to testdroid for flashing a device. Of course the time from getting the signed URL to them pulling down the file is short, so right now it's not a problem. I believe the default for takenuntil is 20 minutes and we reclaim every ~15 minutes I believe so this probably wouldn't be an issue.
Flags: needinfo?(garndt)
Oh, and if option B is pursued, would it be possible to use the maxruntime instead of deadline? Woudl be a shorter window and the creds shouldn't live outside of the task run time anyways. Is there a way to invalidate credentials as well? For instance, if the worker is done with a task it could invalidate them? Not sure if it matters too much.
- maxRuntime is specified to the task.payload for docker-worker and cannot be used by the queue - Temporary credentials cannot be invalidated. (they have no storage overhead. instead they rely on a certificate) Anyways, to better facilitate signedUrls and lower risk of expired credentials, we could just up the expiration to takenUntil + 15 min. If we go with (A) temp creds will be accessed through proxy anyways, so they won't be leaked accidentally (not even sure if we would allow them to be accessed intentionally). Option (A) with expires = takenUntil + <fixed delay> Seems like a solid compromise with high security, and good robustness for clock drift and network latency.
Component: TaskCluster → Queue
Product: Testing → Taskcluster
We probably need to remap: assume:worker-id:<workerGroup>/<workerId> to: assume:task-run:<taskId>/<runId> Which is then issued in task specific temporary credentials by the queue.
I think we will need to do this soon, as part of bug 1137827. Even if we could limit a workerType's scopes (via temp creds from aws-provisioner) to those potentially used by its tasks, it means that the scopes available to a task are the intersection of the workerType's scopes and the task's scopes, which makes modeling a little difficult.
Blocks: 1145802
Attached file Github PR
Let me know if this seems good with you. The old scopes are still supported, we can remove them when all workers have moved to use this...
Assignee: nobody → jopsen
Status: NEW → ASSIGNED
Attachment #8680166 - Flags: review?(garndt)
Comment on attachment 8680166 [details] [review] Github PR Left a couple of questions in the PR that are nothing major, mostly just clarification for myself.
Attachment #8680166 - Flags: review?(garndt) → review+
Depends on: 1220738
Blocks: 1220738
No longer depends on: 1220738
Rolled out... And in production...
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Component: Queue → Services
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: