Closed Bug 1278986 Opened 8 years ago Closed 7 years ago

users' temporary credentials should include repo-specific route scopes

Categories

(Taskcluster :: Services, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: dustin)

References

Details

Currently, users cannot schedule exactly the tasks that their pushes create, because users only have assume:moz-tree:level:1, and that does not include tree-specific routes like

queue:route:coalesce.v1.builds.alder.*
queue:route:index.buildbot.branches.alder.*
queue:route:index.buildbot.revisions.*
queue:route:index.docker.images.v1.alder.*
queue:route:index.gecko.v1.alder.*
queue:route:index.gecko.v2.alder.*
queue:route:tc-treeherder-stage.alder.*
queue:route:tc-treeherder.alder.*

the decision not to include these was pragmatic: it means that tree A can't accidentally post to routes for tree B, and it means that users fooling around with the task creator can't accidentally make colored letters appear in someone else's treeherder results.

So I don't mind fixing this, but I need to think more about how to do so.  That's a lot of routes, and they're "hard-coded" in that lots of things are listening for those exact routes.  With 10 twigs, that will be 80 scopes for level 2.

Maybe some of these could be changed to include a level component, in which case we could give something like queue:route:index.gecko.v2.level-2.alder.* to level-2 users and repos?  But I suppose that would be v3, and we haven't yet killed v1..

Jonas, I'm open to ideas here..
Flags: needinfo?(jopsen)
So if we take a look at:
  https://tools.taskcluster.net/index/#gecko.v2/gecko.v2
We see a lot of branches and twigs from [1].

Clearly:
  Routes for "mozilla-central", "mozilla-inbound", etc. are associated with commit-level 3, see [2].
  Routes for "try" could be associated with commit-level 1.
  Routes for "alder" could be associated with the LDAP group used for access control, assuming access
  is controlled with LDAP (I would suspect so, as that's how try works, afaik).
Though:
  I'm don't find any of this ideal. Controlling what routes at the post-commit-hook (mozilla-taskcluster)
  is really powerful, as it prevents users from polluting the index.
   - notice how even mshal managed to pollute it with a task in "gecko.v2.hey.there" :)

But honestly, what are the scenarios where people run into this?
 - retriggering
 - backfilling
 - manual testing
 - bisection
 - one-click-loaner
 - ?

Thinking of "task.routes" it's effectively a collection of side-effects to trigger after the task.
In cases such as manual testing and bisection, do you really want those side-effects?
I'm not even sure you do in the case of retrigger and backfill, if so we probably service doing the
retrigger and backfill on your behalf preventing you from changing the task.

The one-click-loaner button strips all "task.routes" to avoid any side-effects.
In many cases I suspect that side-effects such as indexing is strictly not desired.

In what scenarios do we want to maintain side-effects?
 - retriggering     (probably, in order to index the retriggered build)
 - backfilling      (maybe)
I'm pretty sure that retriggering has to be done through a service anyways,
to handle retriggering of dependent tasks (ie. retrigger build, also retriggers test tasks).
A gecko-retrigger service could require a scope like:
  retrigger:scheduler-id:<schedulerId>/<taskGroupId>/<taskId>
or
  login:email:<email> // where email is from task.metadata.owner

taskcluster-login could easily issue "login:email:<email>" scopes to people.
This would be valuable even if queue didn't require the scope for creating the task,
granted adding such a feature to the queue could be quite awesome :)

Note:
  Let's try and open this discussion in London if possible. It's not just routes that is a problem
  like this. The scope to create a task in a task-group that already exists is also an element.
  Options are:
   A) letting people pollute, which could be a security risk.
   B) A combination of:
      B.1) Strip routes and use a fresh taskGroupId to avoid side-effects (and scope issues)
      B.2) Have a services do the operation on your behalf, like a gecko-retrigger service.
           I'm not sure if such a service would be gecko specific, or if it could be generic. 

  Option (A) has a lot of down sides. I mean what is the point of granular scopes if don't strive
  lock them down. And what is the value of index if every now and then someone managed to index
  a hacky development task. Then we might as well go back to FTP scraping :)

[1] https://wiki.mozilla.org/ReleaseEngineering/DisposableProjectRepositories
[2] https://www.mozilla.org/en-US/about/governance/policies/commit/access-policy/
Flags: needinfo?(jopsen)
Jonas:

We're building option B.2 here.  The issue is that I would like that service not to be a point where scope escalation might take place.  Everywhere else, we have adhered to the rule that if you can do it in hg, you can do it in taskcluster.  This is a very simple and un-surprising principle, and users have been surprised several times now to find that route scopes are an exception.

I don't want to have to build a service that verifies that the user clicking a button in treeherder has write access to a particular repository.  I would prefer to simply use that user's TC credentials (or authorizedScopes based on those credentials) to perform the operation, falling back to the .

The suggestion to build a service that requires `retrigger:scheduler-id:<schedulerId>/<taskGroupId>/<taskId>` is interesting.  We could grant, for example, `mozilla-group:scm_level_1` the scope `retrigger:scheduler-id:g-1-*` (short for "gecko-level-1-*").  We would still violate the if-you-can-do-it-in-hg principle, but at least we would have granular control over retriggering with a reasonably simple rule for the retriggering process to follow.

Yes, let's talk in London.
We didn't talk in London.

I think B.2 is the best option here.  We will always need tools to handle these scenarios, and I think we can legitimately say that those tools should be hosted and possess their own credentials so that they can protect the integrity of the indexes and treeherder results.

How those tools should authenticate requests is an open question, and perhaps one that we could answer as we continue to untangle mozilla-taskcluster.  It might be best to just define a best practice:

 1. Determine the repo scope: `assume:repo:hg.mozilla.org/<tree>:<whatever>`
 1. Make an auth.expandScopes query with authorizedScopes=[repoScope]
 2. For each returned scope S with the prefix "assume:moz-tree:",
    Reject the request if the user's expanded scopes do not include S
 3. Otherwise, accept: use authorizedScopes=[repoScope] for operations on behalf of this user

This is fairly easy to encapsulate and review, and covers the case of non-level repositories as well (nss, for example).

Jonas, Armen, does this sound OK?  If so, I'll add it to the docs.
Flags: needinfo?(jopsen)
Flags: needinfo?(armenzg)
This plan sounds good to me.
I can't determine the long term implications of such decisions but from I understand this should work.

Thanks for bringing this up!
Flags: needinfo?(armenzg)
Hmm, I'm confused :( sorry.

At this point my observation is that
> the if-you-can-do-it-in-hg principle
is nice, but not necessarily violated by having the post-commit-hook (mozilla-taskcluster) add scopes.
Because you can't push any revision to hg, and you can't use hg to create result-sets or index entries
for revisions that haven't been pushed to hg.
Flags: needinfo?(jopsen)
Neither of you noticed the security flaw in that pseudocode :)
Or two.  The easy one is this: "For each returned scope" would fail to reject if the repo had no such scopes.

The more difficult one is that we might someday define repositories with *more* scopes, that happen to also include level 3.  For example:

 assume:repo:hg.mozilla.org/shadow-central   # it was a thing once - bug 577633
    assume:moz-tree:level:3
    secrets:get:shadow-stuff/*
    queue:create-private-task
    queue:route:coalesce.v1.builds.shadow-central.*
    queue:route:index.buildbot.branches.shadow-central.*
    queue:route:index.buildbot.revisions.*
    queue:route:index.docker.images.v1.shadow-central.*
    queue:route:index.gecko.v1.shadow-central.*
    queue:route:index.gecko.v2.shadow-central.*
    queue:route:tc-treeherder-stage.shadow-central.*
    queue:route:tc-treeherder-stage.v2.shadow-central.*
    queue:route:tc-treeherder.shadow-central.*
    queue:route:tc-treeherder.v2.shadow-central.*

This is a perfectly reasonable role definition: shadow-central can build level-3 tasks, but with some extra access including shadow-stuff/* secrets.  Basically, "level 3 and more".  Yet the algorithm above would allow anyone with level 3 access to operate on this repository, since it wouldn't take any notice of the "and more" scopes.

So, this won't work.
In both flaws, the difficulty is in relying on information in the repo scope to accurately represent the permissions required to push to that repo.  The conversation here has been limited to hg.mozilla.org, but we could ask the same of github repos (and in fact it's github repos that would make users' scope lists explode if we just gave them assume:repo:X for all repos X they have write access to).

What if we had a generic service (or two, one for hg and one for github) that could take a set of scopes and a repo string and determine whether the scopes are enough to push to that repo?  That would be based on ground-truth about the repo, so for example the service for hg would consult the required LDAP group to push to that repo (we'd probably need an hgmo API for that) and look for mozilla-group:<thatGroup>.  Similarly, github would figure out github-user:<username> and then use github APIs to see if that user has access.

I can see a few ways of organizing that service:

 - requires no scopes itself; just takes a scopeset and returns a boolean
 - takes TC credentials and a repo name and returns temporary creds with the union of the cred scopes and the repo scope
   so basically, a deliberate, limited privilege escalation

I like the second option, but it would mean that if we want to use this for retriggering, the frontend (treeherder) would need to do the escalation, and then use the resulting temporary credentials to call the retrigger service.

OK, my brain hurts a bit.  Brian, do you see this meshing with any of your thoughts about TC-github?  I'd be a lot more motivated to put work into this if I thought it improved our github story too.
Flags: needinfo?(bstack)
I think this works well with my thoughts for tc-gh. Braindump of things that come to mind when I read this:

1. When we mention "github-user:<username>" this means that somewhere in this process there will be a way to link your github account to tc in some way, right? I guess we should try to keep it from being a strict 1-1 mapping e.g. I have multiple github accounts (imbstack and taskclusterrobot) that I operate as in taskcluster somewhat frequently. We could either do that with making log in with github on tc-login and making me log out and log in again to switch accounts, or making "github-user:<usernames>" a list.

2. Would the logic for doing this sort of thing be added to tc-lib-scopes and tc-lib-api, etc or would it be up to each service that wants to deal with this sort of thing to check manually?
Flags: needinfo?(bstack)
(In reply to Brian Stack [:bstack] from comment #9)
> 1. When we mention "github-user:<username>" this means that somewhere in
> this process there will be a way to link your github account to tc in some
> way, right? I guess we should try to keep it from being a strict 1-1 mapping
> e.g. I have multiple github accounts (imbstack and taskclusterrobot) that I
> operate as in taskcluster somewhat frequently. We could either do that with
> making log in with github on tc-login and making me log out and log in again
> to switch accounts, or making "github-user:<usernames>" a list.

The idea is that tc-login would allow signing in with github auth, and create temporary creds with github-user:<username>.  A list wouldn't work so well (since you can't match against it), but we could certainly have a credential with both github-user:imbstack and github-user:taskclusterrobot.  That's sort of a separate issue, though -- one of combining multiple credentials into a single super-credential.

> 2. Would the logic for doing this sort of thing be added to tc-lib-scopes
> and tc-lib-api, etc or would it be up to each service that wants to deal
> with this sort of thing to check manually?

I think it would be up to each service, since there are only a few and it's a fairly specific process.
A refinement of the second option, into a kind of taskcluster "repo-sudo":

Add the capability for tc-login to "amend" temporary credentials for access to a specific repo.  This would be an API call, something like:

  let repoCreds = login.repoAccessCredentials("hg.mozilla.org/try:default")

Then TreeHerder could automatically make this call in the frontend and use the resulting credentials to call the retrigger API method.  We could add a tools option to do the same thing.

The behind-the-scenes implementation would look at the scopes passed with the API call, pulling out those which login created (mozilla-user:dmitchell@mozilla.com, github-user:imbstack, etc.) and use the authorizers infrastructure (LDAP, mozillians, github) to check for access to the given repository.  If there's a match, it returns a new temporary credential with the same expiration time as the existing, and the union of the existing scopes and ["assume:repo:" + repoString], and an amended clientId.  Requests for `*`-suffixed repos should be rejected.

Then it's up to those authorizers to make the right kind of decisions, be that by checking the scm level of an hgmo repository or looking at the user's access to that github repository.  We can easily cover bitbucket or gitmo or whatever with this model as well.

Jonas, this seems just crazy enough to appeal to you.  If you like it, I'll run it by some of the vcs folks..

(aside: shoot, at one point this bug boiled down to a few paragraphs of documentation.. now it's a lot more work)
Jonas, curious for your opinion here.  I'm also noting that this is probably not on our critical path right now, so I'm going to stop working on it.
Assignee: dustin → nobody
Flags: needinfo?(jopsen)
Blocks: 1273096
Summary:
  I'm now leaning towards following the if-you-can-do-it-in-hg principle. 
  After thinking through all of this, and looking for nice solutions I'm inclined to say
  that the added complexity isn't worth it.

---------
I note that we now have 3 options:

Options:
  A) Follow the if-you-can-do-it-in-hg principle
  B) Parameterized scope elevation:
     1) Special snowflake services for retriggering etc.
     2) Generic scope-elevation service (implies scope persistence issues)
        Also this is rather complicated.
  C) repo-sudo: temporary scope elevation by returning temporary credentials
     if you have commit access to repository.


Outline of B.2)
  Decision task would create a reqCert and include it in actions.yaml.
  Calling auth.executeReqCert(reqCert) would execute the verb, url, body given in reqCert
  assuming the caller has reqCert.requiredScopes. A reqCert would look like this:
  reqCert = {
    verb, url, body,       // Request this certificate would allow
    clientId, certificate, // clientId (and optional certificate) that signed this reqCert 
    authorizedScopes,      // List of scopes verb, url, body will be executed with
    requiredScopes,        // Scopes required from caller to use this reqCert
    start, expires,        // time interval reqCert is valid
    signature,             // HMAC(all-of-the-above, accessToken)
  }
  Basically, it's a way that one can freeze all parameters for a request and permit someone
  with less scopes that normally required by the request to execute it. Hence, a privileged
  escalation is allowed, given that verb, url and body is frozen.
  
  Could be useful for quite a few use-cases, but it's also rather complicated. In particular
  bstack and I ended up being unhappen with the designs ability to persist authority.


Pros/cons of Option (A) following the if-you-can-do-it-in-hg principle:
  Pros:
    + Following the if-you-can-do-it-in-hg principle is simple
    + It's easy to understand for all our users
    + It's super simple to reason about
  Cons:
    - Accidental index or treeherder poisoning is not super hard to do
      (we could tweak task-creator to warn about routes, in particular treeherder/index routes)
  Realization:
    We currently grant assume:repo:try:* to the decision task, meaning
    that it gets the scope: queue:route:index.gecko.v2.try.*
    So anyone with commit-level 1, can easily push a compromised
    decision task (by modifying .taskcluster.yml) and poison the
    gecko.v2.try.* index namespace.
    (similar follows for level 2 and level 3)

    Thus, we don't actually get any additional security from restricting
    access to these index namespaces. We just make accidental poisoning harder.
    If we hardcoded index routing into mozilla-taskcluster, it would be a
    different story. But this makes things more complicated again.


Given that we're making talking about making accidental poisoning hard, and that this doesn't
have security implications (as I thought it did in Comment 1) I'm now inclined to suggest that
we go with option (A).

All the other things seems to add a let of complexity, just to make poisoning harder.
Probably we could warn about poisoning in the task-creator, which would likely stop most of it :)

We are currently not enforcing the index and treeherder routes in mozilla-taskcluster,
so unless we hardcode that we don't have any security gains from all this complexity.
(obviously, hardcoding would lead to even more complexity in it's own)
Flags: needinfo?(jopsen)
The other con to (A) is that it introduces a lot of scopes -- someone with mozilla-group:scm_level_3 will have an expandedScopes list with 347 entries.

Still, maybe this is the appropriate choice, and we can make changes to the auth service to handle the increased scope count better.
Assignee: nobody → dustin
Or we could change gecko.v2.try to gecko.v3.level-1-try
That way we can give out scopes for: gecko.v3.level-1-*
Blocks: 1325657
I've been thinking about this on and off.  While I think we're inevitably going to get to the place where we redesign auth to handle 100's of scopes for most requests, I think we should do everything we can to limit the number of scopes required, just to help with human comprehension.

So in principle, I like comment 15.  However, it took many months to get rid of the buildbot and gecko.v1 routes.  So we would not get immediate satisfaction.

Mike, you were involved in that work -- what do you think about comment 15?
Flags: needinfo?(mshal)
(noting that if we had done this 7 months ago, gecko.v2 routes would probably have the level in them!)
I don't totally understand the discussion in this bug, but I'll try to summarize what I understood:

1) We need different scopes based on the branch:
   try [level 1]
   twigs [level 2]
   m-c and other important branches [level 3]

2) There are lots of scopes-per-branch (though buildbot.* and gecko.v1.* are gone now, so maybe that helps some?)

3) Combining 1 * 2 makes for way too many scopes

So the idea is instead of gecko.v2.branch, we have gecko.v3.level.branch, and then we only have 3 things to grant scopes for in part 1 instead of number-of-branches, right?

The only real downside to that approach is that I don't think most people have any idea what "level" a branch is, so when they go to look for something in the index they'll probably have to do more poking around than they do now. (And of course the actual work of backfilling gecko.v3 and updating the tools to use it)

How do scopes work when multiple things fit? Does it use the most specific one first? Eg, if I have:

queue:route:index.gecko.v2.try: level 1
queue:route:index.gecko.v2.mozilla-central.*: level 3
queue:route:index.gecko.v2.*: level 2

Would the wildcard at the end apply level 2 to everything that's not try or mozilla-central? Or does it override one or both of the previous ones?
Flags: needinfo?(mshal)
Yes, that's exactly the idea.  I suspect the backfilling and tool updates would be rather annoying.

As for scopes, the mapping sort of goes the other way -- a level 3 repository has the role `moz-tree:level:3`, which under this proposal would expand to include `queue:route:index.gecko.v3.level-3.*`, and that would match e.g., `queue:route:index.gecko.v3.level-3.mozilla-central.<etc>`.
Here are the tree-specific scopes for `repo:hg.mozilla.org/projects/graphics:*`, a typical repo:

queue:route:coalesce.v1.builds.graphics.*
queue:route:index.buildbot.branches.graphics.*
queue:route:index.buildbot.revisions.*
queue:route:index.docker.images.v1.graphics.*
queue:route:index.gecko.v1.graphics.*
queue:route:index.gecko.v2.graphics.*
queue:route:tc-treeherder-stage.v2.graphics.*
queue:route:tc-treeherder.v2.graphics.*

And in fact a bunch of repo roles have, against better judgement, acquired additional scopes that they probably shouldn't have.

So the gecko routes are but one among many of our concerns.

- coalescing -- this path just needs to correspond to the superseder URL, so it's easy to change
- buildbot -- nothing we can do here (barring changing buildbot, ugh)
- docker.images -- also an in-tree change, so not too hard
- treeherder -- would require a change to tc-treeherder too

One thing that might unblock this a little: once we've introduced gecko.v3 and moved any particularly sensitive uses of the index to use v3, we can add 

queue:route:coalesce.v1.builds.*
queue:route:index.buildbot.branches.*
queue:route:index.buildbot.revisions.*
queue:route:index.gecko.v2.*
queue:route:tc-treeherder-stage.v2.*
queue:route:tc-treeherder.v2.*

to moz-tree:level:1 so that everyone has them.  That's risky, but the alternative is to wait a really long time until all uses of the old routes are expired before we get any benefit here.

Backing up, we have three directions we can go:
 1. Use level-qualified scopes and eliminate use of unqualified scopes
 2. Build some kind of repo-sudo support (comment 11)
 3. Modify the auth service to efficiently handle 1000's of scopes for every request
They all kind of stink.
We talked about this at length yesterday.

Jonas made the observation that user temporary credentials are a tiny fraction of the load on the auth service, so if those are the only case where we have long lists of scopes, it's not a huge difference overall.

We decided on a version of #3.  Specifically, we're going to just fix up the roles (so that role `mozilla-group:scm_level_N` has `assume:repo:hg.mozilla.org...` for each level-N repo).  We're confident that the auth service will not explode immediately, but we may be able to better see any issues.
I did this for scm_level_1.  I'll do the rest soon.

https://tools.taskcluster.net/auth/roles/#mozilla-group:scm_level_1
OK, fixed for scm_level_2 and scm_level_3.  I messed up in between, and in fixing that mess-up added some trailing spaces to the roles for moz-tree:level:2.  Anyway, now mozilla-group:scm_level_3 has 222 associated scopes.

That leaves open questions of how to maintain these lists, but we'll deal for now.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Component: Authentication → Services
You need to log in before you can comment on or make changes to this bug.