[tracker] Namespace workerTypes

RESOLVED FIXED

Status

RESOLVED FIXED
3 years ago
2 years ago

People

(Reporter: dustin, Assigned: dustin)

Tracking

Details

In Bug 1216306, I enumerated a list of workerTypes that are allowed to in-tree tasks.  That doesn't make much sense -- instead, we should use a prefix-based namespace.
Blocks: 1219943
Blocks: 1226240
moz-tree:level:1

queue:create-task:aws-provisioner-v1/android-api-11
queue:create-task:aws-provisioner-v1/b2g-desktop-*
queue:create-task:aws-provisioner-v1/b2gbuild*
queue:create-task:aws-provisioner-v1/b2gtest*
queue:create-task:aws-provisioner-v1/balrog
queue:create-task:aws-provisioner-v1/build-c4-2xlarge
queue:create-task:aws-provisioner-v1/dbg-*
queue:create-task:aws-provisioner-v1/desktop-test*
queue:create-task:aws-provisioner-v1/dolphin
queue:create-task:aws-provisioner-v1/emulator-*
queue:create-task:aws-provisioner-v1/flame-kk*
queue:create-task:aws-provisioner-v1/gecko-decision
queue:create-task:aws-provisioner-v1/mulet-opt
queue:create-task:aws-provisioner-v1/opt-*
queue:create-task:aws-provisioner-v1/rustbuild
queue:create-task:aws-provisioner-v1/spidermonkey
queue:create-task:aws-provisioner-v1/symbol-upload
queue:create-task:aws-provisioner-v1/taskcluster-images
queue:create-task:aws-provisioner-v1/test-c4-2xlarge
queue:create-task:aws-provisioner-v1/win2012r2
queue:define-task:aws-provisioner-v1/build-c4-2xlarge
queue:define-task:aws-provisioner-v1/taskcluster-images
queue:define-task:aws-provisioner-v1/test-c4-2xlarge

moz-tree:level:2 adds
queue:create-task:aws-provisioner-v1/testdroid-device

moz-tree:level:3 adds nothing further.

There's some cleanup to do around those define-task scopes (I think that was supposed to happen in bug 1216306).  Other than that, a few questions to ponder here:

 - do we always want to share permissions for all workerTypes between projects of the same level?  For example, gecko try pushes can use (and thus possibly compromise) the rustbuild workerType.  I would think rust would want to segregate itself from gecko-related workerTypes.

 - within decision-task driven build graphs, what are the variables where we must separate workerTypes (due to different configuration), and where we might *not* (for better caching)
   - level -- definitely want to separate here
   - tree -- could go either way
   - task type (build vs. test vs. images vs. linting) -- some need different config, very little cache overlap
   - task subtype (e.g., opt build or reftest) -- probably same workertype

 - how can we exploit scope prefixes to filter permissions down from the decision task to individual tasks?

Overall, I'm thinking that we should assign workerTypes to trees (the `repo:...` roles), rather than to scm levels (`moz-tree:level:N`), and use the same `level-{{level}}-{{project}}-*` pattern we use for docker-worker caches, with the suffix being arbitrary.  But that needs more thought, because it forces us to use separate workerTypes for each tree.  And doesn't address non-gecko, non-gaia builds.
That proposal also doesn't allow users to login to tools and run interactive versions of tasks, which is something we want to support..
First, let's make the general form be {{project}}-*, so:

gecko-{{level}}-build-linux64
               -build-linux32
               -build-win2008
               -test-linux-small
               -test-linux-large
               -util          (lint, symbol-upload, etc.)
               -images        (image-building, with privileged enabled)
rust-*                        (rustbuild, arbitrary)
gaia-*                        (b2g*)  
bmo-*                         (..and so on)
taskcluster-tutorial          (for the TC tutorial)
taskcluster-github            (for github repos without anything more specific)

(this looks forward to making gecko a project like all the rest, but we don't need to jump into that just yet)

Then in terms of roles, we would have:

 moz-tree:level:1 has queue:create-task:aws-provisioner-v1/gecko-1-*
 moz-tree:level:2 adds queue:create-task:aws-provisioner-v1/gecko-2-*
 moz-tree:level:3 adds queue:create-task:aws-provisioner-v1/gecko-3-*
 project-member:bmo has queue:create-task:aws-provisioner-v1/bmo-*
 repo:github.com/bugzilla/bugzilla:* has queue:create-task:aws-provisioner-v1/bmo-*
 repo:github.com/* has queue:create-task:aws-provisioner-v1/taskcluster-github
 project:taskcluster:tutorial has queue:create-task:aws-provisioner-v1/taskcluster-tutorial
 etc.

We can even offer projects the ability to create their own workerTypes down the road.

This will

 * get repos permission to run the workerTypes they need
 * adequately isolate non-gecko stuff from gecko
 * give interactive logins the scopes they need to re-run tasks and/or delegate access
 * gecko builds, within a level, are free to create/destroy

@jonasfj, @garndt, does this sound totally insane?  I feel like when I looked at this 2 months ago, the constraints left no good solution, but what I wrote above seems like a good solution.  Did I forget a constraint?
Flags: needinfo?(jopsen)
Flags: needinfo?(garndt)
..are free to create/destroy workerTypes as caching, configuration, cost, etc. dictate

Comment 5

3 years ago
This is looking good.  I love the idea of separating out level 1 (try) from the rest.  This also takes it a step forward to making it easier to know what worker types to use if you're standing up new stuff.

The only downside is that we will now have a lot more gecko worker types because of 3 levels for each which is a pain in the provisioner UI right now.  I guess I need to learn how to make that manage script work for updating worker types. :)
Flags: needinfo?(garndt)
Yeah, I have been thinking about putting together a taskcluster-admin-scripts repo with some useful management scripts packaged up with a nice CLI wrapper.  We need one to manage project roles too.
> @jonasfj, @garndt, does this sound totally insane? 
Yes :) Okay, maybe.. Depends on the intention.
IMO it's critical to efficiency that we don't create too many workerTypes.

I strongly believe that:
  The TC team should manage some workerTypes.
  These workerTypes should be the **primary** work horses for all projects.
  Projects can get scoped caches, etc, on these workerTypes.
  These workerTypes have decent task isolation.

Any generic task should use these managed workerTypes.
  (linters, simple unite tests, tasks without specific requirements, things people self-serve setup)
  (tutorial and most github repos should also use these work horses)

Whenever, a special case arrives, like frequently build the same branch.
(thus hot cache is a requirement).
We should decide carefully, just how well, we really want to isolate it.


Whenever, we split tasks over two different workerType some real thought have to be given,
as to whether or not that is necessary. Projects shouldn't setup workers for the sake of
just doing so.


-----
That being said, if we want to give a namespace like this, so people can self-serve play with
making their own workerTypes that is totally nice to have. And I totally support this.

I would suggest:
  aws-provisioner-v1/p-<project>-*         (For project-specific workers under aws-provisioner)
  zero-provisioner/p-<project>-*           (For project-specific workers without provisioner) 
  p-<project>-*                            (For project-specfici workers with custom provisioner)

And, then we come up with good name for the managed workerType that we create.
And we manually assign access to create tasks on those. Most of these work horses
probably won't have any restrictions.
Flags: needinfo?(jopsen)
Jonas -- I don't think you read the whole bug :)

We have reasons both for running fewer workerTypes (cache sharing, cost reduction) and more workerTypes (hot caches, strong isolation, host configuration differences).  My proposal balances those reasons.

As you see, all github repos build in the same workerType.  Rust and Gecko should have strong isolation from one another.  BMO could probably use project-taskcluster-github.

I think it would be cool to give *some* projects the ability to manage their own workerTypes, but that would not be every github repo, for sure, and it would probably only come with some means of measuring those projects' expenses and a lot more thought -- not immediately!

As for naming, I'm amused that you've invented yet *another* name for "without provisioner", as we already have "null-provisioner" and "no-provisioner-nope" at least.  But this particular bug is about the workerTypes within aws-provisioner.  If we start getting more provisioners, we can think about namespacing those, as well.

Finally, these workerTypes need to fit into 22 characters in routing keys, so I think adding a static "p-" just wastes two characters, with little improvement in readability.  TaskCluster has a habit of not using terse abbreviations (that's why we have "project-" or "project:" or "project/" everywhere else) and I don't think we should break that habit.

Based on the irc conversation, I think there are two weaknesses with my proposal.  First, it prohibits sharing workerTypes across levels for gecko tests.  Second, it creates isolated workerTypes for level 2, which are somewhere around 15% of the task load [1].  I'm willing to accept the latter (if it was 5%, maybe not), and it's easy to adjust for the former.

gecko-{{level}}-build-*
               -util          (lint, symbol-upload, etc.)
               -images        (image-building, with privileged enabled)
      -test-*                 (non-artifact-producing tasks)
rust-*                        (rustbuild, isolated from gecko)
gaia-*                        (new name for b2g*, isolated from gecko)
<project>-*                   (created as necessary, balancing concerns as above)
tutorial                      (for the TC tutorial)
github                        (for github repos without anything more specific)

Overall, this would leave us with *fewer* workerTypes than we have now, and certainly with fewer workerType-specific scopes to handle.

[1] http://relengofthenerds.blogspot.com/search?updated-max=2015-06-16T13:52:00-07:00; fx-team is level-2, as is most of "other"
Flags: needinfo?(jopsen)
oh, fx-team is level 3, actually.  So we're more like 5% at level 2.  Catlee, what do you think about this?  Could we run all level-2 repos at level 1 or level 3?
Flags: needinfo?(catlee)
Okay... I'm starting to like this :)
Especially, not doing per-level gecko-test-* workerTypes, that should reduce the number a lot.

Bikeshedding names here but can we add -docker-v1 to tutorial and github.
  tutorial-docker-v1
  github-docker-v1
(to make room for other platforms, and breaking docker-worker releases, etc)

> I'm amused that you've invented yet *another* name for "without provisioner",
> as we already have "null-provisioner" and "no-provisioner-nope" at least.
Yes, it was hard to find a new one (did it because I fear those might be abused in unit-tests).
I guess we should probably cleanup abuse instead.

> Finally, these workerTypes need to fit into 22 characters in routing keys,
> so I think adding a static "p-" just wastes two characters,
Yeah, okay, so just to be clear:
  Giving custom workerTypes to a project is the exception not the rule, right?
Ie. this won't be a template that automatically follow, unless a project needs workerTypes.
Flags: needinfo?(jopsen)
Absolutely, it's not the rule.  But it may be important to projects for security reasons, and that's a good excuse (thinking of rust, bmo, and nss/nspr, all of which are pretty critical infrastructure for us and which don't want to share worker hosts with random github repos).

Yeah, adding those suffixes is fine.
Proposed renamings, so far:

Gecko                                                  
 android-api-15           ->  gecko-L-b-android        
 b2gbuild                 ->  gecko-L-b-b2gbuild       
 b2gtest                  ->  gecko-test-b2g           
 b2gtest-emulator         ->  gecko-test-b2g-emul   
 dbg-linux32              ->  gecko-L-b-linux32        
 dbg-linux64              ->  gecko-L-b-linux64        
 dbg-macosx64             ->  gecko-L-b-macosx64       
 desktop-test             ->  gecko-test-m3medium   
 desktop-test-xlarge      ->  gecko-test-xlarge        
 emulator-ics             ->  gecko-L-b-emul-ics       
 emulator-ics-debug       ->  gecko-L-b-emul-ics       
 emulator-jb              ->  gecko-L-b-emul-jb        
 emulator-jb-debug        ->  gecko-L-b-emul-jb        
 emulator-kk              ->  gecko-L-b-emul-kk        
 emulator-kk-debug        ->  gecko-L-b-emul-kk        
 emulator-l               ->  gecko-L-b-emul-l         
 emulator-l-debug         ->  gecko-L-b-emul-l         
 emulator-x86-kk          ->  gecko-L-b-emul-x86-kk 
 flame-kk                 ->  gecko-L-b-flame-kk       
 gecko-decision           ->  gecko-L-decision         
 mulet-debug              ->  gecko-L-b-mulet          
 mulet-opt                ->  gecko-L-b-mulet          
 opt-linux32              ->  gecko-L-b-linux32        
 opt-linux64              ->  gecko-L-b-linux64        
 opt-macosx64             ->  gecko-L-b-macosx64       
 spidermonkey             ->  gecko-L-b-sm             
 symbol-upload            ->  gecko-L-symbol-upload 
 taskcluster-images       ->  gecko-L-docker-img       
                                                       
Gaia                                                   
 gaia                     ->  gaia-build               
 gaia-cache               ->  gaia-cache               
 gaia-decision            ->  gaia-decision            
                                                       
Funsize                                                
 funsize-balrog           ->  gecko-fnsz-balrog        
 funsize-balrog-dev       ->  gecko-fnsz-balrog        
 funsize-mar-generator    ->  gecko-fnsz-mar-gen    
                                                       
Rust                                                   
 cratertest               ->  rust-cratertest          
 rustbuild                ->  rust-build               
                                                       
TC                                                     
 ami-test                 ->  tc-ami-test              
 ami-test-pv              ->  tc-ami-test-pv           
 github-worker            ->  github-docker            
 tutorial                 ->  tutorial-docker          
 worker-ci-test           ->  tc-worker-ci             
 win2012r2                ->  tc-win2012r2             
 y-2012                   ->  tc-y-2012                
                                                       
??                                                     
 balrog                   ->  ??                       
 cli                      ->  ??                       
 tcvcs-cache-device       ->  ??                       
                                                       
Unused?                                                
 android-api-11                                        
 b2g-desktop-debug                                     
 b2g-desktop-opt                                       
 b2gtest-legacy                                        
 dolphin
(In reply to Dustin J. Mitchell [:dustin] from comment #9)
> oh, fx-team is level 3, actually.  So we're more like 5% at level 2. 
> Catlee, what do you think about this?  Could we run all level-2 repos at
> level 1 or level 3?

I think the deciding factor should be whether code from level 2 ever gets merged directly into level 3 repositories, and how much scrutiny it gets when that happens. I suspect we can probably run all level-2 repos at level 1 and be fine.
Flags: needinfo?(catlee)
Component: Integration → Platform and Services
Depends on: 1287436
It seems to me the biggest reason to segregate worker types between try/non-try or across scm levels is to avoid sharing caches. Could this not also be done, where needed, by including the level in the cache name, and enforcing via scopes? Then maybe we wouldn't require separate worker types for try/non-try?

The concept of worker type incorporates two different concerns: i) the environment definition of the host, and ii) the pool of machines. I think the reason we are considering having separate worker types for try/non-try is because we want to split up the pool, not because we want to change the environment. If the worker type just defined the environment, and we had a separate concept for "worker pool", that would allow us to partition the workers into distinct pools, without increasing the number of worker types.

Many pools is a good thing for e.g. keeping hot caches (think RelEng jacuzzis). But many worker types is a bad thing for administrative overhead. With worker pools, you solve both problems, by keeping number of worker types limited (and thus less data redundancy), but having a lightweight mechanism to set up distinct pools for maximum cache/disk utilisation etc, just based on name. This would however require some engineering effort to introduce such a fundamental change (as tasks would need to be able to specify a pool to run in, perhaps as an optional field in the task definition).
(In reply to Pete Moore [:pmoore][:pete] from comment #14)
> It seems to me the biggest reason to segregate worker types between
> try/non-try or across scm levels is to avoid sharing caches. Could this not
> also be done, where needed, by including the level in the cache name, and
> enforcing via scopes? Then maybe we wouldn't require separate worker types
> for try/non-try?

We do that, too.  This is an added level of security in case of a kernel vulnerability that would allow escape from a container.

> The concept of worker type incorporates two different concerns: i) the
> environment definition of the host, and ii) the pool of machines. I think
> the reason we are considering having separate worker types for try/non-try
> is because we want to split up the pool, not because we want to change the
> environment. If the worker type just defined the environment, and we had a
> separate concept for "worker pool", that would allow us to partition the
> workers into distinct pools, without increasing the number of worker types.

This is true as well.

> Many pools is a good thing for e.g. keeping hot caches (think RelEng
> jacuzzis). But many worker types is a bad thing for administrative overhead.
> With worker pools, you solve both problems, by keeping number of worker
> types limited (and thus less data redundancy), but having a lightweight
> mechanism to set up distinct pools for maximum cache/disk utilisation etc,
> just based on name. This would however require some engineering effort to
> introduce such a fundamental change (as tasks would need to be able to
> specify a pool to run in, perhaps as an optional field in the task
> definition).

I think you're suggesting dynamically created pools?  So I could have a task with:

 provisionerId: aws-provisioner-v1,
 workerType: win2012,
 workerPool: gecko-try,

where only the workerType is explicitly configured in the provisioner, and the pool is dynamically created as tasks are encountered?

I don't think this is as easy as it seems: you'd still need configuration for pool sizes.  For hardware resources, you'd need a mapping from pool to machine.  It would be a nice refactoring (separating pool config from environment config, much as we're working on separating AWS config from environment config), but I don't think we could get away with no configuration for pools.
(In reply to Dustin J. Mitchell [:dustin] from comment #15)

> I think you're suggesting dynamically created pools?  So I could have a task
> with:
> 
>  provisionerId: aws-provisioner-v1,
>  workerType: win2012,
>  workerPool: gecko-try,
> 
> where only the workerType is explicitly configured in the provisioner, and
> the pool is dynamically created as tasks are encountered?

Either that, or having a list of allowed pools. For example, the worker type definition could define a "poolSet" which might be e.g. "scm-level", and then somewhere (globally) we could define the general purpose poolSet "scm-level" to look like:


definition of "gecko-X" worker type:
{
  "description": ".....",
  
  "poolSet": "scm-level",
  ....
}

definition of "scm-level" poolSet:
{
  "description": "This pool set is useful for gecko tasks for partitioning worker types by scm level of the push they relate to",
  "pools": [
    "level-1",
    "level-2",
    "level-3"
  ]
}

Notice here the definition of the poolSet is something global not particular to a worker type, so that you'd only have to once create the pool names, but could reuse across multiple worker types. Other designs are also possible, of course, but intuitively for me it feels about the right complexity, not being unnecessarily flexible.

If you think it has merit, we can stick it in a different bug. It might be a nice way to solve this problem that we've been avoiding dealing with, but requires work on the workers, the provisioner and the queue, so would not be a small change! One nice property of this design though is that it is backwardly compatible, so no need to "break all the things".
Logically, too, we can grant teams permissions to affect worker pools, without giving control over the full worker type. This would keep Jonas happy too, as teams can self-serve pools without needing to invent new worker types. =)
(In reply to Dustin J. Mitchell [:dustin] from comment #15)
> For hardware resources, you'd need a mapping from pool to machine.

At the moment workers declare provisionerId and workerType when claiming tasks; they would also need workerPool (provided by the provisioner). In other words, workers could claim work against a worker pool, or just the worker type, (e.g. "aws-provisioner-v1/desktop-test" and "aws-provisioner-v1/desktop-test:try" would both be valid).
We seem to be converging on the naming in comment 12.  Roughly, that is

 project-*
 gecko-t-* for gecko tests
 gecko-L-b-* for per-level gecko builds

I should write that into the namespaces doc
OK, so we'll make this a tracker to get all gecko workerTypes switched over.
Depends on: 1307771, 1311810
Summary: Namespace workerTypes → [tracker] Namespace workerTypes
Depends on: 1320328
This is largely complete, at least for gecko!  I suspect with the setup of multiple provisioners that we'll see workerTypes change again, so we can do a better job of naming when that happens.
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.