Closed
Bug 1220686
Opened 9 years ago
Closed 8 years ago
[tracker] Namespace workerTypes
Categories
(Taskcluster :: Services, defect)
Taskcluster
Services
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: dustin, Assigned: dustin)
References
Details
In Bug 1216306, I enumerated a list of workerTypes that are allowed to in-tree tasks. That doesn't make much sense -- instead, we should use a prefix-based namespace.
Assignee | ||
Comment 1•9 years ago
|
||
moz-tree:level:1
queue:create-task:aws-provisioner-v1/android-api-11
queue:create-task:aws-provisioner-v1/b2g-desktop-*
queue:create-task:aws-provisioner-v1/b2gbuild*
queue:create-task:aws-provisioner-v1/b2gtest*
queue:create-task:aws-provisioner-v1/balrog
queue:create-task:aws-provisioner-v1/build-c4-2xlarge
queue:create-task:aws-provisioner-v1/dbg-*
queue:create-task:aws-provisioner-v1/desktop-test*
queue:create-task:aws-provisioner-v1/dolphin
queue:create-task:aws-provisioner-v1/emulator-*
queue:create-task:aws-provisioner-v1/flame-kk*
queue:create-task:aws-provisioner-v1/gecko-decision
queue:create-task:aws-provisioner-v1/mulet-opt
queue:create-task:aws-provisioner-v1/opt-*
queue:create-task:aws-provisioner-v1/rustbuild
queue:create-task:aws-provisioner-v1/spidermonkey
queue:create-task:aws-provisioner-v1/symbol-upload
queue:create-task:aws-provisioner-v1/taskcluster-images
queue:create-task:aws-provisioner-v1/test-c4-2xlarge
queue:create-task:aws-provisioner-v1/win2012r2
queue:define-task:aws-provisioner-v1/build-c4-2xlarge
queue:define-task:aws-provisioner-v1/taskcluster-images
queue:define-task:aws-provisioner-v1/test-c4-2xlarge
moz-tree:level:2 adds
queue:create-task:aws-provisioner-v1/testdroid-device
moz-tree:level:3 adds nothing further.
There's some cleanup to do around those define-task scopes (I think that was supposed to happen in bug 1216306). Other than that, a few questions to ponder here:
- do we always want to share permissions for all workerTypes between projects of the same level? For example, gecko try pushes can use (and thus possibly compromise) the rustbuild workerType. I would think rust would want to segregate itself from gecko-related workerTypes.
- within decision-task driven build graphs, what are the variables where we must separate workerTypes (due to different configuration), and where we might *not* (for better caching)
- level -- definitely want to separate here
- tree -- could go either way
- task type (build vs. test vs. images vs. linting) -- some need different config, very little cache overlap
- task subtype (e.g., opt build or reftest) -- probably same workertype
- how can we exploit scope prefixes to filter permissions down from the decision task to individual tasks?
Overall, I'm thinking that we should assign workerTypes to trees (the `repo:...` roles), rather than to scm levels (`moz-tree:level:N`), and use the same `level-{{level}}-{{project}}-*` pattern we use for docker-worker caches, with the suffix being arbitrary. But that needs more thought, because it forces us to use separate workerTypes for each tree. And doesn't address non-gecko, non-gaia builds.
Assignee | ||
Comment 2•9 years ago
|
||
That proposal also doesn't allow users to login to tools and run interactive versions of tasks, which is something we want to support..
Assignee | ||
Comment 3•9 years ago
|
||
First, let's make the general form be {{project}}-*, so:
gecko-{{level}}-build-linux64
-build-linux32
-build-win2008
-test-linux-small
-test-linux-large
-util (lint, symbol-upload, etc.)
-images (image-building, with privileged enabled)
rust-* (rustbuild, arbitrary)
gaia-* (b2g*)
bmo-* (..and so on)
taskcluster-tutorial (for the TC tutorial)
taskcluster-github (for github repos without anything more specific)
(this looks forward to making gecko a project like all the rest, but we don't need to jump into that just yet)
Then in terms of roles, we would have:
moz-tree:level:1 has queue:create-task:aws-provisioner-v1/gecko-1-*
moz-tree:level:2 adds queue:create-task:aws-provisioner-v1/gecko-2-*
moz-tree:level:3 adds queue:create-task:aws-provisioner-v1/gecko-3-*
project-member:bmo has queue:create-task:aws-provisioner-v1/bmo-*
repo:github.com/bugzilla/bugzilla:* has queue:create-task:aws-provisioner-v1/bmo-*
repo:github.com/* has queue:create-task:aws-provisioner-v1/taskcluster-github
project:taskcluster:tutorial has queue:create-task:aws-provisioner-v1/taskcluster-tutorial
etc.
We can even offer projects the ability to create their own workerTypes down the road.
This will
* get repos permission to run the workerTypes they need
* adequately isolate non-gecko stuff from gecko
* give interactive logins the scopes they need to re-run tasks and/or delegate access
* gecko builds, within a level, are free to create/destroy
@jonasfj, @garndt, does this sound totally insane? I feel like when I looked at this 2 months ago, the constraints left no good solution, but what I wrote above seems like a good solution. Did I forget a constraint?
Flags: needinfo?(jopsen)
Flags: needinfo?(garndt)
Assignee | ||
Comment 4•9 years ago
|
||
..are free to create/destroy workerTypes as caching, configuration, cost, etc. dictate
Comment 5•9 years ago
|
||
This is looking good. I love the idea of separating out level 1 (try) from the rest. This also takes it a step forward to making it easier to know what worker types to use if you're standing up new stuff.
The only downside is that we will now have a lot more gecko worker types because of 3 levels for each which is a pain in the provisioner UI right now. I guess I need to learn how to make that manage script work for updating worker types. :)
Flags: needinfo?(garndt)
Assignee | ||
Comment 6•9 years ago
|
||
Yeah, I have been thinking about putting together a taskcluster-admin-scripts repo with some useful management scripts packaged up with a nice CLI wrapper. We need one to manage project roles too.
Comment 7•9 years ago
|
||
> @jonasfj, @garndt, does this sound totally insane?
Yes :) Okay, maybe.. Depends on the intention.
IMO it's critical to efficiency that we don't create too many workerTypes.
I strongly believe that:
The TC team should manage some workerTypes.
These workerTypes should be the **primary** work horses for all projects.
Projects can get scoped caches, etc, on these workerTypes.
These workerTypes have decent task isolation.
Any generic task should use these managed workerTypes.
(linters, simple unite tests, tasks without specific requirements, things people self-serve setup)
(tutorial and most github repos should also use these work horses)
Whenever, a special case arrives, like frequently build the same branch.
(thus hot cache is a requirement).
We should decide carefully, just how well, we really want to isolate it.
Whenever, we split tasks over two different workerType some real thought have to be given,
as to whether or not that is necessary. Projects shouldn't setup workers for the sake of
just doing so.
-----
That being said, if we want to give a namespace like this, so people can self-serve play with
making their own workerTypes that is totally nice to have. And I totally support this.
I would suggest:
aws-provisioner-v1/p-<project>-* (For project-specific workers under aws-provisioner)
zero-provisioner/p-<project>-* (For project-specific workers without provisioner)
p-<project>-* (For project-specfici workers with custom provisioner)
And, then we come up with good name for the managed workerType that we create.
And we manually assign access to create tasks on those. Most of these work horses
probably won't have any restrictions.
Flags: needinfo?(jopsen)
Assignee | ||
Comment 8•9 years ago
|
||
Jonas -- I don't think you read the whole bug :)
We have reasons both for running fewer workerTypes (cache sharing, cost reduction) and more workerTypes (hot caches, strong isolation, host configuration differences). My proposal balances those reasons.
As you see, all github repos build in the same workerType. Rust and Gecko should have strong isolation from one another. BMO could probably use project-taskcluster-github.
I think it would be cool to give *some* projects the ability to manage their own workerTypes, but that would not be every github repo, for sure, and it would probably only come with some means of measuring those projects' expenses and a lot more thought -- not immediately!
As for naming, I'm amused that you've invented yet *another* name for "without provisioner", as we already have "null-provisioner" and "no-provisioner-nope" at least. But this particular bug is about the workerTypes within aws-provisioner. If we start getting more provisioners, we can think about namespacing those, as well.
Finally, these workerTypes need to fit into 22 characters in routing keys, so I think adding a static "p-" just wastes two characters, with little improvement in readability. TaskCluster has a habit of not using terse abbreviations (that's why we have "project-" or "project:" or "project/" everywhere else) and I don't think we should break that habit.
Based on the irc conversation, I think there are two weaknesses with my proposal. First, it prohibits sharing workerTypes across levels for gecko tests. Second, it creates isolated workerTypes for level 2, which are somewhere around 15% of the task load [1]. I'm willing to accept the latter (if it was 5%, maybe not), and it's easy to adjust for the former.
gecko-{{level}}-build-*
-util (lint, symbol-upload, etc.)
-images (image-building, with privileged enabled)
-test-* (non-artifact-producing tasks)
rust-* (rustbuild, isolated from gecko)
gaia-* (new name for b2g*, isolated from gecko)
<project>-* (created as necessary, balancing concerns as above)
tutorial (for the TC tutorial)
github (for github repos without anything more specific)
Overall, this would leave us with *fewer* workerTypes than we have now, and certainly with fewer workerType-specific scopes to handle.
[1] http://relengofthenerds.blogspot.com/search?updated-max=2015-06-16T13:52:00-07:00; fx-team is level-2, as is most of "other"
Flags: needinfo?(jopsen)
Assignee | ||
Comment 9•9 years ago
|
||
oh, fx-team is level 3, actually. So we're more like 5% at level 2. Catlee, what do you think about this? Could we run all level-2 repos at level 1 or level 3?
Flags: needinfo?(catlee)
Comment 10•9 years ago
|
||
Okay... I'm starting to like this :)
Especially, not doing per-level gecko-test-* workerTypes, that should reduce the number a lot.
Bikeshedding names here but can we add -docker-v1 to tutorial and github.
tutorial-docker-v1
github-docker-v1
(to make room for other platforms, and breaking docker-worker releases, etc)
> I'm amused that you've invented yet *another* name for "without provisioner",
> as we already have "null-provisioner" and "no-provisioner-nope" at least.
Yes, it was hard to find a new one (did it because I fear those might be abused in unit-tests).
I guess we should probably cleanup abuse instead.
> Finally, these workerTypes need to fit into 22 characters in routing keys,
> so I think adding a static "p-" just wastes two characters,
Yeah, okay, so just to be clear:
Giving custom workerTypes to a project is the exception not the rule, right?
Ie. this won't be a template that automatically follow, unless a project needs workerTypes.
Flags: needinfo?(jopsen)
Assignee | ||
Comment 11•9 years ago
|
||
Absolutely, it's not the rule. But it may be important to projects for security reasons, and that's a good excuse (thinking of rust, bmo, and nss/nspr, all of which are pretty critical infrastructure for us and which don't want to share worker hosts with random github repos).
Yeah, adding those suffixes is fine.
Assignee | ||
Comment 12•9 years ago
|
||
Proposed renamings, so far:
Gecko
android-api-15 -> gecko-L-b-android
b2gbuild -> gecko-L-b-b2gbuild
b2gtest -> gecko-test-b2g
b2gtest-emulator -> gecko-test-b2g-emul
dbg-linux32 -> gecko-L-b-linux32
dbg-linux64 -> gecko-L-b-linux64
dbg-macosx64 -> gecko-L-b-macosx64
desktop-test -> gecko-test-m3medium
desktop-test-xlarge -> gecko-test-xlarge
emulator-ics -> gecko-L-b-emul-ics
emulator-ics-debug -> gecko-L-b-emul-ics
emulator-jb -> gecko-L-b-emul-jb
emulator-jb-debug -> gecko-L-b-emul-jb
emulator-kk -> gecko-L-b-emul-kk
emulator-kk-debug -> gecko-L-b-emul-kk
emulator-l -> gecko-L-b-emul-l
emulator-l-debug -> gecko-L-b-emul-l
emulator-x86-kk -> gecko-L-b-emul-x86-kk
flame-kk -> gecko-L-b-flame-kk
gecko-decision -> gecko-L-decision
mulet-debug -> gecko-L-b-mulet
mulet-opt -> gecko-L-b-mulet
opt-linux32 -> gecko-L-b-linux32
opt-linux64 -> gecko-L-b-linux64
opt-macosx64 -> gecko-L-b-macosx64
spidermonkey -> gecko-L-b-sm
symbol-upload -> gecko-L-symbol-upload
taskcluster-images -> gecko-L-docker-img
Gaia
gaia -> gaia-build
gaia-cache -> gaia-cache
gaia-decision -> gaia-decision
Funsize
funsize-balrog -> gecko-fnsz-balrog
funsize-balrog-dev -> gecko-fnsz-balrog
funsize-mar-generator -> gecko-fnsz-mar-gen
Rust
cratertest -> rust-cratertest
rustbuild -> rust-build
TC
ami-test -> tc-ami-test
ami-test-pv -> tc-ami-test-pv
github-worker -> github-docker
tutorial -> tutorial-docker
worker-ci-test -> tc-worker-ci
win2012r2 -> tc-win2012r2
y-2012 -> tc-y-2012
??
balrog -> ??
cli -> ??
tcvcs-cache-device -> ??
Unused?
android-api-11
b2g-desktop-debug
b2g-desktop-opt
b2gtest-legacy
dolphin
Comment 13•9 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #9)
> oh, fx-team is level 3, actually. So we're more like 5% at level 2.
> Catlee, what do you think about this? Could we run all level-2 repos at
> level 1 or level 3?
I think the deciding factor should be whether code from level 2 ever gets merged directly into level 3 repositories, and how much scrutiny it gets when that happens. I suspect we can probably run all level-2 repos at level 1 and be fine.
Flags: needinfo?(catlee)
Updated•8 years ago
|
Component: Integration → Platform and Services
Comment 14•8 years ago
|
||
It seems to me the biggest reason to segregate worker types between try/non-try or across scm levels is to avoid sharing caches. Could this not also be done, where needed, by including the level in the cache name, and enforcing via scopes? Then maybe we wouldn't require separate worker types for try/non-try?
The concept of worker type incorporates two different concerns: i) the environment definition of the host, and ii) the pool of machines. I think the reason we are considering having separate worker types for try/non-try is because we want to split up the pool, not because we want to change the environment. If the worker type just defined the environment, and we had a separate concept for "worker pool", that would allow us to partition the workers into distinct pools, without increasing the number of worker types.
Many pools is a good thing for e.g. keeping hot caches (think RelEng jacuzzis). But many worker types is a bad thing for administrative overhead. With worker pools, you solve both problems, by keeping number of worker types limited (and thus less data redundancy), but having a lightweight mechanism to set up distinct pools for maximum cache/disk utilisation etc, just based on name. This would however require some engineering effort to introduce such a fundamental change (as tasks would need to be able to specify a pool to run in, perhaps as an optional field in the task definition).
Assignee | ||
Comment 15•8 years ago
|
||
(In reply to Pete Moore [:pmoore][:pete] from comment #14)
> It seems to me the biggest reason to segregate worker types between
> try/non-try or across scm levels is to avoid sharing caches. Could this not
> also be done, where needed, by including the level in the cache name, and
> enforcing via scopes? Then maybe we wouldn't require separate worker types
> for try/non-try?
We do that, too. This is an added level of security in case of a kernel vulnerability that would allow escape from a container.
> The concept of worker type incorporates two different concerns: i) the
> environment definition of the host, and ii) the pool of machines. I think
> the reason we are considering having separate worker types for try/non-try
> is because we want to split up the pool, not because we want to change the
> environment. If the worker type just defined the environment, and we had a
> separate concept for "worker pool", that would allow us to partition the
> workers into distinct pools, without increasing the number of worker types.
This is true as well.
> Many pools is a good thing for e.g. keeping hot caches (think RelEng
> jacuzzis). But many worker types is a bad thing for administrative overhead.
> With worker pools, you solve both problems, by keeping number of worker
> types limited (and thus less data redundancy), but having a lightweight
> mechanism to set up distinct pools for maximum cache/disk utilisation etc,
> just based on name. This would however require some engineering effort to
> introduce such a fundamental change (as tasks would need to be able to
> specify a pool to run in, perhaps as an optional field in the task
> definition).
I think you're suggesting dynamically created pools? So I could have a task with:
provisionerId: aws-provisioner-v1,
workerType: win2012,
workerPool: gecko-try,
where only the workerType is explicitly configured in the provisioner, and the pool is dynamically created as tasks are encountered?
I don't think this is as easy as it seems: you'd still need configuration for pool sizes. For hardware resources, you'd need a mapping from pool to machine. It would be a nice refactoring (separating pool config from environment config, much as we're working on separating AWS config from environment config), but I don't think we could get away with no configuration for pools.
Comment 16•8 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #15)
> I think you're suggesting dynamically created pools? So I could have a task
> with:
>
> provisionerId: aws-provisioner-v1,
> workerType: win2012,
> workerPool: gecko-try,
>
> where only the workerType is explicitly configured in the provisioner, and
> the pool is dynamically created as tasks are encountered?
Either that, or having a list of allowed pools. For example, the worker type definition could define a "poolSet" which might be e.g. "scm-level", and then somewhere (globally) we could define the general purpose poolSet "scm-level" to look like:
definition of "gecko-X" worker type:
{
"description": ".....",
"poolSet": "scm-level",
....
}
definition of "scm-level" poolSet:
{
"description": "This pool set is useful for gecko tasks for partitioning worker types by scm level of the push they relate to",
"pools": [
"level-1",
"level-2",
"level-3"
]
}
Notice here the definition of the poolSet is something global not particular to a worker type, so that you'd only have to once create the pool names, but could reuse across multiple worker types. Other designs are also possible, of course, but intuitively for me it feels about the right complexity, not being unnecessarily flexible.
If you think it has merit, we can stick it in a different bug. It might be a nice way to solve this problem that we've been avoiding dealing with, but requires work on the workers, the provisioner and the queue, so would not be a small change! One nice property of this design though is that it is backwardly compatible, so no need to "break all the things".
Comment 17•8 years ago
|
||
Logically, too, we can grant teams permissions to affect worker pools, without giving control over the full worker type. This would keep Jonas happy too, as teams can self-serve pools without needing to invent new worker types. =)
Comment 18•8 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #15)
> For hardware resources, you'd need a mapping from pool to machine.
At the moment workers declare provisionerId and workerType when claiming tasks; they would also need workerPool (provided by the provisioner). In other words, workers could claim work against a worker pool, or just the worker type, (e.g. "aws-provisioner-v1/desktop-test" and "aws-provisioner-v1/desktop-test:try" would both be valid).
Assignee | ||
Comment 19•8 years ago
|
||
We seem to be converging on the naming in comment 12. Roughly, that is
project-*
gecko-t-* for gecko tests
gecko-L-b-* for per-level gecko builds
I should write that into the namespaces doc
Assignee | ||
Comment 20•8 years ago
|
||
Comment 21•8 years ago
|
||
Commits pushed to master at https://github.com/taskcluster/taskcluster-docs
https://github.com/taskcluster/taskcluster-docs/commit/c2d6f665c739781bacb07c2ebe8b337a50a576b7
Bug 1220686: document worker types
https://github.com/taskcluster/taskcluster-docs/commit/5b5eaf0f1eab4d811d2d51ad4ca633171a1f5bce
Merge pull request #131 from djmitche/bug1220686
Bug 1220686: document worker types
Assignee | ||
Comment 22•8 years ago
|
||
OK, so we'll make this a tracker to get all gecko workerTypes switched over.
Assignee | ||
Comment 23•8 years ago
|
||
This is largely complete, at least for gecko! I suspect with the setup of multiple provisioners that we'll see workerTypes change again, so we can do a better job of naming when that happens.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Component: Platform and Services → Services
You need to log in
before you can comment on or make changes to this bug.
Description
•