workerType Max Length is too low, hindering usability
Categories
(Taskcluster :: General, enhancement)
Tracking
(Not tracked)
People
(Reporter: mhentges, Assigned: ojaswin25111998, Mentored)
Details
With the 22-character limit, the releng team is incurring costs to readability and having basic naming patterns.
A great example is this PR, which has:
- A not-immediately-obvious abbreviation (
b
for "build") - We have the term "components" instead of "android-components": there's a lot of "components" throughout mozilla, it's tough to know that this is for android
- We can't fit a
-g
suffix like other workers, forcing us to either be inconsistent or to now update the other worker - The PR is blocked on us agreeing on the perfect shortening of the name, which wouldn't be as big of an issue if we had more space
TL;DR: can we increase this max limit?
Comment 1•6 years ago
|
||
The issue is around AMQP routes, which have a fixed maximum length of 255 characters.
Here's the breakdown of the task routing keys
characters |
8 | routingKeyKind
22 | taskId
1 | runId
22 | workerId
22 | provisionerId
22 | workerType
22 | schedulerId
22 | taskGroupId
0 | reserved
plus 7 intervening dots = 148. So there's some room to spare. We could potentially make the workerType up to 45 characters, and still have plenty of headroom.
Comment 2•6 years ago
|
||
What about workerType, provisionerId, and schedulerId all are set to 40? I feel like we run into issues with those the most and 40 should be plenty while still keeping us far away from 255.
Comment 3•6 years ago
|
||
I think provisionerId and schedulerId will matter less soon, but yeah, 40 doesn't hurt. I chose 45 since it's 22+dot+22, but that's basically arbitrary anyway.
I think this will be a good mentored bug. It involves carefully searching for everywhere that we limit the length of any of those three fields and adjusting the length limit, and modifying tests to use some longer identifiers.
Let's be careful and add some checks to tc-lib-pulse that sum up the max length of each of the components and bail out if it's >255, too.
Comment 5•6 years ago
|
||
Awesome! The two places you are likely to find length limitations are in services//schemas/v1/ and in services/*/src/{v1,api}.js. Have at it!
I was thinking of creating a new constant(something like identifier-extended-max-length, open for a better name) and then editing the schemas and the api parameters. Is that okay?
Comment 7•6 years ago
|
||
That sounds good. You'll need to define the constant in a few different places, but don't let that deter you!
Okay, I created a PR here: https://github.com/taskcluster/taskcluster/pull/110
Comment 10•6 years ago
|
||
Remaining parts:
- add some checks to tc-lib-pulse that sum up the max length of each of the components and bail out if it's >255
Comment 11•6 years ago
|
||
tc-lib-pulse already checks the max length.
Pete, you indicated you had some other concerns. What's up?
Comment 12•6 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #11)
tc-lib-pulse already checks the max length.
Pete, you indicated you had some other concerns. What's up?
-
The main concern I had seems to be addressed in comment 1.
-
I wasn't aware of this bug, perhaps a reference to this bug in the PR would have helped with this.
-
I think this was a significant enough change that it could have warranted an RFC. At a minimum a message could have been shared via IRC/email/team meeting - it wasn't widely communicated.
-
For changes to our API interface, please always alert me so (as module owner of taskcluster clients and workers) since any interface changes can impact both. I'll take an action to check both, which we always need to do, in the light of interface changes.
Thanks!
Comment 13•6 years ago
|
||
Is there also a bug for updating the provisioner? I'm pretty sure you'll need to involve jhford with changes like this, for the same reason (module owner of the provisioner).
Comment 14•6 years ago
|
||
One concern that I have about this is that it will lead to information being encoded into the worker type names, which is then used later on down the taskgraph or other places to make decisions on how a task executes or an image is built. We should be able to set the worker-type name to, for example, a slugid, and have everything work 100% as expected.
I agree with Pete's comment that this is a large enough change that I think more broad communication would've been warranted.
Reporter | ||
Comment 15•6 years ago
|
||
FWIW, in releng:
For the worker type, we may need to encode:
- Whether the type is for "mobile" projects or not
- The level (1, 2, 3)
- The project it's being applied to (e.g.: "android-components")
The worker ID also has a nontrivial amount of information (this is the same as the worker's hostname):
- Whether the worker is for testing (dep) or not
- Whether the worker applies to "mobile" projects or not
- What kind of releng script is being run (e.g.: beetmover)
- An incrementing number to disambiguate each additional worker that does the same work
Right now, it's tough to squeeze these values into the 22-character limits.
John, is this the kind of information that you're worried about being encoded in the AMQP? Perhaps these should be mapped to generated IDs?
Comment 16•6 years ago
|
||
(responding to john's comment in the PR but consolidating discussion here):
If we're switching to a longer format, do we need to keep the old length around? All old IDs are valid new IDs and my understanding is that this patch is designed to update everything to the new format in one go.
Could we have a single field for max identifier length? If not, maybe we should hold off on landing until we can have a single field? I'm concerned that the old identifier (indentifier-max-length) will end up living in perpetuity and will be confusing to future implementers.
As written, the patch is not changing every identifier, just the three in the title. So there are still things using "identifier-max-length". But Mitch's comment suggests that we should change workerId as well (and probably workerGroup).
This was jonas's concern: if we have different length limits for different fields, it can cause user confusion. But if we settle on a pattern for all identifiers, then we end up with the lowest-common-denominator (22).
A distinction we could make is, things that must always a slugId (taskId, taskGroupId) are 22 characters, while everything else is 40. Then the protections on AMQP route lengths will alert us quickly to any potential overage.
So that would be:
40 characters:
- provisionerId, workerType, workerGroup, workerId
- schedulerId
- organization, repository (in tc-github)
22 characters (no change):
- taskGroupId, taskId
Comment 17•6 years ago
|
||
I would prefer to avoid back-filling an RFC for this if we can.
At this point, we are all in agreement that this is a good and required change that will require some follow-up work to make sure tools, etc are changed accordingly. What's more, we have a patch in-hand from a new contributor, and I would like to encourage forward progress here.
Yes, if we could go back in time, this should have started with an RFC, but back-filling it now feels like busy work.
Pete, John: are you OK with skipping the RFC for this?
Comment 18•6 years ago
|
||
(In reply to Mitchell Hentges [:mhentges] from comment #15)
John, is this the kind of information that you're worried about being
encoded in the AMQP? Perhaps these should be mapped to generated IDs?
Sort of. My concern is that the name of the worker type will be used as input to control flow in tasks. This would mean that a build script which works on a worker type "linux-1-b-linux" wouldn't run correctly if the worker type were set instead to "abcdef12345". This would make it hard to rename worker types or add new worker types.
This is a general concern I have about worker-type names, even at 22. I'm raising this concern here because this is nearly doubling the size of the field, which means even more can be encoded into it and makes it more tempting.
(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #16)
As written, the patch is not changing every identifier, just the three in
the title. So there are still things using "identifier-max-length". But
Mitch's comment suggests that we should change workerId as well (and
probably workerGroup).This was jonas's concern: if we have different length limits for different
fields, it can cause user confusion. But if we settle on a pattern for all
identifiers, then we end up with the lowest-common-denominator (22).A distinction we could make is, things that must always a slugId (taskId,
taskGroupId) are 22 characters, while everything else is 40. Then the
protections on AMQP route lengths will alert us quickly to any potential
overage.So that would be:
40 characters:
- provisionerId, workerType, workerGroup, workerId
- schedulerId
- organization, repository (in tc-github)
22 characters (no change):
- taskGroupId, taskId
So then what we're having is identifiers and slug ids, not two types of identifiers. Since slug ids have specific requirements, and those fields are absolutely required to be slug ids, perhaps they should use a slug id parameter in the schemas and general identifiers can use an 'identifier' parameter rather than 'identifier-extended' for identifiers.
Having "identifier-max-length" which is effectively "slugid-length", but then then having "identifier-extended-max-length" as the actual identifier maximum length is confusing to me.
Comment 19•6 years ago
|
||
So then what we're having is identifiers and slug ids, not two types of identifiers. Since slug ids have specific requirements, and those fields are absolutely required to be slug ids, perhaps they should use a slug id parameter in the schemas and general identifiers can use an 'identifier' parameter rather than 'identifier-extended' for identifiers.
That makes lots of sense!
Regarding the first concern, can you elaborate on why that's a problem? Particularly noting that workerTypes are used in scope expressions like aws-provisioner-v1/gecko-3-*
(and this is both an important part of the Firefox security model and encouraged design) based on the structure Mitch outlined above.
We don't have workers that "decode" their workerType to determine what to do, but I'm not sure why it would be problematic for the platform if a worker implementer did so.
Comment 20•6 years ago
|
||
I'm guessing the fix for the first concern would be to switch workertype/schedudlerId/etc to a slugid and then store human-readable metadata about them for display purposes. I could see that change making sense at some point but perhaps that is the thing that should be rfc'd and this specific change can happen sooner since it will un-frustrate some of our users and doesn't stop us from changing to a slugid later?
Comment 21•6 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #19)
Regarding the first concern, can you elaborate on why that's a problem? Particularly noting that workerTypes are used in scope expressions like
aws-provisioner-v1/gecko-3-*
(and this is both an important part of the Firefox security model and encouraged design) based on the structure Mitch outlined above.
+1. We also do some workerType
allowlisting in scriptworker chain of trust verification, and plan on using the level in workerType
names in ci-admin privilege escalation tests. It also tells us, at a glance, if we can expect working live logs, since level 3 workers have live logging disabled.
If and when we introduce metadata for the workerType
s, we can adjust these unit tests and web tools to use the metadata, and moving to slug ids would be less disruptive at that point.
Comment 22•6 years ago
|
||
OK, I think we're agreed about identifier lengths. I also don't think this is a good time in TC's trajectory to revisit how we identify resources. So, I'll write an RFC for the former, and leave the latter alone.
Comment 24•6 years ago
|
||
Please look at https://github.com/taskcluster/taskcluster-rfcs/pull/139 and leave any additional comments there. I'd like to take this to final-comment this week and move back to the implementation stage next Monday.
Comment 25•6 years ago
|
||
It looks good to me although I guess I was already in support of this.
Thanks for writing it up for discussion!
Updated•6 years ago
|
Updated•6 years ago
|
Updated•6 years ago
|
Assignee | ||
Comment 27•6 years ago
|
||
Alright so I have some doubts regarding the changes to make in ec2-manager:-
-
I noticed that
ec2-manager
doesn't make any assumptions about theworkerType
right now. Should I add a pattern and max-length property to it in the schemas? [1] -
Should the data type be changed to something like char(38) for the
workerType
in the SQL statements as well? [2]
Comment 29•6 years ago
|
||
(In reply to Ojaswin from comment #27)
Alright so I have some doubts regarding the changes to make in ec2-manager:-
- I noticed that
ec2-manager
doesn't make any assumptions about the
workerType
right now. Should I add a pattern and max-length property to it
in the schemas? [1]
That sounds fine to do, but I think adding that limit only in aws-provisioner is also fine. The only place where EC2-Manager sets workerType values is based on AWS-Provisioner requests. The name in ec2-manager really should be something other than "workerType", since it's not strictly speaking a worker type.
- Should the data type be changed to something like char(38) for the
workerType
in the SQL statements as well? [2]
Nope, that's not needed and likely not desirable, oddly enough:
Tip
There is no performance difference among these three types, apart from increased storage space when using the blank-padded type, and a few extra CPU cycles to check the length when storing into a length- constrained column. While character(n) has performance advantages in some other database systems, there is no such advantage in PostgreSQL; in fact character(n) is usually the slowest of the three because of its additional storage costs. In most situations text or character varying should be used instead.
https://www.postgresql.org/docs/current/datatype-character.html
Assignee | ||
Comment 30•6 years ago
|
||
(In reply to John Ford [:jhford] CET/CEST Berlin Time from comment #29)
(In reply to Ojaswin from comment #27)
That sounds fine to do, but I think adding that limit only in aws-provisioner is also fine. The only place where EC2-Manager sets workerType values is based on AWS-Provisioner requests. The name in ec2-manager really should be something other than "workerType", since it's not strictly speaking a worker type.
Ah ok I'll just change the provisioner then.
Nope, that's not needed and likely not desirable, oddly enough:
Tip
There is no performance difference among these three types, apart from increased storage space when using the blank-padded type, and a few extra CPU cycles to check the length when storing into a length- constrained column. While character(n) has performance advantages in some other database systems, there is no such advantage in PostgreSQL; in fact character(n) is usually the slowest of the three because of its additional storage costs. In most situations text or character varying should be used instead.
https://www.postgresql.org/docs/current/datatype-character.html
Ohh, that's interesting! Thanks for the info
Comment 31•6 years ago
|
||
From the PR it sounds like this was deployed about a week ago. Can we consider this bug closed or are there any follow-ups to address?
Updated•6 years ago
|
Description
•