Closed Bug 1520579 Opened 6 years ago Closed 6 years ago

workerType Max Length is too low, hindering usability

Categories

(Taskcluster :: General, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mhentges, Assigned: ojaswin25111998, Mentored)

Details

With the 22-character limit, the releng team is incurring costs to readability and having basic naming patterns.

A great example is this PR, which has:

  • A not-immediately-obvious abbreviation (b for "build")
  • We have the term "components" instead of "android-components": there's a lot of "components" throughout mozilla, it's tough to know that this is for android
  • We can't fit a -g suffix like other workers, forcing us to either be inconsistent or to now update the other worker
  • The PR is blocked on us agreeing on the perfect shortening of the name, which wouldn't be as big of an issue if we had more space

TL;DR: can we increase this max limit?

The issue is around AMQP routes, which have a fixed maximum length of 255 characters.

Here's the breakdown of the task routing keys

characters | 
 8         | routingKeyKind 
 22        | taskId 
 1         | runId  
 22        | workerId   
 22        | provisionerId  
 22        | workerType 
 22        | schedulerId    
 22        | taskGroupId    
 0         | reserved   

plus 7 intervening dots = 148. So there's some room to spare. We could potentially make the workerType up to 45 characters, and still have plenty of headroom.

Component: Worker → General
QA Contact: pmoore
Summary: Identifier Max Length is too low, hindering usability → workerType Max Length is too low, hindering usability

What about workerType, provisionerId, and schedulerId all are set to 40? I feel like we run into issues with those the most and 40 should be plenty while still keeping us far away from 255.

I think provisionerId and schedulerId will matter less soon, but yeah, 40 doesn't hurt. I chose 45 since it's 22+dot+22, but that's basically arbitrary anyway.

I think this will be a good mentored bug. It involves carefully searching for everywhere that we limit the length of any of those three fields and adjusting the length limit, and modifying tests to use some longer identifiers.

Let's be careful and add some checks to tc-lib-pulse that sum up the max length of each of the components and bail out if it's >255, too.

Mentor: dustin

Hey Dustin, I would like to take this up!

Awesome! The two places you are likely to find length limitations are in services//schemas/v1/ and in services/*/src/{v1,api}.js. Have at it!

Assignee: nobody → ojaswin25111998

I was thinking of creating a new constant(something like identifier-extended-max-length, open for a better name) and then editing the schemas and the api parameters. Is that okay?

That sounds good. You'll need to define the constant in a few different places, but don't let that deter you!

Haha yeah I noticed. I'll get to work then :)

Remaining parts:

  • add some checks to tc-lib-pulse that sum up the max length of each of the components and bail out if it's >255

tc-lib-pulse already checks the max length.

Pete, you indicated you had some other concerns. What's up?

Flags: needinfo?(pmoore)

(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #11)

tc-lib-pulse already checks the max length.

Pete, you indicated you had some other concerns. What's up?

  1. The main concern I had seems to be addressed in comment 1.

  2. I wasn't aware of this bug, perhaps a reference to this bug in the PR would have helped with this.

  3. I think this was a significant enough change that it could have warranted an RFC. At a minimum a message could have been shared via IRC/email/team meeting - it wasn't widely communicated.

  4. For changes to our API interface, please always alert me so (as module owner of taskcluster clients and workers) since any interface changes can impact both. I'll take an action to check both, which we always need to do, in the light of interface changes.

Thanks!

Flags: needinfo?(pmoore)

Is there also a bug for updating the provisioner? I'm pretty sure you'll need to involve jhford with changes like this, for the same reason (module owner of the provisioner).

One concern that I have about this is that it will lead to information being encoded into the worker type names, which is then used later on down the taskgraph or other places to make decisions on how a task executes or an image is built. We should be able to set the worker-type name to, for example, a slugid, and have everything work 100% as expected.

I agree with Pete's comment that this is a large enough change that I think more broad communication would've been warranted.

FWIW, in releng:

For the worker type, we may need to encode:

  • Whether the type is for "mobile" projects or not
  • The level (1, 2, 3)
  • The project it's being applied to (e.g.: "android-components")

The worker ID also has a nontrivial amount of information (this is the same as the worker's hostname):

  • Whether the worker is for testing (dep) or not
  • Whether the worker applies to "mobile" projects or not
  • What kind of releng script is being run (e.g.: beetmover)
  • An incrementing number to disambiguate each additional worker that does the same work

Right now, it's tough to squeeze these values into the 22-character limits.

John, is this the kind of information that you're worried about being encoded in the AMQP? Perhaps these should be mapped to generated IDs?

(responding to john's comment in the PR but consolidating discussion here):

If we're switching to a longer format, do we need to keep the old length around? All old IDs are valid new IDs and my understanding is that this patch is designed to update everything to the new format in one go.

Could we have a single field for max identifier length? If not, maybe we should hold off on landing until we can have a single field? I'm concerned that the old identifier (indentifier-max-length) will end up living in perpetuity and will be confusing to future implementers.

As written, the patch is not changing every identifier, just the three in the title. So there are still things using "identifier-max-length". But Mitch's comment suggests that we should change workerId as well (and probably workerGroup).

This was jonas's concern: if we have different length limits for different fields, it can cause user confusion. But if we settle on a pattern for all identifiers, then we end up with the lowest-common-denominator (22).

A distinction we could make is, things that must always a slugId (taskId, taskGroupId) are 22 characters, while everything else is 40. Then the protections on AMQP route lengths will alert us quickly to any potential overage.

So that would be:

40 characters:

  • provisionerId, workerType, workerGroup, workerId
  • schedulerId
  • organization, repository (in tc-github)

22 characters (no change):

  • taskGroupId, taskId

I would prefer to avoid back-filling an RFC for this if we can.

At this point, we are all in agreement that this is a good and required change that will require some follow-up work to make sure tools, etc are changed accordingly. What's more, we have a patch in-hand from a new contributor, and I would like to encourage forward progress here.

Yes, if we could go back in time, this should have started with an RFC, but back-filling it now feels like busy work.

Pete, John: are you OK with skipping the RFC for this?

Flags: needinfo?(pmoore)
Flags: needinfo?(jhford)

(In reply to Mitchell Hentges [:mhentges] from comment #15)

John, is this the kind of information that you're worried about being
encoded in the AMQP? Perhaps these should be mapped to generated IDs?

Sort of. My concern is that the name of the worker type will be used as input to control flow in tasks. This would mean that a build script which works on a worker type "linux-1-b-linux" wouldn't run correctly if the worker type were set instead to "abcdef12345". This would make it hard to rename worker types or add new worker types.

This is a general concern I have about worker-type names, even at 22. I'm raising this concern here because this is nearly doubling the size of the field, which means even more can be encoded into it and makes it more tempting.

(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #16)

As written, the patch is not changing every identifier, just the three in
the title. So there are still things using "identifier-max-length". But
Mitch's comment suggests that we should change workerId as well (and
probably workerGroup).

This was jonas's concern: if we have different length limits for different
fields, it can cause user confusion. But if we settle on a pattern for all
identifiers, then we end up with the lowest-common-denominator (22).

A distinction we could make is, things that must always a slugId (taskId,
taskGroupId) are 22 characters, while everything else is 40. Then the
protections on AMQP route lengths will alert us quickly to any potential
overage.

So that would be:

40 characters:

  • provisionerId, workerType, workerGroup, workerId
  • schedulerId
  • organization, repository (in tc-github)

22 characters (no change):

  • taskGroupId, taskId

So then what we're having is identifiers and slug ids, not two types of identifiers. Since slug ids have specific requirements, and those fields are absolutely required to be slug ids, perhaps they should use a slug id parameter in the schemas and general identifiers can use an 'identifier' parameter rather than 'identifier-extended' for identifiers.

Having "identifier-max-length" which is effectively "slugid-length", but then then having "identifier-extended-max-length" as the actual identifier maximum length is confusing to me.

Flags: needinfo?(jhford)

So then what we're having is identifiers and slug ids, not two types of identifiers. Since slug ids have specific requirements, and those fields are absolutely required to be slug ids, perhaps they should use a slug id parameter in the schemas and general identifiers can use an 'identifier' parameter rather than 'identifier-extended' for identifiers.

That makes lots of sense!

Regarding the first concern, can you elaborate on why that's a problem? Particularly noting that workerTypes are used in scope expressions like aws-provisioner-v1/gecko-3-* (and this is both an important part of the Firefox security model and encouraged design) based on the structure Mitch outlined above.

We don't have workers that "decode" their workerType to determine what to do, but I'm not sure why it would be problematic for the platform if a worker implementer did so.

I'm guessing the fix for the first concern would be to switch workertype/schedudlerId/etc to a slugid and then store human-readable metadata about them for display purposes. I could see that change making sense at some point but perhaps that is the thing that should be rfc'd and this specific change can happen sooner since it will un-frustrate some of our users and doesn't stop us from changing to a slugid later?

(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #19)

Regarding the first concern, can you elaborate on why that's a problem? Particularly noting that workerTypes are used in scope expressions like aws-provisioner-v1/gecko-3-* (and this is both an important part of the Firefox security model and encouraged design) based on the structure Mitch outlined above.

+1. We also do some workerType allowlisting in scriptworker chain of trust verification, and plan on using the level in workerType names in ci-admin privilege escalation tests. It also tells us, at a glance, if we can expect working live logs, since level 3 workers have live logging disabled.

If and when we introduce metadata for the workerTypes, we can adjust these unit tests and web tools to use the metadata, and moving to slug ids would be less disruptive at that point.

OK, I think we're agreed about identifier lengths. I also don't think this is a good time in TC's trajectory to revisit how we identify resources. So, I'll write an RFC for the former, and leave the latter alone.

(clearing needinfo to reset it)

Flags: needinfo?(pmoore)

Please look at https://github.com/taskcluster/taskcluster-rfcs/pull/139 and leave any additional comments there. I'd like to take this to final-comment this week and move back to the implementation stage next Monday.

Flags: needinfo?(pmoore)
Flags: needinfo?(mhentges)
Flags: needinfo?(jhford)
Flags: needinfo?(bstack)
Flags: needinfo?(aki)

It looks good to me although I guess I was already in support of this.

Thanks for writing it up for discussion!

Flags: needinfo?(bstack)

RFC looks fantastic :)

Flags: needinfo?(mhentges)
Flags: needinfo?(aki)
Flags: needinfo?(jhford)
Flags: needinfo?(pmoore)

Alright so I have some doubts regarding the changes to make in ec2-manager:-

  1. I noticed that ec2-manager doesn't make any assumptions about the workerType right now. Should I add a pattern and max-length property to it in the schemas? [1]

  2. Should the data type be changed to something like char(38) for the workerType in the SQL statements as well? [2]

[1] https://github.com/taskcluster/ec2-manager/blob/bfc9a313f731b7686ed3f2be23a7b612374437b3/schemas/v1/errors.yml#L28

[2] https://github.com/taskcluster/ec2-manager/blob/bfc9a313f731b7686ed3f2be23a7b612374437b3/sql/create-db.sql#L27-L30

Hopefully John can help answer those questions.

Flags: needinfo?(jhford)

(In reply to Ojaswin from comment #27)

Alright so I have some doubts regarding the changes to make in ec2-manager:-

  1. I noticed that ec2-manager doesn't make any assumptions about the
    workerType right now. Should I add a pattern and max-length property to it
    in the schemas? [1]

That sounds fine to do, but I think adding that limit only in aws-provisioner is also fine. The only place where EC2-Manager sets workerType values is based on AWS-Provisioner requests. The name in ec2-manager really should be something other than "workerType", since it's not strictly speaking a worker type.

  1. Should the data type be changed to something like char(38) for the
    workerType in the SQL statements as well? [2]

Nope, that's not needed and likely not desirable, oddly enough:

Tip

There is no performance difference among these three types, apart from increased storage space when using the blank-padded type, and a few extra CPU cycles to check the length when storing into a length- constrained column. While character(n) has performance advantages in some other database systems, there is no such advantage in PostgreSQL; in fact character(n) is usually the slowest of the three because of its additional storage costs. In most situations text or character varying should be used instead.

https://www.postgresql.org/docs/current/datatype-character.html

Flags: needinfo?(jhford)

(In reply to John Ford [:jhford] CET/CEST Berlin Time from comment #29)

(In reply to Ojaswin from comment #27)
That sounds fine to do, but I think adding that limit only in aws-provisioner is also fine. The only place where EC2-Manager sets workerType values is based on AWS-Provisioner requests. The name in ec2-manager really should be something other than "workerType", since it's not strictly speaking a worker type.

Ah ok I'll just change the provisioner then.

Nope, that's not needed and likely not desirable, oddly enough:

Tip

There is no performance difference among these three types, apart from increased storage space when using the blank-padded type, and a few extra CPU cycles to check the length when storing into a length- constrained column. While character(n) has performance advantages in some other database systems, there is no such advantage in PostgreSQL; in fact character(n) is usually the slowest of the three because of its additional storage costs. In most situations text or character varying should be used instead.

https://www.postgresql.org/docs/current/datatype-character.html

Ohh, that's interesting! Thanks for the info

From the PR it sounds like this was deployed about a week ago. Can we consider this bug closed or are there any follow-ups to address?

Flags: needinfo?(dustin)
Status: NEW → RESOLVED
Closed: 6 years ago
Flags: needinfo?(dustin)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.