Bugzilla

Comment 1

•

6 years ago

The issue is around AMQP routes, which have a fixed maximum length of 255 characters.

Here's the breakdown of the task routing keys

characters | 
 8         | routingKeyKind 
 22        | taskId 
 1         | runId  
 22        | workerId   
 22        | provisionerId  
 22        | workerType 
 22        | schedulerId    
 22        | taskGroupId    
 0         | reserved

plus 7 intervening dots = 148. So there's some room to spare. We could potentially make the workerType up to 45 characters, and still have plenty of headroom.

Component: Worker → General

QA Contact: pmoore

Summary: Identifier Max Length is too low, hindering usability → workerType Max Length is too low, hindering usability

Brian Stack [:bstack]

Comment 2

•

6 years ago

What about workerType, provisionerId, and schedulerId all are set to 40? I feel like we run into issues with those the most and 40 should be plenty while still keeping us far away from 255.

Comment 3

•

6 years ago

I think provisionerId and schedulerId will matter less soon, but yeah, 40 doesn't hurt. I chose 45 since it's 22+dot+22, but that's basically arbitrary anyway.

I think this will be a good mentored bug. It involves carefully searching for everywhere that we limit the length of any of those three fields and adjusting the length limit, and modifying tests to use some longer identifiers.

Let's be careful and add some checks to tc-lib-pulse that sum up the max length of each of the components and bail out if it's >255, too.

Mentor: dustin

Assignee

Comment 4

•

6 years ago

Hey Dustin, I would like to take this up!

Comment 5

•

6 years ago

Awesome! The two places you are likely to find length limitations are in services//schemas/v1/ and in services/*/src/{v1,api}.js. Have at it!

Assignee: nobody → ojaswin25111998

Assignee

Comment 6

•

6 years ago

I was thinking of creating a new constant(something like identifier-extended-max-length, open for a better name) and then editing the schemas and the api parameters. Is that okay?

Comment 7

•

6 years ago

That sounds good. You'll need to define the constant in a few different places, but don't let that deter you!

Assignee

Comment 8

•

6 years ago

Haha yeah I noticed. I'll get to work then :)

Assignee

Comment 9

•

6 years ago

Okay, I created a PR here: https://github.com/taskcluster/taskcluster/pull/110

Comment 10

•

6 years ago

Remaining parts:

add some checks to tc-lib-pulse that sum up the max length of each of the components and bail out if it's >255

Pete Moore [:pmoore][:pete]

Comment 11

•

6 years ago

tc-lib-pulse already checks the max length.

Pete, you indicated you had some other concerns. What's up?

Flags: needinfo?(pmoore)

Comment 12

•

6 years ago

(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #11)

tc-lib-pulse already checks the max length.

Pete, you indicated you had some other concerns. What's up?

The main concern I had seems to be addressed in comment 1.
I wasn't aware of this bug, perhaps a reference to this bug in the PR would have helped with this.
I think this was a significant enough change that it could have warranted an RFC. At a minimum a message could have been shared via IRC/email/team meeting - it wasn't widely communicated.
For changes to our API interface, please always alert me so (as module owner of taskcluster clients and workers) since any interface changes can impact both. I'll take an action to check both, which we always need to do, in the light of interface changes.

Thanks!

Flags: needinfo?(pmoore)

Pete Moore [:pmoore][:pete]

Comment 13

•

6 years ago

Is there also a bug for updating the provisioner? I'm pretty sure you'll need to involve jhford with changes like this, for the same reason (module owner of the provisioner).

Mitchell Hentges [:mhentges] 🦀

Comment 14

•

6 years ago

One concern that I have about this is that it will lead to information being encoded into the worker type names, which is then used later on down the taskgraph or other places to make decisions on how a task executes or an image is built. We should be able to set the worker-type name to, for example, a slugid, and have everything work 100% as expected.

I agree with Pete's comment that this is a large enough change that I think more broad communication would've been warranted.

Reporter

Comment 15

•

6 years ago

FWIW, in releng:

For the worker type, we may need to encode:

Whether the type is for "mobile" projects or not
The level (1, 2, 3)
The project it's being applied to (e.g.: "android-components")

The worker ID also has a nontrivial amount of information (this is the same as the worker's hostname):

Whether the worker is for testing (dep) or not
Whether the worker applies to "mobile" projects or not
What kind of releng script is being run (e.g.: beetmover)
An incrementing number to disambiguate each additional worker that does the same work

Right now, it's tough to squeeze these values into the 22-character limits.

John, is this the kind of information that you're worried about being encoded in the AMQP? Perhaps these should be mapped to generated IDs?

Chris Cooper [:coop] (he/him)

Comment 16

•

6 years ago

(responding to john's comment in the PR but consolidating discussion here):

If we're switching to a longer format, do we need to keep the old length around? All old IDs are valid new IDs and my understanding is that this patch is designed to update everything to the new format in one go.

Could we have a single field for max identifier length? If not, maybe we should hold off on landing until we can have a single field? I'm concerned that the old identifier (indentifier-max-length) will end up living in perpetuity and will be confusing to future implementers.

As written, the patch is not changing every identifier, just the three in the title. So there are still things using "identifier-max-length". But Mitch's comment suggests that we should change workerId as well (and probably workerGroup).

This was jonas's concern: if we have different length limits for different fields, it can cause user confusion. But if we settle on a pattern for all identifiers, then we end up with the lowest-common-denominator (22).

A distinction we could make is, things that must always a slugId (taskId, taskGroupId) are 22 characters, while everything else is 40. Then the protections on AMQP route lengths will alert us quickly to any potential overage.

So that would be:

40 characters:

provisionerId, workerType, workerGroup, workerId
schedulerId
organization, repository (in tc-github)

22 characters (no change):

taskGroupId, taskId

Comment 17

•

6 years ago

I would prefer to avoid back-filling an RFC for this if we can.

At this point, we are all in agreement that this is a good and required change that will require some follow-up work to make sure tools, etc are changed accordingly. What's more, we have a patch in-hand from a new contributor, and I would like to encourage forward progress here.

Yes, if we could go back in time, this should have started with an RFC, but back-filling it now feels like busy work.

Pete, John: are you OK with skipping the RFC for this?

Flags: needinfo?(pmoore)

Flags: needinfo?(jhford)

Comment 18

•

6 years ago

(In reply to Mitchell Hentges [:mhentges] from comment #15)

John, is this the kind of information that you're worried about being
encoded in the AMQP? Perhaps these should be mapped to generated IDs?

Sort of. My concern is that the name of the worker type will be used as input to control flow in tasks. This would mean that a build script which works on a worker type "linux-1-b-linux" wouldn't run correctly if the worker type were set instead to "abcdef12345". This would make it hard to rename worker types or add new worker types.

This is a general concern I have about worker-type names, even at 22. I'm raising this concern here because this is nearly doubling the size of the field, which means even more can be encoded into it and makes it more tempting.

(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #16)

As written, the patch is not changing every identifier, just the three in
the title. So there are still things using "identifier-max-length". But
Mitch's comment suggests that we should change workerId as well (and
probably workerGroup).

This was jonas's concern: if we have different length limits for different
fields, it can cause user confusion. But if we settle on a pattern for all
identifiers, then we end up with the lowest-common-denominator (22).

A distinction we could make is, things that must always a slugId (taskId,
taskGroupId) are 22 characters, while everything else is 40. Then the
protections on AMQP route lengths will alert us quickly to any potential
overage.

So that would be:

40 characters:

provisionerId, workerType, workerGroup, workerId

schedulerId

organization, repository (in tc-github)

22 characters (no change):

taskGroupId, taskId

So then what we're having is identifiers and slug ids, not two types of identifiers. Since slug ids have specific requirements, and those fields are absolutely required to be slug ids, perhaps they should use a slug id parameter in the schemas and general identifiers can use an 'identifier' parameter rather than 'identifier-extended' for identifiers.

Having "identifier-max-length" which is effectively "slugid-length", but then then having "identifier-extended-max-length" as the actual identifier maximum length is confusing to me.

Flags: needinfo?(jhford)

Comment 19

•

6 years ago

So then what we're having is identifiers and slug ids, not two types of identifiers. Since slug ids have specific requirements, and those fields are absolutely required to be slug ids, perhaps they should use a slug id parameter in the schemas and general identifiers can use an 'identifier' parameter rather than 'identifier-extended' for identifiers.

That makes lots of sense!

Regarding the first concern, can you elaborate on why that's a problem? Particularly noting that workerTypes are used in scope expressions like aws-provisioner-v1/gecko-3-* (and this is both an important part of the Firefox security model and encouraged design) based on the structure Mitch outlined above.

We don't have workers that "decode" their workerType to determine what to do, but I'm not sure why it would be problematic for the platform if a worker implementer did so.

Brian Stack [:bstack]

Comment 20

•

6 years ago

I'm guessing the fix for the first concern would be to switch workertype/schedudlerId/etc to a slugid and then store human-readable metadata about them for display purposes. I could see that change making sense at some point but perhaps that is the thing that should be rfc'd and this specific change can happen sooner since it will un-frustrate some of our users and doesn't stop us from changing to a slugid later?

Aki Sasaki (not active)

Comment 21

•

6 years ago

(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #19)

Regarding the first concern, can you elaborate on why that's a problem? Particularly noting that workerTypes are used in scope expressions like aws-provisioner-v1/gecko-3-* (and this is both an important part of the Firefox security model and encouraged design) based on the structure Mitch outlined above.

+1. We also do some workerType allowlisting in scriptworker chain of trust verification, and plan on using the level in workerType names in ci-admin privilege escalation tests. It also tells us, at a glance, if we can expect working live logs, since level 3 workers have live logging disabled.

If and when we introduce metadata for the workerTypes, we can adjust these unit tests and web tools to use the metadata, and moving to slug ids would be less disruptive at that point.

Comment 22

•

6 years ago

OK, I think we're agreed about identifier lengths. I also don't think this is a good time in TC's trajectory to revisit how we identify resources. So, I'll write an RFC for the former, and leave the latter alone.

Comment 23

•

6 years ago

(clearing needinfo to reset it)

Flags: needinfo?(pmoore)

Mitchell Hentges [:mhentges] 🦀

Comment 24

•

6 years ago

Please look at https://github.com/taskcluster/taskcluster-rfcs/pull/139 and leave any additional comments there. I'd like to take this to final-comment this week and move back to the implementation stage next Monday.

Flags: needinfo?(pmoore)

Flags: needinfo?(mhentges)

Flags: needinfo?(jhford)

Flags: needinfo?(bstack)

Flags: needinfo?(aki)

Brian Stack [:bstack]

Comment 25

•

6 years ago

It looks good to me although I guess I was already in support of this.

Thanks for writing it up for discussion!

Flags: needinfo?(bstack)

Reporter

Comment 26

•

6 years ago

RFC looks fantastic :)

Flags: needinfo?(mhentges)

Aki Sasaki (not active)

Updated

•

6 years ago

Flags: needinfo?(aki)

Pete Moore [:pmoore][:pete]

Updated

•

6 years ago

Flags: needinfo?(jhford)

Updated

•

6 years ago

Flags: needinfo?(pmoore)

[1] https://github.com/taskcluster/ec2-manager/blob/bfc9a313f731b7686ed3f2be23a7b612374437b3/schemas/v1/errors.yml#L28

Assignee

Comment 27

•

6 years ago

Alright so I have some doubts regarding the changes to make in ec2-manager:-

I noticed that ec2-manager doesn't make any assumptions about the workerType right now. Should I add a pattern and max-length property to it in the schemas? [1]
Should the data type be changed to something like char(38) for the workerType in the SQL statements as well? [2]

[2] https://github.com/taskcluster/ec2-manager/blob/bfc9a313f731b7686ed3f2be23a7b612374437b3/sql/create-db.sql#L27-L30

Comment 28

•

6 years ago

Hopefully John can help answer those questions.

Flags: needinfo?(jhford)

https://www.postgresql.org/docs/current/datatype-character.html

Comment 29

•

6 years ago

(In reply to Ojaswin from comment #27)

Alright so I have some doubts regarding the changes to make in ec2-manager:-

I noticed that ec2-manager doesn't make any assumptions about the
workerType right now. Should I add a pattern and max-length property to it
in the schemas? [1]

That sounds fine to do, but I think adding that limit only in aws-provisioner is also fine. The only place where EC2-Manager sets workerType values is based on AWS-Provisioner requests. The name in ec2-manager really should be something other than "workerType", since it's not strictly speaking a worker type.

Should the data type be changed to something like char(38) for the
workerType in the SQL statements as well? [2]

Nope, that's not needed and likely not desirable, oddly enough:

Tip

There is no performance difference among these three types, apart from increased storage space when using the blank-padded type, and a few extra CPU cycles to check the length when storing into a length- constrained column. While character(n) has performance advantages in some other database systems, there is no such advantage in PostgreSQL; in fact character(n) is usually the slowest of the three because of its additional storage costs. In most situations text or character varying should be used instead.

Flags: needinfo?(jhford)

Mihai Tabara [:mtabara]⌚️GMT

Assignee

Comment 30

•

6 years ago

(In reply to John Ford [:jhford] CET/CEST Berlin Time from comment #29)

(In reply to Ojaswin from comment #27)
That sounds fine to do, but I think adding that limit only in aws-provisioner is also fine. The only place where EC2-Manager sets workerType values is based on AWS-Provisioner requests. The name in ec2-manager really should be something other than "workerType", since it's not strictly speaking a worker type.

Ah ok I'll just change the provisioner then.

Nope, that's not needed and likely not desirable, oddly enough:

Tip

There is no performance difference among these three types, apart from increased storage space when using the blank-padded type, and a few extra CPU cycles to check the length when storing into a length- constrained column. While character(n) has performance advantages in some other database systems, there is no such advantage in PostgreSQL; in fact character(n) is usually the slowest of the three because of its additional storage costs. In most situations text or character varying should be used instead.

https://www.postgresql.org/docs/current/datatype-character.html

Ohh, that's interesting! Thanks for the info

Comment 31

•

6 years ago

From the PR it sounds like this was deployed about a week ago. Can we consider this bug closed or are there any follow-ups to address?

Flags: needinfo?(dustin)