Closed Bug 1164203 Opened 9 years ago Closed 6 years ago

generic-worker: Upload usage.js to references.t.n/workers/<provisionerId>/<workerType>/v1/

Categories

(Taskcluster :: Workers, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: jonasfj, Assigned: pmoore)

References

Details

(Whiteboard: [generic-worker])

@pmoore, feel free to close this bug if you don't like it.
If we implement it we should do it for docker-worker and buildbot-bridge too.
But since you seem enthusiastic about it, I'll propose this in the context of
the generic-worker first.

The idea is:
  1) generic-worker starts up
  2) generic-worker knows its provisionerId and workerType
  3) generic-worker upload usage.js to:
     references.taskcluster.net/workers/<provisionerId>/<workerType>/v1/usage.js

usage.js:
 * a UMD module: https://github.com/umdjs/umd/blob/master/commonjsStrict.js
   (commonjsStrict pattern seems solid)
 * usage.js exports:
   - schema() -> 'http://schemas.taskcluster.net/...'
      - Returns URL for task.payload schema
      - Useful for schema validation
   - explain(payload, taskId)
      - returns github flavored markdown explaining task.payload given
      - Possibly using a bash script illustrating how to run the task locally

We can always extend usage.js with more methods etc.

On tools.taskcluster.net:
 * We can use schema() to validate payload in task-creator!
 * We can use explain(payload, taskId) to illustrate how to run the task.
   Or at least give people an idea of what the task.payload means, in case
   of say buildbot-bridge explaining what it might be as far as we can go.
   But docker-worker and generic-worker can show a shell script, and
   generic-worker and even detect what platform it's running on and upload
   a script that illustrates that this windows, linux or mac.
   Note, it's github flavored markdown, so there is room for explaining things
   like what platform it run on. Or that you should enforce maxRunTime by
   pressing ctrl + c, if it takes more than X minutes :)
 * We'll obvious load this in a WebWorker, so malicious code in usage.js won't
   be able to steal TC creds from localStorage, etc...

Anyways, before we care to implement this on tools.taskcluster.net we should
have usage.js for at least one <provisionerId>/<workerType> :)

@pmoore, Is this cool or crazy? Do we have generic-worker deployed, and if so
would uploading a usage.js at start-up be a quick hack?
Flags: needinfo?(pmoore)
Summary: generic-worker: Upload usage.js to references.taskcluster.net/workers/<provisionerId>/<workerType>/explain.js → generic-worker: Upload usage.js to references.taskcluster.net/workers/<provisionerId>/<workerType>/usage.js
Summary: generic-worker: Upload usage.js to references.taskcluster.net/workers/<provisionerId>/<workerType>/usage.js → generic-worker: Upload usage.js to references.t.n/workers/<provisionerId>/<workerType>/v1/
Additional function for usage.js:

usage.js
 - example() -> returns a sane example task.payload for use in task-creator
This is an interesting and cool idea. So essentially the worker types declare a json schema for their payload, together with a functions for generating an example payload, and a function to translate a given payload into a human-readable explanation about what this payload represents.

This certainly has a lot of value and ties up some loose ends that were otherwise open, i.e. making it simpler for people to construct valid tasks, and have insight into the syntax of the payload for a given worker.

A couple of thoughts:
1) I wonder if worker startup is the right place to publish the schemas. I see pros and cons; pros are that you can be quite sure the publish step has not been overlooked as a separate deployment step, i.e. it is unlikely or near impossible that the published version is out-of-sync with the live operational workers, maybe a con might be that several workers with a given provisionerId and workerType may compete to publish the schemas - although that is quite trivial to solve. So maybe it is the right place to do it. Will give this a little bit more thought. Worker will also need scopes to publish, probably task cluster proxy would handle this, so I guess no major issue here.
2) the "v1" in the url to usage.js maybe could be replaced with something more concrete such as a sha representing the version of the workertype? "v1" feels a bit fragile/breakable. Currently when we define a task I believe we specify a workerType and a provisionerId but not a version, so introducing a version here makes me feel like maybe the task definition should also include the same version number.

Maybe it is best if we talk these things through in person - maybe the choices are fine, but would be good to talk them through anyway.

The obvious alternative workflow is to separate the publish of the schema from the runtime startup of the worker, i.e. binding the publish of the schema to the generation of the worker type, inside the provisioner. I suspect this would be a worse workflow and a little more brittle than publishing at worker startup - I tend to favour your solution at worker startup, but maybe we can talk through both of them just to make sure we're happy with the choice.

Obviously downstream there are gains to be made in the Task Creator web ui - e.g. runtime discovery of provisioners and worker types, such that provisioners/worker types could be presented as drop down fields / auto-complete text entry, and there could be some wizardry around real-time validation per keystroke whilst text editing.

Thanks for raising the bug! :)
Flags: needinfo?(pmoore)
1)
  - Cost/time (every worker start has to do at least on HEAD request, to check file MD5 or so)
    (even if it's just 500ms, it's a lot of 500ms we loose, because of for every worker start)
  - Racing is no issue, this is S3 and all workers write the same contents
  - All workers must have S3 access (we can delegate with auth.awsS3Credentials)
    - auth.awsS3Credentials locks down S3 bucket and prefix in bucket
    - Compromised worker could overwrite usage.js and inject code into tools.taskcluster.net
      (we sandbox with WebWorker but it's still not nice to have risk of code injection)
  Note, I don't like this (uploading at runtime), so I'm definitely open to other reliable solutions.

2)
  The "v1/" part is hardcoded and never supposed to change. Unless we dramatically change what is
  uploaded and decide that we can do a gradual migration.
  Too many version numbers get very complicated very quickly, I considered it initially but we would
  have to develop a lot of framework to support it. So we decided not to.

> Maybe it is best if we talk these things through in person - maybe the choices are fine,
> but would be good to talk them through anyway.
Yeah, probably best. Feel free to arrange something.
Depends on: 1173556
Assigning all generic worker bugs to myself for now. If anyone wants to take this bug, feel free to add a comment to request it. I can provide context.
Assignee: nobody → pmoore
I thought about this a little more.

I don't like engineering too much worker-specific logic into provisioner, because at the moment the provisioner knows nothing about what is running on the instances it spawns up, it can be used to essentially spawn anything. It is nice to keep it generic like this.

However, it does already specify UserData to hand over
* access to TC credentials
* worker-type-specific UserData

Arguably, the provisioner could specify a setting in the UserData to denote whether the worker should publish its usage. For example include something like `{ "usage": { "publish": true, "url": "https://....." }}` in the UserData. 
The workflow might look like this:

1) The provisioner keeps a record of which amis have a published usage. When spawning a worker, if the ami it is based on has no published usage, the provisioner includes the "publish-usage" property in the UserData, which is a key to an object containing the url to publish to. There could either be a seperate feedback url for updating the provisioner to advise the publish was successful, or the provisioner could monitor the publish location to see this for itself.

Advantages
==========
1) The publish url is not burned into the worker, so we don't need to create a new ami if we change publish location (e.g. when restructuring our assets) -- admittedly this is a small gain. Putting the provisioner in charge of deciding publish url means bookkeeping of usage for workers is simpler (each worker type doesn't need to decide its own url, we can see them all in one place).
2) This can be done per ami rather than per worker type, meaning potentially less unnecessary redundancy (several worker types might use the same ami).

Disadvantages
=============
1) Maybe the concept of provisioner/worker-type is a more logical separation, even if it is less efficient, since using an AMI is an AWS concept, so other provisioners might not use them (e.g. if we host in other clouds).
2) It makes the provisioner more worker-implementation-aware, which is probably a bad thing - not a clean separation of concerns.

Since devs could be creating worker types, we don't want to enforce an overly complex worker-type publish workflow - so having the generic worker runtime handle the publish still makes a lot of sense to me.


I'm still not sure if I think any of this should go in the provisioner or not. I tend to favour correctness/cleanness of architecture over marginal performance gains, but I'm not sure...
Component: TaskCluster → Generic-Worker
Product: Testing → Taskcluster
Whiteboard: [taskcluster-q32015-meeting]
See Also: → 1245384
Component: Generic-Worker → Worker
Whiteboard: [taskcluster-q32015-meeting] → [generic-worker]
Component: Worker → Generic-Worker
This have been superseded by actions in queue for provisionerIds...
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → INVALID
Component: Generic-Worker → Workers
You need to log in before you can comment on or make changes to this bug.