Closed Bug 1502371 Opened 6 years ago Closed 5 years ago

Support building (public) worker images in tc-builder

Categories

(Taskcluster :: Services, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: dustin, Assigned: dustin)

References

Details

When we "ship" a version of TC, we should ship worker images for the various supported clouds, too.  That's a tall order right now, but we can build some infrastructure to support it, and embed that in tc-builder.  

Bug 1502183:

> There is stuff to build new images in the taskcluster-mozilla-terraform repo under the workers dir.
Blocks: 1502183
Rough plan:

 * set up taskcluster-builder environments in various clouds, using tc-infrastructure
   * dedicated gcp project + service account
   * production AWS account + permission-limited IAM user
   * packet.net? spoon.net? digitalocean? etc.
 * put secrets for all of those into passwordstore
 * run packer (via docker) as part of tc-builder
   * for gcp, use the image exporter post-processor to write the image to a gcs bucket
 * include the outputs (dumped via the manifest post-processor) in the taskcluster.tf.json output
 * set up taskcluster-terraform to re-import the images from the gcs bucket (somehow)
   * we need to think about how to connect these with the runtime config of worker manager

I also want to be careful to get packer to tag everything it creates in these accounts, and have a "cleanup" mode in tc-builder that will seek and destroy old, abandoned resources.  Otherwise it's too easy to ctrl-c and leave an instance running for months.
Worker / provisioning people, thoughts on the above?

I noticed that the docker-worker build process is baking in a lot of secrets.  We probably don't want to do that with a public image?  Is it reasonable to think we can get to a point where all secrets a worker needs to operate are supplied to at at startup, and thus not baked in?

I will likely hack something together (that just uses brian's hacked generic-worker in GCP) so we can get workers in dev/staging environments.  Maybe the rest should be in an RFC?  Any initial guidance is appreciated :)
Flags: needinfo?(wcosta)
Flags: needinfo?(pmoore)
Flags: needinfo?(jhford)
In general, I think image building should be distinct and separate from provisioning.  It sounds like that's the case here.

As you mentioned, garbage collecting is important here.  If I understand the plan correctly, this would be done in a dedicated account?  If so, tagging things with a "shutoff after" timestamp, and then daily going through the account and terminate anything with a shutoff after timestamp in the past or untagged and older than a day.

Is the idea here to keep building images as a manual process?
Flags: needinfo?(jhford)
Regarding provisioning, yes they are separate.  But, I want to make sure we produce something that is easy to set up with provisioning and where it's easy to "upgrade" a deployment and get the newest worker images, just like such an upgrade would automatically use the newest service images.

Maybe we could have a concept of some "built-in" images that can be configured in the worker configuration rules, as an alternative to referencing explicit images such as images custom-built for a particular deployment.  Or maybe the upgrade process calls worker-manager APIs directly to update some well-known rules with the new information?  I'd like to avoid the case where on every upgrade of a deployment someone needs to go copy/paste a whole bunch of identifiers into the correct worker configuration rules.  I realize neither side of this is implemented yet so it's all a little vague, but if you have ideas on what the best approach would be I'd love to hear them.

And yes, at the moment building everything is done locally (./taskcluster-builder [options]) but the door is open to doing so in some kind of automation -- perhaps on a tag of the taskcluster-builder repo (or some other repo containing a build spec).  I think we'll have a clearer idea later of what the best choice will be.
(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #2)
> Worker / provisioning people, thoughts on the above?
> 
> I noticed that the docker-worker build process is baking in a lot of
> secrets.  We probably don't want to do that with a public image?  Is it
> reasonable to think we can get to a point where all secrets a worker needs
> to operate are supplied to at at startup, and thus not baked in?
> 
> I will likely hack something together (that just uses brian's hacked
> generic-worker in GCP) so we can get workers in dev/staging environments. 
> Maybe the rest should be in an RFC?  Any initial guidance is appreciated :)

Deployment scripts of docker-worker are very gecko/AWS specific, but itself docker-worker not so much, what do you have in mind, exactly?
Flags: needinfo?(wcosta)
We'll need to tease that apart.  I see some gecko-specific stuff in the generic-worker repo, too.  I'd like the team to think about what the "generic" image(s) should look like, and how we can support building "custom" images.  And how we can make deployment of those images more user-friendly (the question about worker configs I was asking John).

These are probably more important questions for generic-worker than docker-worker.  At the moment, we're still not planning to include docker-worker in new deployments.
(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #7)
> We'll need to tease that apart.  I see some gecko-specific stuff in the
> generic-worker repo, too.  I'd like the team to think about what the
> "generic" image(s) should look like, and how we can support building
> "custom" images.  And how we can make deployment of those images more
> user-friendly (the question about worker configs I was asking John).
> 
> These are probably more important questions for generic-worker than
> docker-worker.  At the moment, we're still not planning to include
> docker-worker in new deployments.

I don't think we should mess up with worker deployment. I believe we should provide the executable and detailed instructions on how to run it, creating cloud images should be left to the user (we can provide samples)
For the most part I expect users to set up their own builders with configuration which is specific to them. This is especially true for Windows/Mac where the worker host environment is also the host environment of the tasks, so e.g. toolchains specific to the user's tasks are typically installed on the host. If we provide images that don't have the necessary tools installed, they won't be usable.

I think the correct approach here is along the lines of https://github.com/taskcluster/taskcluster-rfcs/issues/122 - this still needs some fleshing out, but I believe that the best way of managing worker type host environments is to make it possible to create a worker type image from inside a taskcluster task, and leave it up to users to run appropriate tasks (and we can provide some sample formulas for them to use, which they can adapt).

The taskcluster platform is at the moment very nicely decoupled from the concerns of setting up workers (anybody can set up workers in any way they chose, and plug them into the taskcluster platform easily and securely) so I'm keen to avoid that we introduce dependencies here or assumptions in our platform that workers have been set up in any particular way. I also see the advantage of providing some usable "bare bones" worker types for the purposes of trying out the platform, though. Perhaps the simple solution is we just create a few example AMIs (and whatever the GCP equivalent is) and publish them publicly, with a link to a doc about how they were created. But I envision that users will (quite rightly) want to be in complete control of setting up their own workers, and that not being bound to the platform.
Flags: needinfo?(pmoore)
OK, sounds good.  I'll close this for now, then.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → INVALID
Well, I will build these "bare bones" AMIs (which, running docker worker, are probably 100% sufficient).
Status: RESOLVED → REOPENED
Resolution: INVALID → ---
From today's meeting, it seems a good plan is to ship docker-engine/docker-worker images as part of the build process, to get users started (and perhaps be enough for most users).  We will have additional support for users building custom workers, but of course that needs lots of flexibility.
Per discussion at this week's all-hands, workers are not part of the Taskcluster Platform product, so need not be included in tc-builder.
Status: REOPENED → RESOLVED
Closed: 6 years ago5 years ago
Resolution: --- → WONTFIX
Component: Redeployability → Services
You need to log in before you can comment on or make changes to this bug.