Add support for aws-provider to generic-worker.
Categories
(Taskcluster :: Workers, enhancement)
Tracking
(Not tracked)
People
(Reporter: tomprince, Assigned: pmoore)
References
(Blocks 1 open bug)
Details
Attachments
(4 files, 1 obsolete file)
Generic worker has support for running under aws-provierioner and worker-manger/gcp-provider but not worker-manager/aws-provider. This would be solved by Bug 1558532, but reading the requirements there, it seems like it may be a large amount of work to support that. In light of needing to migrate to aws-provider before Nov 9, adding the support to generic-worker directly seems like a good interim solution.
Assignee | ||
Comment 1•5 years ago
•
|
||
Agreed, that is certainly the shorter path.
Assignee | ||
Comment 2•5 years ago
|
||
FWIW I see from taskcluster-worker-runner implementation that we can tell if we are running under AWS Provider or AWS Provisioner based on the properties in the user data. They use completely different property names.
AWS Provisioner:
- data
- workerType
- provisionerId
- region
- taskclusterRootUrl
- securityToken
- capacity
AWS Provider:
- workerPoolId
- providerId
- rootUrl
- workerGroup
So it is pretty straightforward to determine at runtime which service spawned the worker, without needing to add additional command line options to generic-worker (we can reuse the existing --configure-for-aws
option to handle both).
Reporter | ||
Comment 3•5 years ago
|
||
From :dustin in slack:
one thing to check when implementing this is that TASKCLUSTER_WORKER_LOCATION is set correctly, otherwise sccache won't work
and be really sure it's the same format as implemented in worker runner, or we'll be sadfaces later :slightly_smiling_face:
otherwise, I think the registerWorker code that landed in gcp.go can be copy/pasta'd to aws.go
Assignee | ||
Comment 4•5 years ago
|
||
(In reply to Tom Prince [:tomprince] from comment #0)
Generic worker has support for running under aws-provierioner and worker-manger/gcp-provider but not worker-manager/aws-provider. This would be solved by Bug 1558532, but reading the requirements there, it seems like it may be a large amount of work to support that. In light of needing to migrate to aws-provider before Nov 9, adding the support to generic-worker directly seems like a good interim solution.
I think all generic-worker worker types that will be migrated to the community cluster and currently run in AWS can already run in Google Cloud, so maybe this isn't a blocker for the migration.
Tom, which worker types were you thinking of?
Reporter | ||
Comment 5•5 years ago
|
||
This is for all the stuff managed by OCC for firefox-ci.
Assignee | ||
Comment 6•5 years ago
•
|
||
Agreed, it would be good to have an AWS fallback to mitigate any of the following potential scenarios:
- We have problems with licensing Windows 7 / Windows 10 workers in Google Cloud
- We have problems under load in GCP
- We have problems greening up jobs in GCP
We only don't need AWS Provider support if none of these things go wrong, which is a considerable risk to take.
The alternatives to adding support in generic-worker natively are also considerably more complex:
- Getting generic-worker working with worker-runner in Windows (and having windows releases of worker-runner that run as a windows service, plus the changes needed to OpenCloudConfig etc)
- Migrating AWS Provisioner to the new firefox-ci cluster (a huge change)
- Continuing to run AWS Provisioner under taskcluster.net but getting it talking to the firefox CI cluster (huge job, lot of risk)
So I agree that adding the support in generic-worker natively is relatively straightforward, key to mitigating risk of the above listed issues, and much easier than the alternative approaches to achieving the same.
In other words, we should totally do this - so I will look into it in the coming days.
Although first we might need bug 1518507 to be completed (including child bug 1588625).
Comment 7•5 years ago
•
|
||
(In reply to Pete Moore [:pmoore][:pete] from comment #6)
Agreed, it would be good to have an AWS fallback to mitigate any of the following potential scenarios:
- We have problems with licensing Windows 7 / Windows 10 workers in Google Cloud
- We have problems under load in GCP
- We have problems greening up jobs in GCP
We only don't need AWS Provider support if none of these things go wrong, which is a considerable risk to take.
No Firefox CI workloads are migrating to GCP before Nov 9, with the possible exception of some builds (pending hg optimizations).
However, we're planning to turn off aws-provisioner and ec2-manager on Nov 9. Generic-worker will absolutely need to support AWS provider so that we can continue running existing workloads in AWS.
Reporter | ||
Comment 8•5 years ago
|
||
Note, that I hope do a staging release in a staging cluster late Friday, which is blocked on this.
Assignee | ||
Comment 9•5 years ago
|
||
I've implemented, and am currently testing.
Most of the diff is shuffling code around.
I think I should be able to release on Monday.
Note, the generic-worker repo currently has an integration with the taskcluster.net taskcluster-github and the community taskcluster-github, so I expect the community taskcluster tasks to fail.
Assignee | ||
Comment 10•5 years ago
|
||
Updated•5 years ago
|
Assignee | ||
Comment 11•5 years ago
|
||
This deploys generic-worker 16.4.0 to production.
Successful try push here.
Assignee | ||
Comment 12•5 years ago
|
||
This migrates the generic-worker linux CI workerpool from GCP to AWS Provider, running
generic-worker 16.4.0 which is the first release to support running under AWS Provider.
Assignee | ||
Updated•5 years ago
|
Updated•5 years ago
|
Assignee | ||
Comment 13•5 years ago
|
||
Many thanks Rob. I've merged the changes, but not deployed as I'm not sure I have the latest L3 chain of trust key.
Are you happy to deploy it?
deploy: gecko-1-b-win2012 gecko-2-b-win2012 gecko-3-b-win2012 gecko-t-win10-64-gpu gecko-t-win10-64 gecko-t-win7-32-gpu gecko-t-win7-32
Many thanks.
Comment 14•5 years ago
|
||
deployment in progress: https://tools.taskcluster.net/groups/QpowuSeYRjujBW5EsjeYdw
Assignee | ||
Comment 15•5 years ago
•
|
||
Tom, as you migrate worker types in AWS Provisioner to worker pools running under AWS Provider, it might be a good opportunity to rename the staging worker types we have for Windows. These are typically used for testing generic-worker updates.
What do you think of the following names?
Production Windows worker pools
===============================
aws-provisioner-v1/gecko-1-b-win2012 => gecko-1/b-win2012
aws-provisioner-v1/gecko-t-win10-64 => gecko-t/t-win10-64
aws-provisioner-v1/gecko-t-win10-64-gpu => gecko-t/t-win10-64-gpu
aws-provisioner-v1/gecko-t-win7-32 => gecko-t/t-win7-32
aws-provisioner-v1/gecko-t-win7-32-gpu => gecko-t/t-win7-32-gpu
Staging Windows worker pools
============================
aws-provisioner-v1/gecko-1-b-win2012-beta => staging-gecko-1/b-win2012
aws-provisioner-v1/gecko-t-win10-64-beta => staging-gecko-t/t-win10-64
aws-provisioner-v1/gecko-t-win10-64-gpu-b => staging-gecko-t/t-win10-64-gpu
aws-provisioner-v1/gecko-t-win7-32-beta => staging-gecko-t/t-win7-32
aws-provisioner-v1/gecko-t-win7-32-gpu-b => staging-gecko-t/t-win7-32-gpu
The following are used by the generic-worker CI to run the generic-worker unit and integration tests on production-like worker environments:
aws-provisioner-v1/gecko-t-win10-64-cu
aws-provisioner-v1/gecko-t-win7-32-cu
aws-provisioner-v1/win2012r2-cu
We could potentially run these workers in the community cluster rather than the firefox-ci cluster, although it would be beneficial if we have a means to keep the images in sync with the firefox-ci counterpart images, in order that we detect integration issues as swiftly/efficiently as possible. Since the mercurial ci-configuration
repository is separate from the github community cluster config repository (whose name I've unfortunately forgotten), I'm not sure how easy it will be to keep the two in sync, and therefore it may be more practical to leave workers in the firefox-ci deployment for generic-worker CI so that the images can be easily shared across the worker pools. What are your thoughts on this?
Note, we need a separate worker pool than the staging worker pool, since the generic-worker config is different on the staging pool and the CI pool (in the CI, generic-worker need to run tasks as root
/LocalSystem
so runs with config setting runTasksAsCurrentUser
set to true
, unlike the staging workers that have this setting set to false
, as does production).
If it is ok to have some worker pools in firefox-ci for the specific purpose of integration testing worker changes, I would propose the following names. What do you think?
aws-provisioner-v1/gecko-t-win10-64-cu => taskcluster-ci/gecko-t-win10-64
aws-provisioner-v1/gecko-t-win7-32-cu => taskcluster-ci/gecko-t-win7-32
aws-provisioner-v1/win2012r2-cu => taskcluster-ci/gecko-1-b-win2012
Assignee | ||
Comment 16•5 years ago
|
||
Hey Dustin, please see comment 15. I realise I should have requested your feedback on this too.
Comment 17•5 years ago
|
||
Having dedicated workers for testing worker changes makes sense -- I'll leave it to Tom whether those make sense in the staging deployment or the firefox-ci deployment, and if the latter whether "staging" is a confusing name for them.
We discussed the integration testing in slack and pete is going to go ahead with it.
Reporter | ||
Comment 18•5 years ago
|
||
Reporter | ||
Comment 19•5 years ago
|
||
I haven't had a chance to think about naming yet. I did try out the worker image, but it seemed that occ didn't like something about the config.
I've attached the ci-config patch I used to create it.
Comment 20•5 years ago
|
||
i think we'll need to patch occ to work with worker-managers ec2 provider. it's probably not going to just work without tweaking a few bits around where occ looks for instance metadata.
Reporter | ||
Comment 21•5 years ago
|
||
A couple of things (none of which need to be addressed here, but would be nice for the future):
- Most of the generic-worker config appears to be about how the AMI is configured, and isn't really sensible for somebody using the AMI to configure. I think these configuration options should be backed into the AMI.
- I'm guessing OCC looks at the aws-provisioner/worker-manager metadata about what worker-type. If this is being used to pull configuration from OCC, I think it would be useful to de-couple the identifier used for that from the worker-name. I'd like to be able to use an AMI+config on a staging worker, and then after verification, switch the production workers to point at the same AMI+config, rather than having a separate OCC config for the two worker types.
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 22•5 years ago
|
||
(In reply to Tom Prince [:tomprince] from comment #19)
but it seemed that occ didn't like something about the config.
What did it not like? Do you have the logs?
Reporter | ||
Comment 23•5 years ago
|
||
:markco discovered that (one) issue was the calls getting http://169.254.169.254/latest/meta-data/public-keys
. Since aws-provider based images don't have keys set, that gives a 404, which causes the call to fail, and thus the script. I tried https://github.com/mozilla-releng/OpenCloudConfig/commit/0e3ce7f3ab8c2be14936b1deef8cddaf200824ba to handle it, but it looks like that isn't enough to handle the errors from the requests.
Reporter | ||
Updated•5 years ago
|
Comment 24•5 years ago
|
||
i don't think the public key lookup is the issue. we only use that metadata url to determine if the instance occ is running on is an ami-build instance (a special case for image building). when occ is running on a worker instance, we expect a 404 on that metadata http lookup and fall back to the instance userdata where we look for a json object containing a workerType
property.
i see that the gecko-t/t-win10-64-beta-3 worker pool definition now has an additionalUserData.workerType
field which i'm guessing is worker-managers syntax for sending userdata to the instances it spawns. that should work well but in my testing this morning i haven't observed worker-manager spawning one of these instance types and i see no errors in trying to spawn this worker type. to debug this we'll need to watch the instance logs when worker-manager starts one of these which it doesn't seem to be doing right now, for reasons i don't understand.
please ping me this evening (morning in america) and maybe we can figure this out if we can get worker-manager to fire up an instance.
Assignee | ||
Comment 25•5 years ago
|
||
(In reply to Rob Thijssen [:grenade (EET/UTC+0300)] from comment #24)
i don't think the public key lookup is the issue. we only use that metadata url to determine if the instance occ is running on is an ami-build instance (a special case for image building). when occ is running on a worker instance, we expect a 404 on that metadata http lookup and fall back to the instance userdata where we look for a json object containing a
workerType
property.
Hey Rob,
Indeed the metadata for AWS Provider is indeed a little different to the metadata from AWS Provisioner, and unfortunately no longer contains the workerType
property.
In AWS Provider, the JSON object in userdata contains these properties:
- workerPoolId
- providerId
- workerGroup
- rootUrl
- workerConfig
For AWS Provisioner, the JSON object in userdata contained these properties:
- data
- capacity
- workerType
- provisionerId
- region
- availabilityZone
- instanceType
- spotBid
- price
- launchSpecGenerated
- lastModified
- provisionerBaseUrl
- taskclusterRootUrl
- securityToken
Note that the workerPoolId
is essentially <provisionerId>/<workerType>
so if you need the worker type, it should be possible to scrape it from the worker pool ID (if that helps).
Reporter | ||
Comment 26•5 years ago
|
||
It looks like https://github.com/mozilla-releng/OpenCloudConfig/commit/f2fbeae37e42d71341b2934ba656897f2ced505d is enough to get generic worker running.
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Comment 27•5 years ago
|
||
OpenCloudConfig looks for a top-level workerType
key in the instance user-data. Since
AWS-Provider doesn't set this, we set it here to tell OpenCloudConfig the
appopriate manifest.
Comment 28•5 years ago
|
||
ami deployment in progress for worker types:
- gecko-3-b-win2012
- gecko-3-b-win2012-c4
- gecko-3-b-win2012-c5
- mpd001-3-b-win2012
Comment 29•5 years ago
|
||
Comment on attachment 9103652 [details]
Bug 1588834: [WIP] g-w on worker-manager
Revision D50290 was moved to bug 1589706. Setting attachment 9103652 [details] to obsolete.
Reporter | ||
Updated•5 years ago
|
Description
•