[generic-worker-managed] Run generic-worker via tc-worker-runner
Categories
(Taskcluster :: Workers, task)
Tracking
(Not tracked)
People
(Reporter: dustin, Assigned: dustin)
References
(Blocks 3 open bugs)
Details
Attachments
(2 files)
This will require establishing feature parity with the panoply of current generic-worker features, including
- support caching configuration over restarts
- support setting permissions for files
- support unpacking files in secrets
- support starting workers as another user
- stay running, reboot or halt when worker exits (based partially on exit code)
- manage autologin
- support termination notification
Assignee | ||
Comment 1•6 years ago
|
||
The dependencies for this are in place. I've now heard from several people "I should probably look at the tc-worker-runner" source, so, well, now's your chance, Miles!
Comment 2•6 years ago
|
||
I've made initial steps towards adding the genericworker worker, namely:
- created the worker package, basing it off of the dockerworker package
- wrote a simple go script (cp.go) that fakes running generic-worker
- wrote a test that invokes that runs the worker, invoking cp.go, and verifies it has run
It's a start, but we still need to verify that running generic-worker itself via worker-runner works.
Assignee | ||
Updated•6 years ago
|
Assignee | ||
Comment 3•5 years ago
|
||
Note that we need this working on Windows, too. Feel free to fork that to a new bug, but let's make sure it's tracked to completion in the tc-cloudops timeframe.
Assignee | ||
Comment 4•5 years ago
|
||
https://github.com/taskcluster/taskcluster-worker-runner/pull/28 covered running g-w on linux, but I think windows support is still open.
Assignee | ||
Comment 5•5 years ago
|
||
Note that this also prevents us from using the static
provider type with generic-worker.
Assignee | ||
Comment 6•5 years ago
|
||
This still blocks a bunch of things, but notably not the TCW on November 9.
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 7•5 years ago
•
|
||
I think that the remaining bit here is to get this working on windows images. At the moment, I think "windows images" means OCC and generic-worker repo, but not monopacker. I'll ignore OCC since that's going out of style anyway /cc grenade.
Then we can start getting this deployed. Once it's deployed more-or-less everywhere, we can remove the now-redundant support for clouds from generic-worker (but of course keep the ability to run independently).
Then, we can start adding more detailed support, first to catch up to docker-worker:
- support for monitoring for instance termination
and then on to more exciting across-the-board stuff
- error reporting for startup errors
- automatic self-termination in cases where worker startup fails
- registration updates ("hey w-m, I'm still here and running")
- etc.
Assignee | ||
Comment 9•5 years ago
|
||
I've added some builds of 0.7.0 for Windows, and they seem to work fine as far as starting generic-worker.
Assignee | ||
Comment 10•5 years ago
|
||
So my next step is to update the imagesets support in community-tc-config to use worker-runner. Once that's in place, grenade can do a similar thing for all firefox-ci worker pools. We will probably start adding features that require worker-runner pretty quickly after that.
Assignee | ||
Comment 11•5 years ago
|
||
Assignee | ||
Comment 12•5 years ago
|
||
I'm off for today, but here's the update: I've modified the staging ubuntu worker-type to use images I built, and they are not starting up. Papertrail is down, so I can't see what's wrong with the instances worker-manager is starting, but on an instance created from the AMI I see
root@ip-172-31-84-232:~# cat /var/log/generic-worker.log
2019/12/09 22:31:40 Loading taskcluster-worker-runner configuration from /etc/start-worker.yml
2019/12/09 22:31:40 Configuring with provider aws
2019/12/09 22:31:40 Could not query user data: (Permanent) HTTP response code 404
HTTP/1.0 404 Not Found
Content-Length: 337
Connection: close
Content-Type: text/html
Date: Mon, 09 Dec 2019 22:31:40 GMT
Server: EC2ws
<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>404 - Not Found</title>
</head>
<body>
<h1>404 - Not Found</h1>
</body>
</html>
which suggests that at least the start-worker
part of the startup process is working.
I created https://community-tc.services.mozilla.com/tasks/X7iHNL7fTS2oUuKOGegP0g to test this.
Assignee | ||
Comment 13•5 years ago
|
||
Ah, they don't have papertrail anyway :)
I added my key to the instances and started a new one.
root@ip-10-0-20-63:~# cat /var/log/generic-worker.log
2019/12/10 22:06:28 Loading taskcluster-worker-runner configuration from /etc/start-worker.yml
2019/12/10 22:06:29 Configuring with provider aws
2019/12/10 22:06:35 Identified as worker community-tc-workers-aws/i-023f748e871a22b14
2019/12/10 22:06:35 Getting secrets from secrets service
2019/12/10 22:06:35 WARNING: No worker secrets for worker pool proj-taskcluster/gw-ci-ubuntu-18-04-staging.
2019/12/10 22:06:35 Configuring for worker implementation generic-worker
2019/12/10 22:06:35 provider metadata zone not available; not setting config zone
2019/12/10 22:06:35 Writing files
2019/12/10 22:06:35 Starting worker
2019/12/10 22:06:42 UTC Loading generic-worker config file '/etc/generic-worker/config'...
2019/12/10 22:06:42 UTC Error loading configuration
2019/12/10 22:06:42 UTC Root cause: Error unmarshaling generic worker config file /etc/generic-worker/config as JSON: json: unknown field "genericWorker"
2019/12/10 22:06:42 UTC &errors.errorString{s:"Error unmarshaling generic worker config file /etc/generic-worker/config as JSON: json: unknown field \"genericWorker\""} (*errors.errorString)
2019/12/10 22:06:42 exit status 64
Assignee | ||
Comment 14•5 years ago
|
||
Bug 1602960 fixes the issues identified in comment 13, and should be ready shortly. The workaround for now is to include genericWorker config in the worker pool as
workerConfig:
wstAudience: "communitytc",
wstServerURL: "https://community-websocktunnel.services.mozilla.com",
omitting the intermediate .. genericWorker: { config: ..
.
Assignee | ||
Comment 15•5 years ago
|
||
Assignee | ||
Comment 16•5 years ago
|
||
With the updated linux approach, the start-worker logs (and thus the worker logs) are visible in the "System Log" in the AWS console. This should make debugging startup issues a lot easier!
Assignee | ||
Comment 17•5 years ago
|
||
The bits out for review here cover getting generic-worker on linux using worker-runner in community-tc:
- https://github.com/mozilla/community-tc-config/pull/153
- https://github.com/taskcluster/taskcluster-worker-runner/pull/85
Remaining:
- Similar changes to community-tc-config and docs in taskcluster-worker-runner for start-worker startup on windows
(the latter won't be too hard -- I've already done the legwork to figure out how it works, so it just remains to write the scripts)
Assignee | ||
Comment 18•5 years ago
|
||
https://serverfault.com/questions/856496/what-and-where-is-the-system-log-for-ec2-windows-instances suggests writing to the serial port from Windows would not be easy.
Assignee | ||
Comment 19•5 years ago
|
||
Assignee | ||
Comment 20•5 years ago
|
||
https://github.com/mozilla/community-tc-config/pull/175
When trying to create these images, I get
./imageset.sh aws update generic-worker-win2016
...
An error occurred (InvalidAMIName.Duplicate) when calling the CreateImage operation: AMI name generic-worker-win2016 version ce2f461f0862ae487e1841c971e2a9f631d88af4 is already in use by AMI ami-03feff875e68e0359
Pete, does that point to something I'm doing wrong? The first time 'round, I thought maybe a snapshot process had been left running when something else crashed, so I ran it again, but got the same result.
Assignee | ||
Comment 21•5 years ago
|
||
Ah, it seems it was b/c I forgot to commit, and it is using the git sha.
Assignee | ||
Comment 22•5 years ago
|
||
This is generating a top-level "zone" config value for generic-worker, which it does not like.
Assignee | ||
Comment 23•5 years ago
|
||
Assignee | ||
Comment 24•5 years ago
|
||
These are now up and running in aws. I'm building new GCP images with the above PR now.
Assignee | ||
Comment 25•5 years ago
|
||
Assignee | ||
Comment 26•5 years ago
|
||
So, I think we can call this done!
Description
•