Closed Bug 1558532 Opened 5 years ago Closed 5 years ago

[generic-worker-managed] Run generic-worker via tc-worker-runner

Categories

(Taskcluster :: Workers, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: dustin)

References

(Blocks 3 open bugs)

Details

Attachments

(2 files)

This will require establishing feature parity with the panoply of current generic-worker features, including

  • support caching configuration over restarts
  • support setting permissions for files
  • support unpacking files in secrets
  • support starting workers as another user
  • stay running, reboot or halt when worker exits (based partially on exit code)
  • manage autologin
  • support termination notification
Depends on: 1561957
Depends on: 1561958
Blocks: 1562285

The dependencies for this are in place. I've now heard from several people "I should probably look at the tc-worker-runner" source, so, well, now's your chance, Miles!

Assignee: dustin → miles
Blocks: 1573977

I've made initial steps towards adding the genericworker worker, namely:

  • created the worker package, basing it off of the dockerworker package
  • wrote a simple go script (cp.go) that fakes running generic-worker
  • wrote a test that invokes that runs the worker, invoking cp.go, and verifies it has run

It's a start, but we still need to verify that running generic-worker itself via worker-runner works.

Summary: Run generic-worker via tc-worker-runner → [generic-worker-managed] Run generic-worker via tc-worker-runner

Note that we need this working on Windows, too. Feel free to fork that to a new bug, but let's make sure it's tracked to completion in the tc-cloudops timeframe.

https://github.com/taskcluster/taskcluster-worker-runner/pull/28 covered running g-w on linux, but I think windows support is still open.

Note that this also prevents us from using the static provider type with generic-worker.

Blocks: 1592844

This still blocks a bunch of things, but notably not the TCW on November 9.

No longer blocks: 1573977
Blocks: 1593482
Blocks: 1598444
Assignee: miles → dustin

I think that the remaining bit here is to get this working on windows images. At the moment, I think "windows images" means OCC and generic-worker repo, but not monopacker. I'll ignore OCC since that's going out of style anyway /cc grenade.

Then we can start getting this deployed. Once it's deployed more-or-less everywhere, we can remove the now-redundant support for clouds from generic-worker (but of course keep the ability to run independently).

Then, we can start adding more detailed support, first to catch up to docker-worker:

  • support for monitoring for instance termination

and then on to more exciting across-the-board stuff

  • error reporting for startup errors
  • automatic self-termination in cases where worker startup fails
  • registration updates ("hey w-m, I'm still here and running")
  • etc.

I've added some builds of 0.7.0 for Windows, and they seem to work fine as far as starting generic-worker.

So my next step is to update the imagesets support in community-tc-config to use worker-runner. Once that's in place, grenade can do a similar thing for all firefox-ci worker pools. We will probably start adding features that require worker-runner pretty quickly after that.

I'm off for today, but here's the update: I've modified the staging ubuntu worker-type to use images I built, and they are not starting up. Papertrail is down, so I can't see what's wrong with the instances worker-manager is starting, but on an instance created from the AMI I see

root@ip-172-31-84-232:~# cat /var/log/generic-worker.log 
2019/12/09 22:31:40 Loading taskcluster-worker-runner configuration from /etc/start-worker.yml
2019/12/09 22:31:40 Configuring with provider aws
2019/12/09 22:31:40 Could not query user data: (Permanent) HTTP response code 404
HTTP/1.0 404 Not Found
Content-Length: 337
Connection: close
Content-Type: text/html
Date: Mon, 09 Dec 2019 22:31:40 GMT
Server: EC2ws

<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
        "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title>404 - Not Found</title>
 </head>
 <body>
  <h1>404 - Not Found</h1>
 </body>
</html>

which suggests that at least the start-worker part of the startup process is working.

I created https://community-tc.services.mozilla.com/tasks/X7iHNL7fTS2oUuKOGegP0g to test this.

Ah, they don't have papertrail anyway :)

I added my key to the instances and started a new one.

root@ip-10-0-20-63:~# cat /var/log/generic-worker.log 
2019/12/10 22:06:28 Loading taskcluster-worker-runner configuration from /etc/start-worker.yml
2019/12/10 22:06:29 Configuring with provider aws
2019/12/10 22:06:35 Identified as worker community-tc-workers-aws/i-023f748e871a22b14
2019/12/10 22:06:35 Getting secrets from secrets service
2019/12/10 22:06:35 WARNING: No worker secrets for worker pool proj-taskcluster/gw-ci-ubuntu-18-04-staging.
2019/12/10 22:06:35 Configuring for worker implementation generic-worker
2019/12/10 22:06:35 provider metadata zone not available; not setting config zone
2019/12/10 22:06:35 Writing files
2019/12/10 22:06:35 Starting worker
2019/12/10 22:06:42 UTC Loading generic-worker config file '/etc/generic-worker/config'...
2019/12/10 22:06:42 UTC Error loading configuration
2019/12/10 22:06:42 UTC Root cause: Error unmarshaling generic worker config file /etc/generic-worker/config as JSON: json: unknown field "genericWorker"
2019/12/10 22:06:42 UTC &errors.errorString{s:"Error unmarshaling generic worker config file /etc/generic-worker/config as JSON: json: unknown field \"genericWorker\""} (*errors.errorString)
2019/12/10 22:06:42 exit status 64
Depends on: 1602960
Blocks: 1602946

Bug 1602960 fixes the issues identified in comment 13, and should be ready shortly. The workaround for now is to include genericWorker config in the worker pool as

workerConfig:
    wstAudience: "communitytc",
    wstServerURL: "https://community-websocktunnel.services.mozilla.com",

omitting the intermediate .. genericWorker: { config: ...

Depends on: 1604201

With the updated linux approach, the start-worker logs (and thus the worker logs) are visible in the "System Log" in the AWS console. This should make debugging startup issues a lot easier!

The bits out for review here cover getting generic-worker on linux using worker-runner in community-tc:

Remaining:

  • Similar changes to community-tc-config and docs in taskcluster-worker-runner for start-worker startup on windows

(the latter won't be too hard -- I've already done the legwork to figure out how it works, so it just remains to write the scripts)

https://github.com/mozilla/community-tc-config/pull/175

When trying to create these images, I get

./imageset.sh aws update generic-worker-win2016
...
An error occurred (InvalidAMIName.Duplicate) when calling the CreateImage operation: AMI name generic-worker-win2016 version ce2f461f0862ae487e1841c971e2a9f631d88af4 is already in use by AMI ami-03feff875e68e0359

Pete, does that point to something I'm doing wrong? The first time 'round, I thought maybe a snapshot process had been left running when something else crashed, so I ran it again, but got the same result.

Flags: needinfo?(pmoore)

Ah, it seems it was b/c I forgot to commit, and it is using the git sha.

Flags: needinfo?(pmoore)

This is generating a top-level "zone" config value for generic-worker, which it does not like.

These are now up and running in aws. I'm building new GCP images with the above PR now.

So, I think we can call this done!

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: