Closed Bug 1499054 Opened 6 years ago Closed 5 years ago

Run tasks under unique OS user accounts on Linux

Categories

(Taskcluster :: Workers, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: pmoore, Assigned: pmoore)

References

Details

(Keywords: good-first-bug, Whiteboard: [sp3])

Attachments

(2 files, 1 obsolete file)

Currently, tasks run as the worker user on Linux.

In https://github.com/taskcluster/generic-worker/pull/62 work is underway to support running tasks under docker, but it would also be worthwhile having support for running tasks on the host machine inside the sandbox of a dedicated user account (like we have for Windows). This would be useful for tasks such as talos tests.
Component: Generic-Worker → Workers

Note, processes can be launched as a different user on both macOS and Linux like this:

https://stackoverflow.com/questions/21705950/running-external-commands-through-os-exec-under-another-user

Whiteboard: good-first-bug
See Also: → 1499051

Note, work is underway to implement task user separation in generic-worker on macOS (bug 1499051), as it is needed for bug 1528374 (macOS pgo builds). When this is complete, we will have task user isolation on Windows and macOS, with only linux remaining (this bug).

However, task user isolation in generic-worker on linux (this bug) currently has a lower precedence than docker container isolation on linux in generic-worker (bug 1499055). Therefore (with current planning) bug 1499055 will be tackled before this bug.

This may also block the flatpak work: bug 1278719

Assignee: nobody → pmoore

pmoore: how much work would be involved here?

coop: while it frees up flatpaks (bug 1278719), I don't imagine it should be prioritized higher than the rest of d-w->g-w migration. That said, if this is 1-3 days of work, this would unblock a planned flatpak Q3 OKR for releng. Do we thing we could fit this in sometime in Q3 and have it not put rest of g-w at risk?

Flags: needinfo?(pmoore)

It probably isn't a massive amount of work (a few days), but I also know how a couple of days can roll into a couple of weeks if we encounter strange issues on the way.

The multiuser engine in generic-worker is currently designed to create users, configure auto-login on reboot, and then reboot the computer to get an interactive user desktop, which is then controlled by a windows service / mac launch daemon. On Linux this may look a little different, since there isn't a defacto desktop environment (e.g. some tasks may not require a graphical context, some may want to use GNOME, others KDE, etc...) so implementing the existing stubs for Linux for the multiuser engine would mean either wiring it to a particular desktop environment, or faking these methods to pretend to be providing a graphical context, etc, or adapting multiuser engine to be more general purpose so that linux can behave a little differently to windows/mac and allow the task to launch a window manager / desktop environment etc.

Since I'm not too clear at the moment on the specific requirements of the Linux jobs that need to run under separate user accounts, it isn't clear to me at the moment if we can get away entirely without a graphical context for the tasks, or if one is needed, and if we need to be able to flexibly choose one, or whether we can pick an arbitrary Desktop Environment (e.g. xfce/kde/gnome/cinnamon/...) and wire the tasks to run under it.

So depending on the answers to these questions, the work is probably between a few days' and a few weeks' work.

Note: in addition to the development and test time, there is also the additional liaison work to do with releng in order to have the hosts set up appropriately with required toolchains, build images, create worker types, help troubleshoot/debug any issues during rollout, etc. This can sometimes take more cycles than might be at first anticipated (i.e. swallow up a couple more weeks).

Flags: needinfo?(pmoore)
Attached file GitHub Pull Request for generic-worker (obsolete) —

Note, I started working on this, but am parking this for now.

This PR enables CI testing for linux and provides the stubs that would need to be implemented for setting (or faking) system auto-login settings, creating/deleting/listing system user accounts, etc in runtime/runtime_linux.go.

We should also check if any tests are currently enabled only on mac, that should also run on linux once this is implemented.

I'm parking this PR for now - so if anyone wants to pick it up from here, they are welcome, or if it is decided this should be done before docker engine in generic-worker, I can return to it.

Assignee: pmoore → nobody

Rather than parking an incomplete solution that may potentially bit-rot, I'll put the preparation work in a separate bug (enabling multiuser builds), and just disable the tests in the CI, and the releases, so we'll get untested builds, and no releases, but uncommenting the tests should enable them, and uncommenting a single line will enable releases too...

Depends on: 1564852
Comment on attachment 9075450 [details] [review]
GitHub Pull Request for generic-worker

I've moved the preparation work for this bug into bug 1564852, so this PR has been closed.
Attachment #9075450 - Attachment is obsolete: true
Whiteboard: good-first-bug
Assignee: nobody → pmoore
Attached file GitHub Pull Request

Work in progress...

This bug is currently blocked by AWS support case 6410417131, copied/pasted here for convenience.


Hi!

We create AMIs for our worker machines. The way we create the AMI is to launch a standard base image, pass a bash script as userdata which bootstraps the system installations and shuts down the machine, and then we take a snapshot when we detect the instance has shutdown. We then launch spot instances based on the AMI we created, and everything works as expected, except for the following new situation we are encountering.

In the case that we install ubuntu-gnome-desktop when creating the AMI, spot instances that we launch using this AMI start up with no network, and fail the Instance Status Checks (System Status Checks pass).

If I remove ubuntu-desktop / ubuntu-gnome-desktop from the installed packages, we do not have this problem.

Looking at the system logs, I see the following routing tables:

[   19.231811] cloud-init[831]: Cloud-init v. 19.1-1-gbaa47854-0ubuntu1~18.04.1 running 'init' at Mon, 02 Sep 2019 16:59:20 +0000. Up 19.09 seconds.
[   19.237650] cloud-init[831]: ci-info: +++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++
[   19.243737] cloud-init[831]: ci-info: +--------+-------+-----------+-----------+-------+-------------------+
[   19.249784] cloud-init[831]: ci-info: | Device |   Up  |  Address  |    Mask   | Scope |     Hw-Address    |
[   19.255828] cloud-init[831]: ci-info: +--------+-------+-----------+-----------+-------+-------------------+
[   19.261921] cloud-init[831]: ci-info: |  ens3  | False |     .     |     .     |   .   | 06:16:8a:ed:e2:12 |
[   19.267866] cloud-init[831]: ci-info: |   lo   |  True | 127.0.0.1 | 255.0.0.0 |  host |         .         |
[   19.273801] cloud-init[831]: ci-info: |   lo   |  True |  ::1/128  |     .     |  host |         .         |
[   19.279765] cloud-init[831]: ci-info: +--------+-------+-----------+-----------+-------+-------------------+
[   19.285710] cloud-init[831]: ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++
[   19.291030] cloud-init[831]: ci-info: +-------+-------------+---------+-----------+-------+
[   19.296309] cloud-init[831]: ci-info: | Route | Destination | Gateway | Interface | Flags |
[   19.301670] cloud-init[831]: ci-info: +-------+-------------+---------+-----------+-------+
[   19.306931] cloud-init[831]: ci-info: +-------+-------------+---------+-----------+-------+

Note, if I start the original instance up that the AMI was taken from, it starts up with no problems, and has valid routing tables, and I can ssh onto the machine etc. The problem is only with the spot instances created from the AMI.

Like I say, I've localised the issue to the ubuntu-gnome-desktop / ubuntu-desktop package installations, since without these packages, the machines start up without a problem.

Example instances:

Base instance that I created AMI from:

Instances spawned from the AMI created from this base instance:

From the base instance, you should be able to see the userdata passed in to bootstrap it, but I'll also attach it to the ticket for reference.

As I say, this happens consistently in automation. The AMI that the base instance was created from is ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190814 (ami-064a0193585662d74). Nothing was done on that instance other than launching it with the attached file 1.txt provided as userdata, which includes the shutdown command at the end which caused the instance to stop.

Lastly, I also discovered that this certainly seems to be an AWS issue rather than an ubuntu issue, since if I launch the spot instances based on the same AMI but without ubuntu-desktop / ubuntu-gnome-desktop packages installed, I am then able to install ubuntu-desktop / ubuntu-gnome-desktop successfully, restart the instances, and they come up with networking working. The problem is only if I install the desktop before taking the snapshot. And like I say, the original instance starts up without problem if I restart it, the issue only exhibits itself on the spot instances that were created from the image, not on the machine that was used to create image itself.

Many thanks in advance for your help!

FWIW this is the code we use in automation that creates the instance and snapshots it: https://github.com/taskcluster/generic-worker/blob/bug1499054/worker_types/update.sh

Kind regards,
Pete

Attachments

  • 1.txt
Attached file 1.txt

The above issue has been resolved, and turned out to be a problem when using the ENI network resolver. After changing to netplan network resolver, this problem has gone away. The solution was to add this as the last machine setup step before shutting down and snapshotting the ec2 instance:

cat > /etc/cloud/cloud.cfg.d/01_network_renderer_policy.cfg << EOF
system_info:
    network:
      renderers: [ 'netplan', 'eni', 'sysconfig' ]
EOF

The next problem I am hitting is a failing CI test:

[taskcluster 2019-09-04T10:34:51.870Z] Worker Type (test-provisioner/test-mdqa8ms3sl6rc9qdjb7ivq-a) settings:
[taskcluster 2019-09-04T10:34:51.871Z]   {
[taskcluster 2019-09-04T10:34:51.871Z]     "aws": {
[taskcluster 2019-09-04T10:34:51.871Z]       "ami-id": "test-ami",
[taskcluster 2019-09-04T10:34:51.871Z]       "availability-zone": "outer-space",
[taskcluster 2019-09-04T10:34:51.871Z]       "instance-id": "test-instance-id",
[taskcluster 2019-09-04T10:34:51.871Z]       "instance-type": "p3.enormous",
[taskcluster 2019-09-04T10:34:51.871Z]       "local-ipv4": "87.65.43.21",
[taskcluster 2019-09-04T10:34:51.871Z]       "public-ipv4": "12.34.56.78"
[taskcluster 2019-09-04T10:34:51.871Z]     },
[taskcluster 2019-09-04T10:34:51.871Z]     "generic-worker": {
[taskcluster 2019-09-04T10:34:51.871Z]       "go-arch": "amd64",
[taskcluster 2019-09-04T10:34:51.871Z]       "go-os": "linux",
[taskcluster 2019-09-04T10:34:51.871Z]       "go-version": "go1.10.8",
[taskcluster 2019-09-04T10:34:51.871Z]       "release": "test-release-url",
[taskcluster 2019-09-04T10:34:51.871Z]       "version": "15.1.4"
[taskcluster 2019-09-04T10:34:51.871Z]     },
[taskcluster 2019-09-04T10:34:51.871Z]     "machine-setup": {
[taskcluster 2019-09-04T10:34:51.871Z]       "maintainer": "pmoore@mozilla.com",
[taskcluster 2019-09-04T10:34:51.871Z]       "script": "test-script-url"
[taskcluster 2019-09-04T10:34:51.871Z]     }
[taskcluster 2019-09-04T10:34:51.871Z]   }
[taskcluster 2019-09-04T10:34:51.871Z] Task ID: bNxVCQwzRqC48SILduxmiA
[taskcluster 2019-09-04T10:34:51.871Z] === Task Starting ===
[taskcluster 2019-09-04T10:34:52.586Z] Uploading redirect artifact public/logs/live.log to URL http://12.34.56.78:46695/log/bF_yM-ILSISTcp022ej-XA with mime type "text/plain; charset=utf-8" and expiry 2019-09-04T10:51:51.985Z
[taskcluster 2019-09-04T10:34:52.586Z] Executing command 0: go get 'github.com/taskcluster/taskcluster-client-go'
go: disabling cache (/home/task_1567592780/.cache/go-build) due to initialization failure: open /home/task_1567592780/.cache/go-build/log.txt: permission denied
go get github.com/taskcluster/taskcluster-client-go: open /home/task_1567592780/gopath1.10.8/pkg/linux_amd64/github.com/taskcluster/taskcluster-client-go.a: permission denied
[taskcluster 2019-09-04T10:34:52.923Z]    Exit Code: 1
[taskcluster 2019-09-04T10:34:52.923Z]    User Time: 854.891ms
[taskcluster 2019-09-04T10:34:52.923Z]  Kernel Time: 230.757ms
[taskcluster 2019-09-04T10:34:52.923Z]    Wall Time: 336.363485ms
[taskcluster 2019-09-04T10:34:52.923Z]       Result: FAILED
[taskcluster 2019-09-04T10:34:52.923Z] === Task Finished ===
[taskcluster 2019-09-04T10:34:52.923Z] Task Duration: 336.885965ms
[taskcluster:error] Uploading error artifact resolvetask.go from file resolvetask.go with message "Could not read file '/home/task_1567592780/gopath1.10.8/src/github.com/taskcluster/generic-worker/testdata/TestResolveResolvedTask/task_1567592780/resolvetask.go'", reason "file-missing-on-worker" and expiry 2019-09-04T11:34:49.270Z
[taskcluster:error] TASK FAILURE during artifact upload: file-missing-on-worker: Could not read file '/home/task_1567592780/gopath1.10.8/src/github.com/taskcluster/generic-worker/testdata/TestResolveResolvedTask/task_1567592780/resolvetask.go'
[taskcluster 2019-09-04T10:34:53.800Z] Uploading redirect artifact public/logs/live.log to URL https://queue.taskcluster.net/v1/task/bNxVCQwzRqC48SILduxmiA/runs/0/artifacts/public/logs/live_backing.log with mime type "text/plain; charset=utf-8" and expiry 2019-09-18T10:34:49.000Z
[taskcluster:error] exit status 1
[taskcluster:error] file-missing-on-worker: Could not read file '/home/task_1567592780/gopath1.10.8/src/github.com/taskcluster/generic-worker/testdata/TestResolveResolvedTask/task_1567592780/resolvetask.go'

Troubleshooting this issue now.

(In reply to Pete Moore [:pmoore][:pete] from comment #12)

Troubleshooting this issue now.

Issue resolved, PR ready for review! :-)

Attachment #9080304 - Flags: review?(miles)
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED

\o/ Great work Pete! This will unblock flatpak work, thank you!

(In reply to Mihai Tabara [:mtabara]⌚️GMT from comment #14)

\o/ Great work Pete! This will unblock flatpak work, thank you!

Many thanks Mihai! :-)

Released in generic-worker 16.0.0.

Attachment #9080304 - Flags: review?(miles) → review+
Whiteboard: [sp3]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: