Closed Bug 1391476 Opened 7 years ago Closed 7 years ago

uid/gid mismatch between checkouts can lead to permission failure doing checkout

Categories

(Firefox Build System :: Task Configuration, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
mozilla57

People

(Reporter: gps, Assigned: gps)

References

Details

Attachments

(11 files, 2 obsolete files)

59 bytes, text/x-review-board-request
dustin
: review+
Details
59 bytes, text/x-review-board-request
dustin
: review+
Details
59 bytes, text/x-review-board-request
dustin
: review+
Details
59 bytes, text/x-review-board-request
dustin
: review+
Details
59 bytes, text/x-review-board-request
dustin
: review+
Details
59 bytes, text/x-review-board-request
dustin
: review+
Details
59 bytes, text/x-review-board-request
dustin
: review+
Details
59 bytes, text/x-review-board-request
dustin
: review+
Details
59 bytes, text/x-review-board-request
dustin
: review+
Details
59 bytes, text/x-review-board-request
dustin
: review+
Details
59 bytes, text/x-review-board-request
nalexander
: review+
mshal
: review+
Details
From https://public-artifacts.taskcluster.net/f4JHtpnhQJyhTsYEm4G-LQ/0/public/logs/live_backing.log:

[taskcluster 2017-08-17 21:53:49.252Z] Task ID: f4JHtpnhQJyhTsYEm4G-LQ
[taskcluster 2017-08-17 21:53:49.253Z] Worker ID: i-00638e1e6479b6c06
[taskcluster 2017-08-17 21:53:49.253Z] Worker Group: us-east-1
[taskcluster 2017-08-17 21:53:49.253Z] Worker Node Type: c4.2xlarge
[taskcluster 2017-08-17 21:53:49.253Z] Worker Type: gecko-1-b-android
[taskcluster 2017-08-17 21:53:49.253Z] Public IP: 54.91.181.90
[taskcluster 2017-08-17 21:53:49.253Z] using cache "tooltool-cache" -> /home/worker/tooltool-cache
[taskcluster 2017-08-17 21:53:49.253Z] using cache "level-1-checkouts-v1" -> /home/worker/checkouts

[taskcluster 2017-08-17 21:53:50.218Z] Downloading artifact "public/image.tar.zst" from task ID: C1rHajDgRYi2RE5ANX64aQ.
[taskcluster 2017-08-17 21:54:10.231Z] Download Progress: 10.21%
[taskcluster 2017-08-17 21:54:15.230Z] Download Progress: 33.74%
[taskcluster 2017-08-17 21:54:20.230Z] Download Progress: 57.12%
[taskcluster 2017-08-17 21:54:25.230Z] Download Progress: 80.34%
[taskcluster 2017-08-17 21:54:29.362Z] Downloaded artifact successfully.
[taskcluster 2017-08-17 21:54:29.363Z] Downloaded 449.712 mb
[taskcluster 2017-08-17 21:54:29.363Z] Decompressing downloaded image
[taskcluster 2017-08-17 21:54:31.109Z] Loading docker image from downloaded archive.
[taskcluster 2017-08-17 21:55:02.634Z] Image 'public/image.tar.zst' from task 'C1rHajDgRYi2RE5ANX64aQ' loaded.  Using image ID sha256:c40afb824e2f6a00b688d856378b6c059bb420c1bde04d654ad86bbc1733fb9b.
[taskcluster 2017-08-17 21:55:02.834Z] === Task Starting ===
[setup 2017-08-17T21:55:03.058941Z] run-task started
[chown 2017-08-17T21:55:03.060631Z] recursively changing ownership of /home/worker/workspace to worker:worker
[chown 2017-08-17T21:55:03.060708Z] recursively changing ownership of /home/worker/tooltool-cache to worker:worker
[setup 2017-08-17T21:55:03.065489Z] running as worker:worker
[vcs 2017-08-17T21:55:03.065540Z] executing ['hg', 'robustcheckout', '--sharebase', '/home/worker/checkouts/hg-store', '--purge', '--upstream', 'https://hg.mozilla.org/mozilla-unified', '--revision', '5b394c411b250de133a11512ce97efce97bc045b', 'https://hg.mozilla.org/try', '/home/worker/workspace/build/src']
[vcs 2017-08-17T21:55:03.114611Z] ensuring https://hg.mozilla.org/try@5b394c411b250de133a11512ce97efce97bc045b is available at /home/worker/workspace/build/src
[vcs 2017-08-17T21:55:03.114763Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
[vcs 2017-08-17T21:55:03.888652Z] (sharing from existing pooled repository 8ba995b74e18334ab3707f27e9eb8f4e37ba3d29)
[vcs 2017-08-17T21:55:03.889747Z] not trusting file /home/worker/checkouts/hg-store/8ba995b74e18334ab3707f27e9eb8f4e37ba3d29/.hg/hgrc from untrusted user 500, group 500
[vcs 2017-08-17T21:55:03.896502Z] abort: could not lock repository /home/worker/workspace/build/src: Permission denied
[taskcluster 2017-08-17 21:55:04.077Z] === Task Finished ===
[taskcluster 2017-08-17 21:55:04.146Z] Artifact "public/build" not found at "/home/worker/artifacts/"
[taskcluster 2017-08-17 21:55:04.209Z] Artifact "private/android-sdk" not found at "/home/worker/private/android-sdk"
[taskcluster 2017-08-17 21:55:04.309Z] Artifact "private/java_home" not found at "/home/worker/private/java_home"
[taskcluster 2017-08-17 21:55:04.646Z] Unsuccessful task run with exit code: 255 completed in 75.395 seconds

I reckon this was caused by /home/worker/checkouts being owned by a different uid/gid than what the task ran as. This can happen if 2 tasks with different uid/gid run on the same worker.

Ideally we normalize our uid/gid to 1000:1000 to avoid ownership mismatch. But it can happen. Especially on Try. We should fix it by doing a recursive chown of the version control share base.
Comment on attachment 8898540 [details]
Bug 1391476 - Ensure version control repository have proper permissions;

This is still buggy in Try.
Attachment #8898540 - Flags: review?(cmanchester)
Depends on: 1391789
Attachment #8898539 - Flags: review?(cmanchester)
Attachment #8898539 - Attachment is obsolete: true
Attachment #8898540 - Attachment is obsolete: true
Comment on attachment 8900044 [details]
Bug 1391476 - Require that all cache paths be declared as volumes;

https://reviewboard.mozilla.org/r/171384/#review176556

::: taskcluster/docker/desktop1604-test/Dockerfile:8
(Diff revision 1)
>  
>  RUN useradd -d /home/worker -s /bin/bash -m worker
>  WORKDIR /home/worker
>  
> +VOLUME /home/worker/tooltool-cache
> +VOLUME /home/worker/.cache

This breaks Docker image building because it attempts to `rm -rf /home/worker/.cache` and this fails because you can't remove a mount point. Good times.
Comment on attachment 8900045 [details]
Bug 1391476 - Tell run-task about volumes so it can sanitize them;

https://reviewboard.mozilla.org/r/171386/#review176576

::: taskcluster/docker/recipes/run-task:352
(Diff revision 2)
> +        if os.listdir(volume):
> +            print('error: volume %s is not empty; non-cache volumes are '
> +                  'supposed to be empty; you likely found a TaskCluster '
> +                  'worker bug' % volume)
> +            return 1

This assertion isn't true.

When Docker builds images, a VOLUME will result in copying any files present in the specified path at the time of the VOLUME directive.

However, I thought this wasn't getting preserved by the way TaskCluster does things. At least one of the Android images breaks this assumption.

So we can:

1) Forbid volumes from having files
2) Allow volumes to have files and chown the root directory
3) Allow volumes to have files and recursively chown

I'm not sure what the uid/gid of files copied to the volume will be. Presumably what the Docker image set up. So I would assume they would be properly set and we wouldn't need to chown them. But I wouldn't put anything past Docker.
Comment on attachment 8900045 [details]
Bug 1391476 - Tell run-task about volumes so it can sanitize them;

https://reviewboard.mozilla.org/r/171386/#review176576

> This assertion isn't true.
> 
> When Docker builds images, a VOLUME will result in copying any files present in the specified path at the time of the VOLUME directive.
> 
> However, I thought this wasn't getting preserved by the way TaskCluster does things. At least one of the Android images breaks this assumption.
> 
> So we can:
> 
> 1) Forbid volumes from having files
> 2) Allow volumes to have files and chown the root directory
> 3) Allow volumes to have files and recursively chown
> 
> I'm not sure what the uid/gid of files copied to the volume will be. Presumably what the Docker image set up. So I would assume they would be properly set and we wouldn't need to chown them. But I wouldn't put anything past Docker.

My distrust of Docker is correct: Docker does **NOT** preserve the uid on files in a VOLUME. I tested this by running ``chown <path-in-volume>`` and confirmed from a running container that the files are owned by ``root:root``. And to be extra sure, the ``layer.tar`` file as part of the tarball produced by ``docker save`` confirmed that the uid/gid are ``0:0``.

I'm leaning towards banning files in volumes because it is simpler and more closely approximates how caches work.
> My distrust of Docker is correct

TRUTH
Comment on attachment 8900038 [details]
Bug 1391476 - Inline list append;

https://reviewboard.mozilla.org/r/171372/#review176838
Attachment #8900038 - Flags: review?(dustin) → review+
Comment on attachment 8900040 [details]
Bug 1391476 - Don't use ~ in paths;

https://reviewboard.mozilla.org/r/171376/#review176844
Attachment #8900040 - Flags: review?(dustin) → review+
Comment on attachment 8900041 [details]
Bug 1391476 - Print cache info;

https://reviewboard.mozilla.org/r/171378/#review176846
Attachment #8900041 - Flags: review?(dustin) → review+
Comment on attachment 8900042 [details]
Bug 1391476 - Track whether caches should be used in untrusted environments;

https://reviewboard.mozilla.org/r/171380/#review176848

Are there any caches without this flag?
Attachment #8900042 - Flags: review?(dustin) → review+
Comment on attachment 8900043 [details]
Bug 1391476 - Capture Docker volumes in docker-worker config;

https://reviewboard.mozilla.org/r/171382/#review176852

::: taskcluster/taskgraph/transforms/task.py:197
(Diff revision 1)
>          Required('allow-ptrace', default=False): bool,
>          Required('loopback-video', default=False): bool,
>          Required('loopback-audio', default=False): bool,
>          Required('docker-in-docker', default=False): bool,  # (aka 'dind')
>  
> +        # Docker volumes. Caches common share the same path. But not always.

Not sure what this sentence means -- is it OK for them to share the same path?  What does it mean?  What if they don't share the same path?
Attachment #8900043 - Flags: review?(dustin) → review+
I'd like a little feedback from Wander or Greg as Docker experts, too.
Flags: needinfo?(wcosta)
Flags: needinfo?(garndt)
Comment on attachment 8900044 [details]
Bug 1391476 - Require that all cache paths be declared as volumes;

https://reviewboard.mozilla.org/r/171384/#review176862

::: commit-message-6a1c5:5
(Diff revision 2)
> +Bug 1391476 - Require that all cache paths be declared as volumes; r?dustin
> +
> +See the inline comment for the rationale here.
> +
> +Depending on the order transforms run in, it is possible to evade this

I don't understand what this comment means. The order's not a mystery.. obviously any check written in the decision task could be evaded by just deleting it.

::: taskcluster/docker/android-gradle-build/Dockerfile:7
(Diff revision 2)
>  FROM          taskcluster/centos6-build-upd:0.1.6.20160329195300
>  MAINTAINER    Nick Alexander <nalexander@mozilla.com>
>  
>  # BEGIN ../desktop-build/Dockerfile
>  
>  # TODO remove when base image is updated

hm, what's supposed to be removed here?
Attachment #8900044 - Flags: review?(dustin) → review+
Comment on attachment 8900045 [details]
Bug 1391476 - Tell run-task about volumes so it can sanitize them;

https://reviewboard.mozilla.org/r/171386/#review176868
Attachment #8900045 - Flags: review?(dustin) → review+
Comment on attachment 8900046 [details]
Bug 1391476 - Automatically set cache/volume permissions in run-task;

https://reviewboard.mozilla.org/r/171388/#review176870

I really like this model
Attachment #8900046 - Flags: review?(dustin) → review+
Comment on attachment 8900047 [details]
Bug 1391476 - Add UID and GID to cache parameters;

https://reviewboard.mozilla.org/r/171390/#review176876
Attachment #8900047 - Flags: review?(dustin) → review+
Comment on attachment 8900042 [details]
Bug 1391476 - Track whether caches should be used in untrusted environments;

https://reviewboard.mozilla.org/r/171380/#review176848

There are a handful, actually.

The version control cache is the most notable. We trust robustcheckout to do sane things so we don't need to worry about cache poisoning as much.

There are definitely a few caches whose configuration I question. It would be worthwhile to audit things. I wouldn't be opposed to changing the default so caches are disabled in "untrusted" environments by default. This was slightly beyond my tolerance for scope bloat in this series.
Fair enough.  Everything is isolated to different workers by level now, so I'm not too worried about *malicious* poisoning.
Comment on attachment 8900044 [details]
Bug 1391476 - Require that all cache paths be declared as volumes;

https://reviewboard.mozilla.org/r/171384/#review176862

> hm, what's supposed to be removed here?

These TODO comments are no longer valid given our new VOLUME requirements. I'll mass remove them.
If I understand the volume bits correctly, we're basically declaring some paths that are typically used for caches on non-try as volumes in the image's dockerfile so they have the same performance (outside of AUFS) as the caches do while not being persistent like caches are.  Makes sense.  I think it highlights that this could be made more explicit and cleaner if we moved the concept of caches to being declared as volumes in the task payload, and the type of volume is persistent (for caches) or temporary for things like we're doing here.

These volumes should be cleared up by docker worker when the containers are removed:
https://github.com/taskcluster/docker-worker/blob/master/src/lib/gc.js#L113-L118
Flags: needinfo?(garndt)
Flags: needinfo?(wcosta)
yeah, we could definitely support this better in taskcluster-worker, but this implementation is really nice
At one point I added persistent volumes to taskcluster-worker, but since then, I think it has been removed.  Perhaps the code created then was not suitable to where taskcluster-worker has been taken now.
(In reply to Dustin J. Mitchell [:dustin] from comment #56)
> yeah, we could definitely support this better in taskcluster-worker, but
> this implementation is really nice

Totally agree that this implementation is really nice, wasn't suggesting otherwise :)
Comment on attachment 8900383 [details]
Bug 1391476 - Don't install nexus.xml in a Docker volume;

https://reviewboard.mozilla.org/r/171734/#review177114

LGTM, assuming try is happy.
Attachment #8900383 - Flags: review+
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/f6f4c5e1b725
Inline list append; r=dustin
https://hg.mozilla.org/integration/autoland/rev/78aac0824b7e
Dedent block; r=dustin
https://hg.mozilla.org/integration/autoland/rev/88cd6294bef4
Don't use ~ in paths; r=dustin
https://hg.mozilla.org/integration/autoland/rev/548978f63439
Print cache info; r=dustin
https://hg.mozilla.org/integration/autoland/rev/985a6df057cb
Track whether caches should be used in untrusted environments; r=dustin
https://hg.mozilla.org/integration/autoland/rev/52a96399c1dc
Capture Docker volumes in docker-worker config; r=dustin
https://hg.mozilla.org/integration/autoland/rev/84fd52d2832a
Require that all cache paths be declared as volumes; r=dustin
https://hg.mozilla.org/integration/autoland/rev/047cc1e53f50
Don't install nexus.xml in a Docker volume; r=mshal
https://hg.mozilla.org/integration/autoland/rev/1768e5267bcf
Tell run-task about volumes so it can sanitize them; r=dustin
https://hg.mozilla.org/integration/autoland/rev/06d162a52dcd
Automatically set cache/volume permissions in run-task; r=dustin
https://hg.mozilla.org/integration/autoland/rev/8e3a645ad94b
Add UID and GID to cache parameters; r=dustin
Comment on attachment 8900383 [details]
Bug 1391476 - Don't install nexus.xml in a Docker volume;

https://reviewboard.mozilla.org/r/171734/#review177526

Thanks for doing this.  If `android-dependencies` is green on try, it's good for me.
Attachment #8900383 - Flags: review?(nalexander) → review+
Comment on attachment 8900383 [details]
Bug 1391476 - Don't install nexus.xml in a Docker volume;

https://reviewboard.mozilla.org/r/171734/#review177526

I did get a green job on Try.
Blocks: 1397503
Comment on attachment 8900042 [details]
Bug 1391476 - Track whether caches should be used in untrusted environments;

https://reviewboard.mozilla.org/r/171380/#review196082

::: taskcluster/taskgraph/transforms/job/run_task.py
(Diff revision 2)
>  def docker_worker_run_task(config, job, taskdesc):
>      run = job['run']
>      worker = taskdesc['worker'] = job['worker']
>      common_setup(config, job, taskdesc)
>  
> -    if run.get('cache-dotcache') and int(config.params['level']) > 1:

I don't think the cache-dotcache condition should have been removed. For one, it broke using run-task on the desktop-build docker image (because the desktop-build docker image doesn't have that path as a VOLUME)
Product: TaskCluster → Firefox Build System
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: