Closed Bug 1426445 Opened 2 years ago Closed 2 years ago

Cache poisoning introduced by tasks on maple repository

Categories

(Firefox Build System :: Task Configuration, task)

task
Not set

Tracking

(firefox60 fixed)

RESOLVED FIXED
mozilla60
Tracking Status
firefox60 --- fixed

People

(Reporter: gps, Assigned: tomprince)

References

Details

Attachments

(2 files)

https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=62b281c39548aa349fd1141caed5d4340700bbb6 has a number of toolchain failures due to cache poisoning. e.g. https://public-artifacts.taskcluster.net/e2ZuCgLVS7K8vIXp4YhFIQ/0/public/logs/live_backing.log says:

[cache 2017-12-20T18:09:18.498Z] cache /builds/worker/checkouts exists; requirements: gid=1000 uid=1000 version=1
error: requirements for populated cache /builds/worker/checkouts differ from this task
cache requirements: gid=1000 uid=1000 version=1
our requirements:   gid=500 uid=500 version=1

There is a UID/GID mismatch on the cache. This likely means:

a) different tasks are running as a different user/group
b) different Docker images have different UID/GID for the same user/group

Our cache policy is that the UID/GID for ALL tasks must be consistent
for the lifetime of the cache. This eliminates permissions problems due
to file/directory user/group ownership.

To make this error go away, ensure that all Docker images are use
a consistent UID/GID and that all tasks using this cache are running as
the same user/group.


audit log:
[2017-12-20T17:04:50.384901Z HTSWdLPsTTye45MzLzGSZQ] created; requirements: gid=1000, uid=1000, version=1
[2017-12-20T18:07:28.926290Z JBUD_5QMT9uqY3Cd5-dyvQ] requirements mismatch; wanted: gid=500, uid=500, version=1
[2017-12-20T18:09:18.498914Z e2ZuCgLVS7K8vIXp4YhFIQ] requirements mismatch; wanted: gid=500, uid=500, version=1

If we follow the audit log, HTSWdLPsTTye45MzLzGSZQ was a task on maple. Every other failing task seems to have a maple task as the root task.

We are supposed to be using uid/gid 500:500 for the worker:worker user:group. However, some maple tasks seems to be using 1000:1000.
There is a google-play-strings docker image on maple using ubuntu:16.04 as the base image. This image produces a worker:worker user:group with uid:gid 1000:1000 instead of 500:500. This is the source of our cache poisoning.

This Dockerfile will need to do the following:

  groupadd -g 500 worker
  useradd -u 500 -g 500 worker

I'd do this, but I have to run off to a meeting.
Flags: needinfo?(bhearsum)
(In reply to Gregory Szorc [:gps] from comment #1)
> There is a google-play-strings docker image on maple using ubuntu:16.04 as
> the base image. This image produces a worker:worker user:group with uid:gid
> 1000:1000 instead of 500:500. This is the source of our cache poisoning.
> 
> This Dockerfile will need to do the following:
> 
>   groupadd -g 500 worker
>   useradd -u 500 -g 500 worker
> 
> I'd do this, but I have to run off to a meeting.

https://hg.mozilla.org/projects/maple/rev/f5047440978ee8f81507dde1119d3e5dd7e7d03f

Original bug is https://bugzilla.mozilla.org/show_bug.cgi?id=1385401, which I've commented in.
Flags: needinfo?(bhearsum)
See Also: → 1385401
3 toolchain jobs on glandium's push of bug 1426324 also showed uid:gid mismatches. The retriggers succeeded.

https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=f41ca59052be9d0a74e3c1d03695dab446a021ae&filter-resultStatus=usercancel&filter-resultStatus=runnable&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=success&filter-searchStr=toolchains

[taskcluster 2017-12-22 12:57:53.157Z] === Task Starting ===
[setup 2017-12-22T12:57:53.474Z] run-task started
[cache 2017-12-22T12:57:53.476Z] cache /builds/worker/checkouts exists; requirements: gid=1000 uid=1000 version=1
error: requirements for populated cache /builds/worker/checkouts differ from this task
cache requirements: gid=1000 uid=1000 version=1
our requirements:   gid=500 uid=500 version=1

There is a UID/GID mismatch on the cache. This likely means:

a) different tasks are running as a different user/group
b) different Docker images have different UID/GID for the same user/group

Our cache policy is that the UID/GID for ALL tasks must be consistent
for the lifetime of the cache. This eliminates permissions problems due
to file/directory user/group ownership.

To make this error go away, ensure that all Docker images are use
a consistent UID/GID and that all tasks using this cache are running as
the same user/group.


audit log:
[2017-12-22T12:01:15.274135Z Ny19mzYSTgyh1UJne9dXnw] created; requirements: gid=1000, uid=1000, version=1
[2017-12-22T12:57:53.476423Z G9MIn2jFQIKWK4LUdmW0DA] requirements mismatch; wanted: gid=500, uid=500, version=1
I think the problem in Comment 3 is due to using the `lint` image on a gecko-N-b-linux image. The lint image uses ubuntu:1604 as a base, which as Comment 1 suggests uses 1000:1000 as the UID/GID. This isn't a problem gecko-t-linux-* workers, as all the images there are basedon ubuntu:1604, so presumably have the same UID/GID. The task in question (Ny19mzYSTgyh1UJne9dXnw[1]) runs on gecko-N-b-linux, though, and the main image used there is `desktop-build` which is based on a centos6 image.

The solution is probably to adjust all the images based on ubuntu:1604 to explicitly set the UID/GID. Doing this will require purging all the caches those images use (or alternatively, `run-task` can be changed to verify that the UID/GID of `worker` is always 500:500 (which incidentally cause the name of the caches used to change).

[1] https://tools.taskcluster.net/groups/YywkNiUjT7CaVXtRvZiF4Q/tasks/Ny19mzYSTgyh1UJne9dXnw/details
Note we're not far from switching off centos for builds, so we should probably switch the centos images to 1000:1000 instead. Or wait for bug 1399679, which, at this point, is almost a review away (I have it all working, I just need to split my patch queue in reviewable form and put it up for review)
I implemented https://github.com/taskcluster/docker-worker/pull/347 to allow run-task to request the cache be destroyed, when it notices inconsistencies like this.
Comment on attachment 8939416 [details]
Bug 1426445: Add sanity check that worker uid/gid is 1000 in run-task;

https://reviewboard.mozilla.org/r/209752/#review215482

This totally makes sense, but I feel like there was some reason we didn't do this already....
Comment on attachment 8939416 [details]
Bug 1426445: Add sanity check that worker uid/gid is 1000 in run-task;

https://reviewboard.mozilla.org/r/209752/#review215522
Attachment #8939416 - Flags: review?(dustin) → review+
Comment on attachment 8939416 [details]
Bug 1426445: Add sanity check that worker uid/gid is 1000 in run-task;

https://reviewboard.mozilla.org/r/209752/#review216386

I didn't implement this because a) it wasn't needed b) it is somewhat architecturally unpure (ideally we shouldn't have low-level details like the "worker" user/group embedded in run-task). But cache poisoning can cause major headaches and run-task is [currently] Firefox centric, so I'm OK with teaching run-task about the "worker" user and group so we can fail faster and not poison caches in the process.
Attachment #8939416 - Flags: review?(gps) → review+
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/15a9e149f2db
Add sanity check that worker uid/gid is 1000 in run-task; r=dustin,gps
Keywords: leave-open
Backout by gszorc@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/a77c974e4c75
Backed out changeset 15a9e149f2db for build bustage
Comment on attachment 8942814 [details]
Bug 1426445: Purge task caches, when an incompatible cache is found; r=gps

Gregory Szorc [:gps] has approved the revision.

https://phabricator.services.mozilla.com/D395#9589
Attachment #8942814 - Flags: review+
Note that this won't show up on try, as incorrect permissions are ignored there.
Pushed by mozilla@hocat.ca:
https://hg.mozilla.org/integration/mozilla-inbound/rev/69b3883f83a4
Purge task caches, when an incompatible cache is found; r=gps
Comment on attachment 8939416 [details]
Bug 1426445: Add sanity check that worker uid/gid is 1000 in run-task;

https://reviewboard.mozilla.org/r/209752/#review224688

I'm fine with doing this and with the value 1000:1000. However, unless I'm not seeing a patch that has landed already, debian-base is still using 500:500.

Also, when switching to 1000:1000, I prefer we bump the cache key to force a new cache. Otherwise we'll just incur cache eviction on the VCS cache, since it is shared across repos.
Comment on attachment 8939416 [details]
Bug 1426445: Add sanity check that worker uid/gid is 1000 in run-task;

https://reviewboard.mozilla.org/r/209752/#review224694

Ship it!
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/ac157b31db6e
Add sanity check that worker uid/gid is 1000 in run-task; r=dustin,gps
Product: TaskCluster → Firefox Build System
I think this is adequately protected against now.
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Assignee: nobody → mozilla
Target Milestone: --- → mozilla60
You need to log in before you can comment on or make changes to this bug.