Allow bigger shared memory mount

RESOLVED FIXED

Status

Taskcluster
Docker-Worker
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: armenzg, Assigned: dustin)

Tracking

Details

Attachments

(1 attachment)

(Reporter)

Description

2 years ago
This is causing issues when running e10s jobs.
This is right now on the critical path.

From bug 1233554
(In reply to Jeff Muizelaar [:jrmuizel] from comment #28)
> It looks like this probably caused by the size of /dev/shm being too small.
> In docker it defaults to 64MB but we likely need more.
> 
> See also:
> https://github.com/jvermillard/docker/commit/
> 77faf17586cecacf79765d47983363962827cb45


From docker (pasting from my own machine instead of waiting for interactive):
> shm /dev/shm tmpfs rw,nosuid,nodev,noexec,relatime,size=65536k 0 0

From releng host:
> udev /dev devtmpfs rw,relatime,size=1898856k,nr_inodes=474714,mode=755 0 0
(Assignee)

Comment 1

2 years ago
I think this can be done with --shm:
  https://github.com/docker/docker/pull/16168

sadly, that was landed a full four months after 1.6.1 was released.

Greg, and genius ideas?
Component: Docker Images → Docker-Worker
(Assignee)

Updated

2 years ago
Blocks: 1233554
(Reporter)

Updated

2 years ago
Blocks: 1242986

Comment 2

2 years ago
(In reply to Dustin J. Mitchell [:dustin] from comment #1)
> I think this can be done with --shm:
>   https://github.com/docker/docker/pull/16168
> 
> sadly, that was landed a full four months after 1.6.1 was released.
> 
> Greg, and genius ideas?

Has this landed in docker yet? It seems it's slated for docker 1.10.0 which comes out in 2 days.

https://github.com/docker/docker/pull/16168#event-470134564
(Assignee)

Comment 3

2 years ago
I wonder if we could hack around this?  With a privileged container, it's possible to remount /dev/shm as root in the container.  We could do that until 1.10.N is stable?
(Assignee)

Updated

2 years ago
Assignee: nobody → dustin

Comment 4

2 years ago
We do have a feature that can be enabled within a task and within a workerType that allows privileged mode (we  use it for some special tasks not run from try) but it does give the task a lot of power so I'm not sure if it's something you would want to enable on the test workertypes that can be accessed from try.
(Assignee)

Comment 5

2 years ago
These are isolated instances which don't produce any binaries that others run, so I think priv mode is OK as a stopgap.  I'm enabling it now for desktop-test and desktop-test-xlarge:

 "userData": {
    "dockerConfig": {
      "allowPrivileged": true
    }
  },

I also added docker-worker:capability:privileged to repo:hg.mozilla.org/try:* (a little scary!)

I'll have a patch in a sec.
(Assignee)

Comment 6

2 years ago
https://treeherder.mozilla.org/#/jobs?repo=try&revision=37d68db9833a
(Assignee)

Comment 7

2 years ago
I committed an act of EC2 genocide to destroy all of the desktop-test instances that did not have the privileged mode enabled, then made the above try push.

If that works before I'm back online, please feel free to steal the cset into your own try pushes.
(Reporter)

Comment 8

2 years ago
https://treeherder.mozilla.org/#/jobs?repo=try&revision=649e130b838e
(Reporter)

Comment 9

2 years ago
Not working

> [taskcluster] Downloaded 2551.338 mb in 636.65 seconds.
> [taskcluster] Loading docker image from downloaded archive.
> [taskcluster] Image 'public/image.tar' from task 'YqckjHV-TXWCXG5RZl5Rnw' downloaded.  Using image ID 12c023d3aa2d6229ab8950c28104f9839f0f6d2f50a8dc5d5ce6c5a5a9fb93ef.
> [taskcluster:error] Docker configuration could not be created.  This may indicate an authentication error when validating scopes necessary for running the task. 
> Error: Insufficient scopes to run task in privileged mode. Try adding docker-worker:capability:privileged to the .scopes array
> [taskcluster] Unsuccessful task run with exit code: -1 completed in 726.836 seconds
(Reporter)

Comment 10

2 years ago
https://treeherder.mozilla.org/#/jobs?repo=try&revision=7649411f0535
(Reporter)

Comment 11

2 years ago
I think adding this should be sufficient:
>   scopes:
>     - 'docker-worker:feature:allowPtrace'
> +    - 'docker-worker:capability:privilege'
(Reporter)

Comment 12

2 years ago
I don't think this has worked [1]
> shm /dev/shm tmpfs rw,nosuid,nodev,noexec,relatime,size=65536k 0 0

So far the original e10s crashtest crash has not cleared up.

dt1 has gone green.
dt2's crash has cleared.
dt6 has probably cleared but we've hit an intermittent orange.
dt8 seems to have cleared.


Still orange: m10, bc1, dt9

[1]
root@taskcluster-worker:~# cat /proc/mounts 
rootfs / rootfs rw 0 0
none / aufs rw,relatime,si=e163718cd3b19ac8,dio 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /dev tmpfs rw,nosuid,mode=755 0 0
devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=666 0 0
shm /dev/shm tmpfs rw,nosuid,nodev,noexec,relatime,size=65536k 0 0
mqueue /dev/mqueue mqueue rw,nosuid,nodev,noexec,relatime 0 0
sysfs /sys sysfs ro,nosuid,nodev,noexec,relatime 0 0
/dev/disk/by-label/cloudimg-rootfs /.taskclusterinteractiveexport ext4 rw,relatime,data=ordered 0 0
/dev/disk/by-label/cloudimg-rootfs /.taskclusterinteractivesession.lock ext4 rw,relatime,data=ordered 0 0
/dev/disk/by-label/cloudimg-rootfs /.taskclusterutils ext4 ro,relatime,data=ordered 0 0
/dev/mapper/instance_storage-all /etc/resolv.conf ext4 rw,relatime,data=ordered 0 0
/dev/mapper/instance_storage-all /etc/hostname ext4 rw,relatime,data=ordered 0 0
/dev/mapper/instance_storage-all /etc/hosts ext4 rw,relatime,data=ordered 0 0
devpts /dev/console devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
proc /proc/asound proc ro,nosuid,nodev,noexec,relatime 0 0
proc /proc/bus proc ro,nosuid,nodev,noexec,relatime 0 0
proc /proc/fs proc ro,nosuid,nodev,noexec,relatime 0 0
proc /proc/irq proc ro,nosuid,nodev,noexec,relatime 0 0
proc /proc/sys proc ro,nosuid,nodev,noexec,relatime 0 0
proc /proc/sysrq-trigger proc ro,nosuid,nodev,noexec,relatime 0 0
tmpfs /proc/kcore tmpfs rw,nosuid,mode=755 0 0
tmpfs /proc/latency_stats tmpfs rw,nosuid,mode=755 0 0
tmpfs /proc/timer_stats tmpfs rw,nosuid,mode=755 0 0
(Reporter)

Comment 13

2 years ago
My bad. I was looking at a push without the change.

garndt: what did I do wrong in this push?
https://hg.mozilla.org/try/rev/7649411f0535
https://public-artifacts.taskcluster.net/BYgHLVADTFmy9qQgD-VFSQ/3/public/logs/live_backing.log
I think it's because the scope for privileged capabilities were missing for try pushes.  I've added it, could you try again?
(Reporter)

Comment 15

2 years ago
I tried again without success:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=f86b246827b0

[taskcluster:error] Error calling 'stopped' for extendTaskGraph : Error encountered when attempting to extend task graph. Graph server error while extending task graph id R-Kg2jBVQTep3v-burDXow : You didn't give the task-graph scopes allowing it define tasks on the queue., [{"message":"You do not have sufficient scopes. This request requires you\nto have one of the following sets of scopes:\ndocker-worker:feature:allowPtrace,docker-worker:capability:privilege,docker-worker:capability:device:loop
The scope is "docker-worker:capability:privileged" I think the scope in your task might be missing a 'd' on privileged
(Reporter)

Comment 17

2 years ago
https://treeherder.mozilla.org/#/jobs?repo=try&revision=f48c40865673
(Reporter)

Comment 18

2 years ago
OK, we passed the scheduling hurdles but we cannot mount:

> + mount -t tmpfs -o rw,nosuid,nodev,noexec,relatime,size=1898856k tmpfs /dev/shm
> mount: block device tmpfs is write-protected, mounting read-only
> mount: cannot mount block device tmpfs read-only
(Assignee)

Comment 19

2 years ago
Hmm, I can't reproduce that in the ubuntu1204-test image on a host running docker-1.6.1:

dustin@dustin-tc-devel ~ $ docker run -ti --rm --privileged taskcluster/ubuntu1204-test:0.1.6
root@taskcluster-worker:~# umount /dev/shm
root@taskcluster-worker:~# mount -t tmpfs -o rw,nosuid,nodev,noexec,relatime,size=1898856k tmpfs /dev/shm
root@taskcluster-worker:~#
(Assignee)

Comment 20

2 years ago
Under docker-worker:
  with allowPtrace: FAIL  - https://tools.taskcluster.net/task-inspector/#cGNKuxGaQierxO1arcUAtg/0
  without allowPtrace: OK - https://tools.taskcluster.net/task-inspector/#B_QXdakGQJC0LqGdzcAMPw/0

so apparently we get allowPtrace *or* the ability to remount, but not both.  Which makes sense -- '--privileged' translates, I think, to some AppArmor changes by docker, which are probably overridden by the AppArmor changes for allowPtrace.

I'll try the push again with allowPtrace turned off.
(Assignee)

Comment 21

2 years ago
https://treeherder.mozilla.org/#/jobs?repo=try&revision=277e6ac11f5b
(Reporter)

Comment 22

2 years ago
The decision task failed.
(Assignee)

Comment 23

2 years ago
Oh, I used the wrong try job of yours -- what was the most recent one that passed the schedulign hurdles?  Can you try applying http://hg.mozilla.org/try/rev/277e6ac11f5b to it?
(Reporter)

Comment 24

2 years ago
https://treeherder.mozilla.org/#/jobs?repo=try&revision=35191af90bc3
(Assignee)

Comment 25

2 years ago
I see some oranges and some greens.  Does it look like this has addressed the limited-shm failures, at least?
(Reporter)

Comment 26

2 years ago
It seems it now has now worked.
https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&revision=08286a17a375,d62d093ac73b

We have cleared the two crashes:
--------------------------------
C e10s is not crashing anymore
m4 e10s is not crashing anymore.

We will ignore dt e10s in our analysis since they don't run on Buildbot and need work to be greened up.
We still have failures for Mn, M10, bc1, bc6 and bc7 (these last two are new - maybe interm).
(Reporter)

Comment 27

2 years ago
Created attachment 8716263 [details] [diff] [review]
shared_memory.diff

This is what I landed.
What should we do about this patch?
Are we OK to land it on m-c? or carry it around until we can upgrade to the newer version of docker?
Attachment #8716263 - Flags: review?(dustin)
(Reporter)

Comment 28

2 years ago
https://treeherder.mozilla.org/#/jobs?repo=try&revision=f1696626be68
(Reporter)

Comment 29

2 years ago
https://treeherder.mozilla.org/#/jobs?repo=try&revision=5fe57bd96728
(Assignee)

Comment 30

2 years ago
Let's carry it around for a while.  If that becomes burdensome, we can reconsisder.
Depends on: 1246210
(Assignee)

Comment 31

2 years ago
Comment on attachment 8716263 [details] [diff] [review]
shared_memory.diff

We probably should have made the capability change in fx_test_base.yml, but for try pushes it doesn't make a difference.
Attachment #8716263 - Flags: review?(dustin) → feedback+
(Reporter)

Comment 32

2 years ago
For Firefox desktop tests we're done in here.

Thanks dustin, garndt!
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
(Assignee)

Comment 33

2 years ago
To clarify: the upgrade to docker 1.10.0 (bug 1246210) plus the addition of shmsize:

https://github.com/gregarndt/docker-worker/commit/4a65b86352c5cb42f2c3ab17c405afd7f803e7fd
diff --git a/lib/task.js b/lib/task.js
index e7a44f2..83ed20a 100644
--- a/lib/task.js
+++ b/lib/task.js
@@ -333,7 +333,8 @@ export class Task {
         StdinOnce: false,
         Env: taskEnvToDockerEnv(env),
         HostConfig: {
-          Privileged: privilegedTask
+          Privileged: privilegedTask,
+          ShmSize: 1800000000
         }
       }
     };

has reproduced the results we had with the privileged containers in comment 5, without the allowPtrace incompatibility.  And -- in a first for Docker .0 release -- nothing else has exploded.

When we switch to taskcluster-worker, we'll want to move that shmsize option into the task definitions, not least so that we have visible evidence of that requirement (aside from the more obvious error messages from bug 1245239).
You need to log in before you can comment on or make changes to this bug.