Closed Bug 1245243 Opened 8 years ago Closed 8 years ago

Allow bigger shared memory mount

Categories

(Taskcluster :: Workers, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Assigned: dustin)

References

Details

Attachments

(1 file)

This is causing issues when running e10s jobs.
This is right now on the critical path.

From bug 1233554
(In reply to Jeff Muizelaar [:jrmuizel] from comment #28)
> It looks like this probably caused by the size of /dev/shm being too small.
> In docker it defaults to 64MB but we likely need more.
> 
> See also:
> https://github.com/jvermillard/docker/commit/
> 77faf17586cecacf79765d47983363962827cb45


From docker (pasting from my own machine instead of waiting for interactive):
> shm /dev/shm tmpfs rw,nosuid,nodev,noexec,relatime,size=65536k 0 0

From releng host:
> udev /dev devtmpfs rw,relatime,size=1898856k,nr_inodes=474714,mode=755 0 0
I think this can be done with --shm:
  https://github.com/docker/docker/pull/16168

sadly, that was landed a full four months after 1.6.1 was released.

Greg, and genius ideas?
Component: Docker Images → Docker-Worker
Blocks: 1233554
Blocks: 1242986
(In reply to Dustin J. Mitchell [:dustin] from comment #1)
> I think this can be done with --shm:
>   https://github.com/docker/docker/pull/16168
> 
> sadly, that was landed a full four months after 1.6.1 was released.
> 
> Greg, and genius ideas?

Has this landed in docker yet? It seems it's slated for docker 1.10.0 which comes out in 2 days.

https://github.com/docker/docker/pull/16168#event-470134564
I wonder if we could hack around this?  With a privileged container, it's possible to remount /dev/shm as root in the container.  We could do that until 1.10.N is stable?
Assignee: nobody → dustin
We do have a feature that can be enabled within a task and within a workerType that allows privileged mode (we  use it for some special tasks not run from try) but it does give the task a lot of power so I'm not sure if it's something you would want to enable on the test workertypes that can be accessed from try.
These are isolated instances which don't produce any binaries that others run, so I think priv mode is OK as a stopgap.  I'm enabling it now for desktop-test and desktop-test-xlarge:

 "userData": {
    "dockerConfig": {
      "allowPrivileged": true
    }
  },

I also added docker-worker:capability:privileged to repo:hg.mozilla.org/try:* (a little scary!)

I'll have a patch in a sec.
I committed an act of EC2 genocide to destroy all of the desktop-test instances that did not have the privileged mode enabled, then made the above try push.

If that works before I'm back online, please feel free to steal the cset into your own try pushes.
Not working

> [taskcluster] Downloaded 2551.338 mb in 636.65 seconds.
> [taskcluster] Loading docker image from downloaded archive.
> [taskcluster] Image 'public/image.tar' from task 'YqckjHV-TXWCXG5RZl5Rnw' downloaded.  Using image ID 12c023d3aa2d6229ab8950c28104f9839f0f6d2f50a8dc5d5ce6c5a5a9fb93ef.
> [taskcluster:error] Docker configuration could not be created.  This may indicate an authentication error when validating scopes necessary for running the task. 
> Error: Insufficient scopes to run task in privileged mode. Try adding docker-worker:capability:privileged to the .scopes array
> [taskcluster] Unsuccessful task run with exit code: -1 completed in 726.836 seconds
I think adding this should be sufficient:
>   scopes:
>     - 'docker-worker:feature:allowPtrace'
> +    - 'docker-worker:capability:privilege'
I don't think this has worked [1]
> shm /dev/shm tmpfs rw,nosuid,nodev,noexec,relatime,size=65536k 0 0

So far the original e10s crashtest crash has not cleared up.

dt1 has gone green.
dt2's crash has cleared.
dt6 has probably cleared but we've hit an intermittent orange.
dt8 seems to have cleared.


Still orange: m10, bc1, dt9

[1]
root@taskcluster-worker:~# cat /proc/mounts 
rootfs / rootfs rw 0 0
none / aufs rw,relatime,si=e163718cd3b19ac8,dio 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /dev tmpfs rw,nosuid,mode=755 0 0
devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=666 0 0
shm /dev/shm tmpfs rw,nosuid,nodev,noexec,relatime,size=65536k 0 0
mqueue /dev/mqueue mqueue rw,nosuid,nodev,noexec,relatime 0 0
sysfs /sys sysfs ro,nosuid,nodev,noexec,relatime 0 0
/dev/disk/by-label/cloudimg-rootfs /.taskclusterinteractiveexport ext4 rw,relatime,data=ordered 0 0
/dev/disk/by-label/cloudimg-rootfs /.taskclusterinteractivesession.lock ext4 rw,relatime,data=ordered 0 0
/dev/disk/by-label/cloudimg-rootfs /.taskclusterutils ext4 ro,relatime,data=ordered 0 0
/dev/mapper/instance_storage-all /etc/resolv.conf ext4 rw,relatime,data=ordered 0 0
/dev/mapper/instance_storage-all /etc/hostname ext4 rw,relatime,data=ordered 0 0
/dev/mapper/instance_storage-all /etc/hosts ext4 rw,relatime,data=ordered 0 0
devpts /dev/console devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
proc /proc/asound proc ro,nosuid,nodev,noexec,relatime 0 0
proc /proc/bus proc ro,nosuid,nodev,noexec,relatime 0 0
proc /proc/fs proc ro,nosuid,nodev,noexec,relatime 0 0
proc /proc/irq proc ro,nosuid,nodev,noexec,relatime 0 0
proc /proc/sys proc ro,nosuid,nodev,noexec,relatime 0 0
proc /proc/sysrq-trigger proc ro,nosuid,nodev,noexec,relatime 0 0
tmpfs /proc/kcore tmpfs rw,nosuid,mode=755 0 0
tmpfs /proc/latency_stats tmpfs rw,nosuid,mode=755 0 0
tmpfs /proc/timer_stats tmpfs rw,nosuid,mode=755 0 0
My bad. I was looking at a push without the change.

garndt: what did I do wrong in this push?
https://hg.mozilla.org/try/rev/7649411f0535
https://public-artifacts.taskcluster.net/BYgHLVADTFmy9qQgD-VFSQ/3/public/logs/live_backing.log
I think it's because the scope for privileged capabilities were missing for try pushes.  I've added it, could you try again?
I tried again without success:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=f86b246827b0

[taskcluster:error] Error calling 'stopped' for extendTaskGraph : Error encountered when attempting to extend task graph. Graph server error while extending task graph id R-Kg2jBVQTep3v-burDXow : You didn't give the task-graph scopes allowing it define tasks on the queue., [{"message":"You do not have sufficient scopes. This request requires you\nto have one of the following sets of scopes:\ndocker-worker:feature:allowPtrace,docker-worker:capability:privilege,docker-worker:capability:device:loop
The scope is "docker-worker:capability:privileged" I think the scope in your task might be missing a 'd' on privileged
OK, we passed the scheduling hurdles but we cannot mount:

> + mount -t tmpfs -o rw,nosuid,nodev,noexec,relatime,size=1898856k tmpfs /dev/shm
> mount: block device tmpfs is write-protected, mounting read-only
> mount: cannot mount block device tmpfs read-only
Hmm, I can't reproduce that in the ubuntu1204-test image on a host running docker-1.6.1:

dustin@dustin-tc-devel ~ $ docker run -ti --rm --privileged taskcluster/ubuntu1204-test:0.1.6
root@taskcluster-worker:~# umount /dev/shm
root@taskcluster-worker:~# mount -t tmpfs -o rw,nosuid,nodev,noexec,relatime,size=1898856k tmpfs /dev/shm
root@taskcluster-worker:~#
Under docker-worker:
  with allowPtrace: FAIL  - https://tools.taskcluster.net/task-inspector/#cGNKuxGaQierxO1arcUAtg/0
  without allowPtrace: OK - https://tools.taskcluster.net/task-inspector/#B_QXdakGQJC0LqGdzcAMPw/0

so apparently we get allowPtrace *or* the ability to remount, but not both.  Which makes sense -- '--privileged' translates, I think, to some AppArmor changes by docker, which are probably overridden by the AppArmor changes for allowPtrace.

I'll try the push again with allowPtrace turned off.
The decision task failed.
Oh, I used the wrong try job of yours -- what was the most recent one that passed the schedulign hurdles?  Can you try applying http://hg.mozilla.org/try/rev/277e6ac11f5b to it?
I see some oranges and some greens.  Does it look like this has addressed the limited-shm failures, at least?
It seems it now has now worked.
https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&revision=08286a17a375,d62d093ac73b

We have cleared the two crashes:
--------------------------------
C e10s is not crashing anymore
m4 e10s is not crashing anymore.

We will ignore dt e10s in our analysis since they don't run on Buildbot and need work to be greened up.
We still have failures for Mn, M10, bc1, bc6 and bc7 (these last two are new - maybe interm).
This is what I landed.
What should we do about this patch?
Are we OK to land it on m-c? or carry it around until we can upgrade to the newer version of docker?
Attachment #8716263 - Flags: review?(dustin)
Let's carry it around for a while.  If that becomes burdensome, we can reconsisder.
Depends on: 1246210
Comment on attachment 8716263 [details] [diff] [review]
shared_memory.diff

We probably should have made the capability change in fx_test_base.yml, but for try pushes it doesn't make a difference.
Attachment #8716263 - Flags: review?(dustin) → feedback+
For Firefox desktop tests we're done in here.

Thanks dustin, garndt!
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
To clarify: the upgrade to docker 1.10.0 (bug 1246210) plus the addition of shmsize:

https://github.com/gregarndt/docker-worker/commit/4a65b86352c5cb42f2c3ab17c405afd7f803e7fd
diff --git a/lib/task.js b/lib/task.js
index e7a44f2..83ed20a 100644
--- a/lib/task.js
+++ b/lib/task.js
@@ -333,7 +333,8 @@ export class Task {
         StdinOnce: false,
         Env: taskEnvToDockerEnv(env),
         HostConfig: {
-          Privileged: privilegedTask
+          Privileged: privilegedTask,
+          ShmSize: 1800000000
         }
       }
     };

has reproduced the results we had with the privileged containers in comment 5, without the allowPtrace incompatibility.  And -- in a first for Docker .0 release -- nothing else has exploded.

When we switch to taskcluster-worker, we'll want to move that shmsize option into the task definitions, not least so that we have visible evidence of that requirement (aside from the more obvious error messages from bug 1245239).
Component: Docker-Worker → Workers
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: