Closed Bug 1221661 Opened 9 years ago Closed 9 years ago

task cluster test results seem to fail on crash reporter tests- probably a common cause

Categories

(Testing :: Mochitest, defect)

defect
Not set
normal

Tracking

(firefox46 fixed)

RESOLVED FIXED
mozilla46
Tracking Status
firefox46 --- fixed

People

(Reporter: jmaher, Assigned: dustin)

References

Details

Attachments

(2 files)

from this try run:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=603b29218a37

we have some failures that look suspicious:
10:02 < jmaher> ted: look at m-3: dom/plugins/test/mochitest/test_CrashService_crash.html |
10:03 < jmaher> ted: and bc2: browser/base/content/test/plugins/browser_CTP_crashreporting.j
10:03 < jmaher> ted: and bc6: toolkit/components/thumbnails/test/browser_thumbnails_bg_crash_during_capture.js | dumpID 
                is present and not an empty string 


and test recommends:
10:11 <@ted> so probably the first thing i would try here (i don't have time *right* at the moment)
10:12 <@ted> is running that same docker image, run Xvfb, and run any old firefox build in it
10:12 <@ted> and kill -ABRT it
10:12 < jmaher> ah, got it
10:12 <@ted> with MOZ_CRASHREPORTER=1 MOZ_CRASHREPORTER_NO_REPORT=1 in the env
10:12 <@ted> and see if you get a minidump in $profile/minidumps
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
10:12 <@ted> actually, these tests are crashing the plugin/content process, so maybe kill -ABRT plugin-container
10:14 <@ted> if that works, maybe run as much of 
             https://dxr.mozilla.org/mozilla-central/source/testing/taskcluster/scripts/tester/test-linux.sh as possible
10:14 <@ted> since it sets up the environment a bit
while running some tests on a production docker image locally, I see dumps being created- one less headache to figure out.
doing kill -ABRT works; I guess we need to figure out where the missing glue is for getting the actual dump out of there.
trying to reproduce this on the above mentioned tests, I run into errors while waiting for focus.
this might be specific to the lack of a window manager (think dialogs not working, etc.)- will update this when we test with a proper window manager on try.  locally it seems to fix the problem.
Hm, that's interesting, I wouldn't expect that to have any effect here. We don't launch the crash reporter client during these tests. I would expect lots of other tests to fail without a window manager though, honestly.

I know we're running some sort of window manager under Xvfb on the buildbot Linux testers.
I was poking through open Breakpad issues and came across https://bugs.chromium.org/p/google-breakpad/issues/detail?id=644. I did some local testing and it looks like Breakpad is actually just broken under Docker. I'll see if it's something fixable.
Okay, so it turns out this is due to Ubuntu's AppArmor profiles. If I pass `--cap-add SYS_PTRACE --security-opt apparmor:unconfined` to `docker run` on my machine I can run the Breakpad unit tests and they pass inside Docker. We'll have to get Taskcluster's docker-worker to do that.
--security-opt apparmor:unconfined seems .. unconfined

14:41:55 <ted> ¯\_(ツ)_/¯
14:42:07 <ted> that's the best i could get out of that docker issue
14:42:44 <ted> i can give you a super simple command to test and you can try to find something that's more restrictive but still works
14:43:28 <ted> `od -x /proc/$$/auxv`
14:43:35 <ted> if that prints an error, breakpad won't work
14:43:40 <ted> if it prints output, it will
14:44:15 <ted> $ docker run --rm -t -i ubuntu:14.04 /bin/sh -c 'od -x /proc/$$/auxv'
14:44:15 <ted> od: /proc/1/auxv: read error: Permission denied
14:44:51 <ted> $ docker run --rm -t -i --cap-add SYS_PTRACE --security-opt apparmor:unconfined ubuntu:14.04 /bin/sh -c 'od -x /proc/$$/auxv' 
14:44:51 <ted> 0000000 0021 0000 0000 0000 c000 84fe 7ffe 0000
14:44:54 <ted> <...>
I'm trying to figure out where to append --cap-add SYS_PTRACE --security-opt apparmor:unconfined to test jobs and I've failed to find where this could be changed.

Any suggestions? Is starting the docker image part of one of the TC repos?
dustin mentioned on IRC that it is too risky to use --security-opt apparmor:unconfined

Are we OK to having green TC jobs running side by side as tier2 *across* the board?
What would the value of running tests without having a solution here?
have we determined on try that adding --security-opt will fix the tests?  If we had that data, then we can determine which tests pass/fail and document it.  Maybe a mozinfo flag and manifest condition could help out?
This isn't something we can test in try -- it requires changes to the docker-worker which invokes docker.

From the little bit I've been able to read, the proposal is the equivalent of removing your front door and doorframe because your spare key doesn't work in the lock.  Effective?  yes.  A good idea?  maybe not.

Ted gave a pretty good reproduction recipe that we could use to test other, more refined solutions.  Someone -- and it might be me, just not now -- needs to dig into this and look at how AppArmor works, how docker uses it, how that might be configured, and so on, then devise the minimal solution with a full understanding of the security implications.  I suspect that might start by putting AppArmor in complain mode and running Ted's recipe in a docker container.  Then probably duplicate the docker-default profile with the necessary changes (after understanding them -- I'd guess it's just allowing ptrace and/or reading auxv).

https://github.com/docker/docker/blob/master/docs/security/apparmor.md
https://github.com/docker/docker/blob/master/daemon/execdriver/native/apparmor.go#L32

It looks like the profile is pretty basic.  Docker rewrites its own profile on startup, but we could easily copy/paste it to another profile named `tc-docker` and invoke that with `--security-opt apparmor:tc-docker`.
As I understand it, this is a tier1 blocker and the top one atm.
I assume we can continue adding more tier2 jobs for now without this.
Assignee: nobody → dustin
Docker looks to have code to work around this on trunk:
https://github.com/docker/docker/blob/master/daemon/execdriver/native/apparmor.go#L67

Adding `ptrace (trace,read) peer=docker-default,` to /etc/apparmor.d/docker on my local machine and restarting apparmor with `service apparmor restart` makes the simple test (`docker run --rm -t -i ubuntu:14.04 /bin/sh -c 'od -x /proc/$$/auxv'`) work locally. However, these changes don't survive a docker daemon restart.

Doing what dustin suggested in comment 12 seems to work fine though:
```
cp /etc/apparmor.d/docker /etc/apparmor.d/docker-ptrace
patch /etc/apparmor.d/docker-ptrace <<EOF
6c6
< profile docker-default flags=(attach_disconnected,mediate_deleted) {
---
> profile docker-ptrace flags=(attach_disconnected,mediate_deleted) {
14a15
>   ptrace (trace,read) peer=docker-ptrace,
EOF
service apparmor restart
```

Running my test command with this apparmor profile works fine then: `docker run --rm -t -i --security-opt apparmor:docker-ptrace ubuntu:14.04 /bin/sh -c 'od -x /proc/$$/auxv'`.

Running a simple Breakpad test works as well, so I think this should fix the unittest problems.
Just as clarification (because I wasn't sure), Docker containers use PID namespaces for each container:
http://man7.org/linux/man-pages/man7/pid_namespaces.7.html

So processes running inside a container cannot inspect processes outside of that container, even with ptrace enabled. I think that enabling ptrace shouldn't actually weaken security.
So basically this new profile would allow processes in docker containers to ptrace one another (since they're all in the docker-ptrace profile), but due to PID namespaces they can practically only ptrace other processes in the same container.  That seems pretty reasonable.

:kang, is that about right?

I'll start in on a patch to accomplish this as a docker-worker feature, preferably controlled by workerType configuration, so we can enable it only for test jobs' workerTypes, where inter-container interference isn't such a big deal.
Flags: needinfo?(gdestuynder)
Oh, hey, we can isolate ptrace to within the single container by creating an apparmor profile per task..
Some disclaimer:
ptrac'ing a process is the equivalent of full control of the said process (both memory and execution).

The namespaces in kernel have been "patched" in over time so that all kernel code would be namespace aware and properly separate resources. However, that also means there's a lower confidence in how well separated these namespaces are / higher probability for bugs. Due to that, apparmor (or selinux or else) are used to ensure all potentially dangerous calls are forbidden (its much easier to audit the system call table, than what the code behind the calls actually do). It's a little safer that way and ptrace() is always among the calls to be denied when making such lists.

Likewise, there might be privileged processes inside a container that you don't want to be ptraced even as root - obviously, removing the limitation will also allow that.

With that in mind, it should otherwise work (i.e. can't ptrace the other PID namespace).
Basically, a test case would be:

unshare -f -p /bin/bash
ps aux #pick a process from parent - you can see them because you're not remounting proc.
gdb -p 999 #gdb does a ptrace() call on 999
<No such process 999> #yay!
Comment on attachment 8695305 [details] [review]
https://github.com/taskcluster/docker-worker/pull/192

After reviewing some failures with Dustin, he was able to get it running in his environment and we also tested on an actual deployed ami of the worker.

For some reason the test fails in the vagrant environment but since we tested by other means, we should let this PR merge and figure out the test later.
Attachment #8695305 - Flags: review?(garndt) → review+
Merged!  We'll deploy tomorrow, only to the desktop-test workerType.
\o/

What do I need to do to make use of it?
I will try it Tuesday evening.
Add

task:
  payload:
    features:
      allowPtrace: true
I've pushed with ptrace enabled (let's see what the results look like):
https://hg.mozilla.org/try/rev/c3cb854e7fa0
New amis were added for desktop-test.  There's some existing pending jobs that will keep the old machines around for a bit.  Once those complete and die off the new ones will rollout.  We can kill old instances to speed up with the process if you need it.

us-east-1: ami-2e397344
us-west-1: ami-461e7526
us-west-2: ami-c94f50a8
Argh, I lied -- we also need to add docker-worker:feature:allowPtrace to the scopes.
Looks promsing?
Bug 1221661: enable ptrace for desktop tests; r?jmaher
Attachment #8698518 - Flags: review?(jmaher)
Comment on attachment 8698518 [details]
MozReview Request: Bug 1221661: enable ptrace for desktop tests; r?jmaher

https://reviewboard.mozilla.org/r/28017/#review25095

awesome, thanks!
Attachment #8698518 - Flags: review?(jmaher) → review+
https://hg.mozilla.org/mozilla-central/rev/40a4725f311a
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla46
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: