Closed Bug 1162965 Opened 4 years ago Closed 4 years ago

./mach gtest GeckoMediaPlugins.* fails when run in a docker container

Categories

(Core :: Security: Process Sandboxing, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla41
Tracking Status
firefox39 --- unaffected
firefox40 --- fixed
firefox41 --- fixed
b2g-master --- unaffected

People

(Reporter: mrrrgn, Assigned: jld)

References

Details

User Story

Steps to reproduce:

    docker run -i -t ubuntu:14.04 /bin/bash
    apt-get -y update && apt-get install -y python
    hg clone https://hg.mozilla.org/mozilla-central && cd mozilla-central
    python/mozboot/bin/bootstrap.py --application-choice=desktop --no-interactive
    ./mach build
    ./mach gtest GeckoMediaPlugins.GMPStorageBasic

Attachments

(2 files, 1 obsolete file)

We're attempting to move jobs to docker containers (for task cluster), and unfortunately it seems that these tests do not work in that environment. Running ./mach bootstrap, ./mach build, and ./mach gtest works fine on a baremetal ubuntu 14:04 machine, but leads to the following errors within a ubuntu 14:04 docker container => https://pastebin.mozilla.org/8832881

Steps to reproduce:

    docker run -i -t ubuntu:14.04 /bin/bash
    hg clone https://hg.mozilla.org/mozilla-central && cd mozilla-central
    python/mozboot/bin/bootstrap.py --application-choice=desktop --no-interactive
    ./mach build
    ./mach gtest
Blocks: 1155749
The tests also fail in "privileged mode" docker -P
User Story: (updated)
The test is actually hanging forever, rather than outright failing

Running GTest tests...
Note: Google Test filter = GeckoMediaPlugins.GMPStorageBasic
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from GeckoMediaPlugins
[ RUN      ] GeckoMediaPlugins.GMPStorageBasic
Sandbox: chroot: Stale file handle
[10296] WARNING: pipe error (26): Connection reset by peer: file /shared/ubuntu-1404-test/mozilla-central/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 459

###!!! [Parent][MessageChannel] Error: (msgtype=0x62000E,name=PGMP::Msg_CloseActive) Channel error: cannot send/recv
Summary: ./mach gtest fails when run in a docker container → ./mach gtest GeckoMediaPlugins.GMPStorageBasic fails when run in a docker container
Sorry to bug you again, but given the extra info above, do you have any ideas about what's going on here?
Flags: needinfo?(cpearce)
User Story: (updated)
The next fails too: ./mach gtest -- -GeckoMediaPlugins.GMPStorageBasic

[ RUN      ] GeckoMediaPlugins.GMPStorageForgetThisSite
[10362] WARNING: pipe error (25): Connection reset by peer: file /shared/ubuntu-1404-test/mozilla-central/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 459
[10362] WARNING: pipe error (26): Connection reset by peer: file /shared/ubuntu-1404-test/mozilla-central/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 459
[10362] WARNING: pipe error (23): Connection reset by peer: file /shared/ubuntu-1404-test/mozilla-central/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 459
[10362] WARNING: pipe error (28): Connection reset by peer: file /shared/ubuntu-1404-test/mozilla-central/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 459

###!!! [Parent][MessageChannel] Error: (msgtype=0x62000E,name=PGMP::Msg_CloseActive) Channel error: cannot send/recv


###!!! [Parent][MessageChannel] Error: (msgtype=0x62000E,name=PGMP::Msg_CloseActive) Channel error: cannot send/recv

Sandbox: chroot: Stale file handle

###!!! [Parent][MessageChannel] Error: (msgtype=0x62000E,name=PGMP::Msg_CloseActive) Channel error: cannot send/recv

[10362] WARNING: pipe error (26): Connection reset by peer: file /shared/ubuntu-1404-test/mozilla-central/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 459
Summary: ./mach gtest GeckoMediaPlugins.GMPStorageBasic fails when run in a docker container → ./mach gtest GeckoMediaPlugins.* fails when run in a docker container
here's a copy of the pastebin for posterity
I suspect the most useful information here would be, what operating-system level features does this test use that might help us find the docker feature or bug making this fail?

From the comments in ipc/chromium/src/chrome/common/ipc_channel_posix.cc:

// channel ids as the pipe names.  Channels on POSIX use anonymous
// Unix domain sockets created via socketpair() as pipes.  These don't
// quite line up.

OK, socketpair() isn't rocket science, that should work in Docker.

// Case 1: normal running
//   The IPC server object will install a mapping in PipeMap from the
//   name which it was given to the client pipe. When forking the client, the
//   GetClientFileDescriptorMapping will ensure that the socket is installed in
//   the magic slot (@kClientChannelFd). The client will search for the
//   mapping, but it won't find any since we are in a new process. Thus the
//   magic fd number is returned. Once the client connects, the server will
//   close its copy of the client socket and remove the mapping.
//
// Case 2: unittests - client and server in the same process
//   The IPC server will install a mapping as before. The client will search
//   for a mapping and find out. It duplicates the file descriptor and
//   connects. Once the client connects, the server will close the original
//   copy of the client socket and remove the mapping. Thus, when the client
//   object closes, it will close the only remaining copy of the client socket
//   in the fd table and the server will see EOF on its side.

I'm guessing that case 2 only applies to unit tests for the IPC code itself; the GMP tests are probably using "normal running".  dup(2)'ing to a well-known fd before forking is a well-established technique that works everywhere (but on OpenBSD..), so that should be fine in Docker too.

###!!! [Parent][MessageChannel] Error: (msgtype=0x62000E,name=PGMP::Msg_CloseActive) Channel error: cannot send/recv
is generated by PrintErrorMessage in ipc/glue/MessageChannel.cpp; the first bit (`[Parent]`) indicates that this is the "parent" side of the channel.

I had a stare at dom/media/gtest/TestGMPCrossOrigin.cpp but it looks like the IPC is buried in the SUT so maybe a bit too deep a dive for right now.

Am I even on the right track?
where does 'Sandbox: chroot: Stale file handle' come from?
We're not sure, but that seems to appear for a lot of passing tests, too.
Stracing the parent thread from socketpair() through its error message:

[pid 23719] socketpair(PF_LOCAL, SOCK_STREAM, 0 <unfinished ...>
[pid 23719] <... socketpair resumed> , [24, 25]) = 0
[pid 23719] fcntl(24, F_SETFL, O_RDONLY|O_NONBLOCK <unfinished ...>
[pid 23719] <... fcntl resumed> )       = 0
[pid 23719] fcntl(25, F_SETFL, O_RDONLY|O_NONBLOCK <unfinished ...>
[pid 23719] <... fcntl resumed> )       = 0
[pid 23719] fcntl(24, F_GETFD <unfinished ...>
[pid 23719] <... fcntl resumed> )       = 0
[pid 23719] fcntl(24, F_SETFD, FD_CLOEXEC <unfinished ...>
[pid 23719] <... fcntl resumed> )       = 0
[pid 23719] fcntl(25, F_GETFD <unfinished ...>
[pid 23719] <... fcntl resumed> )       = 0
[pid 23719] fcntl(25, F_SETFD, FD_CLOEXEC <unfinished ...>
[pid 23719] <... fcntl resumed> )       = 0
[pid 23719] dup(24 <unfinished ...>
[pid 23719] <... dup resumed> )         = 26
[pid 23719] dup(25 <unfinished ...>
[pid 23719] <... dup resumed> )         = 27
[pid 23719] close(24)                   = 0
[pid 23719] close(25 <unfinished ...>
[pid 23719] <... close resumed> )       = 0
[pid 23719] write(9, "\0", 1 <unfinished ...>
[pid 23719] <... write resumed> )       = 1
[pid 23719] write(9, "\0", 1 <unfinished ...>
[pid 23719] <... write resumed> )       = 1
[pid 23719] futex(0x7fb0b2a46aec, FUTEX_WAIT_PRIVATE, 5, NULL <unfinished ...>
[pid 23719] <... futex resumed> )       = 0
[pid 23719] futex(0x7fb0b2a46a88, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 23719] <... futex resumed> )       = 0
[pid 23719] futex(0x7fb0b2a46be0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 23719] <... futex resumed> )       = 0
[pid 23719] futex(0x7fb0b2a46be0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 23719] <... futex resumed> )       = 0
[pid 23719] write(9, "\0", 1 <unfinished ...>
[pid 23719] <... write resumed> )       = 1
[pid 23719] futex(0x7fb0b2a4b90c, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished ...>
[pid 23719] <... futex resumed> )       = 0
[pid 23719] futex(0x7fb0b2a46c90, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 23719] <... futex resumed> )       = 0
[pid 23719] write(9, "\0", 1 <unfinished ...>
[pid 23719] <... write resumed> )       = 1
[pid 23719] write(9, "\0", 1 <unfinished ...>
[pid 23719] <... write resumed> )       = 1
[pid 23719] futex(0x7fb0c4bd0cfc, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0x7fb0c4bd0c98, 12 <unfinished ...>
[pid 23719] <... futex resumed> )       = 1
[pid 23719] futex(0x7fb0b2a4b80c, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished ...>
[pid 23719] <... futex resumed> )       = 0
[pid 23719] futex(0x7fb0b2a46ea0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 23719] <... futex resumed> )       = 0
[pid 23719] write(9, "\0", 1 <unfinished ...>
[pid 23719] <... write resumed> )       = 1
[pid 23719] futex(0x7fb0b2a46aec, FUTEX_WAIT_PRIVATE, 7, NULL <unfinished ...>
[pid 23719] <... futex resumed> )       = 0
[pid 23719] futex(0x7fb0b2a46a88, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 23719] futex(0x7fb0b2a46be0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 23719] <... futex resumed> )       = 0
[pid 23719] futex(0x7fb0b2a46be0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 23719] futex(0x7fb0c4bd0cfc, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0x7fb0c4bd0c98, 14) = 0
[pid 23719] write(2, "\n###!!! [Parent][MessageChannel]"..., 119) = 119

Working backward, it looks like the socketpair() call is from https://dxr.mozilla.org/mozilla-central/source/ipc/chromium/src/chrome/common/ipc_channel_posix.cc#331.  I be the caller is https://dxr.mozilla.org/mozilla-central/source/ipc/glue/Transport_posix.cpp#25, as that matches the dup()'s.  The close();s come from the Transport going out of scope, as described in the comment.  But there the trail runs cold -- I can't see what might have called that.  I don't see any unexpectedly nonzero exit statuses here that might lead to the "Channel error" being logged.

However, shortly after the dup's, I see thread 23699 using fd's 26 and 27

[pid 23699] sendmsg(23, {msg_name(0)=NULL, msg_iov(1)=[{"\20\0\0\0\377\377\377\177\372\377\0\0\1\0\0\0\1\0\0\0\377\377\377\377\377\377\377\377\0\0\0\0"..., 48}], msg_controllen=20, {cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, {27}}, msg_flags=0}, MSG_DONTWAIT <unfinished ...>
[pid 23699] <... sendmsg resumed> )     = 48
[pid 23699] close(27 <unfinished ...>
[pid 23699] <... close resumed> )       = 0

Here it sends fd 27 (one end of the socketpair) over fd 23 to some other thread, and then closes its copy of the fd.  It then sends some messages on fd 26 but those don't include fd's.

[pid 23699] futex(0x7fb0b2a4b90c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fb0b2a4b908, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1} <unfinished ...>
[pid 23699] <... futex resumed> )       = 1
[pid 23699] epoll_ctl(5, EPOLL_CTL_ADD, 26, {EPOLLIN, {u32=26, u64=26}} <unfinished ...>
[pid 23699] <... epoll_ctl resumed> )   = 0
[pid 23699] sendmsg(26, {msg_name(0)=NULL, msg_iov(1)=[{"\4\0\0\0\0\0\0\200\377\377\0\0\1\0\0\0\0\0\0\0\377\377\377\377\377\377\377\377\0\0\0\0"..., 36}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT <unfinished ...>
[pid 23699] <... sendmsg resumed> )     = 36
[pid 23699] sendmsg(26, {msg_name(0)=NULL, msg_iov(1)=[{"\4\0\0\0\377\377\377\177\3\0^\0\1\0\0\0\0\0\0\0\377\377\377\377\377\377\377\377\0\0\0\0"..., 36}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT <unfinished ...>
[pid 23699] <... sendmsg resumed> )     = 36
[pid 23699] sendmsg(26, {msg_name(0)=NULL, msg_iov(1)=[{"\0\0\0\0\2\0\0\0\1\0`\0\1\0\0\0\0\0\0\0\377\377\377\377\377\377\377\377\0\0\0\0", 32}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT <unfinished ...>
[pid 23699] <... sendmsg resumed> )     = 32
[pid 23699] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 23699] <... clock_gettime resumed> {233763, 705643790}) = 0
[pid 23699] gettimeofday( <unfinished ...>
[pid 23699] <... gettimeofday resumed> {1431120136, 942238}, NULL) = 0
[pid 23699] epoll_wait(5,  <unfinished ...>
[pid 23699] <... epoll_wait resumed> {{EPOLLIN, {u32=8, u64=8}}}, 32, -1) = 1
[pid 23699] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 23699] <... clock_gettime resumed> {233763, 705814081}) = 0
[pid 23699] gettimeofday( <unfinished ...>
[pid 23699] <... gettimeofday resumed> {1431120136, 942421}, NULL) = 0
[pid 23699] read(8,  <unfinished ...>
[pid 23699] <... read resumed> "\0", 1) = 1
[pid 23699] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 23699] <... clock_gettime resumed> {233763, 706144638}) = 0
[pid 23699] gettimeofday( <unfinished ...>
[pid 23699] <... gettimeofday resumed> {1431120136, 942737}, NULL) = 0
[pid 23699] epoll_wait(5,  <unfinished ...>
[pid 23699] <... epoll_wait resumed> {{EPOLLIN, {u32=8, u64=8}}}, 32, -1) = 1
[pid 23699] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 23699] <... clock_gettime resumed> {233763, 706362235}) = 0
[pid 23699] gettimeofday( <unfinished ...>
[pid 23699] <... gettimeofday resumed> {1431120136, 942963}, NULL) = 0
[pid 23699] read(8,  <unfinished ...>
[pid 23699] <... read resumed> "\0", 1) = 1
[pid 23699] sendmsg(26, {msg_name(0)=NULL, msg_iov(1)=[{",\0\0\0\2\0\0\0\4\0`\0\1\0\0\0\0\0\0\0\377\377\377\377\377\377\377\377\0\0\0\0"..., 76}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT <unfinished ...>
[pid 23699] <... sendmsg resumed> )     = 76
[pid 23699] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 23699] <... clock_gettime resumed> {233763, 706605251}) = 0
[pid 23699] gettimeofday( <unfinished ...>
[pid 23699] <... gettimeofday resumed> {1431120136, 943233}, NULL) = 0
[pid 23699] epoll_wait(5,  <unfinished ...>
[pid 23699] <... epoll_wait resumed> {{EPOLLIN, {u32=8, u64=8}}}, 32, -1) = 1
[pid 23699] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 23699] <... clock_gettime resumed> {233763, 706840148}) = 0
[pid 23699] gettimeofday( <unfinished ...>
[pid 23699] <... gettimeofday resumed> {1431120136, 943521}, NULL) = 0
[pid 23699] read(8,  <unfinished ...>
[pid 23699] <... read resumed> "\0", 1) = 1
[pid 23699] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 23699] <... clock_gettime resumed> {233763, 707088944}) = 0
[pid 23699] gettimeofday( <unfinished ...>
[pid 23699] <... gettimeofday resumed> {1431120136, 943687}, NULL) = 0
[pid 23699] epoll_wait(5,  <unfinished ...>
[pid 23699] <... epoll_wait resumed> {{EPOLLIN, {u32=8, u64=8}}}, 32, -1) = 1
[pid 23699] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 23699] <... clock_gettime resumed> {233763, 707271601}) = 0
[pid 23699] gettimeofday( <unfinished ...>
[pid 23699] <... gettimeofday resumed> {1431120136, 943866}, NULL) = 0
[pid 23699] read(8,  <unfinished ...>
[pid 23699] <... read resumed> "\0", 1) = 1
[pid 23699] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 23699] <... clock_gettime resumed> {233763, 707449368}) = 0
[pid 23699] gettimeofday({1431120136, 944054}, NULL) = 0
[pid 23699] epoll_wait(5,  <unfinished ...>
[pid 23699] <... epoll_wait resumed> {{EPOLLIN, {u32=8, u64=8}}}, 32, -1) = 1
[pid 23699] clock_gettime(CLOCK_MONOTONIC, {233763, 707620397}) = 0
[pid 23699] gettimeofday({1431120136, 944191}, NULL) = 0
[pid 23699] read(8,  <unfinished ...>
[pid 23699] <... read resumed> "\0", 1) = 1
[pid 23699] clock_gettime(CLOCK_MONOTONIC, {233763, 707752233}) = 0
[pid 23699] gettimeofday( <unfinished ...>
[pid 23699] <... gettimeofday resumed> {1431120136, 944347}, NULL) = 0
[pid 23699] epoll_wait(5,  <unfinished ...>
[pid 23699] <... epoll_wait resumed> {{EPOLLIN, {u32=8, u64=8}}}, 32, -1) = 1
[pid 23699] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 23699] <... clock_gettime resumed> {233763, 707933673}) = 0
[pid 23699] gettimeofday({1431120136, 944548}, NULL) = 0
[pid 23699] read(8, "\0", 1)            = 1
[pid 23699] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 23699] <... clock_gettime resumed> {233763, 708132094}) = 0
[pid 23699] gettimeofday( <unfinished ...>
[pid 23699] <... gettimeofday resumed> {1431120136, 944719}, NULL) = 0
[pid 23699] epoll_wait(5,  <unfinished ...>
[pid 23699] <... epoll_wait resumed> {{EPOLLIN, {u32=8, u64=8}}}, 32, -1) = 1
[pid 23699] clock_gettime(CLOCK_MONOTONIC, {233763, 708301746}) = 0
[pid 23699] gettimeofday({1431120136, 944872}, NULL) = 0
[pid 23699] read(8, "\0", 1)            = 1
[pid 23699] clock_gettime(CLOCK_MONOTONIC, {233763, 708392970}) = 0
[pid 23699] gettimeofday({1431120136, 944962}, NULL) = 0
[pid 23699] epoll_wait(5, {{EPOLLIN, {u32=8, u64=8}}}, 32, -1) = 1
[pid 23699] clock_gettime(CLOCK_MONOTONIC, {233763, 708482559}) = 0
[pid 23699] gettimeofday({1431120136, 945052}, NULL) = 0
[pid 23699] read(8, "\0", 1)            = 1
[pid 23699] clock_gettime(CLOCK_MONOTONIC, {233763, 708572434}) = 0
[pid 23699] gettimeofday({1431120136, 945142}, NULL) = 0
[pid 23699] epoll_wait(5,  <unfinished ...>
[pid 23699] <... epoll_wait resumed> {{EPOLLIN|EPOLLHUP, {u32=23, u64=23}}}, 32, -1) = 1
[pid 23699] clock_gettime(CLOCK_MONOTONIC, {233763, 720257164}) = 0
[pid 23699] gettimeofday( <unfinished ...>
[pid 23699] <... gettimeofday resumed> {1431120136, 956828}, NULL) = 0
[pid 23699] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=31571, si_status=SIGSEGV, si_utime=1, si_stime=0} ---
[pid 23699] recvmsg(23, {msg_name(0)=NULL, msg_iov(1)=[{"", 4096}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT) = 0
[pid 23699] epoll_ctl(5, EPOLL_CTL_DEL, 23, {EPOLLIN, {u32=23, u64=23}}) = 0
[pid 23699] close(23)                   = 0
[pid 23699] futex(0x7fb0b2a46aec, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0x7fb0b2a46a88, 8) = 1
[pid 23699] futex(0x7fb0b2a46be0, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 23699] kill(31571, SIGTERM <unfinished ...>
[pid 23699] <... kill resumed> )        = 0
[pid 23699] wait4(31571,  <unfinished ...>
[pid 23699] <... wait4 resumed> NULL, WNOHANG, NULL) = 31571
[pid 23699] futex(0x7fb0b2a4b9cc, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished ...>
[pid 23699] <... futex resumed> )       = 0
[pid 23699] futex(0x7fb0b2a46f50, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 23699] <... futex resumed> )       = 0
[pid 23699] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 23699] <... clock_gettime resumed> {233763, 722558252}) = 0
[pid 23699] gettimeofday({1431120136, 959187}, NULL) = 0
[pid 23699] epoll_wait(5, {{EPOLLIN|EPOLLERR|EPOLLHUP, {u32=26, u64=26}}, {EPOLLIN, {u32=8, u64=8}}}, 32, -1) = 2
[pid 23699] clock_gettime(CLOCK_MONOTONIC, {233763, 722729198}) = 0
[pid 23699] gettimeofday({1431120136, 959299}, NULL) = 0

And here's our ECONNRESET!!  So presumably the close(26) that follows is responsible for the other error message.

[pid 23699] recvmsg(26, 0x7fb0b7a30b18, MSG_DONTWAIT) = -1 ECONNRESET (Connection reset by peer)
[pid 23699] write(2, "[31545] WARNING: pipe error (26)"..., 168) = 168
[pid 23699] epoll_ctl(5, EPOLL_CTL_DEL, 26, {EPOLLIN, {u32=26, u64=26}}) = 0
[pid 23699] close(26)                   = 0
[pid 23699] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=31571, si_status=SIGSEGV, si_utime=1, si_stime=0} ---

is probably responsible for the dropped connection.  I'm a little vague on pid and tid mappings in Linux, but 

[pid 23721] gettid()                    = 31571

so I'm going to assume that pid 23721 is the one we're looking for here.  Sure enough,

[pid 23721] set_robust_list(0x7fb0b7a319e0, 24) = 0
[pid 23721] dup2(24, 3)                 = 3
[pid 23721] dup2(19, 4)                 = 4
...
[pid 23721] execve("/shared/ubuntu-1404-test/mozilla-central/obj-x86_64-unknown-linux-gnu/dist/bin/plugin-container", ["/shared/ubuntu-1404-test/mozilla"..., "/shared/ubuntu-1404-test/mozilla"..., "/
shared/ubuntu-1404-test/mozilla"..., "-appdir", "/shared/ubuntu-1404-test/mozilla"..., "31545", "true", "geckomediaplugin"], [/* 30 vars */]) = 0
...
[pid 23721] open("/shared/ubuntu-1404-test/mozilla-central/obj-x86_64-unknown-linux-gnu/dist/bin/gmp-fake/1.0/fake.voucher", O_RDONLY <unfinished ...>
[pid 23721] <... open resumed> )        = 11
[pid 23721] lseek(11, 0, SEEK_CUR)      = 0
[pid 23721] lseek(11, 0, SEEK_END <unfinished ...>
[pid 23721] <... lseek resumed> )       = 28
[pid 23721] lseek(11, 0, SEEK_CUR <unfinished ...>
[pid 23721] <... lseek resumed> )       = 28
[pid 23721] lseek(11, 0, SEEK_SET <unfinished ...>
[pid 23721] <... lseek resumed> )       = 0
[pid 23721] read(11,  <unfinished ...>
[pid 23721] <... read resumed> "gmp-fake placeholder voucher", 8191) = 28
[pid 23721] close(11)                   = 0
[pid 23721] open("/shared/ubuntu-1404-test/mozilla-central/obj-x86_64-unknown-linux-gnu/dist/bin/voucher.bin", O_RDONLY <unfinished ...>
[pid 23721] <... open resumed> )        = -1 ENOENT (No such file or directory)
[pid 23721] open("/shared/ubuntu-1404-test/mozilla-central/obj-x86_64-unknown-linux-gnu/dist/bin/gmp-fake/1.0/libfake.so", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid 23721] <... open resumed> )        = 11
[pid 23721] rt_sigaction(SIGSYS, {0x4262ee, [], SA_RESTORER|SA_NODEFER|SA_SIGINFO, 0x7f1b40e23340}, NULL, 8) = 0
[pid 23721] rt_sigprocmask(SIG_UNBLOCK, [SYS], NULL, 8) = 0
[pid 23721] gettid( <unfinished ...>
[pid 23721] <... gettid resumed> )      = 31571
[pid 23721] openat(AT_FDCWD, "/proc/self/task", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 13
[pid 23721] futex(0x7f1b32f680b4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f1b32f680b0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
[pid 23721] futex(0x7f1b32eff9d0, FUTEX_WAIT, 31572, NULL <unfinished ...>
[pid 23721] +++ killed by SIGSEGV +++

so this is the plugin container, and sure enough it's getting its initial IPC socket dup'd to fd 3, then exec's, then does some stuff, then barfs.

How could we get some better debugging information from this plugin-container run?  Maybe a core dump?
:eihrul points out that it's not immediately obvious which is the bug here -- the IPC module is arguably handling ECONNRESET incorrectly, and should handle it as a simple EOF (like the bytes_read == 0 case).  It's possible that socket behaviors within the docker container are different from those outside (although it's the same kernel, so that seems pretty unlikely, but http://stackoverflow.com/questions/2974021/what-does-econnreset-mean-in-the-context-of-an-af-local-socket shows that it's not a very well-defined behavior), and the SIGSEGV is intentional or at least expected.

The other option is, of course, that the plugin-container's segfault is the bug, and we're just finding out about it via ECONNRESET.
I know jld has poked at the POSIX IPC code a lot, maybe he has some insight here.
I think the more likely explanation is that the segfault is the bug and the ipc ECONNRESET is just a symptom of the segfault.  So probably the most fruitful direction is to figure out how to debug the segfault.  The `strace` output didn't give much -- it seldom does for segfaults, since they're generally unrelated to a syscall.
(In reply to Morgan Phillips [:mrrrgn] from comment #2)
> Sandbox: chroot: Stale file handle

I think this is the important line.  That will be followed by a MOZ_CRASH, which segfaults the child process, which will close its fds and cause socket errors in the parent.

What I don't understand is why it's getting ESTALE.  It would help to know what filesystem type /tmp is, but it looks like I should be able to reproduce this locally without too much trouble...
Flags: needinfo?(cpearce)
More background: the sandboxing code is creating a temporary directory, rmdir'ing it, then doing fchdir+chroot to it to revoke its filesystem access.  The expected behavior is for the open fd to keep the directory inode alive enough for that; the result (when it works) is a root directory that's empty and can't have new entries created in it, even by its owner.
Assignee: nobody → jld
Blocks: 1151607
Component: Tools → Security: Process Sandboxing
Product: Release Engineering → Core
QA Contact: hwine
Version: unspecified → Trunk
(In reply to Jed Davis [:jld] {UTC-7} from comment #14)
> What I don't understand is why it's getting ESTALE.  It would help to know
> what filesystem type /tmp is, but it looks like I should be able to
> reproduce this locally without too much trouble...

aufs(5).  It seems that any lookup of "." in an unlinked directory fails with ESTALE; e.g., stat(".") also fails.
https://dxr.mozilla.org/mozilla-central/source/security/sandbox/linux/SandboxChroot.cpp?from=security/sandbox/linux/SandboxChroot.cpp#75

I can replicate pretty easily in an Ubuntu docker instance:

root@f8a8286c97c3:/# python
Python 2.7.6 (default, Mar 22 2014, 22:59:56) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> import tempfile
>>> 
>>> tmpdir = tempfile.mkdtemp()
>>> fd = os.open(tmpdir, os.O_RDONLY | os.O_DIRECTORY)
>>> os.rmdir(tmpdir)
>>> os.fchdir(fd)
>>> os.chroot('.')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [Errno 116] Stale file handle: '.'

Curiously, running that on the same host outside of docker (as a non-priviledged user) gives

dustin@euclid ~/tmp $ python ~dustin/tmp/sandbox.py 
Traceback (most recent call last):
  File "/home/dustin/tmp/sandbox.py", line 8, in <module>
    os.chroot('.')
OSError: [Errno 1] Operation not permitted: '.'

(running as root, the same script succeeds)

I do see some version of http://lwn.net/Articles/272684/ applied, which is really the only interesting thing I see about chroot and ESTALE.

Replacing 'os.chroot' with 'os.chdir' shows the same issue in docker:

>>> import os
>>> import tempfile
>>> 
>>> tmpdir = tempfile.mkdtemp()
>>> fd = os.open(tmpdir, os.O_RDONLY | os.O_DIRECTORY)
>>> os.rmdir(tmpdir)
>>> os.fchdir(fd)
>>> os.chdir('.')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [Errno 116] Stale file handle: '.'

The chroot and chdir syscalls share the following code:

    struct path path;
    int error;
    unsigned int lookup_flags = LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
retry:
    error = user_path_at(AT_FDCWD, filename, lookup_flags, &path);
    if (error)
        goto out; 

    error = inode_permission(path.dentry->d_inode, MAY_EXEC | MAY_CHDIR);
    if (error)
        goto dput_and_out;

user_path_at calls filename_lookup, .. and then I saw your bug update :)
For the record, to try this out as an unprivileged user:

>>> import os,tempfile,ctypes
>>> ctypes.CDLL("libc.so.6", use_errno=True).unshare(1<<28)
0
>>> tmpdir = tempfile.mkdtemp()
>>> fd = os.open(tmpdir, os.O_RDONLY | os.O_DIRECTORY)
>>> os.rmdir(tmpdir)
>>> os.fchdir(fd)
>>> os.chroot('.')
>>> os.access("/", os.W_OK | os.X_OK)
True
>>> os.mkdir("/testdir")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [Errno 2] No such file or directory: '/testdir'

If the unshare call returns -1 (e.g., older kernel, kernel configured not to allow it, or chrooted), none of the rest of this will work — and the sandbox won't try to do it.
Docker has a tmpfs mounted at /dev/shm that should be usable, so if all else fails there's a reasonably quick fix for that case.

The general case is harder.  For example, Linux NFS is… weird.  The directory behaves normally immediately after being rmdir'ed, and can be chrooted to, but after a certain amount of time it starts returning ESTALE for everything.  I'm guessing there's some kind of cache involved, and I'm also guessing there's no guarantee the cache will live long enough that the chroot will actually work.

So there are other ways to do this — like doing the rmdir after the chroot, or something like Chromium's trick with procfs.
So I have two questions.

First, these are the GeckoMediaPlugins tests -- are they the right place to be testing the plugin sandboxing?  If not, what test suite should we be looking at, and how can we isolate GMP from the particulars of sandboxing?

Second, what would be the best way to ensure that these tests continue to give valuable results about the Gecko Media Plugins within Docker?  Secondarily, will users be able to run Firefox in Docker?  I don't think we need to be concerned about NFS in this bug, but it seems not unreasonable to default the mkdtemp to /dev/shm if it exists.  that would let Firefox, and the tests, run successfully under Docker while being unlikely to adversely affect linux systems without /dev/shm.
(In reply to Dustin J. Mitchell [:dustin] from comment #20)
> First, these are the GeckoMediaPlugins tests -- are they the right place to
> be testing the plugin sandboxing?

No, they aren't the right place.  But writing unit tests that the automated test infrastructure can't run yet hasn't been a high priority, and the sandboxing code has some unusual requirements (such as, for this code, needing a single-threaded process to unshare the user namespace and gain capabilities).

> Second, what would be the best way to ensure that these tests continue to
> give valuable results about the Gecko Media Plugins within Docker? 
> Secondarily, will users be able to run Firefox in Docker?

Ideally, Firefox shouldn't behave differently in Docker (or inside ≤31 layers of nested unprivileged LXC, or…) than otherwise.  This bug is a bug, and I have a patch…


If a temporary workaround is needed, setting MOZ_ASSUME_USER_NS=0 in the environment should be the best way.  (Setting MOZ_DISABLE_GMP_SANDBOX to anything will also work, but will disable other sandboxing features that aren't, as far as I know, at issue here.)
(In reply to Chris AtLee [:catlee] from comment #7)
> where does 'Sandbox: chroot: Stale file handle' come from?

(In reply to Dustin J. Mitchell [:dustin] from comment #8)
> We're not sure, but that seems to appear for a lot of passing tests, too.

I missed these comments the first time I looked at this bug.  Does this mean there are GeckoMediaPlugin tests that pass despite the plugin container crashing with SIGSEGV?


Also not helping this bug be correctly diagnosed:

* MOZ_CRASH logs that it's a MOZ_CRASH on debug builds, but there were apparently non-debug?

* Our child process handling is such that we have no reliable way to find out how a child exited; we've known this since at least 2003 (bug 227246) but there are no current plans to fix it that I'm aware of.
Attached patch bug1162965-chroot-dir-hg0.diff (obsolete) — Splinter Review
Tested on Ubuntu 14.04 inside and outside of Docker.
Attachment #8604869 - Flags: review?(gdestuynder)
Comment on attachment 8604869 [details] [diff] [review]
bug1162965-chroot-dir-hg0.diff

Review of attachment 8604869 [details] [diff] [review]:
-----------------------------------------------------------------

::: security/sandbox/linux/SandboxChroot.cpp
@@ +111,4 @@
>  }
>  
> +// This chroot()s and chdir()s to the /proc/<pid>/fdinfo directory of
> +// the calling thread.  When the thread exits the directory will

"calling" here is confusing -- it suggests the thread calling Invoke(), but that's not (I think?) what it means.  How about "the chrooting thread"?

@@ +124,5 @@
> +  char path[32]; // "/proc/2147483647/fdinfo" is 24 including '\0'.
> +  DebugOnly<size_t> len;
> +
> +  len = size_t(snprintf(path, sizeof(path), "/proc/%d/fdinfo", tid));
> +  MOZ_ASSERT(len < sizeof(path));

Using fdinfo is pretty neat.  I'm curious if this is a technique used elsewhere?  Is it fairly likely to continue to work in later Linux versions?
Comment on attachment 8604869 [details] [diff] [review]
bug1162965-chroot-dir-hg0.diff

Review of attachment 8604869 [details] [diff] [review]:
-----------------------------------------------------------------

::: security/sandbox/linux/SandboxChroot.cpp
@@ +111,4 @@
>  }
>  
> +// This chroot()s and chdir()s to the /proc/<pid>/fdinfo directory of
> +// the calling thread.  When the thread exits the directory will

I'd meant it to indicate the caller of the function under the comment; I'll clarify.

@@ +123,5 @@
> +  int tid = syscall(__NR_gettid);
> +  char path[32]; // "/proc/2147483647/fdinfo" is 24 including '\0'.
> +  DebugOnly<size_t> len;
> +
> +  len = size_t(snprintf(path, sizeof(path), "/proc/%d/fdinfo", tid));

I just noticed that this will break with pid namespaces; see https://crbug.com/312380#c34.  It could still land now without breaking anything, but it will need to change for bug 1151624.

@@ +124,5 @@
> +  char path[32]; // "/proc/2147483647/fdinfo" is 24 including '\0'.
> +  DebugOnly<size_t> len;
> +
> +  len = size_t(snprintf(path, sizeof(path), "/proc/%d/fdinfo", tid));
> +  MOZ_ASSERT(len < sizeof(path));

Chromium has been using it since 2011 (https://crbug.com/76542).

Linux generally doesn't break userspace-exposed ABIs, ever, and I believe this counts.  The one thing I was worried about is that /proc/N/fdinfo might resolve to an object that reflects the file descriptor table itself and outlives the original task's destruction if another task shares it (e.g., another thread in the same process, absent a call to unshare(CLONE_FILES)).  I suspect that that won't happen (and it's definitely not the case up to now), but since this needs to change to use a separate process instead, that won't be a problem in any case.
Comment on attachment 8604869 [details] [diff] [review]
bug1162965-chroot-dir-hg0.diff

I'm going to try the /dev/shm approach to unblock releng.
Attachment #8604869 - Flags: review?(gdestuynder)
Comment on attachment 8605001 [details] [diff] [review]
Patch: try /dev/shm instead

Review of attachment 8605001 [details] [diff] [review]:
-----------------------------------------------------------------

This seems reasonable with the fallback.

Linux historically does not break /proc and is sometimes the only interface for some things anyway (among syscalls, netlink, dev entries, etc..)
Attachment #8605001 - Flags: review?(gdestuynder) → review+
Attachment #8604869 - Attachment is obsolete: true
https://hg.mozilla.org/mozilla-central/rev/5f8235c2f2a3
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla41
Aurora/40 is going to break the same way, so this will need uplift.  I'd prefer to let the patch spend a few days in the hands of Nightly users first, but I could do that sooner if releng needs it.
Flags: needinfo?(jld)
Comment on attachment 8605001 [details] [diff] [review]
Patch: try /dev/shm instead

Approval Request Comment
[Feature/regressing bug #]: Regression from bug 1151607
[User impact if declined]: Will break tests on TaskCluster; see comment #0
[Describe test coverage new/current, TreeHerder]: No tests yet; GeckoMediaPlugin tests run on TaskCluster will provide coverage, as do Linux Nightly users on new enough OS versions.
[Risks and why]: Nonzero but low; it's not technically impossible for a system to cause problems with this approach, but there's no reason to use such a configuration and we don't think anyone does so.
[String/UUID change made/needed]: None.
Flags: needinfo?(jld)
Attachment #8605001 - Flags: approval-mozilla-aurora?
(In reply to Jed Davis [:jld] {UTC-7} from comment #33)
> [Risks and why]: Nonzero but low; it's not technically impossible for a
> system to cause problems with this approach, but there's no reason to use
> such a configuration and we don't think anyone does so.

I should also add that risk is limited to breaking GeckoMediaPlugin support, which means OpenH264 and EME.
Attachment #8605001 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
You need to log in before you can comment on or make changes to this bug.