1162965 - ./mach gtest GeckoMediaPlugins.* fails when run in a docker container

Reporter

Description

•

9 years ago

We're attempting to move jobs to docker containers (for task cluster), and unfortunately it seems that these tests do not work in that environment. Running ./mach bootstrap, ./mach build, and ./mach gtest works fine on a baremetal ubuntu 14:04 machine, but leads to the following errors within a ubuntu 14:04 docker container => https://pastebin.mozilla.org/8832881

Steps to reproduce:

    docker run -i -t ubuntu:14.04 /bin/bash
    hg clone https://hg.mozilla.org/mozilla-central && cd mozilla-central
    python/mozboot/bin/bootstrap.py --application-choice=desktop --no-interactive
    ./mach build
    ./mach gtest

Morgan Phillips [:mrrrgn]

Reporter

Updated

•

9 years ago

Blocks: 1155749

Morgan Phillips [:mrrrgn]

Reporter

Comment 1

•

9 years ago

The tests also fail in "privileged mode" docker -P

Morgan Phillips [:mrrrgn]

Reporter

Updated

•

9 years ago

User Story: (updated)

Morgan Phillips [:mrrrgn]

Reporter

Comment 2

•

9 years ago

The test is actually hanging forever, rather than outright failing

Running GTest tests...
Note: Google Test filter = GeckoMediaPlugins.GMPStorageBasic
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from GeckoMediaPlugins
[ RUN      ] GeckoMediaPlugins.GMPStorageBasic
Sandbox: chroot: Stale file handle
[10296] WARNING: pipe error (26): Connection reset by peer: file /shared/ubuntu-1404-test/mozilla-central/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 459

###!!! [Parent][MessageChannel] Error: (msgtype=0x62000E,name=PGMP::Msg_CloseActive) Channel error: cannot send/recv

Morgan Phillips [:mrrrgn]

Reporter

Updated

•

9 years ago

Summary: ./mach gtest fails when run in a docker container → ./mach gtest GeckoMediaPlugins.GMPStorageBasic fails when run in a docker container

Morgan Phillips [:mrrrgn]

Reporter

Comment 3

•

9 years ago

Sorry to bug you again, but given the extra info above, do you have any ideas about what's going on here?

Flags: needinfo?(cpearce)

Morgan Phillips [:mrrrgn]

Reporter

Updated

•

9 years ago

User Story: (updated)

Morgan Phillips [:mrrrgn]

Reporter

Comment 4

•

9 years ago

The next fails too: ./mach gtest -- -GeckoMediaPlugins.GMPStorageBasic

[ RUN      ] GeckoMediaPlugins.GMPStorageForgetThisSite
[10362] WARNING: pipe error (25): Connection reset by peer: file /shared/ubuntu-1404-test/mozilla-central/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 459
[10362] WARNING: pipe error (26): Connection reset by peer: file /shared/ubuntu-1404-test/mozilla-central/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 459
[10362] WARNING: pipe error (23): Connection reset by peer: file /shared/ubuntu-1404-test/mozilla-central/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 459
[10362] WARNING: pipe error (28): Connection reset by peer: file /shared/ubuntu-1404-test/mozilla-central/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 459

###!!! [Parent][MessageChannel] Error: (msgtype=0x62000E,name=PGMP::Msg_CloseActive) Channel error: cannot send/recv


###!!! [Parent][MessageChannel] Error: (msgtype=0x62000E,name=PGMP::Msg_CloseActive) Channel error: cannot send/recv

Sandbox: chroot: Stale file handle

###!!! [Parent][MessageChannel] Error: (msgtype=0x62000E,name=PGMP::Msg_CloseActive) Channel error: cannot send/recv

[10362] WARNING: pipe error (26): Connection reset by peer: file /shared/ubuntu-1404-test/mozilla-central/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 459

Morgan Phillips [:mrrrgn]

Reporter

Updated

•

9 years ago

Summary: ./mach gtest GeckoMediaPlugins.GMPStorageBasic fails when run in a docker container → ./mach gtest GeckoMediaPlugins.* fails when run in a docker container

Dustin J. Mitchell [:dustin] (he/him)

Comment 5

•

9 years ago

Attached file https://pastebin.mozilla.org/8832881 — Details

here's a copy of the pastebin for posterity

Dustin J. Mitchell [:dustin] (he/him)

Comment 6

•

9 years ago

I suspect the most useful information here would be, what operating-system level features does this test use that might help us find the docker feature or bug making this fail?

From the comments in ipc/chromium/src/chrome/common/ipc_channel_posix.cc:

// channel ids as the pipe names. Channels on POSIX use anonymous
// Unix domain sockets created via socketpair() as pipes. These don't
// quite line up.

OK, socketpair() isn't rocket science, that should work in Docker.

// Case 1: normal running
// The IPC server object will install a mapping in PipeMap from the
// name which it was given to the client pipe. When forking the client, the
// GetClientFileDescriptorMapping will ensure that the socket is installed in
// the magic slot (@kClientChannelFd). The client will search for the
// mapping, but it won't find any since we are in a new process. Thus the
// magic fd number is returned. Once the client connects, the server will
// close its copy of the client socket and remove the mapping.
//
// Case 2: unittests - client and server in the same process
// The IPC server will install a mapping as before. The client will search
// for a mapping and find out. It duplicates the file descriptor and
// connects. Once the client connects, the server will close the original
// copy of the client socket and remove the mapping. Thus, when the client
// object closes, it will close the only remaining copy of the client socket
// in the fd table and the server will see EOF on its side.

I'm guessing that case 2 only applies to unit tests for the IPC code itself; the GMP tests are probably using "normal running". dup(2)'ing to a well-known fd before forking is a well-established technique that works everywhere (but on OpenBSD..), so that should be fine in Docker too.

###!!! [Parent][MessageChannel] Error: (msgtype=0x62000E,name=PGMP::Msg_CloseActive) Channel error: cannot send/recv
is generated by PrintErrorMessage in ipc/glue/MessageChannel.cpp; the first bit (`[Parent]`) indicates that this is the "parent" side of the channel.

I had a stare at dom/media/gtest/TestGMPCrossOrigin.cpp but it looks like the IPC is buried in the SUT so maybe a bit too deep a dive for right now.

Am I even on the right track?

Chris AtLee [:catlee]

Comment 7

•

9 years ago

where does 'Sandbox: chroot: Stale file handle' come from?

Dustin J. Mitchell [:dustin] (he/him)

Comment 8

•

9 years ago

We're not sure, but that seems to appear for a lot of passing tests, too.

Dustin J. Mitchell [:dustin] (he/him)

Comment 9

•

9 years ago

Stracing the parent thread from socketpair() through its error message:

[pid 23719] socketpair(PF_LOCAL, SOCK_STREAM, 0 <unfinished ...>
[pid 23719] <... socketpair resumed> , [24, 25]) = 0
[pid 23719] fcntl(24, F_SETFL, O_RDONLY|O_NONBLOCK <unfinished ...>
[pid 23719] <... fcntl resumed> )       = 0
[pid 23719] fcntl(25, F_SETFL, O_RDONLY|O_NONBLOCK <unfinished ...>
[pid 23719] <... fcntl resumed> )       = 0
[pid 23719] fcntl(24, F_GETFD <unfinished ...>
[pid 23719] <... fcntl resumed> )       = 0
[pid 23719] fcntl(24, F_SETFD, FD_CLOEXEC <unfinished ...>
[pid 23719] <... fcntl resumed> )       = 0
[pid 23719] fcntl(25, F_GETFD <unfinished ...>
[pid 23719] <... fcntl resumed> )       = 0
[pid 23719] fcntl(25, F_SETFD, FD_CLOEXEC <unfinished ...>
[pid 23719] <... fcntl resumed> )       = 0
[pid 23719] dup(24 <unfinished ...>
[pid 23719] <... dup resumed> )         = 26
[pid 23719] dup(25 <unfinished ...>
[pid 23719] <... dup resumed> )         = 27
[pid 23719] close(24)                   = 0
[pid 23719] close(25 <unfinished ...>
[pid 23719] <... close resumed> )       = 0
[pid 23719] write(9, "\0", 1 <unfinished ...>
[pid 23719] <... write resumed> )       = 1
[pid 23719] write(9, "\0", 1 <unfinished ...>
[pid 23719] <... write resumed> )       = 1
[pid 23719] futex(0x7fb0b2a46aec, FUTEX_WAIT_PRIVATE, 5, NULL <unfinished ...>
[pid 23719] <... futex resumed> )       = 0
[pid 23719] futex(0x7fb0b2a46a88, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 23719] <... futex resumed> )       = 0
[pid 23719] futex(0x7fb0b2a46be0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 23719] <... futex resumed> )       = 0
[pid 23719] futex(0x7fb0b2a46be0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 23719] <... futex resumed> )       = 0
[pid 23719] write(9, "\0", 1 <unfinished ...>
[pid 23719] <... write resumed> )       = 1
[pid 23719] futex(0x7fb0b2a4b90c, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished ...>
[pid 23719] <... futex resumed> )       = 0
[pid 23719] futex(0x7fb0b2a46c90, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 23719] <... futex resumed> )       = 0
[pid 23719] write(9, "\0", 1 <unfinished ...>
[pid 23719] <... write resumed> )       = 1
[pid 23719] write(9, "\0", 1 <unfinished ...>
[pid 23719] <... write resumed> )       = 1
[pid 23719] futex(0x7fb0c4bd0cfc, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0x7fb0c4bd0c98, 12 <unfinished ...>
[pid 23719] <... futex resumed> )       = 1
[pid 23719] futex(0x7fb0b2a4b80c, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished ...>
[pid 23719] <... futex resumed> )       = 0
[pid 23719] futex(0x7fb0b2a46ea0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 23719] <... futex resumed> )       = 0
[pid 23719] write(9, "\0", 1 <unfinished ...>
[pid 23719] <... write resumed> )       = 1
[pid 23719] futex(0x7fb0b2a46aec, FUTEX_WAIT_PRIVATE, 7, NULL <unfinished ...>
[pid 23719] <... futex resumed> )       = 0
[pid 23719] futex(0x7fb0b2a46a88, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 23719] futex(0x7fb0b2a46be0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 23719] <... futex resumed> )       = 0
[pid 23719] futex(0x7fb0b2a46be0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 23719] futex(0x7fb0c4bd0cfc, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0x7fb0c4bd0c98, 14) = 0
[pid 23719] write(2, "\n###!!! [Parent][MessageChannel]"..., 119) = 119

Working backward, it looks like the socketpair() call is from https://dxr.mozilla.org/mozilla-central/source/ipc/chromium/src/chrome/common/ipc_channel_posix.cc#331.  I be the caller is https://dxr.mozilla.org/mozilla-central/source/ipc/glue/Transport_posix.cpp#25, as that matches the dup()'s.  The close();s come from the Transport going out of scope, as described in the comment.  But there the trail runs cold -- I can't see what might have called that.  I don't see any unexpectedly nonzero exit statuses here that might lead to the "Channel error" being logged.

However, shortly after the dup's, I see thread 23699 using fd's 26 and 27

[pid 23699] sendmsg(23, {msg_name(0)=NULL, msg_iov(1)=[{"\20\0\0\0\377\377\377\177\372\377\0\0\1\0\0\0\1\0\0\0\377\377\377\377\377\377\377\377\0\0\0\0"..., 48}], msg_controllen=20, {cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, {27}}, msg_flags=0}, MSG_DONTWAIT <unfinished ...>
[pid 23699] <... sendmsg resumed> )     = 48
[pid 23699] close(27 <unfinished ...>
[pid 23699] <... close resumed> )       = 0

Here it sends fd 27 (one end of the socketpair) over fd 23 to some other thread, and then closes its copy of the fd.  It then sends some messages on fd 26 but those don't include fd's.

[pid 23699] futex(0x7fb0b2a4b90c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fb0b2a4b908, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1} <unfinished ...>
[pid 23699] <... futex resumed> )       = 1
[pid 23699] epoll_ctl(5, EPOLL_CTL_ADD, 26, {EPOLLIN, {u32=26, u64=26}} <unfinished ...>
[pid 23699] <... epoll_ctl resumed> )   = 0
[pid 23699] sendmsg(26, {msg_name(0)=NULL, msg_iov(1)=[{"\4\0\0\0\0\0\0\200\377\377\0\0\1\0\0\0\0\0\0\0\377\377\377\377\377\377\377\377\0\0\0\0"..., 36}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT <unfinished ...>
[pid 23699] <... sendmsg resumed> )     = 36
[pid 23699] sendmsg(26, {msg_name(0)=NULL, msg_iov(1)=[{"\4\0\0\0\377\377\377\177\3\0^\0\1\0\0\0\0\0\0\0\377\377\377\377\377\377\377\377\0\0\0\0"..., 36}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT <unfinished ...>
[pid 23699] <... sendmsg resumed> )     = 36
[pid 23699] sendmsg(26, {msg_name(0)=NULL, msg_iov(1)=[{"\0\0\0\0\2\0\0\0\1\0`\0\1\0\0\0\0\0\0\0\377\377\377\377\377\377\377\377\0\0\0\0", 32}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT <unfinished ...>
[pid 23699] <... sendmsg resumed> )     = 32
[pid 23699] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 23699] <... clock_gettime resumed> {233763, 705643790}) = 0
[pid 23699] gettimeofday( <unfinished ...>
[pid 23699] <... gettimeofday resumed> {1431120136, 942238}, NULL) = 0
[pid 23699] epoll_wait(5,  <unfinished ...>
[pid 23699] <... epoll_wait resumed> {{EPOLLIN, {u32=8, u64=8}}}, 32, -1) = 1
[pid 23699] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 23699] <... clock_gettime resumed> {233763, 705814081}) = 0
[pid 23699] gettimeofday( <unfinished ...>
[pid 23699] <... gettimeofday resumed> {1431120136, 942421}, NULL) = 0
[pid 23699] read(8,  <unfinished ...>
[pid 23699] <... read resumed> "\0", 1) = 1
[pid 23699] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 23699] <... clock_gettime resumed> {233763, 706144638}) = 0
[pid 23699] gettimeofday( <unfinished ...>
[pid 23699] <... gettimeofday resumed> {1431120136, 942737}, NULL) = 0
[pid 23699] epoll_wait(5,  <unfinished ...>
[pid 23699] <... epoll_wait resumed> {{EPOLLIN, {u32=8, u64=8}}}, 32, -1) = 1
[pid 23699] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 23699] <... clock_gettime resumed> {233763, 706362235}) = 0
[pid 23699] gettimeofday( <unfinished ...>
[pid 23699] <... gettimeofday resumed> {1431120136, 942963}, NULL) = 0
[pid 23699] read(8,  <unfinished ...>
[pid 23699] <... read resumed> "\0", 1) = 1
[pid 23699] sendmsg(26, {msg_name(0)=NULL, msg_iov(1)=[{",\0\0\0\2\0\0\0\4\0`\0\1\0\0\0\0\0\0\0\377\377\377\377\377\377\377\377\0\0\0\0"..., 76}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT <unfinished ...>
[pid 23699] <... sendmsg resumed> )     = 76
[pid 23699] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 23699] <... clock_gettime resumed> {233763, 706605251}) = 0
[pid 23699] gettimeofday( <unfinished ...>
[pid 23699] <... gettimeofday resumed> {1431120136, 943233}, NULL) = 0
[pid 23699] epoll_wait(5,  <unfinished ...>
[pid 23699] <... epoll_wait resumed> {{EPOLLIN, {u32=8, u64=8}}}, 32, -1) = 1
[pid 23699] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 23699] <... clock_gettime resumed> {233763, 706840148}) = 0
[pid 23699] gettimeofday( <unfinished ...>
[pid 23699] <... gettimeofday resumed> {1431120136, 943521}, NULL) = 0
[pid 23699] read(8,  <unfinished ...>
[pid 23699] <... read resumed> "\0", 1) = 1
[pid 23699] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 23699] <... clock_gettime resumed> {233763, 707088944}) = 0
[pid 23699] gettimeofday( <unfinished ...>
[pid 23699] <... gettimeofday resumed> {1431120136, 943687}, NULL) = 0
[pid 23699] epoll_wait(5,  <unfinished ...>
[pid 23699] <... epoll_wait resumed> {{EPOLLIN, {u32=8, u64=8}}}, 32, -1) = 1
[pid 23699] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 23699] <... clock_gettime resumed> {233763, 707271601}) = 0
[pid 23699] gettimeofday( <unfinished ...>
[pid 23699] <... gettimeofday resumed> {1431120136, 943866}, NULL) = 0
[pid 23699] read(8,  <unfinished ...>
[pid 23699] <... read resumed> "\0", 1) = 1
[pid 23699] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 23699] <... clock_gettime resumed> {233763, 707449368}) = 0
[pid 23699] gettimeofday({1431120136, 944054}, NULL) = 0
[pid 23699] epoll_wait(5,  <unfinished ...>
[pid 23699] <... epoll_wait resumed> {{EPOLLIN, {u32=8, u64=8}}}, 32, -1) = 1
[pid 23699] clock_gettime(CLOCK_MONOTONIC, {233763, 707620397}) = 0
[pid 23699] gettimeofday({1431120136, 944191}, NULL) = 0
[pid 23699] read(8,  <unfinished ...>
[pid 23699] <... read resumed> "\0", 1) = 1
[pid 23699] clock_gettime(CLOCK_MONOTONIC, {233763, 707752233}) = 0
[pid 23699] gettimeofday( <unfinished ...>
[pid 23699] <... gettimeofday resumed> {1431120136, 944347}, NULL) = 0
[pid 23699] epoll_wait(5,  <unfinished ...>
[pid 23699] <... epoll_wait resumed> {{EPOLLIN, {u32=8, u64=8}}}, 32, -1) = 1
[pid 23699] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 23699] <... clock_gettime resumed> {233763, 707933673}) = 0
[pid 23699] gettimeofday({1431120136, 944548}, NULL) = 0
[pid 23699] read(8, "\0", 1)            = 1
[pid 23699] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 23699] <... clock_gettime resumed> {233763, 708132094}) = 0
[pid 23699] gettimeofday( <unfinished ...>
[pid 23699] <... gettimeofday resumed> {1431120136, 944719}, NULL) = 0
[pid 23699] epoll_wait(5,  <unfinished ...>
[pid 23699] <... epoll_wait resumed> {{EPOLLIN, {u32=8, u64=8}}}, 32, -1) = 1
[pid 23699] clock_gettime(CLOCK_MONOTONIC, {233763, 708301746}) = 0
[pid 23699] gettimeofday({1431120136, 944872}, NULL) = 0
[pid 23699] read(8, "\0", 1)            = 1
[pid 23699] clock_gettime(CLOCK_MONOTONIC, {233763, 708392970}) = 0
[pid 23699] gettimeofday({1431120136, 944962}, NULL) = 0
[pid 23699] epoll_wait(5, {{EPOLLIN, {u32=8, u64=8}}}, 32, -1) = 1
[pid 23699] clock_gettime(CLOCK_MONOTONIC, {233763, 708482559}) = 0
[pid 23699] gettimeofday({1431120136, 945052}, NULL) = 0
[pid 23699] read(8, "\0", 1)            = 1
[pid 23699] clock_gettime(CLOCK_MONOTONIC, {233763, 708572434}) = 0
[pid 23699] gettimeofday({1431120136, 945142}, NULL) = 0
[pid 23699] epoll_wait(5,  <unfinished ...>
[pid 23699] <... epoll_wait resumed> {{EPOLLIN|EPOLLHUP, {u32=23, u64=23}}}, 32, -1) = 1
[pid 23699] clock_gettime(CLOCK_MONOTONIC, {233763, 720257164}) = 0
[pid 23699] gettimeofday( <unfinished ...>
[pid 23699] <... gettimeofday resumed> {1431120136, 956828}, NULL) = 0
[pid 23699] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=31571, si_status=SIGSEGV, si_utime=1, si_stime=0} ---
[pid 23699] recvmsg(23, {msg_name(0)=NULL, msg_iov(1)=[{"", 4096}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT) = 0
[pid 23699] epoll_ctl(5, EPOLL_CTL_DEL, 23, {EPOLLIN, {u32=23, u64=23}}) = 0
[pid 23699] close(23)                   = 0
[pid 23699] futex(0x7fb0b2a46aec, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0x7fb0b2a46a88, 8) = 1
[pid 23699] futex(0x7fb0b2a46be0, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 23699] kill(31571, SIGTERM <unfinished ...>
[pid 23699] <... kill resumed> )        = 0
[pid 23699] wait4(31571,  <unfinished ...>
[pid 23699] <... wait4 resumed> NULL, WNOHANG, NULL) = 31571
[pid 23699] futex(0x7fb0b2a4b9cc, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished ...>
[pid 23699] <... futex resumed> )       = 0
[pid 23699] futex(0x7fb0b2a46f50, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 23699] <... futex resumed> )       = 0
[pid 23699] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 23699] <... clock_gettime resumed> {233763, 722558252}) = 0
[pid 23699] gettimeofday({1431120136, 959187}, NULL) = 0
[pid 23699] epoll_wait(5, {{EPOLLIN|EPOLLERR|EPOLLHUP, {u32=26, u64=26}}, {EPOLLIN, {u32=8, u64=8}}}, 32, -1) = 2
[pid 23699] clock_gettime(CLOCK_MONOTONIC, {233763, 722729198}) = 0
[pid 23699] gettimeofday({1431120136, 959299}, NULL) = 0

And here's our ECONNRESET!!  So presumably the close(26) that follows is responsible for the other error message.

[pid 23699] recvmsg(26, 0x7fb0b7a30b18, MSG_DONTWAIT) = -1 ECONNRESET (Connection reset by peer)
[pid 23699] write(2, "[31545] WARNING: pipe error (26)"..., 168) = 168
[pid 23699] epoll_ctl(5, EPOLL_CTL_DEL, 26, {EPOLLIN, {u32=26, u64=26}}) = 0
[pid 23699] close(26)                   = 0

Dustin J. Mitchell [:dustin] (he/him)

Comment 10

•

9 years ago

[pid 23699] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=31571, si_status=SIGSEGV, si_utime=1, si_stime=0} ---

is probably responsible for the dropped connection.  I'm a little vague on pid and tid mappings in Linux, but 

[pid 23721] gettid()                    = 31571

so I'm going to assume that pid 23721 is the one we're looking for here.  Sure enough,

[pid 23721] set_robust_list(0x7fb0b7a319e0, 24) = 0
[pid 23721] dup2(24, 3)                 = 3
[pid 23721] dup2(19, 4)                 = 4
...
[pid 23721] execve("/shared/ubuntu-1404-test/mozilla-central/obj-x86_64-unknown-linux-gnu/dist/bin/plugin-container", ["/shared/ubuntu-1404-test/mozilla"..., "/shared/ubuntu-1404-test/mozilla"..., "/
shared/ubuntu-1404-test/mozilla"..., "-appdir", "/shared/ubuntu-1404-test/mozilla"..., "31545", "true", "geckomediaplugin"], [/* 30 vars */]) = 0
...
[pid 23721] open("/shared/ubuntu-1404-test/mozilla-central/obj-x86_64-unknown-linux-gnu/dist/bin/gmp-fake/1.0/fake.voucher", O_RDONLY <unfinished ...>
[pid 23721] <... open resumed> )        = 11
[pid 23721] lseek(11, 0, SEEK_CUR)      = 0
[pid 23721] lseek(11, 0, SEEK_END <unfinished ...>
[pid 23721] <... lseek resumed> )       = 28
[pid 23721] lseek(11, 0, SEEK_CUR <unfinished ...>
[pid 23721] <... lseek resumed> )       = 28
[pid 23721] lseek(11, 0, SEEK_SET <unfinished ...>
[pid 23721] <... lseek resumed> )       = 0
[pid 23721] read(11,  <unfinished ...>
[pid 23721] <... read resumed> "gmp-fake placeholder voucher", 8191) = 28
[pid 23721] close(11)                   = 0
[pid 23721] open("/shared/ubuntu-1404-test/mozilla-central/obj-x86_64-unknown-linux-gnu/dist/bin/voucher.bin", O_RDONLY <unfinished ...>
[pid 23721] <... open resumed> )        = -1 ENOENT (No such file or directory)
[pid 23721] open("/shared/ubuntu-1404-test/mozilla-central/obj-x86_64-unknown-linux-gnu/dist/bin/gmp-fake/1.0/libfake.so", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid 23721] <... open resumed> )        = 11
[pid 23721] rt_sigaction(SIGSYS, {0x4262ee, [], SA_RESTORER|SA_NODEFER|SA_SIGINFO, 0x7f1b40e23340}, NULL, 8) = 0
[pid 23721] rt_sigprocmask(SIG_UNBLOCK, [SYS], NULL, 8) = 0
[pid 23721] gettid( <unfinished ...>
[pid 23721] <... gettid resumed> )      = 31571
[pid 23721] openat(AT_FDCWD, "/proc/self/task", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 13
[pid 23721] futex(0x7f1b32f680b4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f1b32f680b0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
[pid 23721] futex(0x7f1b32eff9d0, FUTEX_WAIT, 31572, NULL <unfinished ...>
[pid 23721] +++ killed by SIGSEGV +++

so this is the plugin container, and sure enough it's getting its initial IPC socket dup'd to fd 3, then exec's, then does some stuff, then barfs.

How could we get some better debugging information from this plugin-container run?  Maybe a core dump?

Dustin J. Mitchell [:dustin] (he/him)

Comment 11

•

9 years ago

:eihrul points out that it's not immediately obvious which is the bug here -- the IPC module is arguably handling ECONNRESET incorrectly, and should handle it as a simple EOF (like the bytes_read == 0 case).  It's possible that socket behaviors within the docker container are different from those outside (although it's the same kernel, so that seems pretty unlikely, but http://stackoverflow.com/questions/2974021/what-does-econnreset-mean-in-the-context-of-an-af-local-socket shows that it's not a very well-defined behavior), and the SIGSEGV is intentional or at least expected.

The other option is, of course, that the plugin-container's segfault is the bug, and we're just finding out about it via ECONNRESET.

(not currently active) Ted Mielczarek

Comment 12

•

9 years ago

I know jld has poked at the POSIX IPC code a lot, maybe he has some insight here.

Dustin J. Mitchell [:dustin] (he/him)

Comment 13

•

9 years ago

I think the more likely explanation is that the segfault is the bug and the ipc ECONNRESET is just a symptom of the segfault.  So probably the most fruitful direction is to figure out how to debug the segfault.  The `strace` output didn't give much -- it seldom does for segfaults, since they're generally unrelated to a syscall.

https://pastebin.mozilla.org/8832881 9 years ago Dustin J. Mitchell [:dustin] (he/him) 2.31 KB, text/plain		Details
bug1162965-chroot-dir-hg0.diff 9 years ago Jed Davis [:jld] ⟨⏰\|UTC-7⟩ ⟦he/him⟧ 7.99 KB, patch	jld : review-	Details \| Diff \| Splinter Review
Patch: try /dev/shm instead 9 years ago Jed Davis [:jld] ⟨⏰\|UTC-7⟩ ⟦he/him⟧ 1.56 KB, patch	kang : review+ Sylvestre : approval-mozilla-aurora+	Details \| Diff \| Splinter Review