Closed Bug 868919 Opened 11 years ago Closed 11 years ago

Abort with 'constructor for actor failed' in PCompositorChild.cpp

Categories

(Core :: IPC, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla24

People

(Reporter: mattwoodrow, Assigned: dvander)

References

Details

(Keywords: intermittent-failure)

Attachments

(1 file, 2 obsolete files)

This crash occurs when running the /dom/browser-element/mochitest/test_browserElement_oop_AppWindowNamespace.html mochitest with OMTC enabled (pref - layers.offmainthreadcomposition.enabled=true).

On Mac at least, it happens in every configuration *except* 10.6 and 10.8 debug.

https://tbpl.mozilla.org/?tree=Try&rev=7156b2a18035

https://tbpl.mozilla.org/php/getParsedLog.php?id=22494283&tree=Try&full=1#error0

STR:
- Create a Mac build with --enable-optimize and --enable-debug (this is the configuration used for 'opt' builds on TBPL)
- From the objdir run 'TEST_PATH=dom/browser-element/mochitest/test_browserElement_oop_AppWindowNamespace.html EXTRA_TEST_ARGS='--setpref=layers.offmainthreadcomposition.enabled=true' make mochitest-plain'


This bug may also reproduce on linux (with the BasicCompositor), I haven't tested there.
The real error is ###!!! [Child][SyncChannel] Error: Channel error: cannot send/recv

Which means that the other side has crashed, or somehow the channel got disconnected. This seems plausible since immediately before this we see XPCOM shutdown goop:

02:12:37     INFO -   => mAllocCount:           9118

So it appears that one side is shutting down but the other side isn't aware of this and is still sending messages. IIRC the compositor protocol is "backwards" (child is the chrome process and parent is the content process).

I doubt this is IPC; almost certainly graphics or something.
I've tried running with debuggers attached to both processes, and the first sign of trouble is the abort mentioned in the bug title.

The child process crashes with that abort, the main process continues to run fine.
Also 'child' is the tab-content process, and 'parent' is the UI/chrome process.

The browser itself continues to run after this abort, we just fail tests because the content process died.
You need to figure out which constructor is failing, I guess. It still seems that some process is shutting down early, based on the logging information.
Windows 7 32-bit try opt test reftest-ipc on 2013-05-15 17:18:09 PDT for push af1d22cc25e8
slave: t-w732-ix-061
https://tbpl.mozilla.org/php/getParsedLog.php?id=23004291&tree=Try#error0
This bug appears to be with this code:

http://mxr.mozilla.org/mozilla-central/source/ipc/glue/Transport_posix.cpp#47

I haven't got my head around the exact details, but switching those two booleans to false fixes the bug.

It appears that we create a socket (and a pair of fd's) on the main-process-main-thread.

We dup both of them in that function, and pass 'true' for the boolean meaning that we can free them once they've been sent to the other process through the socket.

The one we send to the child process gets dup'd, so closing it after the message is fine. The one that gets sent to the compositor thread doesn't get dup'd, so closing it kills the socket.

It's not obvious why this only happens some of the time.
Thanks Matt for doing a lot of the debugging here. Chromium ran into a similar issue, but landed on a workaround: http://code.google.com/p/chromium/issues/detail?id=26754#c87

The question is whether this is a race condition in our code, or a race condition in the kernel. It looks like neither, if I'm reading the XNU source right. In unp_gc() in bsd/kern/uipc_usrreq.c, 

>    /*
>     * If all refs are from msgs, and it's not marked accessible
>     * then it must be referenced from some unreachable cycle
>     * of (shut-down) FDs, so include it in our
>     * list of FDs to remove
>     */
>    if (fg->fg_count == fg->fg_msgcount && !(fg->fg_flag & FMARK)) {

So, if the only active reference to a socket is sitting in a process's message queue, it is considered dead and will be closed.

BSD 4.4 lists this under "BUGS" (http://www.unix.com/man-page/FreeBSD/2/sendmsg/):

>     Because sendmsg() does not necessarily block until the data has been
>     transferred, it is possible to transfer an open file descriptor across an
>     AF_UNIX domain socket (see recv(2)), then close() it before it has actu-
>     ally been sent, the result being that the receiver gets a closed file
>     descriptor.  It is left to the application to implement an acknowledgment
>     mechanism to prevent this from happening.

So this should be easy to fix by just holding the parent fds alive until the owning channel is closed.
Assignee: nobody → dvander
Status: NEW → ASSIGNED
Whoops, forgot to add some context: the bug is in the chromium IPC code:
 
 http://mxr.mozilla.org/mozilla-central/source/ipc/chromium/src/chrome/common/ipc_channel_posix.cc#664

The file descriptor is immediately closed after being sent, so if it sits in the message queue too long it will be forcefully closed.
Attached patch fix for reported issue (obsolete) — Splinter Review
Update: I have had a patch for this, but I still get intermittent test failures due to a severed pipe. Unfortunately printfs make the problem much harder to reproduce, and any sort of dtrace attachment makes the problem go away completely.

After a lot of fruitless debugging, I built a custom OS X kernel to log when pipes are disconnected due to sitting in a message queue. I can confirm that this is the reason we're failing to construct a PCompositor as reported in comment #0. The attached patch works around this issue by making the receiving channel ACK, so the parent knows when to close. I can also confirm that this addresses the kernel issue.

The severed pipe problem that appears even with this patch, seems to be different. I get a much more intermittent, but still reproducible failure where the actual initial process pipe is being disconnected. I'm not sure how, as I see no close() calls to the other end of the pipe. At this point I'm suspecting either a timing bug in our code, another kernel issue, or maybe an errant close() or dup2() call that I'm missing. It looks like this second issue has been causing lots intermittent orange as of recently.

Unfortunately I've run out of time to debug this at the moment, so I'm putting the first patch up for review in case anyone wants to play with it. When I get back my plan is to keep playing with custom kernel builds to narrow this down
Attachment #760163 - Flags: review?(cjones.bugs)
Another data point is that removing the CommitAll() call completely in ipc_channel_posix.cc makes all of the issues go away. It's easy to see why this would fix the sendmsg() bug, but not the process pipe one, unless it just changes timing or something.
Attached patch fix for reported issue (obsolete) — Splinter Review
correct patch
Attachment #760163 - Attachment is obsolete: true
Attachment #760163 - Flags: review?(cjones.bugs)
Attachment #760283 - Flags: review?(cjones.bugs)
Comment on attachment 760283 [details] [diff] [review]
fix for reported issue

Is anyone else able to review this?

This is blocking us from being able to enable off-main-thread compositing on mac, an important goal for the gfx team.
Attachment #760283 - Flags: review?(bent.mozilla)
Attachment #760283 - Flags: review?(benjamin)
Comment on attachment 760283 [details] [diff] [review]
fix for reported issue

bent is the right person for this.
Attachment #760283 - Flags: review?(cjones.bugs)
Attachment #760283 - Flags: review?(benjamin)
Comment on attachment 760283 [details] [diff] [review]
fix for reported issue

Review of attachment 760283 [details] [diff] [review]:
-----------------------------------------------------------------

This looks good but I think we need some more sanity checks:

::: ipc/chromium/src/chrome/common/ipc_channel_posix.cc
@@ +566,5 @@
>  
> +#if defined(OS_MACOSX)
> +          // Send a message to the other side, indicating that we are now
> +          // responsible for closing the descriptor.
> +          Message *msg = new Message(MSG_ROUTING_NONE,

Nit: Can you name this something less likely to collide? fdAck maybe?

@@ +569,5 @@
> +          // responsible for closing the descriptor.
> +          Message *msg = new Message(MSG_ROUTING_NONE,
> +                                     RECEIVED_FDS_MESSAGE_TYPE,
> +                                     IPC::Message::PRIORITY_NORMAL);
> +          msg->set_fd_cookie(m.fd_cookie());

Can you DCHECK that this is non-zero?

@@ +588,5 @@
>            listener_->OnChannelConnected(MessageIterator(m).NextInt());
> +#if defined(OS_MACOSX)
> +        } else if (m.routing_id() == MSG_ROUTING_NONE &&
> +                   m.type() == RECEIVED_FDS_MESSAGE_TYPE) {
> +          CloseDescriptors(m.fd_cookie());

DCHECK here too.

@@ +670,5 @@
>        msgh.msg_controllen = cmsg->cmsg_len;
>  
>        msg->header()->num_fds = num_fds;
> +#if defined(OS_MACOSX)
> +      msg->set_fd_cookie(++last_pending_fd_id_);

Can you make sure to set this to 0 if we're not sending fds? We could only do it in debug builds I guess.

@@ +689,2 @@
>      if (bytes_written > 0)
>        msg->file_descriptor_set()->CommitAll();

Nit: Comment here saying that fds get committed in response to the RECEIVED_FDS_MESSAGE_TYPE special message later.

@@ +805,5 @@
>    }
>  }
>  
> +#if defined(OS_MACOSX)
> +void Channel::ChannelImpl::CloseDescriptors(uint32_t pending_fd_id)

DCHECK pending_fd_id here

@@ +807,5 @@
>  
> +#if defined(OS_MACOSX)
> +void Channel::ChannelImpl::CloseDescriptors(uint32_t pending_fd_id)
> +{
> +  for (std::list<PendingDescriptors>::iterator i = pending_fds_.begin();

We should really have some kind of assertion that we always find the id we're looking for. Looping through and not closing anything should be an error.

::: ipc/chromium/src/chrome/common/ipc_channel_posix.h
@@ +52,5 @@
>    // MessageLoopForIO::Watcher implementation.
>    virtual void OnFileCanReadWithoutBlocking(int fd);
>    virtual void OnFileCanWriteWithoutBlocking(int fd);
>  
> +  void CloseDescriptors(uint32_t pending_fd_id);

This needs an #ifdef guard, it's only implemented on OSX

::: ipc/chromium/src/chrome/common/ipc_message.h
@@ +291,5 @@
>      msgid_t type;   // specifies the user-defined message type
>      uint32_t flags;   // specifies control flags for the message
>  #if defined(OS_POSIX)
>      uint32_t num_fds; // the number of descriptors included with this message
> +# if defined(OS_POSIX)

You're already inside an OS_POSIX block here... Did you mean to do OS_MACOSX here (and above for the cookie accessors)?
Attachment #760283 - Flags: review?(bent.mozilla)
Comment on attachment 760283 [details] [diff] [review]
fix for reported issue

Review of attachment 760283 [details] [diff] [review]:
-----------------------------------------------------------------

Oh, also, we need to handle the case where the channel goes down before we receive the RECEIVED_FDS_MESSAGE_TYPE message... Otherwise we'll leak fds.
(In reply to ben turner [:bent] from comment #15)
> Comment on attachment 760283 [details] [diff] [review]
> fix for reported issue
> 
> Review of attachment 760283 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Oh, also, we need to handle the case where the channel goes down before we
> receive the RECEIVED_FDS_MESSAGE_TYPE message... Otherwise we'll leak fds.

If the channel goes down, the list should destruct which will close all the fd sets, so that should be okay.
Attached patch fix v2Splinter Review
Attachment #760283 - Attachment is obsolete: true
Attachment #764521 - Flags: review?(bent.mozilla)
(In reply to David Anderson [:dvander] from comment #16)

> 
> If the channel goes down, the list should destruct which will close all the
> fd sets, so that should be okay.

Looks like it will hit a warning in the FileDescriptorSet destructor though, so I made this explicit.
Comment on attachment 764521 [details] [diff] [review]
fix v2

Review of attachment 764521 [details] [diff] [review]:
-----------------------------------------------------------------

r=me with this addressed:

::: ipc/chromium/src/chrome/common/ipc_channel.h
@@ +128,2 @@
>    enum {
> +#if defined(OS_POSIX)

This should be OS_MACOSX right?

::: ipc/chromium/src/chrome/common/ipc_channel_posix.cc
@@ +718,5 @@
>        message_send_bytes_written_ = 0;
>  
> +#if defined(OS_MACOSX)
> +      if (!msg->file_descriptor_set()->empty())
> +        pending_fds_.push_back(PendingDescriptors(msg->fd_cookie(), msg->file_descriptor_set()));

Nit: Please wrap this a little.

@@ +821,5 @@
> +      pending_fds_.erase(i);
> +      return;
> +    }
> +  }
> +  NOTREACHED();

This would be better:

  DCHECK(false) << "pending_fd_id not in our list!";

::: ipc/chromium/src/chrome/common/ipc_message.cc
@@ +42,5 @@
>    header()->rpc_local_stack_depth = static_cast<uint32_t>(-1);
>    header()->seqno = 0;
> +# if defined(OS_MACOSX)
> +  header()->cookie = 0;
> +# endif

Nit: Here and in ipc_message.h please remove the spaces after the #. That isn't really a common style here.

::: ipc/glue/ProtocolUtils.h
@@ +33,5 @@
> +    SHMEM_DESTROYED_MESSAGE_TYPE = kuint16max - 6,
> +    UNBLOCK_CHILD_MESSAGE_TYPE = kuint16max - 5,
> +    BLOCK_CHILD_MESSAGE_TYPE   = kuint16max - 4,
> +    SHMEM_CREATED_MESSAGE_TYPE = kuint16max - 3,
> +    GOODBYE_MESSAGE_TYPE       = kuint16max - 2

Nit: I liked the old comment about where to find |kuint16max - 1|.
Attachment #764521 - Flags: review?(bent.mozilla) → review+
Sorry for review lag, I was on vacation.  Thanks for reviewing, bent.
(In reply to Chris Jones [:cjones] [:warhammer] from comment #20)
> Sorry for review lag, I was on vacation.  Thanks for reviewing, bent.

JEALOUS ;)
https://hg.mozilla.org/mozilla-central/rev/79d6fa05cff4
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla24
Depends on: 886892
No longer depends on: 886892
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: