Gecko IPC hangs during boot on jb-gonk

RESOLVED FIXED in Firefox 28

Status

()

Core
Widget: Gonk
P1
blocker
RESOLVED FIXED
5 years ago
4 years ago

People

(Reporter: tkundu, Assigned: Sushil)

Tracking

unspecified
1.3 Sprint 5 - 11/22
ARM
Gonk (Firefox OS)
Points:
---
Bug Flags:
in-moztrap -

Firefox Tracking Flags

(blocking-b2g:1.3+, firefox28 fixed)

Details

Attachments

(5 attachments, 2 obsolete attachments)

Created attachment 8335099 [details]
It shows that homescreen process (pid 995) is stuck on IPDL::PBrowser::SendGetDPI ()

I observed that multicore device with FFOS 1.3 freezes just after bootup. It displays homescreen but without any icons and also we cannot pan homescreen.

Further debugging shows that Homescreen sends synchronous IPC reuest to b2g process and waits for acknowledgement from b2g . But B2G process never replies back to Homescreen process.

I also observed that Gecko_IOThread (used by IPC) is consuming 50% CPU continuously.

This issue happens randomly and it is easier to reproduce with |adb shell setprop debug.composition.type dyn |

Please note that you may need to reboot multiple times to see this issue on multicore device (JB port of FFOS 1.3).

I attached logcat for IPC transaction during boot.

I also see that device has following process running during this hangs.

adb shell b2g-ps
APPLICATION      USER     PID   PPID  VSIZE  RSS     WCHAN    PC         NAME
b2g              root      307   1     214792 71636 ffffffff 00000000 S /system/b2g/b2g
Homescreen       u0_a995   995   307   70904  25316 ffffffff 00000000 S /system/b2g/plugin-container
plugin-containe  root      1061  307   48784  11436 ffffffff 00000000 S /system/b2g/plugin-container
Created attachment 8335102 [details]
cpu_usage during this problem

it shows that gecko IO thread's cpu usage is very high when this happens
I can also attach the patch which I used to generate this logs. Please let me know if it helps anybody
Severity: normal → blocker
blocking-b2g: --- → 1.3?
Priority: -- → P1
blocking-b2g: 1.3? → 1.3+
Target Milestone: --- → 1.3 Sprint 5 - 11/22
I also saw that multiple compositor thread gets created randomly in FFOS 1.3 JB port. But only one compositor thread is doing composition. I am not sure whether this is related to this bug or not. 

For Ex: I can see following output in FFOS 1.3 JB port (running on multicore device). But only compositor thread id 871 is doing composition.


  $ adb shell top -t | grep Compositor
  307   871  3   0% S 182112K  87192K     root     Compositor      /system/b2g/b2g
  307   871  3   0% S 189660K  97632K     root     Compositor      /system/b2g/b2g
  307   871  0   3% S 227736K  77704K     root     Compositor      /system/b2g/b2g
  307  1042  0   0% S 227736K  77704K     root     Compositor      /system/b2g/b2g
  307   871  0   6% S 228080K  74104K     root     Compositor      /system/b2g/b2g
  307   871  0   5% S 227056K  72176K     root     Compositor      /system/b2g/b2g
Hey Ben, you know something about IPC right?  Could you please help aim Tapas in the right direction here if possible?
Flags: needinfo?(bent.mozilla)
I don't know what to make of these logs really. There should definitely only be one compositor thread in the parent process but many other threads (including many threads from other processes) can send layers transactions to it. It looks to me like the IO thread is just being flooded by way too many layers requests.

Tentatively moving this over to GFX in the hopes that these logs will mean more to the layers team.
Component: General → Graphics: Layers
Flags: needinfo?(bent.mozilla)
Product: Firefox OS → Core
(In reply to ben turner [:bent] (needinfo? encouraged) from comment #5)
> I don't know what to make of these logs really. There should definitely only
> be one compositor thread in the parent process but many other threads
> (including many threads from other processes) can send layers transactions
> to it. It looks to me like the IO thread is just being flooded by way too
> many layers requests.
> 
> Tentatively moving this over to GFX in the hopes that these logs will mean
> more to the layers team.

NI :milan here to see if he has any ideas or can help with an assignee to investigate this.
Flags: needinfo?(milan)
What kind of device is this?  Do we have access to it?  CJ, is there one in Taipei, can somebody like Peter take a look?
Flags: needinfo?(milan) → needinfo?(cku)
Can you please try with any multicore device ? You may need to reboot multiple times to produce this issue

Steps to produce:

1) Flash any multicore device with JB port of FFOS 1.3
2) Reboot it . Then try to see whether homescreen is displaying all icons or it is panning or not.
3) If you see that homescreen is showing all icons in step-2 then reboot it again and follow step-2

If you can produce this issue in step-2 then please run |adb shell top -t| and you will see Gecko_IOThread is taking 50% CPU usage . You call also see that homescreen is hanging here : inside MessageChannel::SendAndWait() 

http://dxr.mozilla.org/mozilla-central/source/ipc/glue/MessageChannel.cpp#487 

Bsaically, homescreen is sending some IPDL request to b2g and waiting forever for reply in IPC request.

Comment 9

5 years ago
Peter, please cooperate with performance team on this issue
Flags: needinfo?(pchang)
Flags: needinfo?(cyu)

Updated

5 years ago
Flags: needinfo?(cku)
I am trying to debug it further myself. I am seeing that homescreen synchronous IPDL request are blocked at 

MessageChannel::SendAndWait(Message* aMsg, Message* aReply)[1] and it waits on function WaitForSyncNotify()[2] forever. It happens randomly only between homescreen and b2g process during boot on multicore device

[1] http://dxr.mozilla.org/mozilla-central/source/ipc/glue/MessageChannel.cpp#487
[2] http://dxr.mozilla.org/mozilla-central/source/ipc/glue/MessageChannel.cpp#1176
[3] http://dxr.mozilla.org/mozilla-central/source/ipc/glue/MessageChannel.cpp#908

I saw for a successful synchronous IPDL request, following thing happens

1) homescreen process calls [1] . 
2) b2g process receives this synchronous message in MessageChannel::DispatchSyncMessage(const Message& aMsg) [3] and calls appropriate message handler based on message type 

Can anyone tell me what happens between above two steps. I am curious to know implementation details which may help me to debug possible race condition .
Flags: needinfo?(bent.mozilla)
I tried to reproduce this on Nexus 4 with JB gonk and m-c gecko but didn't see the problem.

From your log, it seems the b2g process fails to deliver the IPC request to the actor on the main thread. There is also another process sending a request but never getting the response:

01-01 02:10:48.289  1061  1061 I Gecko   : IPDL::PContent::SendPCrashReporterConstructor Entered

Even the composition thread is flooding, we should not see the consistent high cpu on the Gecko_IOThread. It looks like that some file descriptor the IO thread is monitoring is causing the spin.

Could you turn on IPC logging with your logging instrumentation. We need to see whether the IO threads in the 2 processes really send/recv the message over the wire. Thanks.
Flags: needinfo?(cyu)
And also, reproducing the problem with the profiler will be very helpful.
Yeah, I don't know what to recommend yet. I'd try what Cervantes suggests, the environment variable you need to set is MOZ_IPC_MESSAGE_LOG.

Could it be that the main thread is blocking while trying to do a synchronous layer transaction and that the compositor has a bunch of other layer transactions to process before it gets there?
Flags: needinfo?(bent.mozilla)
Created attachment 8337162 [details]
IPC logs

I saw that homescreen is stuck in following IPC request:
[time:22687463344][940][PBrowserChild] Sending Msg_GetDefaultScale([TODO]) 

Can anyone suggest for my comment in #comments 10?
Flags: needinfo?(cyu)
(In reply to Tapas Kumar Kundu from comment #10)
1. mozilla::ipc::MessageChannel::Send() calls mLink->SendMessage(). For IPC, mLink is of class mozilla::ipc::ProcessLink.
2. mozilla::ipc::ProcessLink::SendMessage() post a task to the IPC thread to run Transport::Send() (alias of IPC::Channel::Send()).
3. IPC::Channel::Send() calls IPC::Channel::ChannelImpl::Send().
4. IPC::Channel::ChannelImpl::Send() calls its OutputQueuePush() to enqueue the message and then ProcessOutgoingMessages() to flush the message.
5. ProcessOutgoingMessages() calls sendmsg() to send the message to over the socketpair file descriptor.
--------------------------- process boundary ---------------------------------
6. The IPC thread message loop calls IPC::Channel::ChannelImpl::OnFileCanReadWithoutBlocking().
7. IPC::Channel::ChannelImpl::OnFileCanReadWithoutBlocking() calls its ProcessIncomingMessages(), which calls recvmsg() to read the message from the socketpair file descriptor.
8. IPC::Channel::ChannelImpl::ProcessIncomingMessages() calls mozilla::ipc::ProcessLink::OnMessageReceived() of its listener_.
9. mozilla::ipc::ProcessLink::OnMessageReceived() calls mozilla::ipc::MessageChannel::OnMessageReceived() of its mChan.
10. mozilla::ipc::MessageChannel::OnMessageReceived() posts a mozilla::ipc::MessageChannel::DequeueTask to the main thread's message loop.
11. mozilla::ipc::MessageChannel::DequeueTask() calls mozilla::ipc::MessageChannel::RefCountedTask::Run(), which calls an instantiation of RunnableFunction or RunnableMethod template, which calls in instantiation of DispatchToMethod template.
12. DispatchToMethod calls mozilla::ipc::MessageChannel::OnMaybeDequeueOne(), which calls mozilla::ipc::MessageChannel::DispatchMessage()
13. mozilla::ipc::MessageChannel::DispatchMessage() calls DispatchSyncMessage(), which calls mListener->OnMessageReceived(). mListener is the protocol actor
Flags: needinfo?(cyu)
(In reply to Tapas Kumar Kundu from comment #14)
> Created attachment 8337162 [details]
> IPC logs
> 
> I saw that homescreen is stuck in following IPC request:
> [time:22687463344][940][PBrowserChild] Sending Msg_GetDefaultScale([TODO]) 
> 
> Can anyone suggest for my comment in #comments 10?

Sorry, my memory is blurred. This IPC log is also from the actors, not on the IPC thread. So we still don't know what's going on on the IPC thread. Maybe you can output the errno and dump the IPC message headers before sendmsg() and after recvmsg().
Cervantes,

Since you have the most knowledge here, can you please be the owner of the same?
Flags: needinfo?(cyu)
Tapas, on which device are you seeing this problem?
If this is device is not available in Taipei, can I have remote access to your device to investigate? You may set up a tunnel and email me the access and I can take a look.
Assignee: nobody → cyu
Flags: needinfo?(cyu) → needinfo?(tkundu)

Updated

5 years ago
Flags: needinfo?(pchang)
Hi Cervantes,

I sent you email for joining meeting with me online . I think that this is the best way to resolve it quickly.  Thanks a lot for your help :)
Flags: needinfo?(cyu)
Flags: needinfo?(tkundu)
Summary: Gecko IPC hangs during boot for multicore device → Gecko IPC hangs during boot on jb-gonk
Created attachment 8339189 [details]
The stack trace of epoll_wait failure
Flags: needinfo?(cyu)
In the debug session, we kept seeing backtrace of in the attachment on b2g's IO thread (5 out of 6 samples). This looks like quite suspicious.

It looks like we are having a spinning because of epoll_wait() failure. The backtrace is about libevent generating a warning of epoll_failure, see
http://dxr.mozilla.org/mozilla-central/source/ipc/chromium/src/third_party/libevent/epoll.c#413

Since the error is "Invalid argument", I think this is the result of strerror(EINVAL). From epoll_wait() man page:

"EINVAL epfd is not an epoll file descriptor, or maxevents is less than or equal to zero."

Please check the parameters in calling epoll_wait() in
http://dxr.mozilla.org/mozilla-central/source/ipc/chromium/src/third_party/libevent/epoll.c#407
to see why we have epoll_wait() failure.
To be more specific, please check that the file descriptor is a valid epollfd like:
lrwx------ root     root              2013-11-27 06:00 36 -> anon_inode:[eventpoll]
under /proc/13849/fd, where 13849 is b2g's pid, and 36 is the epollfd.

If the failure is due to invalid file descriptor, you should see that the argument passed to epoll_wait pointing to a nonexisting one, or not "anon_inode:[eventpoll]"

If the failure is due to invalid maxevents, you should see it from gdb.
Hi Cervantes,

I shared logs with you by email . Please let me know when you are comfortable/free . I would like to see it together again. 

Thanks a lot for your help :)
Flags: needinfo?(cyu)
From the log, there are something interesting:
under /proc/<b2g pid>/fd:
lr-x------ root     root              1970-01-01 04:43 0 -> anon_inode:sync_fence
It looks like the graphics driver is using file descriptor 0.


and the following is from instrumentation to check for epoll_wait() arguments:
01-01 04:38:29.699   303   677 I Gecko   : ALERTB2G EPOLLB2G WARN epoll_dispatch L:414 : epoll_wait: fd: 0, no_of_events=1, maxevents 32, timeout -1

So the IO thread is using epoll_wait() with a file descriptor that is not an epoll fd. And no one bails out from this error. We just keep getting EINVAL when we epoll_wait() this fd.

There are 2 problems with this:
1. file descriptor 0 is used as anon_inode:sync_fence. This file descriptor should always be /dev/null if b2g is launched by init, or points to stdin if it is launced from the shell like this
lrwx------ root     root              2013-11-27 23:09 0 -> /dev/pts/0

I think you need to go back to use strace or QView to check why file descriptor 0 is closed and then used by the driver.

2. libevent's epollfd is changed to 0. We can know why the epollfd used by libevent is 0 right away with hardware watchpoint in gdb. Does the platform support this?
Flags: needinfo?(cyu)
Thanks a lot for your help . I found that stdin is closed by  a bug in MDP HAL.  We will upstream a fix for that soon. I already asked my colleague for that .

I also verified that IPC hang issue is not coming anymore ( I tried 20 times) after fixing stdin closing bug.
Good to hear that this is fixed. But I'll be interested in how a bug in MDP HAL would ever result in the MessagePumpLibevent using 0 in epoll_wait.
 Cervantes Yu ,

MDP HAL closes fd 0 atleast two times without opening it. It happens during boot. when MDP HAL closes it 1st time , epoll_wait starts polling on fd 0 and it hangs when MDP closes it 2nd time. MDP HAL is running as a part of b2g process and Gecko_IOThread is also running part of b2g process. It seems to me that this is causing high CPI usage by Gecko_IOThread because it keeps polling on fd 0 continuously. 

Thanks a lot for your help :)
(Assignee)

Comment 28

5 years ago
Created attachment 8341380 [details] [diff] [review]
Initialize acquire and release fence fds of HWC_BACKGROUND layer.

Thanks Tapas and Cervantes Yu for pointing out fd 0 issue.

HAL relies on framework to set proper acquire fence fds on all HWC layers. So acquire and release fence fds of HWC_BACKGROUND layer need to be initialized as well. Otherwise fd 0 gets closed in closeAcquireFds() during hwc set call which is a valid fd and it leads to bad state if fd 0 is being used as stdin. There is 1 more path which can also lead to closing of fd 0 in HAL. We are also up-streaming the HAL fix.
Assignee: cyu → sushilchauhan
(Assignee)

Updated

5 years ago
Component: Graphics: Layers → General
Product: Core → Firefox OS
(Assignee)

Updated

5 years ago
Attachment #8341380 - Flags: review?(dwilson)
Comment on attachment 8341380 [details] [diff] [review]
Initialize acquire and release fence fds of HWC_BACKGROUND layer.

Review of attachment 8341380 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM
Attachment #8341380 - Flags: review?(dwilson) → review+
Please add the bug number to the patch description
Component: General → Widget: Gonk
Product: Firefox OS → Core
(Assignee)

Comment 31

5 years ago
Created attachment 8341419 [details] [diff] [review]
Bug 940851 - Gecko IPC hangs during boot on jb-gonk. r=dwilson

Uploading HG friendly patch.
Attachment #8341380 - Attachment is obsolete: true
Attachment #8341419 - Flags: review+
(Assignee)

Updated

5 years ago
Keywords: checkin-needed
nit: would be nice if the commit message described what the patch was doing rather than just regurgitating the title of this bug.
(Assignee)

Comment 33

5 years ago
Created attachment 8341461 [details] [diff] [review]
Bug 940851 - Initialize acquire & release-fence fds of HWC_BACKGROUND layer. r=dwilson

Initialize acquire and release fence fds of HWC_BACKGROUND layer. Otherwise fd 0 gets closed during hwc set call which is a valid fd and it leads to bad state if fd 0 is being used as stdin. HAL relies on framework to set proper acquire fence fds on all HWC layers.
Attachment #8341419 - Attachment is obsolete: true
Attachment #8341461 - Flags: review+
(Assignee)

Updated

5 years ago
Status: NEW → ASSIGNED
https://hg.mozilla.org/mozilla-central/rev/4d198cb36af0
Status: ASSIGNED → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
status-firefox28: --- → fixed

Updated

4 years ago
Flags: in-moztrap-
You need to log in before you can comment on or make changes to this bug.