Closed Bug 1051633 Opened 10 years ago Closed 10 years ago

If we print to logcat too early the homescreen fails to start

Categories

(Core :: DOM: Content Processes, defect, P1)

ARM
Gonk (Firefox OS)
defect

Tracking

()

RESOLVED FIXED
mozilla35
blocking-b2g 2.1+
Tracking Status
firefox33 --- wontfix
firefox34 --- fixed
firefox35 --- fixed
b2g-v2.1 --- fixed
b2g-v2.2 --- fixed

People

(Reporter: khuey, Assigned: cyu)

References

(Blocks 1 open bug)

Details

(Whiteboard: [caf priority: p2][CR 726359])

Attachments

(7 files, 7 obsolete files)

232.67 KB, text/x-log
Details
960.00 KB, application/x-compressed-tar
Details
999 bytes, patch
Details | Diff | Splinter Review
16.74 KB, patch
khuey
: review+
Details | Diff | Splinter Review
1.37 KB, text/plain
Details
16.73 KB, patch
Details | Diff | Splinter Review
16.34 KB, patch
Details | Diff | Splinter Review
Attached file Logcat
I built flame-kk using the provided base image from b2g-internal and ./config.sh flame-kk.  Note that flame-kk does not have a 2.0 configuration, so I had to build 2.1.  When I try to boot the device after flashing 2.1 the homescreen always fails to start.  The vendor image comes with some version of 1.4 which works fine.

I have made the following observations:

1. Setting dom.ipc.prelaunchProcess.enabled to false makes the homescreen start.
2. The main thread of the Nuwa process is blocked in a different place.

On my Nexus 5 (working)
(gdb) bt
#0  __futex_syscall3 () at bionic/libc/arch-arm/bionic/futex_arm.S:39
#1  0xb6ec5ec8 in __pthread_cond_timedwait_relative (cond=cond@entry=0xbe83d874, mutex=mutex@entry=0xbe83d864, reltime=0xf0)
    at bionic/libc/bionic/pthread.c:1117
#2  0xb6ec5f28 in __pthread_cond_timedwait (cond=0xbe83d874, mutex=0xbe83d864, abstime=<optimized out>, 
    clock=<optimized out>) at bionic/libc/bionic/pthread.c:1140
#3  0xb6f42e68 in __wrap_pthread_cond_wait (cond=0xbe83d874, mtx=0xbe83d864) at ../../../mozglue/build/Nuwa.cpp:1048
#4  0xb463ee5c in ConditionVariable::Wait (this=this@entry=0xbe83d874)
    at ../../../ipc/chromium/src/base/condition_variable_posix.cc:40
#5  0xb4640c3a in base::WaitableEvent::TimedWait (this=this@entry=0xb6a3f08c, max_time=...)
    at ../../../ipc/chromium/src/base/waitable_event_posix.cc:195
#6  0xb4640c5c in base::WaitableEvent::Wait (this=this@entry=0xb6a3f08c)
    at ../../../ipc/chromium/src/base/waitable_event_posix.cc:201
#7  0xb4646140 in base::MessagePumpDefault::Run (this=0xb6a3f080, delegate=0xbe83d960)
    at ../../../ipc/chromium/src/base/message_pump_default.cc:60
#8  0xb464625e in MessageLoop::RunInternal (this=this@entry=0xbe83d960)
    at ../../../ipc/chromium/src/base/message_loop.cc:229
#9  0xb4646276 in RunHandler (this=0xbe83d960) at ../../../ipc/chromium/src/base/message_loop.cc:222
#10 MessageLoop::Run (this=this@entry=0xbe83d960) at ../../../ipc/chromium/src/base/message_loop.cc:196
#11 0xb4653e16 in ProcLoaderServiceRun (aArgv=<optimized out>, aArgc=1, aFd=<optimized out>, aPeerPid=186)
    at ../../../ipc/glue/ProcessUtils_linux.cpp:528
#12 XRE_ProcLoaderServiceRun (aPeerPid=186, aFd=<optimized out>, aArgc=1, aArgv=<optimized out>)
    at ../../../ipc/glue/ProcessUtils_linux.cpp:568
#13 0x0000b178 in RunProcesses (argv=0xbe83dac4, argc=1) at ../../../b2g/app/B2GLoader.cpp:215
#14 main (argc=1, argv=0xbe83dac4) at ../../../b2g/app/B2GLoader.cpp:247

On flame-kk (not working):
(gdb) bt
#0  __futex_wait () at bionic/libc/arch-arm/bionic/futex_arm.S:51
#1  0xb6f2f018 in __pthread_cond_timedwait_relative (cond=0xb6a04654, mutex=0xb6acc290, reltime=0xf0)
    at bionic/libc/bionic/pthread.c:1117
#2  0xb6f2f078 in __pthread_cond_timedwait (cond=0xb6a04654, mutex=0xb6acc290, abstime=<optimized out>, 
    clock=<optimized out>) at bionic/libc/bionic/pthread.c:1140
#3  0xb6ed3bfc in __wrap_pthread_cond_wait (cond=0xb6a04654, mtx=0xb6acc290) at ../../../gecko/mozglue/build/Nuwa.cpp:1048
#4  0xb69a507c in PR_WaitCondVar (cvar=0xb6a04650, timeout=4294967295)
    at ../../../../../gecko/nsprpub/pr/src/pthreads/ptsynch.c:385
#5  0xb4e2a09e in Wait (aInterval=4294967295, this=0xb6a04644) at ../../dist/include/mozilla/CondVar.h:79
#6  mozilla::Monitor::Wait (this=0xb6a04640, aInterval=aInterval@entry=4294967295)
    at ../../dist/include/mozilla/Monitor.h:40
#7  0xb4f73322 in mozilla::ipc::MessageChannel::WaitForSyncNotify (this=this@entry=0xb6a54448)
    at /home/khuey/dev/B2G-2.0-flame/gecko/ipc/glue/MessageChannel.cpp:1420
#8  0xb4f77b34 in mozilla::ipc::MessageChannel::SendAndWait (this=this@entry=0xb6a54448, aMsg=aMsg@entry=0xb6a3fd60, 
    aReply=aReply@entry=0xbef2de24) at /home/khuey/dev/B2G-2.0-flame/gecko/ipc/glue/MessageChannel.cpp:712
#9  0xb4f782da in mozilla::ipc::MessageChannel::Send (this=this@entry=0xb6a54448, aMsg=aMsg@entry=0xb6a3fd60, 
    aReply=aReply@entry=0xbef2de24) at /home/khuey/dev/B2G-2.0-flame/gecko/ipc/glue/MessageChannel.cpp:619
#10 0xb4fb8a5c in mozilla::dom::PContentChild::SendPCrashReporterConstructor (this=this@entry=0xb6a54418, actor=0xb6a3fd60, 
    tid=@0xbef2de70: 523, processType=@0xbef2de74: 2)
    at /home/khuey/dev/B2G-2.0-flame/objdir-gecko/ipc/ipdl/PContentChild.cpp:931
#11 0xb4fb8afa in mozilla::dom::PContentChild::SendPCrashReporterConstructor (this=this@entry=0xb6a54418, 
    tid=@0xb6756e20: 1, processType=@0x1: <error reading variable>)
    at /home/khuey/dev/B2G-2.0-flame/objdir-gecko/ipc/ipdl/PContentChild.cpp:894
#12 0xb55b9d0a in mozilla::dom::ContentChild::Init (this=this@entry=0xb6a54418, aIOLoop=aIOLoop@entry=0xb3fb1ba0, 
    aParentHandle=aParentHandle@entry=202, aChannel=0xb6a2b1d0)
    at /home/khuey/dev/B2G-2.0-flame/gecko/dom/ipc/ContentChild.cpp:581
#13 0xb55ba874 in mozilla::dom::ContentProcess::Init (this=0xb6a54400)
    at /home/khuey/dev/B2G-2.0-flame/gecko/dom/ipc/ContentProcess.cpp:27
#14 0xb5ab4218 in XRE_InitChildProcess (aArgc=<optimized out>, aArgv=<optimized out>)
    at /home/khuey/dev/B2G-2.0-flame/gecko/toolkit/xre/nsEmbedFunctions.cpp:532
#15 0xb506e38a in content_process_main (argc=8, argc@entry=9, argv=argv@entry=0xb6a01af0)
    at ../../../gecko/ipc/contentproc/plugin-container.cpp:148
#16 0xb4f75386 in mozilla::ipc::ProcLoaderLoadRunner::DoWork (this=<optimized out>)
    at /home/khuey/dev/B2G-2.0-flame/gecko/ipc/glue/ProcessUtils_linux.cpp:381
#17 0xb4f73ae2 in ProcLoaderServiceRun (aArgv=<optimized out>, aArgc=1, aFd=<optimized out>, aPeerPid=202)
    at /home/khuey/dev/B2G-2.0-flame/gecko/ipc/glue/ProcessUtils_linux.cpp:540
#18 XRE_ProcLoaderServiceRun (aPeerPid=202, aFd=<optimized out>, aArgc=1, aArgv=<optimized out>)
    at /home/khuey/dev/B2G-2.0-flame/gecko/ipc/glue/ProcessUtils_linux.cpp:568
#19 0x0000af14 in RunProcesses (argv=0xbef2ea04, argc=1) at ../../../gecko/b2g/app/B2GLoader.cpp:215
#20 main (argc=1, argv=0xbef2ea04) at ../../../gecko/b2g/app/B2GLoader.cpp:247

3. The b2g-ps output contains two b2g processes (the Nuwa thing is bug 1050122 which I see on my Nexus 5, so I don't think it matters (this is not a DEBUG build)) and a preallocated process that did not change its name.

khuey@minbar:~/dev/B2G-2.0-flame$ adb shell b2g-ps
APPLICATION    SEC USER     PID   PPID  VSIZE  RSS     WCHAN    PC         NAME
b2g              0 root      201   1     204672 74156 ffffffff b6eb58a4 S /system/b2g/b2g
b2g              0 root      618   201   50500  3640  ffffffff b6eb5a70 S /system/b2g/b2g
(Preallocated a  2 u0_a1029  1029  201   71348  26096 ffffffff b6eee8a4 S /system/b2g/plugin-container

4. The preallocated process's PPID is the main b2g process, not the Nuwa process.

5. Consistent with not changing its name, the app process never received LoadURL.

6. The about:memory report shows a ContentParent in the main process for the homescreen that is attached to the right PID.

I think the first thing we need to do is rule this out as the cause of the various 2.0 related things we see, which means we need to run B2G 2.0 on flame-kk.  mwu, how soon will we be able to do that?
Flags: needinfo?(mwu)
Attachment #8470560 - Attachment description: flame-kk.tgz → about memory report
Attached file Patch to reproduce (obsolete) —
This is the patch I used to debug bug 1048011. When I sleep 30 seconds in the child process before running the PProcLoaderChild actor, the Nuwa process never goes up and the process is stuck in PContentChild::SendPCrashReporterConstructor().
(In reply to Cervantes Yu from comment #2)
> Created attachment 8470606 [details]
> Patch to reproduce

It doesn't reproduce the case for me.
(In reply to Cervantes Yu from comment #2)
> Created attachment 8470606 [details]
> Patch to reproduce

I did not run printf_stderr() line for my previous comment.  After adding printf_stderr(), it is broken for our implementation is backed by logcat.  So, it is broken for different reason.  For it would be broken, there are lines to set the value of gDisableAndroidLog in ProcLoaderServiceRun().
Is this only on flame-kk, not for ICS?
Comment on attachment 8470606 [details]
Patch to reproduce

Scratch the patch since printf_stderr() is not supposed to be called in this place.
Attachment #8470606 - Attachment is obsolete: true
Using changeset 198832:a9b43778f0c2, we initially encountered a similar problem, but resetting gaia makes the problem go away. I still don't know why.
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #0)
> I think the first thing we need to do is rule this out as the cause of the
> various 2.0 related things we see, which means we need to run B2G 2.0 on
> flame-kk.  mwu, how soon will we be able to do that?

For your purposes, switching gecko and gaia to the 2.0 branches should be enough.
Flags: needinfo?(mwu)
I could reproduce the issue with following script after one hour.

while true; do adb shell stop b2g; sleep 1; adb shell start b2g; while true; do if adb shell b2g-ps|grep Homescreen  > /dev/null; then date; break; fi; sleep 1; done; done

It is not easy to reproduce with my owned build.
(In reply to Thinker Li [:sinker] from comment #9)
> I could reproduce the issue with following script after one hour.

After studied into my case, I believe it is another bug.  See bug 1052363.

So, I still could not reproduce the bug here.
I have a gut feeling that this might be that somewhere __android_log_print() or the like is called before it's safe to call it. I am making this guess because the problematic stack is exactly the same as what I saw when I called __android_log_print() to add debug information where it's unsafe to do so. See https://bugzilla.mozilla.org/show_bug.cgi?id=977026#c83 for why we android logcat can't be used before we shuffle IPC file descriptors.

An easy way to verify: list the open file descriptors of of the process where PProcLoaderChild runs (the 2nd b2g process from b2g-ps). If it's stuck because of android logcat, you should see the following, where FDs open to /dev/log/main where overwritten by ShuffleFileDescriptors().

> lrwx------ root     root              2014-08-12 19:32 0 -> /dev/null
> lrwx------ root     root              2014-08-12 19:32 1 -> /dev/null
> l-wx------ root     root              2014-08-12 19:33 10 -> pipe:[845193]
> lrwx------ root     root              2014-08-12 19:33 11 -> socket:[842374]
> lrwx------ root     root              2014-08-12 19:33 12 -> socket:[842376]
> lrwx------ root     root              2014-08-12 19:33 13 -> socket:[842378]
> lr-x------ root     root              2014-08-12 19:32 14 -> pipe:[845194]
> l-wx------ root     root              2014-08-12 19:32 15 -> pipe:[845194]
> lrwx------ root     root              2014-08-12 19:32 2 -> /dev/null
> lrwx------ root     root              2014-08-12 19:32 3 -> socket:[843135]
> lrwx------ root     root              2014-08-12 19:32 4 -> socket:[842370]
> lr-x------ root     root              2014-08-12 19:32 5 -> /dev/__properties__
> lrwx------ root     root              2014-08-12 19:32 6 -> anon_inode:[eventpoll]
> lrwx------ root     root              2014-08-12 19:32 7 -> socket:[845191]
> lrwx------ root     root              2014-08-12 19:32 8 -> socket:[845192]
> lr-x------ root     root              2014-08-12 19:32 9 -> pipe:[845193]

On a healthy process, you should see open file descriptors like:

> l-wx------ u0_a11806 u0_a11806          2014-08-12 19:30 21 -> /dev/log/main
> l-wx------ u0_a11806 u0_a11806          2014-08-12 19:30 22 -> /dev/log/radio
> l-wx------ u0_a11806 u0_a11806          2014-08-12 19:30 23 -> /dev/log/events
> l-wx------ u0_a11806 u0_a11806          2014-08-12 19:30 24 -> /dev/log/system

One possible use of android logcat is IPC message logging, but I am not sure if this is the root cause of your problem. Kyle would you make a quick check whether the android logcat file descriptors are overwritten.
Flags: needinfo?(khuey)
If it's confirmed that the root cause is android logcat accidentially called, we could avoid the problem by pre-reserving (e.g. open to /dev/null) file descriptors for ShuffleFileDescriptors so it won't be taken by accident.
khuey@minbar:~/dev/B2G-2.0-flame$ adb shell b2g-ps
APPLICATION    SEC USER     PID   PPID  VSIZE  RSS     WCHAN    PC         NAME
b2g              0 root      199   1     192660 71496 ffffffff b6ed18a4 S /system/b2g/b2g
b2g              0 root      583   199   50500  3604  ffffffff b6ed1a70 S /system/b2g/b2g
(Preallocated a  2 u0_a1020  1020  199   69300  26064 ffffffff b6ead8a4 S /system/b2g/plugin-container
khuey@minbar:~/dev/B2G-2.0-flame$ adb shell ls -l /proc/583/fd
lrwx------ root     root              2014-08-10 16:18 0 -> /dev/null
lrwx------ root     root              2014-08-10 16:18 1 -> /dev/null
l-wx------ root     root              2014-08-10 16:20 10 -> pipe:[7676]
lrwx------ root     root              2014-08-10 16:20 11 -> socket:[6083]
lrwx------ root     root              2014-08-10 16:20 12 -> socket:[6085]
lrwx------ root     root              2014-08-10 16:20 13 -> socket:[6087]
lrwx------ root     root              2014-08-10 16:18 2 -> /dev/null
lrwx------ root     root              2014-08-10 16:18 3 -> socket:[6081]
lrwx------ root     root              2014-08-10 16:18 4 -> socket:[6077]
lr-x------ root     root              2014-08-10 16:18 5 -> /dev/__properties__
lrwx------ root     root              2014-08-10 16:18 6 -> anon_inode:[eventpoll]
lrwx------ root     root              2014-08-10 16:18 7 -> socket:[7674]
lrwx------ root     root              2014-08-10 16:18 8 -> socket:[7675]
lr-x------ root     root              2014-08-10 16:20 9 -> pipe:[7676]
Flags: needinfo?(khuey)
We should remove the |gDisableAndroidLog| flag once we preserving fds for shuffling.
I am working on a patch to reserve the magic file descriptors.
Currently we have 3 for child side IPC channel, 4 for crash reporter and 5 for android system prop.
Assignee: nobody → cyu
This preoccupies the magic file descriptors ({3, 4, 5}) with /dev/null so other operations such as __android_log_print() won't create file descriptors in the "danger zone".

Try push: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=07f9fa4f56bb
Attachment #8472933 - Flags: review?(khuey)
The less error-prone patch that won't depend on the magic list {3, 4, 5}. This won't break after a new magic FD (such as 6) is added, unless the number of magic FDs grows up to 10.

Try push log: https://tbpl.mozilla.org/?tree=Try&rev=0e76f02cde77
Attachment #8472933 - Attachment is obsolete: true
Attachment #8472933 - Flags: review?(khuey)
Attachment #8473582 - Flags: review?(khuey)
I'm seeing a similar issue as khuey on JB. I locally applied attachment 8473582 [details] [diff] [review], but did not see an improvement. This is blocking me from doing any B2G dev unfortunately.

For reference my output:

    erahm-25043:mozilla-central ericrahm$ adb shell b2g-ps
    APPLICATION      USER     PID   PPID  VSIZE  RSS     WCHAN    PC         NAME
    b2g              root      2627  1     167152 78772 ffffffff b6f7163c S /system/b2g/b2g
    (Nuwa)           root      2630  2627  51412  7328  ffffffff b6f71808 S /system/b2g/b2g
    (Preallocated a  u0_a2791  2791  2627  70188  27180 ffffffff b6ed763c S /system/b2g/plugin-container
    erahm-25043:mozilla-central ericrahm$ adb shell ls -l /proc/2630/fd
    lrwx------ root     root              2014-08-21 14:53 0 -> /dev/null
    lrwx------ root     root              2014-08-21 14:53 1 -> /dev/null
    lrwx------ root     root              2014-08-21 15:00 10 -> socket:[13227]
    lrwx------ root     root              2014-08-21 15:00 11 -> socket:[13229]
    lrwx------ root     root              2014-08-21 15:00 12 -> socket:[13231]
    lr-x------ root     root              2014-08-21 14:53 13 -> pipe:[7998]
    l-wx------ root     root              2014-08-21 15:00 14 -> pipe:[7998]
    lr-x------ root     root              2014-08-21 15:00 19 -> /dev/__properties__
    lrwx------ root     root              2014-08-21 14:53 2 -> /dev/null
    lrwx------ root     root              2014-08-21 15:00 20 -> socket:[7994]
    lrwx------ root     root              2014-08-21 14:53 3 -> socket:[7994]
    lrwx------ root     root              2014-08-21 14:53 4 -> anon_inode:[eventpoll]
    lr-x------ root     root              2014-08-21 14:53 5 -> /dev/__properties__
    lrwx------ root     root              2014-08-21 14:53 6 -> socket:[7995]
    lrwx------ root     root              2014-08-21 14:53 7 -> socket:[7996]
    lr-x------ root     root              2014-08-21 14:53 8 -> pipe:[7997]
    l-wx------ root     root              2014-08-21 15:00 9 -> pipe:[7997]
    erahm-25043:mozilla-central ericrahm$
(In reply to Eric Rahm [:erahm] from comment #18)
> I'm seeing a similar issue as khuey on JB. I locally applied attachment
> 8473582 [details] [diff] [review], but did not see an improvement. This is
> blocking me from doing any B2G dev unfortunately.
> 
Obviously android logcat file descriptors are still missing in the Nuwa process so it could be android logcat messing with IPC channels. The patch preoccupies 10 file descriptors right after fork(). The only reason that I can think of to make it fail is android_log_print() being called before fork(). I will need your help to debug this.
To use it, launch b2g with MOZ_DEBUG_CHILD_PROCESS env variable:

> root@android:/ # export MOZ_DEBUG_CHILD_PROCESS=1
> root@android:/ # b2g.sh
> B2GLoader CHILDCHILDCHILDCHILD
>  debug me @ 17769

Then you will have 30 seconds to list the file descriptors. This is the output you should see, no android logcat FDs.
> root@android:/ # ls -l /proc/17769/fd
> lrwx------ root     root              2014-08-21 23:19 0 -> /dev/pts/2
> lrwx------ root     root              2014-08-21 23:19 1 -> /dev/pts/2
> lrwx------ root     root              2014-08-21 23:19 2 -> /dev/pts/2
> lrwx------ root     root              2014-08-21 23:19 4 -> socket:[952013]
> lr-x------ root     root              2014-08-21 23:19 8 -> /dev/__properties__ (deleted)
>root@android:/ #

Eric, could you run this test? Thanks.
Flags: needinfo?(erahm)
Fix the build break for the debug code.
Attachment #8477171 - Attachment is obsolete: true
I encountered the same on emulator debug build. The b2g loader already performed __android_log_print() even before entering main(). I sleep(30) in the beginning of main() and there are already many lines of

> I/Gecko   (   44): [44] WARNING: XPCOM objects created/destroyed from static ctor/dtor: file /mnt/SSD/data/hg/mozilla-central/xpcom/base/nsTraceRefcnt.cpp, line 148

in adb logcat, and the process already has adb logcat initialized:

> 1|root@android:/ # ls -l /proc/44/fd                                           
> lrwx------ root     root              2014-08-22 11:33 0 -> /dev/null
> lrwx------ root     root              2014-08-22 11:33 1 -> /dev/null
> lrwx------ root     root              2014-08-22 11:33 2 -> /dev/null
> l-wx------ root     root              2014-08-22 11:33 4 -> /dev/log/main
> l-wx------ root     root              2014-08-22 11:33 5 -> /dev/log/radio
> l-wx------ root     root              2014-08-22 11:33 6 -> /dev/log/events
> lr-x------ root     root              2014-08-22 11:33 8 -> /dev/__properties__ (deleted)
> root@android:/ # 

I think we still need to disable android logcat before we fork the child process.
(In reply to Cervantes Yu from comment #22)
> I encountered the same on emulator debug build. The b2g loader already
> performed __android_log_print() even before entering main(). I sleep(30) in
> the beginning of main() and there are already many lines of
> 
Scratch the above statement. The following warning are generated in loading libxul. We still can use the trick of preoccupying file descriptors, but this has to be done in the beginning of main(), and we need to close the unused ones both in parent and child processes. I'll update a patch for this.

> > I/Gecko   (   44): [44] WARNING: XPCOM objects created/destroyed from static ctor/dtor: file /mnt/SSD/data/hg/mozilla-central/xpcom/base/nsTraceRefcnt.cpp, line 148
Summary: When building flame-kk homescreen fails to start → If we print to logcat too early the homescreen fails to start
Blocks: 1052720
[Blocking Requested - why for this release]:

Blocks a CAF blocker.
blocking-b2g: --- → 2.1?
(In reply to Cervantes Yu from comment #20)
> Eric, could you run this test? Thanks.

I'll check this out now.
Note: this is reproduces 100% if you enable DMD.
I got three child processes, here's the output:

erahm-25043:~ ericrahm$ adb shell ls -l /proc/16377/fd
lrwx------ u0_a16377 u0_a16377          2014-08-22 18:55 0 -> /dev/pts/1
lrwx------ u0_a16377 u0_a16377          2014-08-22 18:55 1 -> /dev/pts/1
l-wx------ u0_a16377 u0_a16377          2014-08-22 18:55 10 -> /dev/cpuctl/apps/tasks
l-wx------ u0_a16377 u0_a16377          2014-08-22 18:55 11 -> /dev/cpuctl/apps/bg_non_interactive/tasks
lrwx------ u0_a16377 u0_a16377          2014-08-22 18:55 2 -> /dev/pts/1
lrwx------ u0_a16377 u0_a16377          2014-08-22 18:55 3 -> socket:[182717]
l-wx------ u0_a16377 u0_a16377          2014-08-22 18:55 4 -> /dev/log/main
lr-x------ u0_a16377 u0_a16377          2014-08-22 18:55 5 -> /dev/__properties__
l-wx------ u0_a16377 u0_a16377          2014-08-22 18:55 6 -> /dev/log/radio
l-wx------ u0_a16377 u0_a16377          2014-08-22 18:55 7 -> /dev/log/events
l-wx------ u0_a16377 u0_a16377          2014-08-22 18:55 8 -> /dev/log/system
lrwx------ u0_a16377 u0_a16377          2014-08-22 18:55 9 -> /dev/binder

erahm-25043:~ ericrahm$ adb shell ls -l /proc/16381/fd
lrwx------ u0_a16381 u0_a16381          2014-08-22 18:55 0 -> /dev/pts/1
lrwx------ u0_a16381 u0_a16381          2014-08-22 18:55 1 -> /dev/pts/1
l-wx------ u0_a16381 u0_a16381          2014-08-22 18:55 10 -> /dev/cpuctl/apps/tasks
l-wx------ u0_a16381 u0_a16381          2014-08-22 18:55 11 -> /dev/cpuctl/apps/bg_non_interactive/tasks
lrwx------ u0_a16381 u0_a16381          2014-08-22 18:55 2 -> /dev/pts/1
lrwx------ u0_a16381 u0_a16381          2014-08-22 18:55 3 -> socket:[183930]
l-wx------ u0_a16381 u0_a16381          2014-08-22 18:55 4 -> /dev/log/main
lr-x------ u0_a16381 u0_a16381          2014-08-22 18:55 5 -> /dev/__properties__
l-wx------ u0_a16381 u0_a16381          2014-08-22 18:55 6 -> /dev/log/radio
l-wx------ u0_a16381 u0_a16381          2014-08-22 18:55 7 -> /dev/log/events
l-wx------ u0_a16381 u0_a16381          2014-08-22 18:55 8 -> /dev/log/system
lrwx------ u0_a16381 u0_a16381          2014-08-22 18:55 9 -> /dev/binder

erahm-25043:~ ericrahm$ adb shell ls -l /proc/16158/fd
lrwx------ root     root              2014-08-22 18:55 0 -> /dev/pts/1
lrwx------ root     root              2014-08-22 18:55 1 -> /dev/pts/1
lrwx------ root     root              2014-08-22 18:55 2 -> /dev/pts/1
lrwx------ root     root              2014-08-22 18:55 3 -> socket:[183637]
lr-x------ root     root              2014-08-22 18:55 5 -> /dev/__properties__
Flags: needinfo?(erahm)
(In reply to Eric Rahm [:erahm] from comment #26)
> Note: this is reproduces 100% if you enable DMD.

Yes, because DMD call __android_log_vprint() even before we enter main(), and because of this, preoccupying file descriptors on entering main() will not work. We might need to preoccupy FDs in libmozglue, the 1st library to load if we want to completely avoid this problem.
Update the patch to address the problem when DMD is enabled.

Eric, could you test this patch to see if it works? Thanks.
Attachment #8473582 - Attachment is obsolete: true
Attachment #8473582 - Flags: review?(khuey)
Attachment #8478977 - Flags: feedback?(erahm)
Comment on attachment 8478977 [details] [diff] [review]
Make sure the magic file descriptors won't be taken for other uses (v3)

Review of attachment 8478977 [details] [diff] [review]:
-----------------------------------------------------------------

This does fix my issue, although I think we'd prefer to be able to print out those statements indicating that DMD is enabled.

If this is the only option, it might make more sense to just remove printing of those lines [1] from DMD on B2G rather than adding an env var.

[1] http://dxr.mozilla.org/mozilla-central/source/memory/replace/dmd/DMD.cpp?from=DMD.cpp&case=true#1604,1607,1614
Attachment #8478977 - Flags: feedback?(erahm) → feedback+
Changes:
* Don't use env variable in DMD.
* Only reserve FD 3-7 in the child process and child will abort if any is taken.
Attachment #8478977 - Attachment is obsolete: true
Attachment #8479776 - Flags: review?(khuey)
(In reply to Eric Rahm [:erahm] from comment #30)
> Comment on attachment 8478977 [details] [diff] [review]
> Make sure the magic file descriptors won't be taken for other uses (v3)
> 
> Review of attachment 8478977 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> This does fix my issue, although I think we'd prefer to be able to print out
> those statements indicating that DMD is enabled.
> 
> If this is the only option, it might make more sense to just remove printing
> of those lines [1] from DMD on B2G rather than adding an env var.
> 
We need to disable printing these messages to android logcat on startup. I only disable the output in Init() on b2g.

I considered some alternatives like delaying the output, sending to stderr, or fork() to print, none of which is clean so I just silence the output.
NI :kyle for help with review as this is needed for 1052720.
Flags: needinfo?(khuey)
Comment on attachment 8479776 [details] [diff] [review]
Make sure the magic file descriptors won't be taken for other uses (v4)

Review of attachment 8479776 [details] [diff] [review]:
-----------------------------------------------------------------

I didn't get a chance to test this today, but it looks good.

::: memory/replace/dmd/DMD.cpp
@@ +209,1 @@
>    __android_log_vprint(ANDROID_LOG_INFO, "DMD", aFmt, ap);

indent this two more spaces please.
Attachment #8479776 - Flags: review?(khuey) → review+
Flags: needinfo?(khuey)
Emulator runs fail because of

Hit MOZ_CRASH(ProcLoader error: a magic file descriptor is occupied.) at ../../../gecko/ipc/glue/ProcessUtils_linux.cpp:194

I need to figure out why someone still occupies a magic file descriptors.
It turns out that on debug build, nsTraceRefcnt.cpp will fopen() /data/local/tests/log/runtests_leaks.log with fileno = 4. The patch aborts because if finds FD 4 is taken.
blocking-b2g: 2.1? → 2.1+
V5 WIP to address the issue that nsTraceRefcnt.cpp fopen()s before we fork().
Also the fact that it fopen()s is not fork-safe. That's why I need to add NS_LogReinit() (and also call it after Nuwa forks).

I might need to separate the parts regarding nsTraceRefcnt into another part or track the issue with it in another bug.
How much more do we have to do here?  Not being able to run DMD is painful.
Flags: needinfo?(cyu)
I will update the patch later today and open a new bug to handle the issue with XPCOM leak logs.
Flags: needinfo?(cyu)
Try push log:
https://tbpl.mozilla.org/?tree=Try&rev=f50b037fc7fa

Request for review again because there is design change in the timing of FD reservation. It's done right after entering main(). I do this to avoid the problem with android logcat calls coming from where, and we don't need to disallow logcat calls before fork.

Because of this I don't call base::CloseSuperfluousFds() after fork. The assumption under this is that we fork pretty early in the startup process, the file descriptors before fork should be necessary in the child process.
Attachment #8479776 - Attachment is obsolete: true
Attachment #8487175 - Attachment is obsolete: true
Attachment #8491290 - Flags: review?(khuey)
Blocks: 1069191
(In reply to Cervantes Yu from comment #41)
> Created attachment 8491290 [details] [diff] [review]
> Make sure the magic file descriptors won't be taken for other uses (v5)
> 
> Try push log:
> https://tbpl.mozilla.org/?tree=Try&rev=f50b037fc7fa
> 
> Request for review again because there is design change in the timing of FD
> reservation. It's done right after entering main(). I do this to avoid the
> problem with android logcat calls coming from where, and we don't need to
> disallow logcat calls before fork.
> 
> Because of this I don't call base::CloseSuperfluousFds() after fork. The
> assumption under this is that we fork pretty early in the startup process,
> the file descriptors before fork should be necessary in the child process.

We are seeing some unknown memleak issue in FFOS 2.1 . Could you please give us a fix for v2.1 asap ? It is blocking us to debug memleak issue in FFOS v2.1
Flags: needinfo?(erahm)
Flags: needinfo?(cyu)
This is a temporary workaround for just DMD. It is untested, but based on previous attempts to fix this bug it should work.
Tapas, attachment 8491786 [details] should help workaround your DMD issues. It should apply cleanly to the 2.1 branch.
Flags: needinfo?(erahm)
(In reply to Eric Rahm [:erahm] from comment #45)
> Tapas, attachment 8491786 [details] should help workaround your DMD issues.
> It should apply cleanly to the 2.1 branch.

it seems like your patch is fixing 'website white-screen issue' in bug 1069542 comment 0 . We are landing that patch in our build for memleak debugging but we will still wait for Cervantes to land a real fix for this problem in v2.0.
(In reply to Tapas[:tkundu on #b2g/gaia/memshrink/gfx] (always NI me) from comment #46)

> We are landing that patch in our build for memleak debugging but we will still wait for Cervantes to land  a real fix for this problem in v2.0.

We need fix in 2.1 (not in 2.0)


 Sorry for the mistake
Whiteboard: [CR 726359] → [caf priority: p2][CR 726359]
Attached patch Patch for 2.1Splinter Review
Flags: needinfo?(cyu)
So this still doesn't make logcat work, right?  Is that possible?  If not, do we need the rest of this patch if we just block DMD from using logcat?
Flags: needinfo?(cyu)
This is supposed to make logcat work unless DMD should call StatusMsg() after it finishes initialization
( https://mxr.mozilla.org/mozilla-central/source/memory/replace/dmd/DMD.cpp#1672 )
and before entering main(). With this patch DMD is the only place that we suppress __android_log_print() calls. Once entering main(), we can safely use logcat.

Doesn't this patch work in the presense of DMD?
Flags: needinfo?(cyu)
DMD works, unless I remove the patch to DMD, of course.  I was asking if we needed the rest of the patch if we just do the DMD bit.
Comment on attachment 8491290 [details] [diff] [review]
Make sure the magic file descriptors won't be taken for other uses (v5)

Review of attachment 8491290 [details] [diff] [review]:
-----------------------------------------------------------------

r=me

::: memory/replace/dmd/DMD.cpp
@@ +205,5 @@
> +  // Don't call __android_log_vprint() during initialization, or the magic file
> +  // descriptors will be occupied by android logcat.
> +  if (gIsDMDRunning)
> +#endif
> +    __android_log_vprint(ANDROID_LOG_INFO, "DMD", aFmt, ap);

I guess this is still useful if someone runs DMD on Firefox for Android.
Attachment #8491290 - Flags: review?(khuey) → review+
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #51)
> DMD works, unless I remove the patch to DMD, of course.  I was asking if we
> needed the rest of the patch if we just do the DMD bit.

We need to because even DMD doesn't write android logcat, XPCOM initialization (__android_log_print() in nsDebugImpl.cpp) and refcount tracing (fopen()) still creates file descriptors. They are both before we fork the child process. So we need the rest of the patch.
This was merged to m-c but the bug wasn't marked for some reason.

https://hg.mozilla.org/mozilla-central/rev/1e25cd3e8219

Please request Aurora approval on the v2.1 patch when you get a chance.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Flags: needinfo?(cyu)
Resolution: --- → FIXED
Target Milestone: --- → mozilla35
This was backed out in https://hg.mozilla.org/mozilla-central/rev/14639e630d21

Later retest shows this patch to be non-offending: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=eb441928c187

I will reland it.
Treeherder link appears to be broken. The TPBL link: https://tbpl.mozilla.org/?tree=Try&rev=eb441928c187
Comment on attachment 8492050 [details] [diff] [review]
Patch for 2.1

Approval Request Comment
[Feature/regressing bug #]: 977026
[User impact if declined]: The device fails to boot on engineering debug build or DMD is turned on.
[Describe test coverage new/current, TBPL]: Manual verification, TPBL.
[Risks and why]: Low to median. This is not a trivial patch, but we try to make it generic and prevent similar problems in the future.

The patch changes file descriptor handling in the early stage of b2g startup. The approach of file descriptor reservation should be safe and resistant to future additions of file descriptor operations.
Attachment #8492050 - Flags: approval-mozilla-aurora?
Attachment #8492050 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Backed out for bustage. Yes, this was the 2.1 patch. Looks like you never rolled the non-unified build fixes into it...
https://hg.mozilla.org/releases/mozilla-aurora/rev/e10f2bc2dd00

https://tbpl.mozilla.org/php/getParsedLog.php?id=49358071&tree=Mozilla-Aurora
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #64)
> Backed out for bustage. Yes, this was the 2.1 patch. Looks like you never
> rolled the non-unified build fixes into it...
> https://hg.mozilla.org/releases/mozilla-aurora/rev/e10f2bc2dd00
> 
> https://tbpl.mozilla.org/php/getParsedLog.php?id=49358071&tree=Mozilla-Aurora

Oops, my bad. Will update the patch with the fix.
Flags: needinfo?(cyu)
Attached patch Patch for 2.1Splinter Review
Merged the fix for nonunified build.
This is causing a startup crash when I try to start with gdb

Program received signal SIGSEGV, Segmentation fault.
0x0000b616 in ReserveFileDescriptors (aReservedFds=...) at ../../../b2g/app/B2GLoader.cpp:239
239	      MOZ_CRASH("ProcLoader error: a magic file descriptor is occupied.");
(gdb) bt
#0  0x0000b616 in ReserveFileDescriptors (aReservedFds=...) at ../../../b2g/app/B2GLoader.cpp:239
#1  main (argc=1, argv=0xbee3ea24) at ../../../b2g/app/B2GLoader.cpp:279
removing this line leads to the next moz_crash at line 255. I think we have to back this out :(
Flags: needinfo?(cyu)
Did you try to restart gdb inside a gdb session? Starting the gdb session from scratch works, but if you try to kill the b2g process and restart the gdb session you will see this crash because the b2g process is started with a new socket.

This is when the crash happens:

APPLICATION    SEC USER     PID   PPID  VSIZE  RSS     WCHAN    PC         NAME
b2g              0 root      552   32681 5056   1840  c019c3a4 0000b056 t /system/b2g/b2g

And the file descriptors of b2g and gdbserver:

root@flame:/proc/32681/fd # ls -l
lrwx------ root     root              2014-10-06 11:29 0 -> /dev/pts/0
lrwx------ root     root              2014-10-06 11:29 1 -> /dev/pts/0
lrwx------ root     root              2014-10-06 11:29 2 -> /dev/pts/0
lrwx------ root     root              2014-10-06 11:29 4 -> socket:[10196674]
lr-x------ root     root              2014-10-06 11:29 8 -> /dev/__properties__
lrwx------ root     root              2014-10-06 11:29 9 -> socket:[5657]
root@flame:/proc/32681/fd # cd /proc/552/fd                                    
root@flame:/proc/552/fd # ls -l
lrwx------ root     root              2014-10-06 10:36 0 -> /dev/pts/0
lrwx------ root     root              2014-10-06 10:36 1 -> /dev/pts/0
lrwx------ root     root              2014-10-06 10:36 2 -> /dev/pts/0
lrwx------ root     root              2014-10-06 10:36 3 -> /dev/null
lrwx------ root     root              2014-10-06 10:36 4 -> socket:[10196674]
lr-x------ root     root              2014-10-06 10:36 8 -> /dev/__properties__
lrwx------ root     root              2014-10-06 10:36 9 -> socket:[5657]

Note that file descriptor 4 triggers the guard and crashes. It is leaked from gdbserver to the inferior. This leak is from the beginning of exec() and we can't code to avoid this. gdbserver needs an upgrade.
Flags: needinfo?(cyu)
It appears that we encountered this bug:
https://sourceware.org/bugzilla/show_bug.cgi?id=15604

I tried to upgrade gdb/gdbserver. The gdb/gdbserver from ndk-10b still has the leak (and more more than the gdbserver we are using now). If we want to totally get rid of this problem, we need to upgrade to gdb/gdbserver 7.7.

The short-term solution is 'call close(4)' within gdb. This closes the socket leaked from gdbserver and you won't encounter the crash. It should be pretty safe in gdb.

The long-term solution could be that we remove the use of magic file descriptors. Then we can remove the trick of reserving file descriptors in this patch.

Greg, did you see this crash right after ./run-gdb.sh or after you killed and restarted b2g within gdb?

Now that this happens in gdb only and we can work this around, do you think we still need to back this out?
Flags: needinfo?(anygregor)
Depends on: 1078383
Flags: needinfo?(anygregor)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: