Last Comment Bug 817730 - Gaia does not come up reliably on the Pandaboard
: Gaia does not come up reliably on the Pandaboard
Status: RESOLVED FIXED
: regression
Product: Core
Classification: Components
Component: Widget: Gonk (show other bugs)
: unspecified
: x86_64 Linux
: P1 blocker (vote)
: mozilla20
Assigned To: Thomas Zimmermann [:tzimmermann] [:tdz]
:
Mentors:
Depends on:
Blocks: 802317
  Show dependency treegraph
 
Reported: 2012-12-03 11:23 PST by William Lachance (:wlach)
Modified: 2013-01-25 17:04 PST (History)
16 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
+
fixed
fixed
fixed


Attachments
Logcat of dead gaia loading (11.65 KB, text/plain)
2012-12-04 17:57 PST, cmtalbert
no flags Details
another logcat from this problem (11.74 KB, text/plain)
2012-12-04 18:02 PST, cmtalbert
no flags Details
Kernel from 11/27 (3.72 MB, application/octet-stream)
2012-12-05 07:57 PST, Thomas Zimmermann [:tzimmermann] [:tdz]
no flags Details
Orangutan module for the old kernel (221.81 KB, application/octet-stream)
2012-12-05 09:13 PST, Thomas Zimmermann [:tzimmermann] [:tdz]
no flags Details
Logcat of no b2g (19.27 KB, text/plain)
2012-12-05 14:33 PST, William Lachance (:wlach)
no flags Details
Gaia failed during third Eideticker run (243.56 KB, text/plain)
2012-12-06 02:32 PST, Thomas Zimmermann [:tzimmermann] [:tdz]
no flags Details
Logcat with TransactionInactiveError (238.71 KB, text/plain)
2012-12-07 04:41 PST, Thomas Zimmermann [:tzimmermann] [:tdz]
no flags Details
Don't block when rilproxy is not available (1.84 KB, patch)
2012-12-11 08:01 PST, Thomas Zimmermann [:tzimmermann] [:tdz]
kyle: review+
Details | Diff | Splinter Review
Keep display turned on (3.95 KB, patch)
2012-12-12 04:24 PST, Thomas Zimmermann [:tzimmermann] [:tdz]
no flags Details | Diff | Splinter Review
Don't block when rilproxy is not available (1.84 KB, patch)
2012-12-13 04:01 PST, Thomas Zimmermann [:tzimmermann] [:tdz]
tzimmermann: review+
Details | Diff | Splinter Review

Description William Lachance (:wlach) 2012-12-03 11:23:49 PST
Using the recent B2G builds and flashing it onto my Pandaboard, I've only been intermittently seeing the Gaia interface via HDMI output after startup. The B2G process itself starts just fine, and you briefly see the "Mozilla" logo on the HDMI screen, so I know things are at least partially working.

If I run eideticker tests which restart B2G and perform a few other operations via marionette, occasionally (maybe 20% of the time) things will come up as I expect. It is not clear to me why this is so.

There is nothing suspicious in the default logcat. I am currently trying to track down underlying causes by instrumenting the gonk layer which interacts with the screen, to see if anything weird is going on.

Possibly related: bug 783651.
Comment 1 Michael Wu [:mwu] 2012-12-03 11:26:40 PST
Is this a regression?
Comment 2 William Lachance (:wlach) 2012-12-03 11:27:39 PST
(In reply to Michael Wu [:mwu] from comment #1)
> Is this a regression?

Yes, a few weeks ago Gaia came up reliably so long as the Panda booted correctly.
Comment 3 Thomas Zimmermann [:tzimmermann] [:tdz] 2012-12-03 11:33:07 PST
Hey,

Here are some tickets about problems somehow related to the interface on the PandaBoard. Is it one of those? What exactly is happening?

https://bugzilla.mozilla.org/show_bug.cgi?id=801658
https://bugzilla.mozilla.org/show_bug.cgi?id=815586
Comment 4 William Lachance (:wlach) 2012-12-03 13:53:45 PST
(In reply to Thomas Zimmermann from comment #3)
> Hey,
> 
> Here are some tickets about problems somehow related to the interface on the
> PandaBoard. Is it one of those? What exactly is happening?
> 
> https://bugzilla.mozilla.org/show_bug.cgi?id=801658
> https://bugzilla.mozilla.org/show_bug.cgi?id=815586

After applying the patch in bug 815586, Gaia does seem to start coming up at least as far as the lockscreen, though interaction with the user interface via marionette still seems to consistently fail. Need more investigation.

Applying the patch in bug 801658 doesn't seem to have a discernable effect.
Comment 5 Lawrence Mandel [:lmandel] (use needinfo) 2012-12-04 11:53:24 PST
Thomas - Can you take this one?
Comment 6 Thomas Zimmermann [:tzimmermann] [:tdz] 2012-12-04 12:12:28 PST
(In reply to Lawrence Mandel [:lmandel] from comment #5)
> Thomas - Can you take this one?

We'll see. :/

William, what do I need to do to reproduce the problem?
Comment 7 William Lachance (:wlach) 2012-12-04 13:08:45 PST
(In reply to Thomas Zimmermann from comment #6)
> (In reply to Lawrence Mandel [:lmandel] from comment #5)
> > Thomas - Can you take this one?
> 
> We'll see. :/
> 
> William, what do I need to do to reproduce the problem?

This happened reliably on a fresh build flashed according to:

Even a basic marionette interaction like the following did not work properly:

https://github.com/mozilla/eideticker/blob/master/src/tests/b2g/appswitching/appswitching.py

(note it works fine on an Unagi)

Apparently Clint has had more luck with an image flashed according to alternate instructions. Trying that now, will get back to you.
Comment 8 cmtalbert 2012-12-04 17:57:37 PST
Created attachment 688541 [details]
Logcat of dead gaia loading

I had used the instructions at https://mana.mozilla.org/wiki/display/IT/Building+and+Installing+B2G+to+a+Panda+Board to flash the pandaboard in the way we are doing so in the full automated test framework.  I did that with a build from friday or saturday (I can get the manifest from that build when I get home and have access to that machine) and gaia worked on it.

However, I built a fresh build today, and *MOVED GECKO and Gaia to the device* and it stopped working. In other words, I did a repo sync today, built, and then did ./flash.sh gecko && ./flash.sh gaia and then I got this attached logcat.

Visually, I see the panda boot to the "mozilla" screen but it never comes off the mozilla screen. and never fully loads gaia. In one boot, I could just see the edge of the 'm' in mozilla and the character was very large, as though it were being drawn at an incorrect scale or dpi. Other boots, it booted where you could see "mozilla" in the normal size but it never loaded gaia from there.

I will try flashing the full build onto this device and see what happens with that.
Comment 9 cmtalbert 2012-12-04 17:59:23 PST
This blocks all gaia smoketest and performance work on pandaboards.

To repo: Simply pull and build, flash your panda, and you should see this.
Comment 10 cmtalbert 2012-12-04 18:02:37 PST
Created attachment 688545 [details]
another logcat from this problem
Comment 11 William Lachance (:wlach) 2012-12-04 18:54:14 PST
(In reply to William Lachance (:wlach) from comment #7)

> Apparently Clint has had more luck with an image flashed according to
> alternate instructions. Trying that now, will get back to you.

Didn't have any more luck with that (in fact, the instructions did not seem to yield a bootable image for me). :(
Comment 12 William Lachance (:wlach) 2012-12-04 18:55:18 PST
(In reply to Clint Talbert ( :ctalbert ) from comment #8)
> Created attachment 688541 [details]
> Logcat of dead gaia loading
> 
> I had used the instructions at
> https://mana.mozilla.org/wiki/display/IT/
> Building+and+Installing+B2G+to+a+Panda+Board to flash the pandaboard in the
> way we are doing so in the full automated test framework.  I did that with a
> build from friday or saturday (I can get the manifest from that build when I
> get home and have access to that machine) and gaia worked on it.
> 
> However, I built a fresh build today, and *MOVED GECKO and Gaia to the
> device* and it stopped working. In other words, I did a repo sync today,
> built, and then did ./flash.sh gecko && ./flash.sh gaia and then I got this
> attached logcat.
> 
> Visually, I see the panda boot to the "mozilla" screen but it never comes
> off the mozilla screen. and never fully loads gaia. In one boot, I could
> just see the edge of the 'm' in mozilla and the character was very large, as
> though it were being drawn at an incorrect scale or dpi. Other boots, it
> booted where you could see "mozilla" in the normal size but it never loaded
> gaia from there.
> 
> I will try flashing the full build onto this device and see what happens
> with that.

You could also try tzimmerman's patch in bug 815586, which seems to make it so that the device at least makes it to the homescreen.
Comment 13 Thomas Zimmermann [:tzimmermann] [:tdz] 2012-12-05 03:09:31 PST
William, Clint, could you  add

  export B2G_DEBUG=1

to the beginning of your .userconfig and do a clean rebuild and flash? This should add a lot more debugging information to logcat. Could you attach the output of logcat to this bug?

Clint, how long did you wait before reporting this bug? I know how stupid this question sounds, but booting can take quite some time (minutes), especially after flashing. Did you press GPIO_113 after booting stopped? The PandaBoard disables its screen if no input happens, and GPIO_113 wakes it up again.
Comment 14 Thomas Zimmermann [:tzimmermann] [:tdz] 2012-12-05 03:14:45 PST
(In reply to William Lachance (:wlach) from comment #2)
> (In reply to Michael Wu [:mwu] from comment #1)
> > Is this a regression?
> 
> Yes, a few weeks ago Gaia came up reliably so long as the Panda booted
> correctly.

Did it work last week? We replaced the libc last week with something closer to what we have on the phones. We could roll-back that change and see if it works again. Although I'd be surprised.
Comment 15 Thomas Zimmermann [:tzimmermann] [:tdz] 2012-12-05 07:57:32 PST
Created attachment 688789 [details]
Kernel from 11/27

Clint,

Last week, I replaced the PandaBoard kernel with a newer version that adds several features for profiling. But trying this kernel within the Linaro environment, I run into lots of problems. Could you do a clean rebuild (preferably with B2G_DEBUG) and than replace the current kernel the older one I just attached to this ticket. Does this improve the booting process for you?

Let me know if you need any info or help.
Comment 16 Thomas Zimmermann [:tzimmermann] [:tdz] 2012-12-05 08:58:05 PST
Here is how I replace the kernel binary.

- Run config.sh and build.sh as regular,
- rm out/target/product/panda/kernel,
- copy the kernel binary to device/ti/panda/kernel, and
- _only_ run build.sh. (Dont run config.sh)

This should give you a build with the replaced kernel binary. Running cat /proc/version from the adb shell should print

> Linux version 3.0.8-g5fe57cb (tdz@host-6-53.mv.mozilla.com) (gcc version 4.4.3 (GCC) ) #5 SMP PREEMPT Thu Nov 8 12:03:09 PST 2012
Comment 17 Thomas Zimmermann [:tzimmermann] [:tdz] 2012-12-05 09:13:46 PST
Created attachment 688823 [details]
Orangutan module for the old kernel
Comment 18 Alex Keybl [:akeybl] 2012-12-05 12:44:23 PST
Mass Modify: All un-milestoned, unresolved blocking-basecamp+ bugs are being moved into the C3 milestone. Note that the target milestone does not mean that these bugs can't be resolved prior to 12/10, rather C2 bugs should be prioritized ahead of C3 bugs.
Comment 19 William Lachance (:wlach) 2012-12-05 14:33:25 PST
Created attachment 688966 [details]
Logcat of no b2g

Unfortunately those instructions do not seem to give a useable system for me. Not entirely clear why. As of now, it seems like the b2g executable does not start for whatever reason.

I had a problem earlier with this because I was using a modified init.omap4pandaboard.rc, but the issues still persist with that reverted back to the default.

As far as I know I'm using the stock images (as of this morning), with Thomas's custom kernel, the patches in bug 815586 and bug 815586 applied.

Attached is the system's logcat.
Comment 20 cmtalbert 2012-12-05 17:34:24 PST
Jgriffin's panda board went on the fritz so I just fedex'd him mine (plus serial usb debug cable, and everything else). He has more consecutive hours in the day to work on this.  So, he can follow up here -- jgriffin can you try a complete rebuild and the instructions in comment 15 and see if they help?

I'm currently panda-less at the moment, so I can help direct things but not much else.

I can say that my build from last friday/saturday was booting gaia just fine. So yes, it does seem to be quite a recent change.  I let the panda sit for several minutes after I'd flashed it.  It didn't seem to be doing anything and had stopped writing to logcat.

Can you not repro this tzimmerman? What method are you using to flash the panda?
Comment 21 Thomas Zimmermann [:tzimmermann] [:tdz] 2012-12-06 01:15:53 PST
> As far as I know I'm using the stock images (as of this morning), with
> Thomas's custom kernel, the patches in bug 815586 and bug 815586 applied.
> 
> Attached is the system's logcat.

Hmm, looks like ServiceManager has a problem. :/ You might try to clean up and rebuild from scratch again.
Comment 22 Thomas Zimmermann [:tzimmermann] [:tdz] 2012-12-06 01:27:56 PST
> I can say that my build from last friday/saturday was booting gaia just
> fine. So yes, it does seem to be quite a recent change.  I let the panda sit
> for several minutes after I'd flashed it.  It didn't seem to be doing
> anything and had stopped writing to logcat.
> 
> Can you not repro this tzimmerman? What method are you using to flash the
> panda?

Normally I use the  AOSP-based setup (i.e., the regular flash.sh script with some modifications). I never had any problems with the latest kernel. When I booted the Linaro setup (i.e., the one for testing) to reproduce this bug, I ran into massive problems with even booting the kernel. I should probably use Linaro as my default. And if it's not the kernel, there was also an update to the  libc that could cause problems.

There might be two separate issues here: yours and the one William reported. The current state of this bug is that I couldn't reproduce William's problem when we tried running eideticker yesterday. The tests did exactly what they were supposed to do. But maybe we can fix your bug at least.
Comment 23 Thomas Zimmermann [:tzimmermann] [:tdz] 2012-12-06 02:32:45 PST
Created attachment 689143 [details]
Gaia failed during third Eideticker run

This is the logcat of a test where Gaia failed. I ran Eideticker three times, and on the third test, Gaia failed at the lock screen. It was still usable, but no tests were visible. Instead I could have unlocked the system. Any related output in the logcat is near the end of the file.
Comment 24 Jonathan Griffin (:jgriffin) 2012-12-06 13:34:29 PST
(In reply to Thomas Zimmermann from comment #23)
> Created attachment 689143 [details]
> Gaia failed during third Eideticker run
> 
> This is the logcat of a test where Gaia failed. I ran Eideticker three
> times, and on the third test, Gaia failed at the lock screen. It was still
> usable, but no tests were visible. Instead I could have unlocked the system.
> Any related output in the logcat is near the end of the file.

Where is the test that produced this?  Towards the end of the file, the only thing I see is Marionette launching the music app, going to cards view, and then returning to the top-level frame.  I don't see anything that looks like an attempt  to unlock the lock screen.  However, B2G is restarted towards the very end, after which Marionette doesn't come up...it looks like B2G isn't starting properly, unless the log was interrupted before the boot process was complete.
Comment 25 Thomas Zimmermann [:tzimmermann] [:tdz] 2012-12-07 01:31:26 PST
(In reply to Jonathan Griffin (:jgriffin) from comment #24)
> (In reply to Thomas Zimmermann from comment #23)
> > Created attachment 689143 [details]
> > Gaia failed during third Eideticker run
> > 
> > This is the logcat of a test where Gaia failed. I ran Eideticker three
> > times, and on the third test, Gaia failed at the lock screen. It was still
> > usable, but no tests were visible. Instead I could have unlocked the system.
> > Any related output in the logcat is near the end of the file.
> 
> Where is the test that produced this?  Towards the end of the file, the only
> thing I see is Marionette launching the music app, going to cards view, and
> then returning to the top-level frame.  I don't see anything that looks like
> an attempt  to unlock the lock screen.  However, B2G is restarted towards
> the very end, after which Marionette doesn't come up...it looks like B2G
> isn't starting properly, unless the log was interrupted before the boot
> process was complete.

The test is the Python script that is mentioned in comment 7. As far as I can see, it opens 3 apps and switches between them. And that was what happened on the screen. Wlach helped me with setting up Eideticker and running the test.

Does it make a difference on your PandaBoard if you use the kernel that is attached to this bug instead of the one from the repository?
Comment 26 Thomas Zimmermann [:tzimmermann] [:tdz] 2012-12-07 01:44:42 PST
> The test is the Python script that is mentioned in comment 7. As far as I
> can see, it opens 3 apps and switches between them. And that was what
> happened on the screen.

I should say, that was what usually happened. In this particular run, the test got stuck on the lock screen.
Comment 27 Thomas Zimmermann [:tzimmermann] [:tdz] 2012-12-07 04:41:22 PST
Created attachment 689682 [details]
Logcat with TransactionInactiveError

After two successful runs, the test script returned an error during the third test.

> trying 2828 127.0.0.1
> trying 2828 127.0.0.1
> trying 2828 127.0.0.1
> trying 2828 127.0.0.1
> trying 2828 127.0.0.1
> trying 2828 127.0.0.1
> trying 2828 127.0.0.1
> trying 2828 127.0.0.1
> trying 2828 127.0.0.1
> trying 2828 127.0.0.1
> trying 2828 127.0.0.1
> trying 2828 127.0.0.1
> trying 2828 127.0.0.1
> trying 2828 127.0.0.1
> trying 2828 127.0.0.1
> trying 2828 127.0.0.1
> trying 2828 127.0.0.1
> trying 2828 127.0.0.1
> Creating webserver...
> appswitching.py; src/tests/b2g/appswitching
> Setting up device
> Setting up DHCP...
> attempts left: 5
> Traceback (most recent call last):
>   File "bin/runtest.py", line 163, in <module>
>     main()
>   File "bin/runtest.py", line 151, in main
>     tempdir = EIDETICKER_TEMP_DIR)
>   File "/home/tdz/Projects/mozilla/tools/eideticker/src/eideticker/eideticker/test.py", line 50, in get_test
>     test = module.Test(**kwargs)
>   File "/home/tdz/Projects/mozilla/tools/eideticker/src/eideticker/eideticker/test.py", line 319, in __init__
>     self.device.unlock()
>   File "/home/tdz/Projects/mozilla/tools/eideticker/src/mozbase/mozb2g/mozb2g/b2gmixin.py", line 196, in unlock
>     """)
>   File "/home/tdz/Projects/mozilla/tools/eideticker/lib/python2.7/site-packages/marionette/marionette.py", line 458, in execute_async_script
>     specialPowers=special_powers)
>   File "/home/tdz/Projects/mozilla/tools/eideticker/lib/python2.7/site-packages/marionette/marionette.py", line 211, in _send_message
>     self._handle_error(response)
>   File "/home/tdz/Projects/mozilla/tools/eideticker/lib/python2.7/site-packages/marionette/marionette.py", line 246, in _handle_error
>     raise JavascriptException(message=message, status=status, stacktrace=stacktrace)
> marionette.errors.JavascriptException: TransactionInactiveError: A request was placed against a transaction which is currently not active, or which is finished.

The TransactionInactiveError also shows up in the  logcat.

> E/GeckoConsole( 2377): [JavaScript Warning: "ReferenceError: reference to undefined property message.json.command_id" {file: "chrome://marionette/content/marionette-actors.js" line: 1763}]
> I/Gecko   ( 2377): 946696274128 Marionette      INFO    sendToClient: {"from":"conn1.marionette1","error":{"message":"TransactionInactiveError: A request was placed against a transaction which is currently not active, or which is finished.","status":17,"stacktrace":null}}, undefined, {81038f70-394e-4cc0-b49f-64c4ef942dbe}
> I/Gecko   ( 2377): 946696274647 Marionette      WARN    got a response with no command_id
Comment 28 Thomas Zimmermann [:tzimmermann] [:tdz] 2012-12-07 04:54:58 PST
> > E/GeckoConsole( 2377): [JavaScript Warning: "ReferenceError: reference to undefined property message.json.command_id" {file: "chrome://marionette/content/marionette-actors.js" line: 1763}]

Is that something to worry about?
Comment 29 Thomas Zimmermann [:tzimmermann] [:tdz] 2012-12-07 09:50:11 PST
Don't know if that is related, but there is a backtrace of a deadlock I've seen during debugging this problem.


Thread 1:

gdb> 
#0  __futex_syscall3 () at bionic/libc/arch-arm/bionic/atomics_arm.S:182
#1  0x40114758 in __pthread_cond_timedwait_relative (cond=0x449b1814, mutex=0x459a5510, reltime=0x0) at bionic/libc/bionic/pthread.c:1476
#2  0x4011480c in __pthread_cond_timedwait (cond=0x449b1814, mutex=0x459a5510, abstime=0x0, clock=0x0) at bionic/libc/bionic/pthread.c:1499
#3  0x4039304e in PR_WaitCondVar (cvar=0x449b1810, timeout=0xffffffff) at /home/tdz/Projects/mozilla/src/mozilla-inbound/bug-758103/nsprpub/pr/src/pthreads/ptsynch.c:385
#4  0x4176c58e in mozilla::CondVar::Wait (this=0xbed96574, interval=0xffffffff) at /home/tdz/Projects/mozilla/src/B2G-panda/objdir-gecko-debug/xpcom/build/BlockingResourceBase.cpp:340
#5  0x416ef512 in mozilla::Monitor::Wait (aDevice=mozilla::hal::SWITCH_USB) at ../dist/include/mozilla/Monitor.h:47
#6  mozilla::MonitorAutoLock::Wait (aDevice=mozilla::hal::SWITCH_USB) at ../dist/include/mozilla/Monitor.h:102
#7  mozilla::hal_impl::EnableSwitchNotifications (aDevice=mozilla::hal::SWITCH_USB) at /home/tdz/Projects/mozilla/src/mozilla-inbound/bug-758103/hal/gonk/GonkSwitch.cpp:391
#8  0x416e6ef4 in mozilla::hal::EnableSwitchNotifications (aDevice=mozilla::hal::SWITCH_USB) at /home/tdz/Projects/mozilla/src/mozilla-inbound/bug-758103/hal/Hal.cpp:725
#9  0x416e7ac8 in mozilla::hal::RegisterSwitchObserver (aDevice=mozilla::hal::SWITCH_USB, aObserver=0x4040afa8) at /home/tdz/Projects/mozilla/src/mozilla-inbound/bug-758103/hal/Hal.cpp:772
#10 0x411c9ee6 in UsbCableObserver () at /home/tdz/Projects/mozilla/src/mozilla-inbound/bug-758103/dom/system/gonk/AutoMounter.cpp:510
#11 mozilla::system::InitAutoMounter () at /home/tdz/Projects/mozilla/src/mozilla-inbound/bug-758103/dom/system/gonk/AutoMounter.cpp:544
#12 0x411c8194 in mozilla::dom::gonk::SystemWorkerManager::Init (this=0x44a2a180) at /home/tdz/Projects/mozilla/src/mozilla-inbound/bug-758103/dom/system/gonk/SystemWorkerManager.cpp:381
#13 0x411c8314 in mozilla::dom::gonk::SystemWorkerManager::FactoryCreate () at /home/tdz/Projects/mozilla/src/mozilla-inbound/bug-758103/dom/system/gonk/SystemWorkerManager.cpp:447
#14 0x40b9b910 in SystemWorkerManagerConstructor (aOuter=0x449b1814, aIID=..., aResult=0x0) at /home/tdz/Projects/mozilla/src/mozilla-inbound/bug-758103/layout/build/nsLayoutModule.cpp:288
#15 0x4176e766 in mozilla::GenericFactory::CreateInstance (this=<value optimized out>, aOuter=<value optimized out>, aIID=<value optimized out>, aResult=0x0) at /home/tdz/Projects/mozilla/src/B2G-panda/objdir-gecko-debug/xpcom/build/GenericFactory.cpp:16
#16 0x4179ea92 in nsComponentManagerImpl::CreateInstanceByContractID (this=<value optimized out>, aContractID=0x44a424c0 "@mozilla.org/telephony/system-worker-manager;1", aDelegate=0x0, aIID=..., aResult=0xbed966c0) at /home/tdz/Projects/mozilla/src/mozilla-inbound/bug-758103/xpcom/components/nsComponentManager.cpp:1035
#17 0x4179fdcc in nsComponentManagerImpl::GetServiceByContractID (this=0x40449200, aContractID=<value optimized out>, aIID=<value optimized out>, result=<value optimized out>) at /home/tdz/Projects/mozilla/src/mozilla-inbound/bug-758103/xpcom/components/nsComponentManager.cpp:1427
#18 0x41767f9c in CallGetService (aContractID=0x44a424c0 "@mozilla.org/telephony/system-worker-manager;1", aIID=..., aResult=0x0) at /home/tdz/Projects/mozilla/src/B2G-panda/objdir-gecko-debug/xpcom/build/nsComponentManagerUtils.cpp:62
#19 0x41767fdc in nsGetServiceByContractID::operator() (this=<value optimized out>, aIID=..., aInstancePtr=0x0) at /home/tdz/Projects/mozilla/src/B2G-panda/objdir-gecko-debug/xpcom/build/nsComponentManagerUtils.cpp:246
#20 0x41767562 in nsCOMPtr_base::assign_from_gs_contractid (this=0xbed967c0, gs=..., iid=<value optimized out>) at /home/tdz/Projects/mozilla/src/B2G-panda/objdir-gecko-debug/xpcom/build/nsCOMPtr.cpp:92
#21 0x4179ae1a in nsCOMPtr (category=0x41f13da9 "profile-after-change", origin=<value optimized out>, observerTopic=<value optimized out>) at ../../dist/include/nsCOMPtr.h:912
#22 NS_CreateServicesFromCategory (category=0x41f13da9 "profile-after-change", origin=<value optimized out>, observerTopic=<value optimized out>) at /home/tdz/Projects/mozilla/src/mozilla-inbound/bug-758103/xpcom/components/nsCategoryManager.cpp:837
#23 0x40a2cf28 in nsXREDirProvider::DoStartup (this=<value optimized out>) at /home/tdz/Projects/mozilla/src/mozilla-inbound/bug-758103/toolkit/xre/nsXREDirProvider.cpp:816
#24 0x40a27170 in XREMain::XRE_mainRun (this=0xbed96a14) at /home/tdz/Projects/mozilla/src/mozilla-inbound/bug-758103/toolkit/xre/nsAppRunner.cpp:3718
#25 0x40a2a0b8 in XREMain::XRE_main (this=0xbed96a14, argc=<value optimized out>, argv=0xbed98c14, aAppData=<value optimized out>) at /home/tdz/Projects/mozilla/src/mozilla-inbound/bug-758103/toolkit/xre/nsAppRunner.cpp:3891
#26 0x40a2a26a in XRE_main (argc=0x1, argv=0xbed98c14, aAppData=0x21184, aFlags=<value optimized out>) at /home/tdz/Projects/mozilla/src/mozilla-inbound/bug-758103/toolkit/xre/nsAppRunner.cpp:4089
#27 0x0000a2a8 in do_main (argc=0x1, argv=0xbed98c14) at /home/tdz/Projects/mozilla/src/mozilla-inbound/bug-758103/b2g/app/nsBrowserApp.cpp:164
#28 __dl___vfprintf (argc=0x1, argv=0xbed98c14) at /home/tdz/Projects/mozilla/src/mozilla-inbound/bug-758103/b2g/app/nsBrowserApp.cpp:253
gdb> 

Thread 2:

gdb> bt
#0  connect () at bionic/libc/arch-arm/syscalls/connect.S:9
#1  0x41761f0e in mozilla::ipc::RilClient::OpenSocket (this=0x4499f2b0) at /home/tdz/Projects/mozilla/src/mozilla-inbound/bug-758103/ipc/ril/Ril.cpp:207
#2  0x41761fee in mozilla::ipc::RilReconnectTask::Run (this=<value optimized out>) at /home/tdz/Projects/mozilla/src/mozilla-inbound/bug-758103/ipc/ril/Ril.cpp:136
#3  0x417d904e in MessageLoop::RunTask (this=0x100ffdc8, task=0x44a9cbe0) at /home/tdz/Projects/mozilla/src/mozilla-inbound/bug-758103/ipc/chromium/src/base/message_loop.cc:333
#4  0x417d9878 in MessageLoop::DeferOrRunPendingTask (this=0x1d, pending_task=<value optimized out>) at /home/tdz/Projects/mozilla/src/mozilla-inbound/bug-758103/ipc/chromium/src/base/message_loop.cc:341
#5  0x417da5ca in MessageLoop::DoWork (this=0x100ffdc8) at /home/tdz/Projects/mozilla/src/mozilla-inbound/bug-758103/ipc/chromium/src/base/message_loop.cc:441
#6  0x417f0a16 in base::MessagePumpLibevent::Run (this=0x40402460, delegate=0x100ffdc8) at /home/tdz/Projects/mozilla/src/mozilla-inbound/bug-758103/ipc/chromium/src/base/message_pump_libevent.cc:311
#7  0x417d9602 in MessageLoop::RunInternal (this=0x100ffdc8) at /home/tdz/Projects/mozilla/src/mozilla-inbound/bug-758103/ipc/chromium/src/base/message_loop.cc:215
#8  0x417d9662 in MessageLoop::RunHandler (this=0x100ffdc8) at /home/tdz/Projects/mozilla/src/mozilla-inbound/bug-758103/ipc/chromium/src/base/message_loop.cc:208
#9  MessageLoop::Run (this=0x100ffdc8) at /home/tdz/Projects/mozilla/src/mozilla-inbound/bug-758103/ipc/chromium/src/base/message_loop.cc:182
#10 0x417e35c4 in base::Thread::ThreadMain (this=0x404073c0) at /home/tdz/Projects/mozilla/src/mozilla-inbound/bug-758103/ipc/chromium/src/base/thread.cc:156
#11 0x417f118a in ThreadFunc (closure=0x1d) at /home/tdz/Projects/mozilla/src/mozilla-inbound/bug-758103/ipc/chromium/src/base/platform_thread_posix.cc:39
#12 0x401150d0 in __thread_entry (func=0x417f1181 <ThreadFunc>, arg=0x404073c0, tls=0x100fff00) at bionic/libc/bionic/pthread.c:201
#13 0x40114b8c in pthread_create (thread_out=<value optimized out>, attr=0xbed968a0, start_routine=0x417f1181 <ThreadFunc>, arg=0x404073c0) at bionic/libc/bionic/pthread.c:337
#14 0x00000000 in ?? ()
Comment 30 Dave Hylands [:dhylands] 2012-12-07 11:01:14 PST
This looks like thread 2 (I'm assuming the I/O theread) is blocked on the OpenSocket call.

This is, in turn blocking thread 1, which is waiting for an item scheduled on the IOThread to complete.
Comment 31 Thomas Zimmermann [:tzimmermann] [:tdz] 2012-12-10 03:48:30 PST
I just triggered this problem again. This might be the actual problem that William reported: I get a running b2g process, but nothing is shown on the screen because of the deadlock.

Here is what the other threads do:

  20 Thread 2307.2330  __futex_syscall3 () at bionic/libc/arch-arm/bionic/atomics_arm.S:182
  19 Thread 2307.2329  __futex_syscall3 () at bionic/libc/arch-arm/bionic/atomics_arm.S:182
  18 Thread 2307.2328  __futex_syscall3 () at bionic/libc/arch-arm/bionic/atomics_arm.S:182
  17 Thread 2307.2327  __futex_syscall3 () at bionic/libc/arch-arm/bionic/atomics_arm.S:182
  16 Thread 2307.2326  __futex_syscall3 () at bionic/libc/arch-arm/bionic/atomics_arm.S:182
  15 Thread 2307.2325  __futex_syscall3 () at bionic/libc/arch-arm/bionic/atomics_arm.S:182
  14 Thread 2307.2324  __futex_syscall3 () at bionic/libc/arch-arm/bionic/atomics_arm.S:183
  13 Thread 2307.2323  __futex_syscall3 () at bionic/libc/arch-arm/bionic/atomics_arm.S:182
  12 Thread 2307.2322  __futex_syscall3 () at bionic/libc/arch-arm/bionic/atomics_arm.S:182
  11 Thread 2307.2321  __futex_syscall3 () at bionic/libc/arch-arm/bionic/atomics_arm.S:182
  10 Thread 2307.2320  __futex_syscall3 () at bionic/libc/arch-arm/bionic/atomics_arm.S:182
  9 Thread 2307.2319  __futex_syscall3 () at bionic/libc/arch-arm/bionic/atomics_arm.S:182
  8 Thread 2307.2318  __futex_syscall3 () at bionic/libc/arch-arm/bionic/atomics_arm.S:182
  7 Thread 2307.2317  __futex_syscall3 () at bionic/libc/arch-arm/bionic/atomics_arm.S:182
  6 Thread 2307.2316  __futex_syscall3 () at bionic/libc/arch-arm/bionic/atomics_arm.S:182
  5 Thread 2307.2315  __futex_syscall3 () at bionic/libc/arch-arm/bionic/atomics_arm.S:182
  4 Thread 2307.2314  poll () at bionic/libc/arch-arm/syscalls/poll.S:10
  3 Thread 2307.2313  __futex_syscall3 () at bionic/libc/arch-arm/bionic/atomics_arm.S:182
  2 Thread 2307.2312  connect () at bionic/libc/arch-arm/syscalls/connect.S:9
* 1 Thread 2307.2307  __futex_syscall3 () at bionic/libc/arch-arm/bionic/atomics_arm.S:182
Comment 32 Thomas Zimmermann [:tzimmermann] [:tdz] 2012-12-10 06:38:25 PST
Thread 2 tries to connect to /dev/socket/rilproxy. The other end of the connection should be rilproxy, but this process sleeps. Here is the state and backtrace.

>root@android:/ # cat /proc/1259/stat                                           
> 1259 (rilproxy) S 1 1259 0 0 -1 4194560 135 0 1 0 64 271 0 0 20 0 1 0 1207 847872 83 4294967295 32768 36764 3197844560 3197843176 1074781332 0 0 0 38120 3221918604 0 0 17 1 0 0 0 0 0

> ./run-gdb.sh attach 1259
> nanosleep () at bionic/libc/arch-arm/syscalls/nanosleep.S:10
> 10	    ldmfd   sp!, {r4, r7}
> gdb> bt
> #0  nanosleep () at bionic/libc/arch-arm/syscalls/nanosleep.S:10
> #1  0x4010a364 in sleep (seconds=<value optimized out>) at bionic/libc/unistd/sleep.c:45
> #2  0x00008b22 in main (argc=<value optimized out>, argv=<value optimized out>) at rilproxy/src/rilproxy.c:186

Rilproxy should only sleep for one second and then open the socket file, but it never awakes from the sleep. The next step is to find out why it doesn't wake up.

CC'ing qdot.
Comment 33 Thomas Zimmermann [:tzimmermann] [:tdz] 2012-12-10 10:58:39 PST
When I disable hrtimers in the kernel, sleep works correctly, but the deadlock still persists. Thread 4 (poll) gets invalid after some time. Unfortunately, I cannot switch to thread 4 to get a trace of the stack.

>  4 Thread 2841.2848  0xffff0520 in ?? ()
Comment 34 Thomas Zimmermann [:tzimmermann] [:tdz] 2012-12-10 11:12:18 PST
Wlach was able to get a stack trace of the problem he sees.

wlach@eideticker:~/src/B2G$ adb shell ps | grep b2g
root      1261  1     198636 96776 ffffffff 400744e0 S /system/b2g/b2g
app_0     1426  1261  78028  34028 ffffffff 401244e0 S /system/b2g/plugin-container
app_0     1475  1261  70528  27612 ffffffff 400654e0 S /system/b2g/plugin-container
wlach@eideticker:~/src/B2G$ ./run-gdb.sh attach 1261
Attached; pid = 1261
Listening on port 12262
prebuilt/linux-x86/toolchain/arm-linux-androideabi-4.4.x/bin/arm-linux-androideabi-gdb -x /tmp/b2g.gdbinit.wlach /home/wlach/src/B2G/objdir-gecko/dist/bin/b2g
GNU gdb (GDB) 7.1-android-gg2
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "--host=i686-linux-gnu --target=arm-elf-linux".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/wlach/src/B2G/objdir-gecko/dist/bin/b2g...done.
Remote debugging from host 127.0.0.1
warning: .dynamic section for "out/target/product/panda/symbols/system/bin/linker" is not at the expected address (wrong library or version mismatch?)

syscall () at bionic/libc/arch-arm/bionic/syscall.S:50
50	    ldmfd   sp!, {r4, r5, r6, r7}
(gdb) 
(gdb) bt
#0  syscall () at bionic/libc/arch-arm/bionic/syscall.S:50
#1  0x417cc776 in epoll_wait (epfd=23, events=0xbea2c5a8, maxevents=16, 
    timeout=<value optimized out>)
    at /home/wlach/src/B2G/gecko/ipc/chromium/src/third_party/libevent/epoll_sub.c:51
#2  0x414f2872 in nsAppShell::ProcessNextNativeEvent (this=0x4383c700, 
    mayWait=true) at /home/wlach/src/B2G/gecko/widget/gonk/nsAppShell.cpp:717
#3  0x415144dc in nsBaseAppShell::DoProcessNextNativeEvent (this=0xfffffffc, 
    mayWait=168, recursionDepth=16)
    at /home/wlach/src/B2G/gecko/widget/xpwidgets/nsBaseAppShell.cpp:139
#4  0x415146fc in nsBaseAppShell::OnProcessNextEvent (this=0x4383c700, 
    thr=0x40404320, mayWait=168, recursionDepth=0)
    at /home/wlach/src/B2G/gecko/widget/xpwidgets/nsBaseAppShell.cpp:298
#5  0x4179ef78 in nsThread::ProcessNextEvent (this=0x40404320, mayWait=true, 
    result=0xbea2c77f)
    at /home/wlach/src/B2G/gecko/xpcom/threads/nsThread.cpp:600
#6  0x41766524 in NS_ProcessNextEvent_P (thread=0xfffffffc, mayWait=true)
    at /home/wlach/src/B2G/objdir-gecko/xpcom/build/nsThreadUtils.cpp:221
#7  0x415d25ec in mozilla::ipc::MessagePump::Run (this=0x40402430, 
    aDelegate=0x4042e1a0)
    at /home/wlach/src/B2G/gecko/ipc/glue/MessagePump.cpp:117
---Type <return> to continue, or q <return> to quit---
#8  0x417d4caa in MessageLoop::RunInternal (this=0x4042e1a0)
    at /home/wlach/src/B2G/gecko/ipc/chromium/src/base/message_loop.cc:215
#9  0x417d4d0a in MessageLoop::RunHandler (this=0x4042e1a0)
    at /home/wlach/src/B2G/gecko/ipc/chromium/src/base/message_loop.cc:208
#10 MessageLoop::Run (this=0x4042e1a0)
    at /home/wlach/src/B2G/gecko/ipc/chromium/src/base/message_loop.cc:182
#11 0x41513d62 in nsBaseAppShell::Run (this=0x4383c700)
    at /home/wlach/src/B2G/gecko/widget/xpwidgets/nsBaseAppShell.cpp:163
#12 0x41403a06 in nsAppStartup::Run (this=0x4383e340)
    at /home/wlach/src/B2G/gecko/toolkit/components/startup/nsAppStartup.cpp:291
#13 0x40a2692e in XREMain::XRE_mainRun (this=0xbea2ca14)
    at /home/wlach/src/B2G/gecko/toolkit/xre/nsAppRunner.cpp:3824
#14 0x40a29538 in XREMain::XRE_main (this=0xbea2ca14, 
    argc=<value optimized out>, argv=0xbea2ec14, aAppData=<value optimized out>)
    at /home/wlach/src/B2G/gecko/toolkit/xre/nsAppRunner.cpp:3891
#15 0x40a296ea in XRE_main (argc=1, argv=0xbea2ec14, aAppData=0x21184, 
    aFlags=<value optimized out>)
    at /home/wlach/src/B2G/gecko/toolkit/xre/nsAppRunner.cpp:4089
#16 0x0000a2a8 in do_main (argc=1, argv=0xbea2ec14)
    at /home/wlach/src/B2G/gecko/b2g/app/nsBrowserApp.cpp:164
#17 __dl___vfprintf (argc=1, argv=0xbea2ec14)
---Type <return> to continue, or q <return> to quit---
    at /home/wlach/src/B2G/gecko/b2g/app/nsBrowserApp.cpp:253
(gdb) thread
[Current thread is 1 (Thread 1261.1261)]
Comment 35 Dave Hylands [:dhylands] 2012-12-10 11:38:39 PST
So this shows that the main thread isn't deadlocked. It's just waiting for the next event to show up.
Comment 36 Thomas Zimmermann [:tzimmermann] [:tdz] 2012-12-11 02:10:28 PST
Pull requests for reverting the recent changes to the kernel and libc are pending on github.

https://github.com/mozilla-b2g/android-device-panda/pull/15
https://github.com/mozilla-b2g/b2g-manifest/pull/33
Comment 37 Thomas Zimmermann [:tzimmermann] [:tdz] 2012-12-11 08:01:19 PST
Created attachment 690885 [details] [diff] [review]
Don't block when rilproxy is not available

Dave is correct: there is no deadlock, just a blocked I/O thread. This is a patch for the problem.

Rilproxy first connects with b2g and then loops for ever while trying to connect to rild. On the PandaBoard no RIL is available, thus rild just sleeps. If b2g gets killed (e.g., by the automated tests), it gets restarted by init and tries to connect to rilproxy again. This blocks, because rilproxy cannot accept this connection while it's trying to connect to rild. The fix in b2g is to make the ril socket non-blocking before trying to connect. The connect call will now fail if rilproxy is not available, and the I/O task will get returned into the queue for a later retry. You can see when this happens if you get a lot of 'Cannot open socket for RIL' messages in logcat.

The problem should also be patched in rilproxy. OTOH, it's not a good idea to depend on external processes for making progress, so this patch should go in as well.
Comment 38 Dave Hylands [:dhylands] 2012-12-11 08:31:29 PST
So should we also change other socket users, like VolumeManager <--> vold and the tethering code <--> netd so that they use non-blocking opens?

I know for the VolumeManager, if the open fails, it will retry once per second until it gets connected. But if the open is going hang for some reason, thats probably bad.
Comment 39 Thomas Zimmermann [:tzimmermann] [:tdz] 2012-12-11 08:43:01 PST
I haven't seen this problem with other tasks and it really only happens because rild and rilproxy don't behave well on PandaBoards. OTOH it might make sense to review other, similar tasks for this kind of problem.
Comment 40 Thomas Zimmermann [:tzimmermann] [:tdz] 2012-12-11 08:43:39 PST
William, could you apply the attached patch and check if it fixes the problem for you? If it doesn't fix the problem, could you enter the debugger again and enter 'threads' (with s) to get a list of all threads. I think I told you the wrong command yesterday, because 'threads' doesn't work for me. Ping me on IRC if you need help.
Comment 41 Dave Hylands [:dhylands] 2012-12-11 08:45:53 PST
I thought the gdb command was 'info threads'
Comment 42 Thomas Zimmermann [:tzimmermann] [:tdz] 2012-12-11 09:21:02 PST
(In reply to Dave Hylands [:dhylands] from comment #41)
> I thought the gdb command was 'info threads'

Yeah, I think you're right. I remember reading 'threads' on some website, but it didn't work for me. Instead 'thread' printed a list of all threads, but had problems switching among threads. I have gdb 7.4.
Comment 43 Thomas Zimmermann [:tzimmermann] [:tdz] 2012-12-11 09:45:35 PST
More backtraces from wlach:

(gdb) info threads
[New Thread 2093.2098]
[New Thread 2093.2099]
[New Thread 2093.2100]
[New Thread 2093.2101]
[New Thread 2093.2102]
[New Thread 2093.2103]
[New Thread 2093.2104]
[New Thread 2093.2105]
[New Thread 2093.2106]
[New Thread 2093.2107]
[New Thread 2093.2108]
[New Thread 2093.2109]
[New Thread 2093.2110]
[New Thread 2093.2111]
[New Thread 2093.2112]
[New Thread 2093.2113]
[New Thread 2093.2114]
[New Thread 2093.2115]
[New Thread 2093.2116]
[New Thread 2093.2117]
[New Thread 2093.2118]
[New Thread 2093.2119]
[New Thread 2093.2120]
[New Thread 2093.2121]
[New Thread 2093.2122]
[New Thread 2093.2123]
[New Thread 2093.2126]
[New Thread 2093.2129]
[New Thread 2093.2130]
[New Thread 2093.2132]
  31 Thread 2093.2132  poll () at bionic/libc/arch-arm/syscalls/poll.S:10
  30 Thread 2093.2130  __futex_syscall3 ()
    at bionic/libc/arch-arm/bionic/atomics_arm.S:182
  29 Thread 2093.2129  __ioctl () at bionic/libc/arch-arm/syscalls/__ioctl.S:9
  28 Thread 2093.2126  __futex_syscall3 ()
    at bionic/libc/arch-arm/bionic/atomics_arm.S:182
  27 Thread 2093.2123  __futex_syscall3 ()
    at bionic/libc/arch-arm/bionic/atomics_arm.S:182
  26 Thread 2093.2122  __futex_syscall3 ()
    at bionic/libc/arch-arm/bionic/atomics_arm.S:182
  25 Thread 2093.2121  syscall () at bionic/libc/arch-arm/bionic/syscall.S:50
  24 Thread 2093.2120  read () at bionic/libc/arch-arm/syscalls/read.S:9
  23 Thread 2093.2119  __futex_syscall3 ()
    at bionic/libc/arch-arm/bionic/atomics_arm.S:182
  22 Thread 2093.2118  __futex_syscall3 ()
    at bionic/libc/arch-arm/bionic/atomics_arm.S:182
  21 Thread 2093.2117  __futex_syscall3 ()
    at bionic/libc/arch-arm/bionic/atomics_arm.S:182
  20 Thread 2093.2116  __futex_syscall3 ()
    at bionic/libc/arch-arm/bionic/atomics_arm.S:182
  19 Thread 2093.2115  __futex_syscall3 ()
---Type <return> to continue, or q <return> to quit---
    at bionic/libc/arch-arm/bionic/atomics_arm.S:182
  18 Thread 2093.2114  __futex_syscall3 ()
    at bionic/libc/arch-arm/bionic/atomics_arm.S:182
  17 Thread 2093.2113  __futex_syscall3 ()
    at bionic/libc/arch-arm/bionic/atomics_arm.S:182
  16 Thread 2093.2112  poll () at bionic/libc/arch-arm/syscalls/poll.S:10
  15 Thread 2093.2111  __futex_syscall3 ()
    at bionic/libc/arch-arm/bionic/atomics_arm.S:182
  14 Thread 2093.2110  __futex_syscall3 ()
    at bionic/libc/arch-arm/bionic/atomics_arm.S:183
  13 Thread 2093.2109  __futex_syscall3 ()
    at bionic/libc/arch-arm/bionic/atomics_arm.S:182
  12 Thread 2093.2108  __futex_syscall3 ()
    at bionic/libc/arch-arm/bionic/atomics_arm.S:182
  11 Thread 2093.2107  __futex_syscall3 ()
    at bionic/libc/arch-arm/bionic/atomics_arm.S:182
  10 Thread 2093.2106  __futex_syscall3 ()
    at bionic/libc/arch-arm/bionic/atomics_arm.S:182
  9 Thread 2093.2105  __futex_syscall3 ()
    at bionic/libc/arch-arm/bionic/atomics_arm.S:182
  8 Thread 2093.2104  __futex_syscall3 ()
---Type <return> to continue, or q <return> to quit---
    at bionic/libc/arch-arm/bionic/atomics_arm.S:182
  7 Thread 2093.2103  __futex_syscall3 ()
    at bionic/libc/arch-arm/bionic/atomics_arm.S:182
  6 Thread 2093.2102  __futex_syscall3 ()
    at bionic/libc/arch-arm/bionic/atomics_arm.S:182
  5 Thread 2093.2101  __futex_syscall3 ()
    at bionic/libc/arch-arm/bionic/atomics_arm.S:182
  4 Thread 2093.2100  poll () at bionic/libc/arch-arm/syscalls/poll.S:10
  3 Thread 2093.2099  __futex_syscall3 ()
    at bionic/libc/arch-arm/bionic/atomics_arm.S:182
  2 Thread 2093.2098  syscall () at bionic/libc/arch-arm/bionic/syscall.S:50
* 1 Thread 2093.2093  syscall () at bionic/libc/arch-arm/bionic/syscall.S:50


(gdb) thread 24
[Switching to thread 24 (Thread 2093.2120)]#0  read ()
    at bionic/libc/arch-arm/syscalls/read.S:9
9	    swi     #0
(gdb) bt
#0  read () at bionic/libc/arch-arm/syscalls/read.S:9
#1  0x414e42de in frameBufferWatcher ()
    at /home/wlach/src/B2G/gecko/widget/gonk/nsWindow.cpp:132
#2  0x400ce0d0 in __thread_entry (func=0x414e41f9 <frameBufferWatcher>, arg=0x0, 
    tls=0x47771f00) at bionic/libc/bionic/pthread.c:201
#3  0x400cdb8c in pthread_create (thread_out=<value optimized out>, 
    attr=0x400f535c, start_routine=0x414e41f9 <frameBufferWatcher>, arg=0x0)
    at bionic/libc/bionic/pthread.c:337
#4  0x00000000 in ?? ()

(gdb) thread 4
[Switching to thread 4 (Thread 2093.2100)]#0  0xffff0520 in ?? ()
(gdb) bt
#0  0xffff0520 in ?? ()
Cannot access memory at address 0x3e7fc18
#1  0x405d34e6 in _pr_poll_with_poll (pds=0x43913020, npds=2, timeout=65535000)
    at /home/wlach/src/B2G/gecko/nsprpub/pr/src/pthreads/ptio.c:3920
#2  0x43913020 in ?? ()
Cannot access memory at address 0x2
#3  0x43913020 in ?? ()
Cannot access memory at address 0x2
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) c
Continuing.
^C
Program received signal SIGINT, Interrupt.
[Switching to Thread 2093.2093]
syscall () at bionic/libc/arch-arm/bionic/syscall.S:50
50	    ldmfd   sp!, {r4, r5, r6, r7}
(gdb) thread 4
[Switching to thread 4 (Thread 2093.2100)]#0  0xffff0520 in ?? ()
(gdb) bt
#0  0xffff0520 in ?? ()
Cannot access memory at address 0x3e7fc18
#1  0x405d34e6 in _pr_poll_with_poll (pds=0x43913020, npds=2, timeout=65535000)
    at /home/wlach/src/B2G/gecko/nsprpub/pr/src/pthreads/ptio.c:3920
warning: (Internal error: pc 0x1 in read in psymtab, but not in symtab.)

warning: (Internal error: pc 0x1 in read in psymtab, but not in symtab.)

#2  0x00000002 in ?? (warning: (Internal error: pc 0x1 in read in psymtab, but not in symtab.)

) at bionic/libc/bionic/logd_write.c:196
warning: (Internal error: pc 0x1 in read in psymtab, but not in symtab.)

warning: (Internal error: pc 0x1 in read in psymtab, but not in symtab.)

warning: (Internal error: pc 0x1 in read in psymtab, but not in symtab.)

warning: (Internal error: pc 0x1 in read in psymtab, but not in symtab.)

#3  0x00000002 in ?? (warning: (Internal error: pc 0x1 in read in psymtab, but not in symtab.)

) at bionic/libc/bionic/logd_write.c:196
warning: (Internal error: pc 0x1 in read in psymtab, but not in symtab.)

Backtrace stopped: previous frame identical to this frame (corrupt stack?)
Comment 44 Thomas Zimmermann [:tzimmermann] [:tdz] 2012-12-11 09:50:00 PST
(gdb) thread 25
[Switching to thread 25 (Thread 2093.2121)]#0  syscall ()
    at bionic/libc/arch-arm/bionic/syscall.S:50
50	    ldmfd   sp!, {r4, r5, r6, r7}
(gdb) bt
#0  syscall () at bionic/libc/arch-arm/bionic/syscall.S:50
#1  0x417bb776 in epoll_wait (epfd=32, events=0x43ab1c68, maxevents=16, 
    timeout=<value optimized out>)
    at /home/wlach/src/B2G/gecko/ipc/chromium/src/third_party/libevent/epoll_sub.c:51
#2  0x414e811c in android::EventHub::getEvents (this=0x43ab1c10, 
    timeoutMillis=<value optimized out>, buffer=<value optimized out>, 
    bufferSize=<value optimized out>)
    at /home/wlach/src/B2G/gecko/widget/gonk/libui/EventHub.cpp:753
#3  0x414fb52a in android::InputReader::loopOnce (this=0x46b93000)
    at /home/wlach/src/B2G/gecko/widget/gonk/libui/InputReader.cpp:277
#4  0x414f1618 in android::InputReaderThread::threadLoop (
    this=<value optimized out>)
    at /home/wlach/src/B2G/gecko/widget/gonk/libui/InputReader.cpp:838
#5  0x401b1198 in android::Thread::_threadLoop (user=<value optimized out>)
    at frameworks/base/libs/utils/Threads.cpp:834
#6  0x401b17de in thread_data_t::trampoline (t=<value optimized out>)
    at frameworks/base/libs/utils/Threads.cpp:127
#7  0x400ce0d0 in __thread_entry (
    func=0x401b1749 <thread_data_t::trampoline(thread_data_t const*)>, 
    arg=0x46b96040, tls=0x479fff00) at bionic/libc/bionic/pthread.c:201
---Type <return> to continue, or q <return> to quit---
#8  0x400cdb8c in pthread_create (thread_out=<value optimized out>, 
    attr=0xbe9de354, 
    start_routine=0x401b1749 <thread_data_t::trampoline(thread_data_t const*)>, 
    arg=0x46b96040) at bionic/libc/bionic/pthread.c:337
#9  0x00000000 in ?? ()
Comment 45 Dave Hylands [:dhylands] 2012-12-11 10:38:06 PST
Another tip: You can get more insight into what the threads actually are by using:

adb shell ps -t PID

ps -t will show the name that's been assigned to the thread.
info threads shows you the PID.TID
Comment 46 Kyle Machulis [:kmachulis] [:qdot] 2012-12-11 10:41:05 PST
Comment on attachment 690885 [details] [diff] [review]
Don't block when rilproxy is not available

Review of attachment 690885 [details] [diff] [review]:
-----------------------------------------------------------------

Ah, yeah, we do this in the new UnixSocket IPC stuff but always set it late here. Thanks!
Comment 47 Thomas Zimmermann [:tzimmermann] [:tdz] 2012-12-12 04:24:42 PST
Created attachment 691282 [details] [diff] [review]
Keep display turned on

The read operation in thread no. 24 waits for the display to be turned on. This doesn't happen, and thus no screen-on event gets generated. This problem seems related to bug 819431.

Does the attached workaround at least mitigate the problem for you? The screen should still turn black, but Gaia hopefully starts reliably. Have a look at bug 818506 if you what to turn on the display by hand.

It looks like we need to grab a wake look to keep the display turned on.
Comment 48 Thomas Zimmermann [:tzimmermann] [:tdz] 2012-12-12 04:26:12 PST
s/look/lock
Comment 49 Thomas Zimmermann [:tzimmermann] [:tdz] 2012-12-12 04:28:30 PST
William, what you could do is to apply the patch in bug 818506. When b2g blocks, press the GPIO button and check if b2g wakes up and continues. The patch is at https://github.com/tdz/android-device-panda/tree/bug-818506.
Comment 50 William Lachance (:wlach) 2012-12-12 09:27:33 PST
(In reply to Kyle Machulis [:kmachulis] [:qdot] from comment #46)
> Comment on attachment 690885 [details] [diff] [review]
> Don't block when rilproxy is not available
> 
> Review of attachment 690885 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Ah, yeah, we do this in the new UnixSocket IPC stuff but always set it late
> here. Thanks!

One weird side effect of this patch is that the logcat (when debugging mode is enabled), gets absolutely filled with this message, which gets emitted every 5 seconds or so:

I/Gonk    ( 6737): Cannot open socket for RIL!

Could we tone that down a bit?
Comment 51 William Lachance (:wlach) 2012-12-12 10:57:06 PST
(In reply to Thomas Zimmermann from comment #49)
> William, what you could do is to apply the patch in bug 818506. When b2g
> blocks, press the GPIO button and check if b2g wakes up and continues. The
> patch is at https://github.com/tdz/android-device-panda/tree/bug-818506.

Hmm, I can't reproduce this problem again. Ran the test 10 times and each time b2g came up as expected.
Comment 52 William Lachance (:wlach) 2012-12-12 10:59:33 PST
Comment on attachment 691282 [details] [diff] [review]
Keep display turned on

This actually doesn't seem to be necessary, at least with the tests I've tried so far, the HDMI output stays on for the duration of the test after gecko is started.

I'm guessing that you didn't want feedback on the patch itself, so I'm going to unset the flag.
Comment 53 William Lachance (:wlach) 2012-12-12 11:04:25 PST
So with the RIL patch and the new kernel, it seems like b2g now *does* start gaia reliably (at least with a somewhat older build of gaia from last week, need to try the latest as well).

My vote would be to make sure that those things get applied and mark this one as fixed. With that out of the way there's still:

* Bug 819431 on the HDMI output is still an issue, though now that gecko starts reliably it looks like it's not blocking tests anymore. IMO this is still worth fixing as a high priority as it makes eideticker development very difficult (as well as potentially being an issue with test reliability).
* Bug 819563 on orangutan is preventing any of these tests from running properly (even though they do start). I can do some of the initial troubleshooting here though.
Comment 54 Thomas Zimmermann [:tzimmermann] [:tdz] 2012-12-12 11:25:41 PST
I'll look into this again tomorrow, and when I can't find anything and it doesn't show up again, we may at least remove the blocker status.
Comment 55 Thomas Zimmermann [:tzimmermann] [:tdz] 2012-12-13 04:01:41 PST
Created attachment 691768 [details] [diff] [review]
Don't block when rilproxy is not available
Comment 56 Thomas Zimmermann [:tzimmermann] [:tdz] 2012-12-14 06:58:40 PST
(In reply to William Lachance (:wlach) from comment #50)
> (In reply to Kyle Machulis [:kmachulis] [:qdot] from comment #46)
> > Comment on attachment 690885 [details] [diff] [review]
> > Don't block when rilproxy is not available
> > 
> > Review of attachment 690885 [details] [diff] [review]:
> > -----------------------------------------------------------------
> > 
> > Ah, yeah, we do this in the new UnixSocket IPC stuff but always set it late
> > here. Thanks!
> 
> One weird side effect of this patch is that the logcat (when debugging mode
> is enabled), gets absolutely filled with this message, which gets emitted
> every 5 seconds or so:
> 
> I/Gonk    ( 6737): Cannot open socket for RIL!
> 
> Could we tone that down a bit?

I have a fix for rilproxy pending on github: https://github.com/mozilla-b2g/rilproxy/pull/24. With it the warning should not appear that often. Kyle, could you review the patch and merge it if appropriate. Thanks.
Comment 57 Kyle Machulis [:kmachulis] [:qdot] 2012-12-14 11:07:17 PST
Yeah, that looks good. I'll try to bring in both the ril.cpp and rilproxy patches today.
Comment 58 Kyle Machulis [:kmachulis] [:qdot] 2012-12-14 14:22:40 PST
Ok, I brought in the rilproxy patch. Just realized I don't have all of the repos to bring in the ril.cpp patch. I could bring it into m-i at least, but I'm not sure how it needs to go to aurora/b2g18.
Comment 59 Kyle Machulis [:kmachulis] [:qdot] 2012-12-14 15:41:55 PST
Ok, didn't realize it was bb+, I'm bringing it into m-i as soon as m-i reopens.
Comment 60 Kyle Machulis [:kmachulis] [:qdot] 2012-12-14 21:59:06 PST
https://hg.mozilla.org/integration/mozilla-inbound/rev/a6ba8f732257
Comment 61 Armen Zambrano [:armenzg] (EDT/UTC-4) 2012-12-15 08:24:46 PST
I assume I don't need to update the b2g snapshot for a gecko related change; is this correct?
https://hg.mozilla.org/mozilla-central/file/default/b2g/config/panda/sources.xml

Do we need to uplift this change?
Comment 62 Ryan VanderMeulen [:RyanVM] 2012-12-15 13:28:45 PST
https://hg.mozilla.org/mozilla-central/rev/a6ba8f732257
Comment 64 Armen Zambrano [:armenzg] (EDT/UTC-4) 2012-12-17 06:38:47 PST
I merged m-c to cedar to get a hold of this change:
https://hg.mozilla.org/projects/cedar/rev/a6ba8f732257

I will re-image all panda boards as soon as I have a build.
Comment 65 Andreas Gal :gal 2012-12-19 23:56:43 PST
Armen, any progress here?
Comment 66 Armen Zambrano [:armenzg] (EDT/UTC-4) 2012-12-20 07:46:37 PST
(In reply to Andreas Gal :gal from comment #65)
> Armen, any progress here?

The boards got re-imaged. Nevertheless, we re-image on every check-in with the latest code.
TBH, there was no real reason to say that I was going to re-image them besides for my own personal testing.
Sorry for the confusion I created.

Note You need to log in before you can comment on or make changes to this bug.