Last Comment Bug 821379 - Pandaboard will become unresponsive after idling
: Pandaboard will become unresponsive after idling
Status: RESOLVED WONTFIX
:
Product: Core
Classification: Components
Component: Widget: Gonk (show other bugs)
: Trunk
: x86 Gonk (Firefox OS)
: -- normal (vote)
: ---
Assigned To: Nobody; OK to take it and work on it
:
:
Mentors:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-12-13 10:26 PST by Malini Das [:mdas] - Away, not checking bugmail
Modified: 2016-03-21 08:47 PDT (History)
5 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments

Description Malini Das [:mdas] - Away, not checking bugmail 2012-12-13 10:26:35 PST
If you let the pandaboard idle, it will eventually become unresponsive. If you do 'adb devices' it will just hang. You have to reboot the board for it to work again.
 I find that this problem happens intermittently, over a random period of idle time.
Comment 1 Armen Zambrano [:armenzg] (EDT/UTC-4) 2012-12-13 10:30:24 PST
This can be seen on the releng side by a nagios check declaring the board as ping down.

This can be easily fixed by scripting something that will check if the mozpool status is free and ping is down then we can just reboot the device without asking any further questions.
Comment 2 Malini Das [:mdas] - Away, not checking bugmail 2012-12-13 11:19:15 PST
Hmm, after building and flashing today's build, I haven't seen this problem yet. I'll keep this open for a while to make sure it's not a fluke.
Comment 3 Malini Das [:mdas] - Away, not checking bugmail 2012-12-13 11:31:17 PST
New problem!

It's now powered on, but not listed in adb devices. Weird.
Comment 4 Malini Das [:mdas] - Away, not checking bugmail 2012-12-13 11:31:57 PST
This was after running a few gaia smoketests, waiting an hour or so, then running them again. Midtest, it went into this state.
Comment 5 Malini Das [:mdas] - Away, not checking bugmail 2012-12-13 11:39:51 PST
and now it just came back up all by itself. Hmm.
Comment 6 Malini Das [:mdas] - Away, not checking bugmail 2012-12-13 12:00:26 PST
After coming back online, it is unresponsive to adb shell and logcat. It is listed in adb devices, and lsusb, but there isn't much I can do other than that.
Comment 7 Jonathan Griffin (:jgriffin) 2012-12-17 13:55:14 PST
This just happened now.  Log cat only shows lots of:

W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ec1880) failed -22 (Invalid argument)
W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ec1e80) failed -22 (Invalid argument)
W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ecf7c0) failed -22 (Invalid argument)
W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ecfd80) failed -22 (Invalid argument)
W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ecffc0) failed -22 (Invalid argument)
W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ed3780) failed -22 (Invalid argument)
W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ed3cc0) failed -22 (Invalid argument)
W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ef61c0) failed -22 (Invalid argument)
W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ef6800) failed -22 (Invalid argument)
W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ef6e80) failed -22 (Invalid argument)

(which happens before the freeze, too).  The serial port shows only:

[   32.051666] omapdss HDMI: ENTER hdmi_display_enable                          
[   32.158691] omapdss DISPC error: timeout waiting for EVSYNC                  
[   33.508697] misc dsscomp: [eceb4c00] ignoring set failure -22                
[  177.840911] adb_release                                                      
[  177.841766] android_work: sent uevent USB_STATE=DISCONNECTED                 
[  177.856170] adb_open                                                         
[  177.877624] android_work: sent uevent USB_STATE=CONNECTED                    
[  177.887542] android_work: sent uevent USB_STATE=DISCONNECTED                 
[  177.945343] android_work: sent uevent USB_STATE=CONNECTED                    
[  178.496856] android_usb gadget: high speed config #1: android                
[  178.505981] android_work: sent uevent USB_STATE=CONFIGURED
Comment 8 Jonathan Griffin (:jgriffin) 2012-12-17 13:55:49 PST
Note that the display is also frozen, not blanked.  I see the Gaia lock screen.
Comment 9 William Lachance (:wlach) 2012-12-17 15:08:54 PST
Here's what I see on serial:

[ 6338.648010] omapdss HDMI: ENTER hdmi_display_enable                            
[ 6338.751373] omapdss DISPC error: timeout waiting for EVSYNC                    
[ 6338.759490] omap_thermal_unthrottle: temperature reduced, ending cpu throttling
[ 6338.940246] misc dsscomp: [ecbd3800] ignoring set failure -22                  
[ 6346.758758] omap_thermal_throttle: temperature too high, cpu throttle at max 90
[ 6347.766448] throttle_delayed_work_fn: OMAP temp read 66200 exceeds the threshod
[ 6347.782135] omap_thermal_throttle: temperature too high, cpu throttle at max 70
[ 6358.759307] omap_thermal_unthrottle: temperature reduced, ending cpu throttling

Not sure whether this means we have two issues, or that one or both sets of serial console output don't offer a clue to the problem.

In case these messages are relevant: My panda seems to be relatively warm to the touch, although the room it's in isn't hot by any means (I'd estimate maybe 22 or 23 degrees centigrade).
Comment 10 Jonathan Griffin (:jgriffin) 2012-12-17 15:12:42 PST
I just had this problem recur; I got no output on either serial port or logcat when it happened. :(
Comment 11 Thomas Zimmermann [:tzimmermann] [:tdz] 2013-01-02 07:25:31 PST
(In reply to Jonathan Griffin (:jgriffin) from comment #7)
> This just happened now.  Log cat only shows lots of:
> 
> W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ec1880) failed -22
> (Invalid argument)
> W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ec1e80) failed -22
> (Invalid argument)
> W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ecf7c0) failed -22
> (Invalid argument)
> W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ecfd80) failed -22
> (Invalid argument)
> W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ecffc0) failed -22
> (Invalid argument)
> W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ed3780) failed -22
> (Invalid argument)
> W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ed3cc0) failed -22
> (Invalid argument)
> W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ef61c0) failed -22
> (Invalid argument)
> W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ef6800) failed -22
> (Invalid argument)
> W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ef6e80) failed -22
> (Invalid argument)
> 
> (which happens before the freeze, too).  The serial port shows only:

This has already been reported in bug 801658.


(In reply to William Lachance (:wlach) from comment #9)
> Here's what I see on serial:
> 
> [ 6338.648010] omapdss HDMI: ENTER hdmi_display_enable                      
> 
> [ 6338.751373] omapdss DISPC error: timeout waiting for EVSYNC              
> 
> [ 6338.759490] omap_thermal_unthrottle: temperature reduced, ending cpu
> throttling
> [ 6338.940246] misc dsscomp: [ecbd3800] ignoring set failure -22            
> 
> [ 6346.758758] omap_thermal_throttle: temperature too high, cpu throttle at
> max 90
> [ 6347.766448] throttle_delayed_work_fn: OMAP temp read 66200 exceeds the
> threshod
> [ 6347.782135] omap_thermal_throttle: temperature too high, cpu throttle at
> max 70
> [ 6358.759307] omap_thermal_unthrottle: temperature reduced, ending cpu
> throttling

I've seen this too, but it seems uncritical. The value is reported in /sys/bus/platform/drivers/omap_temp_sensor/omap_temp_sensor.0/temperature. Normally my board runs between 50000 to 55000. Throttling the CPU is just a safety measure.
Comment 12 Thomas Zimmermann [:tzimmermann] [:tdz] 2013-01-02 07:28:09 PST
I just managed to reproduce the problem and got this at the serial console:

> [  290.180847] hub 1-1:1.0: port 1 disabled by hub (EMI?), re-enabling...
> [  290.188720] usb 1-1.1: USB disconnect, device number 3
> [  290.195404] smsc95xx 1-1.1:1.0: eth0: unregister 'smsc95xx' usb-ehci-omap.0-1.1, smsc95xx USB 2.0 Ethernet
> [  290.409790] init: untracked pid 1413 exited
> [  295.366882] hub 1-1:1.0: hub_port_status failed (err = -110)
> [  295.375610] hub 1-1:1.0: connect-debounce failed, port 1 disabled

It looks like the USB port fails after some time.

I checked the reported temperature, but it was only ~45000.
Comment 13 Thomas Zimmermann [:tzimmermann] [:tdz] 2013-01-02 08:16:03 PST
Pid 1413 is the DHCP client, errno number 110 is ETIMEDOUT.

USB suspending is enabled in the kernel. Maybe we'll just need to disable it...

An EMI problem is reported here:

  http://softsolder.com/2009/01/10/mysterious-usb-disconnects/

and solved here

  http://softsolder.com/2009/01/28/usb-disconnects-nobody-moves-nobody-gets-hurt/

/me is wondering if we need to ground the PandaBoards or put them into metal boxes...
Comment 14 Thomas Zimmermann [:tzimmermann] [:tdz] 2013-01-02 09:33:20 PST
(In reply to Thomas Zimmermann [:tzimmermann] from comment #13)
> USB suspending is enabled in the kernel. Maybe we'll just need to disable
> it...

Nope, didn't help.
Comment 15 Thomas Zimmermann [:tzimmermann] [:tdz] 2013-01-03 10:53:22 PST
I looked deeper into this today and it really to be a problem in the USB chipset. After the USB port failed, I get a number of debugging messages like the ones below.

> [  523.298522] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 1
> [  523.306427] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 2
> [  523.314239] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 3
> [  523.322052] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 4
> [  523.329681] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 5
> [  523.337402] ehci-omap ehci-omap.0: detected XactErr len 0/16 retry 1
> [  523.344696] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 6
> [  523.352447] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 7
> [  523.359985] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 8
> [  523.367736] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 9
> [  523.375427] ehci-omap ehci-omap.0: detected XactErr len 0/16 retry 2
> [  523.382751] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 10
> [  523.390563] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 11
> [  523.398162] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 12
> [  523.406005] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 13
> [  523.413848] ehci-omap ehci-omap.0: detected XactErr len 0/16 retry 3
> [  523.421112] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 14
> [  523.428924] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 15
> [  523.430114] hub 1-1:1.0: state 7 ports 5 chg 0000 evt 0002
> [  523.442962] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 16
> [  524.452880] usb 1-1: khubd timed out on ep0in len=0/4
> [  525.468566] usb 1-1: khubd timed out on ep0in len=4/4
> [  526.484100] usb 1-1: khubd timed out on ep0in len=4/4
> [  527.500488] usb 1-1: khubd timed out on ep0in len=4/4
> [  528.516143] usb 1-1: khubd timed out on ep0in len=4/4
> [  528.522857] hub 1-1:1.0: hub_port_status failed (err = -110)

> [  875.439697] ehci-omap ehci-omap.0: detected XactErr len 0/8 retry 31
> [  875.440368] ehci-omap ehci-omap.0: devpath 1.2 ep0out 3strikes
> [  875.440368] usb 1-1: clear tt buffer port 2, a4 ep0 t00080248
> [  875.441040] ehci-omap ehci-omap.0: reused qh e42f2d00 schedule
> [  875.441101] usb 1-1.2: link qh8-0e01/e42f2d00 start 3 [1/2 us]
> [  875.441101] generic-usb 0003:046D:C03E.0001: can't reset device, ehci-omap.0-1.2/input0, status -71

It's not predictable when this happens, but it is always reproducible. I tried various changes to the kernel config, but none made a difference.
Comment 16 Armen Zambrano [:armenzg] (EDT/UTC-4) 2013-01-09 12:27:42 PST
I'm removing bug 802317 from the blocking list.

Regardless of the state that a panda gets into, our automation will force a re-image (thanks to mozpool) before assigning a job and running tests. This bug does not block releng's setup.

If you believe I'm missing something please re-add the bug and let me know what I missed.
Comment 17 Justin Wood (:Callek) 2016-03-21 08:47:21 PDT
No longer using pandas at mozilla

Note You need to log in before you can comment on or make changes to this bug.