Closed Bug 821379 Opened 11 years ago Closed 8 years ago

Pandaboard will become unresponsive after idling

Categories

(Core Graveyard :: Widget: Gonk, defect)

x86
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: mdas, Unassigned)

References

Details

If you let the pandaboard idle, it will eventually become unresponsive. If you do 'adb devices' it will just hang. You have to reboot the board for it to work again.
 I find that this problem happens intermittently, over a random period of idle time.
This can be seen on the releng side by a nagios check declaring the board as ping down.

This can be easily fixed by scripting something that will check if the mozpool status is free and ping is down then we can just reboot the device without asking any further questions.
Hmm, after building and flashing today's build, I haven't seen this problem yet. I'll keep this open for a while to make sure it's not a fluke.
New problem!

It's now powered on, but not listed in adb devices. Weird.
This was after running a few gaia smoketests, waiting an hour or so, then running them again. Midtest, it went into this state.
and now it just came back up all by itself. Hmm.
After coming back online, it is unresponsive to adb shell and logcat. It is listed in adb devices, and lsusb, but there isn't much I can do other than that.
This just happened now.  Log cat only shows lots of:

W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ec1880) failed -22 (Invalid argument)
W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ec1e80) failed -22 (Invalid argument)
W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ecf7c0) failed -22 (Invalid argument)
W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ecfd80) failed -22 (Invalid argument)
W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ecffc0) failed -22 (Invalid argument)
W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ed3780) failed -22 (Invalid argument)
W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ed3cc0) failed -22 (Invalid argument)
W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ef61c0) failed -22 (Invalid argument)
W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ef6800) failed -22 (Invalid argument)
W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ef6e80) failed -22 (Invalid argument)

(which happens before the freeze, too).  The serial port shows only:

[   32.051666] omapdss HDMI: ENTER hdmi_display_enable                          
[   32.158691] omapdss DISPC error: timeout waiting for EVSYNC                  
[   33.508697] misc dsscomp: [eceb4c00] ignoring set failure -22                
[  177.840911] adb_release                                                      
[  177.841766] android_work: sent uevent USB_STATE=DISCONNECTED                 
[  177.856170] adb_open                                                         
[  177.877624] android_work: sent uevent USB_STATE=CONNECTED                    
[  177.887542] android_work: sent uevent USB_STATE=DISCONNECTED                 
[  177.945343] android_work: sent uevent USB_STATE=CONNECTED                    
[  178.496856] android_usb gadget: high speed config #1: android                
[  178.505981] android_work: sent uevent USB_STATE=CONFIGURED
Note that the display is also frozen, not blanked.  I see the Gaia lock screen.
Here's what I see on serial:

[ 6338.648010] omapdss HDMI: ENTER hdmi_display_enable                            
[ 6338.751373] omapdss DISPC error: timeout waiting for EVSYNC                    
[ 6338.759490] omap_thermal_unthrottle: temperature reduced, ending cpu throttling
[ 6338.940246] misc dsscomp: [ecbd3800] ignoring set failure -22                  
[ 6346.758758] omap_thermal_throttle: temperature too high, cpu throttle at max 90
[ 6347.766448] throttle_delayed_work_fn: OMAP temp read 66200 exceeds the threshod
[ 6347.782135] omap_thermal_throttle: temperature too high, cpu throttle at max 70
[ 6358.759307] omap_thermal_unthrottle: temperature reduced, ending cpu throttling

Not sure whether this means we have two issues, or that one or both sets of serial console output don't offer a clue to the problem.

In case these messages are relevant: My panda seems to be relatively warm to the touch, although the room it's in isn't hot by any means (I'd estimate maybe 22 or 23 degrees centigrade).
I just had this problem recur; I got no output on either serial port or logcat when it happened. :(
(In reply to Jonathan Griffin (:jgriffin) from comment #7)
> This just happened now.  Log cat only shows lots of:
> 
> W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ec1880) failed -22
> (Invalid argument)
> W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ec1e80) failed -22
> (Invalid argument)
> W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ecf7c0) failed -22
> (Invalid argument)
> W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ecfd80) failed -22
> (Invalid argument)
> W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ecffc0) failed -22
> (Invalid argument)
> W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ed3780) failed -22
> (Invalid argument)
> W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ed3cc0) failed -22
> (Invalid argument)
> W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ef61c0) failed -22
> (Invalid argument)
> W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ef6800) failed -22
> (Invalid argument)
> W/GraphicBufferMapper( 1267): unregisterBuffer(0x46ef6e80) failed -22
> (Invalid argument)
> 
> (which happens before the freeze, too).  The serial port shows only:

This has already been reported in bug 801658.


(In reply to William Lachance (:wlach) from comment #9)
> Here's what I see on serial:
> 
> [ 6338.648010] omapdss HDMI: ENTER hdmi_display_enable                      
> 
> [ 6338.751373] omapdss DISPC error: timeout waiting for EVSYNC              
> 
> [ 6338.759490] omap_thermal_unthrottle: temperature reduced, ending cpu
> throttling
> [ 6338.940246] misc dsscomp: [ecbd3800] ignoring set failure -22            
> 
> [ 6346.758758] omap_thermal_throttle: temperature too high, cpu throttle at
> max 90
> [ 6347.766448] throttle_delayed_work_fn: OMAP temp read 66200 exceeds the
> threshod
> [ 6347.782135] omap_thermal_throttle: temperature too high, cpu throttle at
> max 70
> [ 6358.759307] omap_thermal_unthrottle: temperature reduced, ending cpu
> throttling

I've seen this too, but it seems uncritical. The value is reported in /sys/bus/platform/drivers/omap_temp_sensor/omap_temp_sensor.0/temperature. Normally my board runs between 50000 to 55000. Throttling the CPU is just a safety measure.
I just managed to reproduce the problem and got this at the serial console:

> [  290.180847] hub 1-1:1.0: port 1 disabled by hub (EMI?), re-enabling...
> [  290.188720] usb 1-1.1: USB disconnect, device number 3
> [  290.195404] smsc95xx 1-1.1:1.0: eth0: unregister 'smsc95xx' usb-ehci-omap.0-1.1, smsc95xx USB 2.0 Ethernet
> [  290.409790] init: untracked pid 1413 exited
> [  295.366882] hub 1-1:1.0: hub_port_status failed (err = -110)
> [  295.375610] hub 1-1:1.0: connect-debounce failed, port 1 disabled

It looks like the USB port fails after some time.

I checked the reported temperature, but it was only ~45000.
Pid 1413 is the DHCP client, errno number 110 is ETIMEDOUT.

USB suspending is enabled in the kernel. Maybe we'll just need to disable it...

An EMI problem is reported here:

  http://softsolder.com/2009/01/10/mysterious-usb-disconnects/

and solved here

  http://softsolder.com/2009/01/28/usb-disconnects-nobody-moves-nobody-gets-hurt/

/me is wondering if we need to ground the PandaBoards or put them into metal boxes...
(In reply to Thomas Zimmermann [:tzimmermann] from comment #13)
> USB suspending is enabled in the kernel. Maybe we'll just need to disable
> it...

Nope, didn't help.
I looked deeper into this today and it really to be a problem in the USB chipset. After the USB port failed, I get a number of debugging messages like the ones below.

> [  523.298522] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 1
> [  523.306427] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 2
> [  523.314239] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 3
> [  523.322052] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 4
> [  523.329681] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 5
> [  523.337402] ehci-omap ehci-omap.0: detected XactErr len 0/16 retry 1
> [  523.344696] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 6
> [  523.352447] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 7
> [  523.359985] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 8
> [  523.367736] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 9
> [  523.375427] ehci-omap ehci-omap.0: detected XactErr len 0/16 retry 2
> [  523.382751] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 10
> [  523.390563] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 11
> [  523.398162] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 12
> [  523.406005] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 13
> [  523.413848] ehci-omap ehci-omap.0: detected XactErr len 0/16 retry 3
> [  523.421112] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 14
> [  523.428924] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 15
> [  523.430114] hub 1-1:1.0: state 7 ports 5 chg 0000 evt 0002
> [  523.442962] ehci-omap ehci-omap.0: detected XactErr len 0/18944 retry 16
> [  524.452880] usb 1-1: khubd timed out on ep0in len=0/4
> [  525.468566] usb 1-1: khubd timed out on ep0in len=4/4
> [  526.484100] usb 1-1: khubd timed out on ep0in len=4/4
> [  527.500488] usb 1-1: khubd timed out on ep0in len=4/4
> [  528.516143] usb 1-1: khubd timed out on ep0in len=4/4
> [  528.522857] hub 1-1:1.0: hub_port_status failed (err = -110)

> [  875.439697] ehci-omap ehci-omap.0: detected XactErr len 0/8 retry 31
> [  875.440368] ehci-omap ehci-omap.0: devpath 1.2 ep0out 3strikes
> [  875.440368] usb 1-1: clear tt buffer port 2, a4 ep0 t00080248
> [  875.441040] ehci-omap ehci-omap.0: reused qh e42f2d00 schedule
> [  875.441101] usb 1-1.2: link qh8-0e01/e42f2d00 start 3 [1/2 us]
> [  875.441101] generic-usb 0003:046D:C03E.0001: can't reset device, ehci-omap.0-1.2/input0, status -71

It's not predictable when this happens, but it is always reproducible. I tried various changes to the kernel config, but none made a difference.
I'm removing bug 802317 from the blocking list.

Regardless of the state that a panda gets into, our automation will force a re-image (thanks to mozpool) before assigning a job and running tests. This bug does not block releng's setup.

If you believe I'm missing something please re-add the bug and let me know what I missed.
No longer blocks: 802317
No longer using pandas at mozilla
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.