panda boards are not pulled out of production or rebooted when there is a failure to connect

RESOLVED FIXED

Status

Release Engineering
General Automation
RESOLVED FIXED
5 years ago
5 years ago

People

(Reporter: jmaher, Assigned: Callek)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

(Reporter)

Description

5 years ago
In verify.py, we fail to connect to a panda board.  This is correct as the panda is not even pingable.

In the buildbot steps we attempt to reboot the device, but we cannot:
http://dev-master01.build.scl1.mozilla.com:8036/builders/Android%20Panda%20mozilla-central%20opt%20test%20mochitest-8/builds/1807/steps/reboot%20device/logs/stdio

python -u /builds/sut_tools/reboot.py 10.12.52.130
 in dir /builds/panda-0029/test/. (timeout 1800 secs)
 watching logfiles {}
 argv: ['python', '-u', '/builds/sut_tools/reboot.py', '10.12.52.130']
 environment:
  HOME=/home/cltbld
  PATH=/tools/buildbot-0.8.4-pre-moz2/bin:/usr/local/bin:/usr/local/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/cltbld/bin
  PWD=/builds/panda-0029/test
  SUT_IP=10.12.52.130
  SUT_NAME=panda-0029
 using PTY: False
connecting to: 10.12.52.130
reconnecting socket
Could not connect; sleeping for 5 seconds.
reconnecting socket
Could not connect; sleeping for 10 seconds.
reconnecting socket
Could not connect; sleeping for 15 seconds.
reconnecting socket
Could not connect; sleeping for 20 seconds.
reconnecting socket
Traceback (most recent call last):
  File "/builds/sut_tools/reboot.py", line 57, in <module>
    dm = devicemanager.DeviceManagerSUT(deviceIP)
  File "/builds/tools/sut_tools/mozdevice/devicemanagerSUT.py", line 53, in __init__
    raise BaseException("Failed to connect to SUT Agent and retrieve the device root.")
BaseException: Failed to connect to SUT Agent and retrieve the device root.
program finished with exit code 1
elapsedTime=65.122836

if reboot.py fails with an error code, we need to yank the device out of production and give it some hands on attention.
(Assignee)

Comment 1

5 years ago
So few issues here:

verify.py (contrary to what I told you) does indeed kill buildbot if the device can't connect (it has checks and balances there)

It also calls out to cleanup, but if cleanup itself raises an exception it ignores it :(. Cleanup however does a good job of handling errors, except the one I see:

11/13/2012 12:52:06: INFO: devroot None
11/13/2012 12:54:06: ERROR: Remote Device Error: waiting for device timed out.

Which is likely http://hg.mozilla.org/build/tools/file/b77d3e229983/sut_tools/cleanup.py#l56

We also don't have login to output what we are uninstalling, to get a better idea of exactly when we fail.

Secondly reboot.py if it can't do a logcat does indeed abort early, we can't kill buildbot here since if the job is otherwise good we don't want to report a bad job and need to retry, so the idea is to do the reboot always.

The current patch I'm doing will *not* force PDU reboot for tegras yet, but is still better than nothing.
(Assignee)

Comment 2

5 years ago
Created attachment 681379 [details] [diff] [review]
[tools] v1

I did not yet stage this anywhere. See previous comment for more details/info.
Assignee: nobody → bugspam.Callek
Status: NEW → ASSIGNED
Attachment #681379 - Flags: review?(jmaher)
(Reporter)

Comment 3

5 years ago
Comment on attachment 681379 [details] [diff] [review]
[tools] v1

Review of attachment 681379 [details] [diff] [review]:
-----------------------------------------------------------------

::: sut_tools/reboot.py
@@ +25,5 @@
>              log.info("Unable to find a proper devicename, will attempt to reboot device")
>  
> +    if dm is not None:
> +        try:
> +            dm.getInfo('process')

we should put some safety checks around dm.getInfo.  We recently added the timeout=10 to the logcat command, I would like to do the same here.

@@ +61,5 @@
> +        log.info("connecting to: %s" % deviceIP)
> +        dm = devicemanager.DeviceManagerSUT(deviceIP)
> +        dm.debug = 5
> +    except:
> +        pass

we need to log a warning here and find some method to send in a message to reboot() that we are unable to connect to devicemanager.
Attachment #681379 - Flags: review?(jmaher) → review-
(Assignee)

Comment 4

5 years ago
Comment on attachment 681379 [details] [diff] [review]
[tools] v1

Review of attachment 681379 [details] [diff] [review]:
-----------------------------------------------------------------

Ok, re-requesting review, after our IRC convo today.

I had felt like I was needing to make a change, but in preparing to do so, it feels like I can't easily... see my notes inline below.

::: sut_tools/reboot.py
@@ +25,5 @@
>              log.info("Unable to find a proper devicename, will attempt to reboot device")
>  
> +    if dm is not None:
> +        try:
> +            dm.getInfo('process')

I had planned to add the timeout to the dm.getInfo stuff as we discussed, however dm.getInfo() doesn't accept a timeout param, and I would *really* like to not have local sut mods to mozdevice, and instead get that upstreamed and landed here, then take advantage of it.

I'll note that dm.getInfo in devicemanagerSUT doesn't specify a specific timeout, however our default_timeout will take affect, which is 300 . A theory I could do, if you feel its better for this code, -- now -- is to manually change dm.default_timeout

@@ +61,5 @@
> +        log.info("connecting to: %s" % deviceIP)
> +        dm = devicemanager.DeviceManagerSUT(deviceIP)
> +        dm.debug = 5
> +    except:
> +        pass

As discussed on IRC, what happens here is we fail to get a devicemanager connection, end up with dm=None, then we enter reboot() where we have a log about being unable to connect, and proceed right to rebooting [if possible]
Attachment #681379 - Flags: review- → review?(jmaher)
(Reporter)

Comment 5

5 years ago
Comment on attachment 681379 [details] [diff] [review]
[tools] v1

Review of attachment 681379 [details] [diff] [review]:
-----------------------------------------------------------------

looks good, as per our irc  conversation.

::: sut_tools/reboot.py
@@ +25,5 @@
>              log.info("Unable to find a proper devicename, will attempt to reboot device")
>  
> +    if dm is not None:
> +        try:
> +            dm.getInfo('process')

Since this is scoped to reboot.py only, we could set dm.default_timeout to be 30.
Attachment #681379 - Flags: review?(jmaher) → review+
(Assignee)

Comment 6

5 years ago
http://hg.mozilla.org/build/tools/rev/596962be713b with default_timeout changed.

Updating staging tegra and all panda foopies now
Status: ASSIGNED → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.