Closed
Bug 808398
Opened 13 years ago
Closed 12 years ago
add support into sut_tools to verify and retry after a pdu reboot
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jmaher, Assigned: jmaher)
References
Details
Attachments
(2 files)
2.04 KB,
patch
|
Callek
:
review+
|
Details | Diff | Splinter Review |
3.27 KB,
patch
|
Callek
:
review+
|
Details | Diff | Splinter Review |
I found after doing a pdu power cycle, sometimes the board cannot respond to a ping after a minute. Usually the board is up after 30 seconds, so this is a problem.
We need to write into our tools a solution for retrying (up to 3 times) a pdu cycle. I see something like this:
reboot(try=1)
if try > 3:
exit and put device in remediation program
pdu cycle
sleep 60 seconds
if not verify ping
reboot(++try)
if not verify sut
reboot(++try)
Comment 1•13 years ago
|
||
Pandas, tegras or both -- and is this a problem *during* actual jobs?
Also do we know what the underlying cause of this "not up properly" is yet, I fear we might be simply trying to work around a race condition that can be hit anytime, and will just end up arbitrarily delaying the end-to-end time of jobs due to that.
Assignee | ||
Comment 2•13 years ago
|
||
This would be for pandas. If the device doesn't come up in a pingable state, then there isn't much to do. It might be we can hack the network drivers or the android OS to make it work better. Right now that will not get us going and we do know that the ethernet port on the Panda boards is just a usb port hacked on there, otherwise known as an afterthought.
Assignee | ||
Comment 3•13 years ago
|
||
a new function to replace soft_reboot/waitForDevice. This sleeps for a bit, then tries a few times to connect, if it fails, it will cycle the power up to 5 times. Works well for panda boards in the chassis, I have done over 400 iterations of the core verify, cleanup, install, reboot cycle using this patch.
Attachment #685650 -
Flags: review?(bugspam.Callek)
Assignee | ||
Comment 4•13 years ago
|
||
Attachment #685651 -
Flags: review?(bugspam.Callek)
Comment 5•13 years ago
|
||
Comment on attachment 685650 [details] [diff] [review]
add logic to reboot to retry if device doesn't come online (1.0)
Review of attachment 685650 [details] [diff] [review]:
-----------------------------------------------------------------
r+ with nits. (feel free to argue alternate opinions)
::: sut_tools/reboot.py
@@ +39,4 @@
> log.info(status)
> except:
> log.info("Failure while rebooting device")
>
Nit, add a setFlag + return 1 to a test on the status here. Otherwise it will presume complete success.
::: sut_tools/sut_lib.py
@@ +501,5 @@
> time.sleep(waitTime)
> if not deviceIsBack:
> log.error("Remote Device Error: waiting for device timed out.")
> + return False
> + return True
Comment-Only: I never thought about waitForDevice doing a sys.exit(1) - This may explain why some of the devices [tegras] have constantly cycled with a down device, and never set an error flag.
We should plan to revisit and remove sys.exit(1) from the lib at some point.
@@ +503,5 @@
> log.error("Remote Device Error: waiting for device timed out.")
> + return False
> + return True
> +
> +def soft_reboot_and_verify(device, dm, waitTime=60, max_attempts=5, *args, **kwargs):
we have waitForDevice default waittime of 120 seconds, I suggest we use that here. Especially since, now with Panda's we cause a PDU reboot *again* after 60 if it hasn't come up. I think the return time [for tegras especially] is typically around 90 seconds.
Attachment #685650 -
Flags: review?(bugspam.Callek) → review+
Comment 6•13 years ago
|
||
Comment on attachment 685650 [details] [diff] [review]
add logic to reboot to retry if device doesn't come online (1.0)
Review of attachment 685650 [details] [diff] [review]:
-----------------------------------------------------------------
r+ with nits. (feel free to argue alternate opinions)
::: sut_tools/reboot.py
@@ -35,1 @@
>
Also need to change the import directive in this file.
@@ +39,4 @@
> log.info(status)
> except:
> log.info("Failure while rebooting device")
>
Nit, add a setFlag + return 1 to a test on the status here. Otherwise it will presume complete success.
::: sut_tools/sut_lib.py
@@ +501,5 @@
> time.sleep(waitTime)
> if not deviceIsBack:
> log.error("Remote Device Error: waiting for device timed out.")
> + return False
> + return True
Comment-Only: I never thought about waitForDevice doing a sys.exit(1) - This may explain why some of the devices [tegras] have constantly cycled with a down device, and never set an error flag.
We should plan to revisit and remove sys.exit(1) from the lib at some point.
@@ +503,5 @@
> log.error("Remote Device Error: waiting for device timed out.")
> + return False
> + return True
> +
> +def soft_reboot_and_verify(device, dm, waitTime=60, max_attempts=5, *args, **kwargs):
we have waitForDevice default waittime of 120 seconds, I suggest we use that here. Especially since, now with Panda's we cause a PDU reboot *again* after 60 if it hasn't come up. I think the return time [for tegras especially] is typically around 90 seconds.
Comment 7•13 years ago
|
||
Comment on attachment 685651 [details] [diff] [review]
use soft_reboot_and_verify in the rest of sut_tools (1.0)
Review of attachment 685651 [details] [diff] [review]:
-----------------------------------------------------------------
Alternative to testing status everywhere could be to `raise` an exception when it can't reconnect. But that might need more thought so that we catch it everywhere we want to.
r+ with these nits
::: sut_tools/cleanup.py
@@ +52,4 @@
> return RETCODE_ERROR
>
> if reboot_needed:
> + soft_reboot_and_verify(device_name, dm)
Nit: To maintain parity for now, can you test status and do a return or sys.exit here, and possibly an improve by doing a setFlag as well?
::: sut_tools/config.py
@@ +109,4 @@
> # adjust resolution up if we are part of a reftest run
> if (testname == 'reftest') and width < refWidth:
> if dm.adjustResolution(width=refWidth, height=refHeight, type='crt'):
> + soft_reboot_and_verify(dm=dm, device=deviceName, ipAddr=proxyIP, port=proxyPort)
same here
::: sut_tools/installApp.py
@@ +136,4 @@
> if (width == 1600 or height == 1200):
> dm.adjustResolution(1024, 768, 'crt')
> log.info('forcing device reboot')
> + soft_reboot_and_verify(device=deviceName, dm=dm, ipAddr=proxyIP, port=proxyPort)
same here
::: sut_tools/reboot.py
@@ +7,4 @@
> import random
> import time
> from sut_lib import getOurIP, calculatePort, clearFlag, setFlag, waitForDevice, \
> + log, soft_reboot_and_verify
this should have been in last patch.
Attachment #685651 -
Flags: review?(bugspam.Callek) → review+
Assignee | ||
Comment 8•13 years ago
|
||
why is this still open?
Comment 9•12 years ago
|
||
(In reply to Joel Maher (:jmaher) from comment #8)
> why is this still open?
bug neglect :(
it was yours but not assigned to you, patches landed. we can close.
Assignee: nobody → jmaher
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Updated•12 years ago
|
Product: mozilla.org → Release Engineering
Updated•7 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•