Closed
Bug 811641
Opened 12 years ago
Closed 12 years ago
add support into sut_tools to retry after a relay board socket connection failure
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: Callek, Assigned: Callek)
References
Details
Attachments
(1 file, 1 obsolete file)
1.14 KB,
patch
|
jmaher
:
review+
|
Details | Diff | Splinter Review |
+++ This bug was initially created as a clone of Bug #808398 +++ I found after doing a pdu power cycle, that we could sometimes fail if done fast enough (the relay board would only accept one socket connection at a time). I filed Bug 806152 about that issue. And based on https://bugzilla.mozilla.org/show_bug.cgi?id=806152#c9 it seems like we will not be able to easily work around that issue at the lowest level (the relay board itself) Therefore I am doing this to work around it in sut_tools for now, pending our eventual integration with mozpool. Feel free to nitpick on my python style here, I chose was seemed the most logical to me, but I'm open to bikeshedding if you can cite at least one reason why you feel your color is better. :-)
Attachment #681382 -
Flags: review?(jmaher)
Comment 1•12 years ago
|
||
Comment on attachment 681382 [details] [diff] [review] [tools] v1 Review of attachment 681382 [details] [diff] [review]: ----------------------------------------------------------------- if we return false, verify.py, reboot.py, installApp.py and config.py will do nothing with the return code. Oddly enough cleanup.py import soft_reboot, but never calls it. ::: sut_tools/sut_lib.py @@ +518,5 @@ > bank, relay = map(int, pandas[device]['relayid'].split(":")) > log.info("Calling PDU powercycle for %s, %s:%s:%s" % (device, relay_host, bank, relay)) > + maxTries = 5 > + curTry = 0 > + while not relayModule.powercycle(relay_host, bank, relay): lets log a warning to indicate we failed to powercycle: log.info("Was not able to powercycle, attempt %s of %s" % (curTry, maxTries))
Attachment #681382 -
Flags: review?(jmaher) → review-
Comment 2•12 years ago
|
||
There are two weaknesses I've identified on the relay boards. Keep in mind these are basically Arduino's bit-banging ethernet, so don't expect too much from their TCP stacks! 1. Only one TCP connection at a time - others will be refused 2. A new connection made immediately after the previous is closed will be refused (I don't know "immediately" in milliseconds - BMM is using 1 second) I think this code has a good likelihood of working around both, although it is a probabilistic solution, so there's still room for continued failure. In particular, if more than 5 instances of this code try to attack the same relay board, and only one succeeds each second, then only 5 will succeed before all time out. Given that there are 14 relays on a board, I'd recommend setting the retries to 15 or higher.
Assignee | ||
Comment 3•12 years ago
|
||
Ok, so after talking in IRC we think we'll move forward with this, as-is and in the cases where we feel we need/want to completely fail a job at the reboot failing to issue, we'll properly do that in a followup. We also agreed that this is still much better of a choice than we have right now, and to properly add extra logging of each failed try, and to increase max to 15 [though it might be larger than needed, but still useful]
Attachment #681382 -
Attachment is obsolete: true
Attachment #681858 -
Flags: review?(jmaher)
Comment 4•12 years ago
|
||
Comment on attachment 681858 [details] [diff] [review] [tools] v2 Review of attachment 681858 [details] [diff] [review]: ----------------------------------------------------------------- looks as we discussed on irc, thanks!
Attachment #681858 -
Flags: review?(jmaher) → review+
Assignee | ||
Comment 5•12 years ago
|
||
http://hg.mozilla.org/build/tools/rev/8493ef87d9c2 And updating staging tegra and all panda foopies now
Assignee | ||
Updated•12 years ago
|
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
Updated•6 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•