Closed Bug 811641 Opened 12 years ago Closed 12 years ago

add support into sut_tools to retry after a relay board socket connection failure

Categories

(Release Engineering :: General, defect)

ARM
Android
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Callek, Assigned: Callek)

References

Details

Attachments

(1 file, 1 obsolete file)

Attached patch [tools] v1 (obsolete) — Splinter Review
+++ This bug was initially created as a clone of Bug #808398 +++

I found after doing a pdu power cycle, that we could sometimes fail if done fast enough (the relay board would only accept one socket connection at a time). I filed Bug 806152 about that issue.

And based on https://bugzilla.mozilla.org/show_bug.cgi?id=806152#c9 it seems like we will not be able to easily work around that issue at the lowest level (the relay board itself) Therefore I am doing this to work around it in sut_tools for now, pending our eventual integration with mozpool.

Feel free to nitpick on my python style here, I chose was seemed the most logical to me, but I'm open to bikeshedding if you can cite at least one reason why you feel your color is better. :-)
Attachment #681382 - Flags: review?(jmaher)
Comment on attachment 681382 [details] [diff] [review]
[tools] v1

Review of attachment 681382 [details] [diff] [review]:
-----------------------------------------------------------------

if we return false, verify.py, reboot.py, installApp.py and config.py will do nothing with the return code.  Oddly enough cleanup.py import soft_reboot, but never calls it.

::: sut_tools/sut_lib.py
@@ +518,5 @@
>          bank, relay = map(int, pandas[device]['relayid'].split(":"))
>          log.info("Calling PDU powercycle for %s, %s:%s:%s" % (device, relay_host, bank, relay))
> +        maxTries = 5
> +        curTry = 0
> +        while not relayModule.powercycle(relay_host, bank, relay):

lets log a warning to indicate we failed to powercycle:
log.info("Was not able to powercycle, attempt %s of %s" % (curTry, maxTries))
Attachment #681382 - Flags: review?(jmaher) → review-
There are two weaknesses I've identified on the relay boards.  Keep in mind these are basically Arduino's bit-banging ethernet, so don't expect too much from their TCP stacks!

 1. Only one TCP connection at a time - others will be refused
 2. A new connection made immediately after the previous is closed will be refused
    (I don't know "immediately" in milliseconds - BMM is using 1 second)

I think this code has a good likelihood of working around both, although it is a probabilistic solution, so there's still room for continued failure.  In particular, if more than 5 instances of this code try to attack the same relay board, and only one succeeds each second, then only 5 will succeed before all time out.  Given that there are 14 relays on a board, I'd recommend setting the retries to 15 or higher.
Attached patch [tools] v2Splinter Review
Ok, so after talking in IRC we think we'll move forward with this, as-is and in the cases where we feel we need/want to completely fail a job at the reboot failing to issue, we'll properly do that in a followup.

We also agreed that this is still much better of a choice than we have right now, and to properly add extra logging of each failed try, and to increase max to 15 [though it might be larger than needed, but still useful]
Attachment #681382 - Attachment is obsolete: true
Attachment #681858 - Flags: review?(jmaher)
Comment on attachment 681858 [details] [diff] [review]
[tools] v2

Review of attachment 681858 [details] [diff] [review]:
-----------------------------------------------------------------

looks as we discussed on irc, thanks!
Attachment #681858 - Flags: review?(jmaher) → review+
http://hg.mozilla.org/build/tools/rev/8493ef87d9c2

And updating staging tegra and all panda foopies now
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: