Sooo, in an attempt to help recover some panda boards, and while testing code, I notice we can get connection refused on the relay boards (log from a manual py shell below) Do we have a rate-limiting here, if so, getting multiple connections to the relay board at *once* is entirely possible in production, and will be needed. ----------------- Traceback (most recent call last): File "<stdin>", line 2, in <module> File "/builds/sut_tools/sut_lib.py", line 520, in reboot_relay return relayModule.powercycle(relay_host, bank, relay) File "/builds/sut_tools/relay.py", line 109, in powercycle with connected_socket(relay_hostname, PORT) as sock: File "/usr/lib64/python2.6/contextlib.py", line 16, in __enter__ return self.gen.next() File "/builds/sut_tools/relay.py", line 53, in connected_socket sock.connect((hostname, port)) File "<string>", line 1, in connect socket.error: [Errno 111] Connection refused >>> for x in sut_lib.pandas: ... print x ... try: ... sut_lib.reboot_relay(x) ... except: ... pass ... panda-0048 10/27/2012 17:55:56: INFO: Calling PDU powercycle for panda-0048, panda-relay-004.build.scl1.mozilla.com:1:3 True panda-0028 10/27/2012 17:55:57: INFO: Calling PDU powercycle for panda-0028, panda-relay-002.build.scl1.mozilla.com:2:3 True panda-0032 10/27/2012 17:55:58: INFO: Calling PDU powercycle for panda-0032, panda-relay-002.build.scl1.mozilla.com:2:7 panda-0033 10/27/2012 17:55:58: INFO: Calling PDU powercycle for panda-0033, panda-relay-002.build.scl1.mozilla.com:2:8 panda-0030 10/27/2012 17:55:58: INFO: Calling PDU powercycle for panda-0030, panda-relay-002.build.scl1.mozilla.com:2:5 panda-0031 10/27/2012 17:55:58: INFO: Calling PDU powercycle for panda-0031, panda-relay-002.build.scl1.mozilla.com:2:6 panda-0054 10/27/2012 17:55:58: INFO: Calling PDU powercycle for panda-0054, panda-relay-004.build.scl1.mozilla.com:2:5 True panda-0055 10/27/2012 17:55:59: INFO: Calling PDU powercycle for panda-0055, panda-relay-004.build.scl1.mozilla.com:2:6 panda-0056 10/27/2012 17:55:59: INFO: Calling PDU powercycle for panda-0056, panda-relay-004.build.scl1.mozilla.com:2:7 ...... Note how the successful ones return True
Adding in a time.sleep(20) works so far, so either we need a 2-10 second delay+retry when it fails to connect here in the relay script, or there might be a relay config thing that could/should be tweaked.
I don't actually know much about how these boards are configured, maybe Jake knows. It wouldn't entirely surprise me to find that they only support one simultaneous connection.
Agreed. That should be fairly easy to add to BMM. Presumably you could do the same in foopy code?
I filed bug 806337 for BMM.
Assignee: server-ops-releng → nobody
Component: Server Operations: RelEng → Release Engineering
QA Contact: arich
Is bmm production ready right now? if so we should plan on using it. Otherwise we need to have a retry loop to ensure we can handle the load until we have BMM or some api that queues up the requests.
It is, but we decided last week not to use it until B2G is ready to go. In the interim, DCOps will be using it for reboots and reimages.
Marking need-info?jake for if the relay boards can be configured to accept multiple simultaneous connections, otherwise we'll need to bake in some retry magic to our sut code [and bmm]
I'm fairly certain the boards themselves can't be configured that way - they're not particularly sophisticated. I haven't verified that the boards actually *don't* accept multiple connections, but it sounds like you have. In that case, bake away (and otherwise, verify then bake).
I highly doubt the relay board supports multiple connections it is essentially a tcp socket to serial port bridge. And I don't the believe the firmware holds a serial command queue. Calls to reboot a panda board should ultimately be called to a central api (probably mozpool/lifeguard/bmm) to be able to prevent overlapping and to allow for locks and queues.
Ok, WONTFIX in favor of short-term 811641 and long term MozPool/BMM supporting retries/being our interface.
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → WONTFIX
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.