Closed Bug 806152 Opened 12 years ago Closed 12 years ago

Panda relay boards issueing connection refused....

Categories

(Release Engineering :: General, defect)

x86_64
Windows 7
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: Callek, Unassigned)

References

Details

Sooo, in an attempt to help recover some panda boards, and while testing code, I notice we can get connection refused on the relay boards (log from a manual py shell below)

Do we have a rate-limiting here, if so, getting multiple connections to the relay board at *once* is entirely possible in production, and will be needed.

-----------------
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/builds/sut_tools/sut_lib.py", line 520, in reboot_relay
    return relayModule.powercycle(relay_host, bank, relay)
  File "/builds/sut_tools/relay.py", line 109, in powercycle
    with connected_socket(relay_hostname, PORT) as sock:
  File "/usr/lib64/python2.6/contextlib.py", line 16, in __enter__
    return self.gen.next()
  File "/builds/sut_tools/relay.py", line 53, in connected_socket
    sock.connect((hostname, port))
  File "<string>", line 1, in connect
socket.error: [Errno 111] Connection refused
>>> for x in sut_lib.pandas:
...   print x
...   try:
...     sut_lib.reboot_relay(x)
...   except:
...     pass
...
panda-0048
10/27/2012 17:55:56: INFO: Calling PDU powercycle for panda-0048, panda-relay-004.build.scl1.mozilla.com:1:3
True
panda-0028
10/27/2012 17:55:57: INFO: Calling PDU powercycle for panda-0028, panda-relay-002.build.scl1.mozilla.com:2:3
True
panda-0032
10/27/2012 17:55:58: INFO: Calling PDU powercycle for panda-0032, panda-relay-002.build.scl1.mozilla.com:2:7
panda-0033
10/27/2012 17:55:58: INFO: Calling PDU powercycle for panda-0033, panda-relay-002.build.scl1.mozilla.com:2:8
panda-0030
10/27/2012 17:55:58: INFO: Calling PDU powercycle for panda-0030, panda-relay-002.build.scl1.mozilla.com:2:5
panda-0031
10/27/2012 17:55:58: INFO: Calling PDU powercycle for panda-0031, panda-relay-002.build.scl1.mozilla.com:2:6
panda-0054
10/27/2012 17:55:58: INFO: Calling PDU powercycle for panda-0054, panda-relay-004.build.scl1.mozilla.com:2:5
True
panda-0055
10/27/2012 17:55:59: INFO: Calling PDU powercycle for panda-0055, panda-relay-004.build.scl1.mozilla.com:2:6
panda-0056
10/27/2012 17:55:59: INFO: Calling PDU powercycle for panda-0056, panda-relay-004.build.scl1.mozilla.com:2:7
......

Note how the successful ones return True
Adding in a time.sleep(20) works so far, so either we need a 2-10 second delay+retry when it fails to connect here in the relay script, or there might be a relay config thing that could/should be tweaked.
I don't actually know much about how these boards are configured, maybe Jake knows. It wouldn't entirely surprise me to find that they only support one simultaneous connection.
Agreed.  That should be fairly easy to add to BMM.  Presumably you could do the same in foopy code?
I filed bug 806337 for BMM.
Assignee: server-ops-releng → nobody
Component: Server Operations: RelEng → Release Engineering
QA Contact: arich
Is bmm production ready right now?  if so we should plan on using it.  

Otherwise we need to have a retry loop to ensure we can handle the load until we have BMM or some api that queues up the requests.
It is, but we decided last week not to use it until B2G is ready to go.  In the interim, DCOps will be using it for reboots and reimages.
Marking need-info?jake for if the relay boards can be configured to accept multiple simultaneous connections, otherwise we'll need to bake in some retry magic to our sut code [and bmm]
Flags: needinfo?(jwatkins)
I'm fairly certain the boards themselves can't be configured that way - they're not particularly sophisticated.  I haven't verified that the boards actually *don't* accept multiple connections, but it sounds like you have.  In that case, bake away (and otherwise, verify then bake).
Flags: needinfo?(jwatkins)
I highly doubt the relay board supports multiple connections it is essentially a tcp socket to serial port bridge.  And I don't the believe the firmware holds a serial command queue.

Calls to reboot a panda board should ultimately be called to a central api (probably mozpool/lifeguard/bmm) to be able to prevent overlapping and to allow for locks and queues.
Depends on: 811641
Ok, WONTFIX in favor of short-term 811641 and long term MozPool/BMM supporting retries/being our interface.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WONTFIX
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.