Closed Bug 806337 Opened 12 years ago Closed 12 years ago

BMM: Lock relay boards and add retries

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: dustin)

Details

I need to test that relays support concurrent access.  If not, they need to be locked.  In either case, they need to be retried and status verified to ensure that a "reboot" actually reboots the board.
Assignee: server-ops-releng → dustin
As noted at https://bugzilla.mozilla.org/show_bug.cgi?id=806152#c9 , we should assume the relay boards do not handle multiple connections so locks and retries should be implemented in bmm as suggested.

BMM/lifeguard/mozpool should also be the only place to properly request a panda board reboot, whether via human or code, so states are maintained properly.
I'm working on this right this very instant.  The relay code already checks the status after power-off and again after power-on, so with a bit of locking this should be good to go.  I'm also going to add some short socket timeouts so we're not stuck waiting to talk to a relay that's not there.
We might also want to add a tcp connect timeout to the server side.
Timeouts are done:
  http://hg.mozilla.org/build/bmm/file/b032b833c3cf/mozpool/bmm/relay.py

I used asyncore so that all socket operations are async, and we can timeout waiting for them.  So connection delays, socket delays, and so on are all handled appropriately, and we can pretty much guarantee that as long as the local CPU isn't tied up, the relay functions will return within their timeout.

Locking is in there, too, but much simpler:
  http://hg.mozilla.org/build/bmm/file/b032b833c3cf/mozpool/bmm/relay.py#l65
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.