Closed
Bug 800387
Opened 13 years ago
Closed 13 years ago
Build a state machine for mobile processes
Categories
(Infrastructure & Operations :: RelOps: General, task)
Infrastructure & Operations
RelOps: General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: dustin, Assigned: dustin)
Details
The current BMM implementation is pretty optimistic - set the PXE config (symlink), reboot the board (relay), and assume it will work.
We'll need a more substantial process. I think we should model this as a state machine.
States will be represented as short strings in the DB.
Events will be identified by short strings, too, and will include:
- explicit event-notification API calls from the boards
- timeouts since the last state change
(and explicitly NOT include syslog lines, as syslog is unreliable)
| Assignee | ||
Comment 1•13 years ago
|
||
Here's how I'm thinking of specifying this state machine. Using classes gives us ample opportunity to override things and customize the behavior. My only worry is that this might be a little verbose. Thoughts? Comments? (This is just a few states for now, until I'm happy with the specification style)
----
# This Source Code Form is subject to the terms of the Mozilla Public
# License, v. 2.0. If a copy of the MPL was not distributed with this
# file, You can obtain one at http://mozilla.org/MPL/2.0/.
class State(object):
pass
####
# Initial and steady states
class New(State):
# TODO: these two methods should be in a mixin, most likely, as they'll
# appear in many states
@State.apiEvent('rq-reboot')
def onRebootRequest(self):
self.gotoState('RebootRebooting')
@State.apiEvent('rq-reimage')
def onReimageRequest(self):
self.gotoState('ReimageRebooting')
class Ready(State):
TIMEOUT = 300
def onEntry(self):
self.clearCounters()
self.startPolling()
def gotoState(self, state):
# always stop polling before moving to another state
self.stopPolling()
State.gotoState(self, state)
@State.timeout(TIMEOUT)
def onTimeout(self):
self.gotoState('Ready')
@State.apiEvent('rq-reboot')
def onRebootRequest(self):
self.gotoState('RebootRebooting')
@State.apiEvent('rq-reimage')
def onReimageRequest(self):
self.gotoState('ReimageRebooting')
def onPollFailure(self):
self.gotoState('RebootRebooting')
####
# Rebooting
class RebootRebooting(State):
# wait for 60 seconds for a poer cycle to succeed, and do this a bunch of
# times; failures here are likely a problem with the network or relay board,
# so we want to retry until that's available.
TIMEOUT = 60
PERMANENT_FAILURE_COUNT = 200
def onEntry(self):
self.removeSymlink()
self.startPowerCycle()
def gotoState(self, state):
# always stop the power-cycling process before moving to another state
self.stopPowerCycle()
State.gotoState(self, state)
@State.timeout(TIMEOUT)
def onTimeout(self):
if self.countFailure('RebootRebooting') > self.PERMANENT_FAILURE_COUNT:
self.gotoState('FailedRebootRebooting')
else:
self.gotoState('RebootRebooting')
def onPowerCycleOk(self):
self.gotoState('RebootComplete')
def onPowerCycleFail(self):
pass # just wait for our timeout to expire
class RebootComplete(State):
# give the image ample time to come up and tell us that it's running, but if
# that doesn't work after a few reboots, the image itself is probably bad
TIMEOUT = 600
PERMANENT_FAILURE_COUNT = 10
@State.timeout(TIMEOUT)
def onTimeout(self):
if self.countFailure('RebootComplete') > self.PERMANENT_FAILURE_COUNT:
self.gotoState('FailedRebootComplete')
else:
self.gotoState('RebootRebooting')
@State.apiEvent('image-running')
def onImageRunning(self):
self.gotoState('Ready')
| Assignee | ||
Updated•13 years ago
|
Assignee: server-ops-releng → dustin
| Assignee | ||
Comment 2•13 years ago
|
||
Work is suspended here while mcote and I figure out where this fits.
| Assignee | ||
Comment 3•13 years ago
|
||
This has since become the lifeguard state-machine.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Updated•12 years ago
|
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in
before you can comment on or make changes to this bug.
Description
•