Closed Bug 800387 Opened 13 years ago Closed 13 years ago

Build a state machine for mobile processes

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: dustin)

Details

The current BMM implementation is pretty optimistic - set the PXE config (symlink), reboot the board (relay), and assume it will work. We'll need a more substantial process. I think we should model this as a state machine. States will be represented as short strings in the DB. Events will be identified by short strings, too, and will include: - explicit event-notification API calls from the boards - timeouts since the last state change (and explicitly NOT include syslog lines, as syslog is unreliable)
Here's how I'm thinking of specifying this state machine. Using classes gives us ample opportunity to override things and customize the behavior. My only worry is that this might be a little verbose. Thoughts? Comments? (This is just a few states for now, until I'm happy with the specification style) ---- # This Source Code Form is subject to the terms of the Mozilla Public # License, v. 2.0. If a copy of the MPL was not distributed with this # file, You can obtain one at http://mozilla.org/MPL/2.0/. class State(object): pass #### # Initial and steady states class New(State): # TODO: these two methods should be in a mixin, most likely, as they'll # appear in many states @State.apiEvent('rq-reboot') def onRebootRequest(self): self.gotoState('RebootRebooting') @State.apiEvent('rq-reimage') def onReimageRequest(self): self.gotoState('ReimageRebooting') class Ready(State): TIMEOUT = 300 def onEntry(self): self.clearCounters() self.startPolling() def gotoState(self, state): # always stop polling before moving to another state self.stopPolling() State.gotoState(self, state) @State.timeout(TIMEOUT) def onTimeout(self): self.gotoState('Ready') @State.apiEvent('rq-reboot') def onRebootRequest(self): self.gotoState('RebootRebooting') @State.apiEvent('rq-reimage') def onReimageRequest(self): self.gotoState('ReimageRebooting') def onPollFailure(self): self.gotoState('RebootRebooting') #### # Rebooting class RebootRebooting(State): # wait for 60 seconds for a poer cycle to succeed, and do this a bunch of # times; failures here are likely a problem with the network or relay board, # so we want to retry until that's available. TIMEOUT = 60 PERMANENT_FAILURE_COUNT = 200 def onEntry(self): self.removeSymlink() self.startPowerCycle() def gotoState(self, state): # always stop the power-cycling process before moving to another state self.stopPowerCycle() State.gotoState(self, state) @State.timeout(TIMEOUT) def onTimeout(self): if self.countFailure('RebootRebooting') > self.PERMANENT_FAILURE_COUNT: self.gotoState('FailedRebootRebooting') else: self.gotoState('RebootRebooting') def onPowerCycleOk(self): self.gotoState('RebootComplete') def onPowerCycleFail(self): pass # just wait for our timeout to expire class RebootComplete(State): # give the image ample time to come up and tell us that it's running, but if # that doesn't work after a few reboots, the image itself is probably bad TIMEOUT = 600 PERMANENT_FAILURE_COUNT = 10 @State.timeout(TIMEOUT) def onTimeout(self): if self.countFailure('RebootComplete') > self.PERMANENT_FAILURE_COUNT: self.gotoState('FailedRebootComplete') else: self.gotoState('RebootRebooting') @State.apiEvent('image-running') def onImageRunning(self): self.gotoState('Ready')
Assignee: server-ops-releng → dustin
Work is suspended here while mcote and I figure out where this fits.
This has since become the lifeguard state-machine.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.