The BMM state machine is designed with the following goals:
- if at all possible, get a board into the desired state (ready), using lots of retries and timeouts
- do not keep state in memory, so processes and servers can be restarted with no impact
- accurately reflect the state of failed boards
The implementation is a set of states - identified by short strings. Each state has zero or more actions that are taken when a board enters that state, and one or more events that can occur while a board is in that state. Most states have a 'timeout' event which occurs when no other events occur for a configured time. Most other events are indicated to the system by an HTTP API call from the board itself. Two operations (rebooting and polling) are performed by the reimaging service and generate additional, internal events (reboot-ok, poll-failure).
Aside from the named states, a set of named counters are also stored in the database. Whenever a timeout occurs, the corresponding counter is incremented, and if it passes a configured value, the board goes into a terminal failure state. This is the release valve to stop the service from retrying actions that will never succeed without human interaction, e.g., rebooting a board with no sdcard. These failure states can be monitored and trigger human interaction.
What follows is a rough draft of the states. I don't promise to keep this up to date with the code, but it give an idea of the structure.
A few additional notes:
- the state machine is only concerned with two exits from the "ready" state: rebooting or reimaging. Exactly what sort of reimage occurs is dictated by the PXE configuration put in place for the board. This dictates the second-stage script that runs, and that second-stage script decides whether to go to the reimage-2ndstage state or the maintenance state.
- The ready state has a timeout that just leads back to the ready state. The idea is that if no other state changes occur on a board, the state machine will re-enter the ready state and re-initiate the status polling operation. If that operation fails, the board is rebooted automatically.