Designing State
I was recently developing an application to conduct a specific long-running hardware test. While things were working, it got to a point where there was a very definite "smell". Changes were introducing bugs and there was too much surface area.
Here are some loose thoughts for approaching.
Try to divide the system into these four layers
-
Raw State (truths of the system)
- is_display_on
- is_faulted
-
Derived capabilities, conditions, GUI selectors, Clock/Task Guards (not stored)
- can_start
- can_stop
- can_poll_meter
- Actions/Tasks (requires...)
- UI Actions (GUI selectors)
- start button enabled (can_start)
- stop button enabled (can_stop)
Other Ideas
- Consider adding "guard" callable to task execution condition. Guard would prevent task from entering is_due if guard is false.
- Try to break down the system into a "store" or single state object
- Limit state changes through named events
- Selectors, pure functions that that derive condition from state (literally the state object)
- Effects as code that talks to hardware, thread, files, etc...
- Views as GUI elements triggered by or reflects selectors and dispatches events
- use "transition states" if operations are not instantaneous (starting, stopping, initializing...)
- It should be impossible to represent an invalid state
- handlers should have minimal defensive checks
- separate phase, cause, capabilities,
- Supervisors, controllers are good things. Make sure you have one!
- Don't create more runtime phases/states to represent why stopped/faulted. Instead change phase/state and provide structured reason.
Define Lifecycle
- Session Life Cycle (persistence and restoring state)
- Runtime life cycle (idle, initializing, running, stopping, paused)
- infrastructure/device life cycle (connecting, connected, initializing, initialized, operational, shutting down, closed)
- Execution life cycle (clock advancement, advancing, halted)
- Fault interruption (operator, device, task, unrecoverable/recoverable)
Refactor Guide
- inventory conditions of system
- concern (thing being controlled), rule, rule location, future rule location, raw state or derived state
- Reduce mode count, maybe split into axes
- centralize guards/selectors/conditions
- centralize first before rewriting anything else
- Make tasks declare conditions (e.g., name, interval, guard, handler, other)
- make GUI bind to selectors
- Define transition events
For each thing, identify if it is a fact, a command, a capability or a policy
Mental Models
- Operator Model
- Can start/resume when stopped
- Can stop while starting/running
- Sees a clear reason if stopped due to fault
- Internal Model
- Idle
- Starting
- Running
- Stopping
- Paused
- Faulted
- Completed/Aborted
Project Structure
- model.py
- Enums and persistent dataclasses.
- selectors.py contains Pure functions:
- can_start
- can_stop
- show_fault_banner
- execution_enabled
- should_advance_clock
- should_checkpoint_periodically
- controller.py Owns state and handles events:
- create session
- resume session
- start requested
- stop requested
- device fault
- task abort
- startup complete
- shutdown complete
- scheduler.py Generic task engine.
- task specs
- clock advancement
- retry/abort/defer behavior
- emits events upward
- persistence.py Save/load JSON snapshots and dataclass restore.
- gui_adapter.py Maps selectors to widgets and routes user actions to supervisor events.
Examples
def can_poll_meter(s: AppState) -> bool:
return (
s.meter_connected
and not s.estop
and (
s.run_phase is RunPhase.RUNNING
or s.calibration_active
)
)
from __future__ import annotations
from dataclasses import dataclass, field
from enum import Enum, auto
from typing import Optional
class Event(Enum):
CREATE_NEW_SESSION = auto()
LOAD_EXISTING_SESSION = auto()
START_REQUESTED = auto()
STARTUP_SUCCEEDED = auto()
STARTUP_FAILED = auto()
STOP_REQUESTED = auto()
DEVICE_FAULTED = auto()
TASK_ABORTED = auto()
SHUTDOWN_COMPLETE = auto()
TEST_COMPLETED = auto()
class SessionOrigin(Enum):
NEW = auto()
RESUMED = auto()
class RuntimePhase(Enum):
IDLE = auto() # GUI open, session prepared, waiting for start
STARTING = auto() # connecting/init/orchestration in progress
RUNNING = auto() # clocks advancing
STOPPING = auto() # orderly shutdown in progress
PAUSED = auto() # stopped but resumable
FAULTED = auto() # stopped due to fault; may or may not be resumable
COMPLETED = auto() # terminal successful end
ABORTED = auto() # terminal failed/unrecoverable end
class StopReason(Enum):
NONE = auto()
OPERATOR = auto()
DEVICE_FAULT = auto()
TASK_ABORT = auto()
STARTUP_FAILURE = auto()
COMPLETION = auto()
INTERNAL_ERROR = auto()
class InfraPhase(Enum):
DISCONNECTED = auto()
CONNECTING = auto()
CONNECTED = auto()
INITIALIZING = auto()
READY = auto()
SHUTTING_DOWN = auto()
CLOSED = auto()
@dataclass
class FaultInfo:
code: str
message: str
resumable: bool = True
source: Optional[str] = None
@dataclass
class AppState:
session_origin: Optional[SessionOrigin] = None
runtime_phase: RuntimePhase = RuntimePhase.IDLE
infra_phase: InfraPhase = InfraPhase.DISCONNECTED
stop_reason: StopReason = StopReason.NONE
active_fault: Optional[FaultInfo] = None
# persisted test state
test_state: dict = field(default_factory=dict)
config: dict = field(default_factory=dict)
# bookkeeping
checkpoint_dirty: bool = False
shutdown_requested: bool = False