Designing State

I was recently developing an application to conduct a specific long-running hardware test. While things were working, it got to a point where there was a very definite "smell". Changes were introducing bugs and there was too much surface area.

Here are some loose thoughts for approaching.

Try to divide the system into these four layers

Raw State (truths of the system)
- is_display_on
- is_faulted
Derived capabilities, conditions, GUI selectors, Clock/Task Guards (not stored)
- can_start
- can_stop
- can_poll_meter
Actions/Tasks (requires...)
UI Actions (GUI selectors)
- start button enabled (can_start)
- stop button enabled (can_stop)

Other Ideas

Consider adding "guard" callable to task execution condition. Guard would prevent task from entering is_due if guard is false.
Try to break down the system into a "store" or single state object
Limit state changes through named events
Selectors, pure functions that that derive condition from state (literally the state object)
Effects as code that talks to hardware, thread, files, etc...
Views as GUI elements triggered by or reflects selectors and dispatches events
use "transition states" if operations are not instantaneous (starting, stopping, initializing...)
It should be impossible to represent an invalid state
handlers should have minimal defensive checks
separate phase, cause, capabilities,
Supervisors, controllers are good things. Make sure you have one!
Don't create more runtime phases/states to represent why stopped/faulted. Instead change phase/state and provide structured reason.

Define Lifecycle

Session Life Cycle (persistence and restoring state)
Runtime life cycle (idle, initializing, running, stopping, paused)
infrastructure/device life cycle (connecting, connected, initializing, initialized, operational, shutting down, closed)
Execution life cycle (clock advancement, advancing, halted)
Fault interruption (operator, device, task, unrecoverable/recoverable)

Refactor Guide

inventory conditions of system
- concern (thing being controlled), rule, rule location, future rule location, raw state or derived state
Reduce mode count, maybe split into axes
centralize guards/selectors/conditions
- centralize first before rewriting anything else
Make tasks declare conditions (e.g., name, interval, guard, handler, other)
make GUI bind to selectors
Define transition events

For each thing, identify if it is a fact, a command, a capability or a policy

Mental Models

Operator Model
- Can start/resume when stopped
- Can stop while starting/running
- Sees a clear reason if stopped due to fault
Internal Model
- Idle
- Starting
- Running
- Stopping
- Paused
- Faulted
- Completed/Aborted

Project Structure

model.py
- Enums and persistent dataclasses.
selectors.py contains Pure functions:
- can_start
- can_stop
- show_fault_banner
- execution_enabled
- should_advance_clock
- should_checkpoint_periodically
controller.py Owns state and handles events:
- create session
- resume session
- start requested
- stop requested
- device fault
- task abort
- startup complete
- shutdown complete
scheduler.py Generic task engine.
- task specs
- clock advancement
- retry/abort/defer behavior
- emits events upward
persistence.py Save/load JSON snapshots and dataclass restore.
gui_adapter.py Maps selectors to widgets and routes user actions to supervisor events.

Examples

def can_poll_meter(s: AppState) -> bool:
    return (
        s.meter_connected
        and not s.estop
        and (
            s.run_phase is RunPhase.RUNNING
            or s.calibration_active
        )
    )


from __future__ import annotations
from dataclasses import dataclass, field
from enum import Enum, auto
from typing import Optional

class Event(Enum):
    CREATE_NEW_SESSION = auto()
    LOAD_EXISTING_SESSION = auto()
    START_REQUESTED = auto()
    STARTUP_SUCCEEDED = auto()
    STARTUP_FAILED = auto()
    STOP_REQUESTED = auto()
    DEVICE_FAULTED = auto()
    TASK_ABORTED = auto()
    SHUTDOWN_COMPLETE = auto()
    TEST_COMPLETED = auto()

class SessionOrigin(Enum):
    NEW = auto()
    RESUMED = auto()


class RuntimePhase(Enum):
    IDLE = auto()          # GUI open, session prepared, waiting for start
    STARTING = auto()      # connecting/init/orchestration in progress
    RUNNING = auto()       # clocks advancing
    STOPPING = auto()      # orderly shutdown in progress
    PAUSED = auto()        # stopped but resumable
    FAULTED = auto()       # stopped due to fault; may or may not be resumable
    COMPLETED = auto()     # terminal successful end
    ABORTED = auto()       # terminal failed/unrecoverable end


class StopReason(Enum):
    NONE = auto()
    OPERATOR = auto()
    DEVICE_FAULT = auto()
    TASK_ABORT = auto()
    STARTUP_FAILURE = auto()
    COMPLETION = auto()
    INTERNAL_ERROR = auto()


class InfraPhase(Enum):
    DISCONNECTED = auto()
    CONNECTING = auto()
    CONNECTED = auto()
    INITIALIZING = auto()
    READY = auto()
    SHUTTING_DOWN = auto()
    CLOSED = auto()


@dataclass
class FaultInfo:
    code: str
    message: str
    resumable: bool = True
    source: Optional[str] = None


@dataclass
class AppState:
    session_origin: Optional[SessionOrigin] = None
    runtime_phase: RuntimePhase = RuntimePhase.IDLE
    infra_phase: InfraPhase = InfraPhase.DISCONNECTED
    stop_reason: StopReason = StopReason.NONE
    active_fault: Optional[FaultInfo] = None

    # persisted test state
    test_state: dict = field(default_factory=dict)
    config: dict = field(default_factory=dict)

    # bookkeeping
    checkpoint_dirty: bool = False
    shutdown_requested: bool = False