# Architecture `claude-failover` is a single Go binary structured as a set of cooperating goroutines. Each goroutine owns a narrow responsibility and communicates through typed channels and a shared `State` struct guarded by a `sync.RWMutex`. A single-writer discipline is enforced: only the **account-switcher** may mutate the active-account field. > This document enriches the high-level design with concrete Go > implementation notes distilled from the current shell-based > `agent-orchestrator` (see `graceful-switch.sh`, `launch-agent.sh`, > `setup-tmux.sh`, `start-dedicated-agents.sh`, `watchdog.sh`, > `checkpoint-daemon.sh`) and the companion TUI written in Go > (`agent-orchestrator/tui/*.go`). The daemon is the consolidation of > those scripts into a single, testable process. --- ## 1. Process model - Single static binary: `claude-failover` (package main in `cmd/claude-failover`). - One OS process, many goroutines. No child daemons. - Runs under systemd (`User=ubuntu`, `Restart=always`). - Persists state at `/var/lib/claude-failover/state.json` (atomic write). - Listens locally on `127.0.0.1:8765` for MCP HTTP control. - Single YAML config (`/etc/claude-failover/config.yaml`), see [`config.example.yaml`](../config.example.yaml). --- ## 2. Goroutines Each goroutine is started from `main` with a `context.Context` derived from the root context. All goroutines obey cancellation and drain their pending work before returning. ### 2.1 dispatcher - Watches every registered project's `.agent-queue/inbox/` directory via [`fsnotify`](https://pkg.go.dev/github.com/fsnotify/fsnotify). - For each new `*.md` (ignoring `*.dispatched` and `*.dispatch-meta`), parses YAML frontmatter (title, priority, tags, `needs_claude_code`, `capabilities`). - Selects an idle session from the pool based on: - per-project priority, - agent capability tags, - `needs_claude_code: true` bypass (direct to Claude Code, skipping the GPU triage path inherited from the current dispatcher), - dispatcher-level cooldowns to avoid flooding a freshly-launched session. - On successful assignment, renames `.md` to `.md.dispatched` atomically and writes a pointer into the target session's tmux prompt via `TmuxAdapter.SendKeys`. - Emits a `TaskDispatched` event on the notifier channel. Migration note: Phase 1 wraps the existing `smart-dispatcher.sh` behind `exec.Command` and only takes over the fsnotify layer; Phase 2 re-implements dispatch selection natively in Go. ### 2.2 quota-monitor - Polls every account's usage counters on a configurable interval (`quota.poll_interval`, default `30s`). - Two data sources, merged: 1. **Tmux pane scrape** - runs `tmux capture-pane -p -t ` and parses the "Claude usage" footer lines emitted by Claude Code. Implementation reuses the tmux wrapper from [`claude-squad/session/tmux`](https://github.com/smtg-ai/claude-squad/tree/main/session/tmux). 2. **Local telemetry** - scans `~/.claude/statsig/` and `~/.claude/projects/*.jsonl` for message timestamps, computes sliding windows (5h, 1 week). - Also reads `SessionStart` / `SessionEnd` / `PreToolUse` hook events written by Claude Code into `~/.claude/hooks-audit.log` (opt-in via `settings.json`) for precise counters. - Emits `QuotaWarning` and `SwapRequested` events when thresholds in the config are crossed. ### 2.3 session-watcher - Maintains an in-memory table keyed by tmux session name (`ccl-*`). - For each session tracks: - process liveness via `tmux has-session -t ` (exit code), - heartbeat timestamp parsed from the project's `.agent-queue/status.json`, - current `state` field (`idle` / `working` / `shell`), - last assigned task id and deadline. - Sessions whose heartbeat is older than `watchdog.stall_after` while `state == "working"` raise a `SessionStalled` alert and become candidates for a forced restart (via `account-switcher` or `launch-agent` re-run). - Replaces `watchdog.sh` entirely once Phase 2 ships. ### 2.4 checkpoint - On `checkpoint.interval` (default `30m`, a.k.a. the Anthropic 5h window / 10 checkpoints), serializes per-session context: - current task id, - last recorded tool call (name + truncated args), - cwd as reported by the session, - the last N lines of the session's scrollback (`tmux capture-pane -pS -`). - Persists to SQLite at `checkpoint.db` (default `/var/lib/claude-failover/checkpoints.db`) with retention pruning based on the 30-minute rolling window. - Schema: `checkpoints(id INTEGER PK, session TEXT, ts INTEGER, task_id TEXT, cwd TEXT, tool_calls JSON, scrollback TEXT)`. - Indexed on `(session, ts DESC)`; old rows pruned by `janitor`. - Also exports a JSON snapshot per session to `checkpoint.dir//.json` for out-of-band tooling. ### 2.5 janitor Periodic housekeeping (runs every `janitor.interval`, default `5m`): - removes stale `.md.dispatched` markers whose source task is gone, - archives `done/` entries older than `janitor.done_horizon`, - prunes checkpoint rows beyond `checkpoint.keep`, - rotates the daemon's own log file when it exceeds `log.max_size_mb`. ### 2.6 notifier - Fan-out of typed events (`SwapFired`, `SessionStalled`, `TaskFailed`, `QuotaWarning`, `TaskDispatched`) to configured sinks: - Telegram bot (alerts channel), - MCP control-plane push (HTTP callback), - structured log aggregator (stdout JSON). - Each sink is a goroutine with its own bounded buffered channel. Back-pressure is handled by dropping low-priority events (`TaskDispatched`) and logging a `NotifierSaturated` warning. ### 2.7 account-switcher (state machine) The switcher is the only goroutine allowed to mutate `State.ActiveAccount`. It consumes `SwapRequested` events from a single-consumer channel and transitions through an atomic state machine: ``` +--------+ SwapRequested +--------+ | normal | ---------------> | saving | +--------+ +--------+ ^ | | v (checkpoint all sessions) | +-----------+ | | switching | | +-----------+ | | | v (relaunch tmux, flip HOME) | ResumeComplete +----------+ +---------------------- | resuming | +----------+ ``` - `normal -> saving`: halt dispatcher intake, request an immediate checkpoint for every session, wait for ack with a deadline. - `saving -> switching`: kill tmux sessions in reverse launch order, repoint the `~/.claude` symlink (or per-session `HOME`) to the target account's home directory. - `switching -> resuming`: relaunch sessions via `TmuxAdapter.NewSession`; each session replays its most recent checkpoint (task pointer + cwd + truncated scrollback) so Claude Code reopens at the same task. - `resuming -> normal`: unblock dispatcher, start cooldown timer on the old account, emit `SwapFired`. Guarantees: - Only one swap in flight at a time (the channel has capacity 1, extra requests coalesce). - Swap is cancellable via `context.Context`; on cancel the switcher rolls back to the pre-swap `ActiveAccount`. - All FS writes go through `FSAdapter` and are `*.tmp + rename` to keep `state.json` recoverable. See [`session-switch-analysis.md`](./session-switch-analysis.md) for why the shared-symlink approach is required (Claude Code bug #16103) and how resuming interactive sessions differs from resuming autonomous ones. --- ## 3. Shared state ```go type State struct { mu sync.RWMutex ActiveAccount string Accounts map[string]*AccountState Sessions map[string]*SessionState Pool PoolState Dispatcher DispatcherState LastSwap time.Time PendingSwap bool StartedAt time.Time } type AccountState struct { Name string Home string Priority int Usage5h float64 // 0.0 - 1.0 UsageWeek float64 CooldownEnd time.Time LastSwapIn time.Time LastSwapOut time.Time } type SessionState struct { Name string Kind string // "autonomous" | "dedicated" | "interactive" | "orchestrator" Project string State string // "idle" | "working" | "shell" | "stalled" Task string AssignedAt time.Time Heartbeat time.Time } ``` - Readers take `RLock`; the account-switcher takes `Lock` for the duration of a swap. - All other writers enqueue on a single-writer channel (`stateOps chan func(*State)`) consumed by a dedicated goroutine. This mirrors the actor pattern and eliminates lock-ordering issues. - Every mutation triggers a debounced flush of `state.json` (`persist.debounce`, default `1s`) written atomically via `FSAdapter`. --- ## 4. Event system Internal events are plain Go structs on typed channels. No reflection. | Channel | Producer(s) | Consumer | Buffer | |-------------------|-----------------------|-------------------|--------| | `swapReq` | quota-monitor, HTTP | account-switcher | 1 | | `sessionEvt` | session-watcher | notifier, state | 32 | | `taskEvt` | dispatcher | notifier, state | 64 | | `checkpointEvt` | checkpoint | state | 8 | | `notifierSink[i]` | notifier | sink i | 128 | Flow for a typical swap: ``` quota-monitor --SwapRequested--> swapReq --> account-switcher | | | +--> state (lock) | +--> TmuxAdapter (kill/launch) | +--> FSAdapter (symlink flip) | +--> sessionEvt (SwapFired) v notifier --> telegram/mcp/log ``` --- ## 5. Adapters (mockable interfaces) ### Tmux Thin wrapper around `exec.Command("tmux", ...)`, mockable via interface: ```go type TmuxAdapter interface { HasSession(name string) (bool, error) NewSession(name, cwd string, env []string) error KillSession(name string) error SendKeys(name string, keys string) error CapturePane(name string, lines int) (string, error) ListSessions() ([]string, error) } ``` - Default implementation: `ExecTmux` shelling out to the `tmux` binary. - Test implementation: `FakeTmux` backed by an in-memory map. - Inspired by [`smtg-ai/claude-squad/session/tmux`](https://github.com/smtg-ai/claude-squad/tree/main/session/tmux), which already battle-tests pane capture and send-keys escaping. ### Filesystem & Claude CLI ```go type FSAdapter interface { ReadFile(path string) ([]byte, error) WriteFileAtomic(path string, data []byte, perm os.FileMode) error Rename(oldp, newp string) error Stat(path string) (os.FileInfo, error) Watch(path string) (<-chan fsnotify.Event, error) } type ClaudeCLIAdapter interface { ResumeSession(sessionID, cwd string) error InjectPrompt(sessionID, prompt string) error Quota() (Usage, error) } ``` All three interfaces have `Fake*` implementations under `internal/testutil/` so every goroutine can be unit-tested without a real tmux, filesystem, or Claude CLI. --- ## 6. HTTP control plane (MCP) Bound to `127.0.0.1:8765` (config `mcp_http.listen`). Never exposed publicly - the SecuAAS MCP gateway proxies from the host. | Method | Path | Purpose | |--------|-------------------------------|----------------------------------| | GET | `/api/v1/session_status` | Full state snapshot | | GET | `/api/v1/accounts` | Account usage + limits | | GET | `/api/v1/sessions` | Session table | | POST | `/api/v1/trigger_dispatch` | Force inbox scan | | POST | `/api/v1/trigger/swap` | Force failover (bearer required) | | GET | `/healthz` | Liveness probe | - All mutating routes require the bearer token from `mcp_http.bearer_token_env`. - Read routes are unauthenticated but only reachable on loopback. - Handlers are thin: they publish on the same internal channels used by goroutines, then wait on a reply channel with a timeout. --- ## 7. Graceful shutdown - `main` installs a `signal.Notify` for `SIGTERM`, `SIGINT`, `SIGHUP`. - On signal, the root `context.Context` is cancelled. - Goroutines drain in a defined order: 1. HTTP server (`Shutdown(ctx)` with 5s deadline), 2. dispatcher (stops accepting new tasks, finishes in-flight assignment), 3. quota-monitor (cancels poll), 4. session-watcher (flushes last state), 5. checkpoint (forces one final checkpoint for every working session), 6. account-switcher (refuses new swaps, completes an in-flight swap up to `shutdown.swap_deadline`), 7. notifier (flushes bounded buffers), 8. janitor (exits immediately). - `state.json` is flushed one last time before exit. - Exit code `0` on clean drain, `1` on deadline timeout (systemd will restart). --- ## 8. Configuration Single YAML file. See [`config.example.yaml`](../config.example.yaml) for the authoritative schema. Key sections: - `accounts[]` - ordered list of Anthropic accounts (name, home, limits, priority). - `pool` - dedicated sessions, autonomous pool prefix and min/max, shared projects dir. - `quota` - poll interval, 5h/weekly thresholds. - `checkpoint` - interval, dir, db, keep count. - `janitor` - interval, horizons. - `mcp_http` - listen addr, bearer token env var. - `notifier` - telegram, mcp callback, log sinks. The config is loaded once at startup. `SIGHUP` triggers a re-parse; only safe fields (thresholds, intervals, notifier sinks) are hot-reloaded. Structural changes (accounts, pool) require a restart. --- ## 9. Migration plan The existing orchestrator is a set of bash scripts under `agent-orchestrator/` (`graceful-switch.sh`, `launch-agent.sh`, `setup-tmux.sh`, `start-dedicated-agents.sh`, `watchdog.sh`, `checkpoint-daemon.sh`, `smart-dispatcher.sh`). Cutover is staged to minimize risk. ### Phase 1 - Daemon wrapper (weeks 1-2) - Build `claude-failover` as a supervisor that `exec.Command`s the existing `.sh` scripts. - Goroutines are skeletons that call scripts: - dispatcher -> `smart-dispatcher.sh`, - account-switcher -> `graceful-switch.sh `, - session-watcher -> parses `watchdog.sh --json` output, - checkpoint -> calls `checkpoint-daemon.sh --once`. - Shared state, event system, HTTP control plane, and config are already native Go. - Systemd unit replaces the existing per-script cron/tmux launchers. ### Phase 2 - Native rewrite (weeks 3-6) - Replace each shell wrapper with a native Go implementation one goroutine at a time, in this order: session-watcher -> checkpoint -> dispatcher -> account-switcher. - Shell scripts remain as a fallback until each native path has run a full week without regression. ### Phase 3 - Removal (week 7+) - Delete `agent-orchestrator/*.sh` once Phase 2 is green. - Keep the TUI (`agent-orchestrator/tui/`) as a read-only client of the daemon's HTTP API. --- ## 10. Testing strategy - Adapters (`TmuxAdapter`, `FSAdapter`, `ClaudeCLIAdapter`) are interfaces with in-memory fakes under `internal/testutil/`. - Unit tests cover every goroutine in isolation: - dispatcher: feed fake fsnotify events, assert tmux `SendKeys` calls. - quota-monitor: feed canned pane captures, assert `SwapRequested`. - account-switcher: drive the state machine through a full swap, including cancellation and rollback. - Integration tests boot the full daemon with fake adapters and a `httptest.Server` on the MCP port. A golden-file test compares `state.json` after a scripted sequence. - Race detector mandatory in CI (`go test -race ./...`). - An `e2e/` suite runs against a real tmux + real Claude Code CLI in a container, executed nightly. --- ## 11. Systemd unit ```ini # /etc/systemd/system/claude-failover.service [Unit] Description=Claude Code failover & orchestration daemon After=network-online.target Wants=network-online.target [Service] Type=notify User=ubuntu Group=ubuntu WorkingDirectory=/home/ubuntu Environment=CLAUDE_FAILOVER_CONFIG=/etc/claude-failover/config.yaml ExecStart=/usr/local/bin/claude-failover --config ${CLAUDE_FAILOVER_CONFIG} ExecReload=/bin/kill -HUP $MAINPID Restart=always RestartSec=3 TimeoutStopSec=30 LimitNOFILE=65536 # State & logs StateDirectory=claude-failover LogsDirectory=claude-failover # Hardening (tmux needs TTY access, keep minimal) NoNewPrivileges=true ProtectSystem=strict ReadWritePaths=/var/lib/claude-failover /var/log/claude-failover /home/ubuntu/.claude /home/ubuntu/projects [Install] WantedBy=multi-user.target ``` `Type=notify` requires the daemon to call `sd_notify(READY=1)` once all goroutines have registered and the HTTP server is listening. Use [`github.com/coreos/go-systemd/v22/daemon`](https://pkg.go.dev/github.com/coreos/go-systemd/v22/daemon).