claude-failover/docs/architecture.md

17 KiB

Architecture

claude-failover is a single Go binary structured as a set of cooperating goroutines. Each goroutine owns a narrow responsibility and communicates through typed channels and a shared State struct guarded by a sync.RWMutex. A single-writer discipline is enforced: only the account-switcher may mutate the active-account field.

This document enriches the high-level design with concrete Go implementation notes distilled from the current shell-based agent-orchestrator (see graceful-switch.sh, launch-agent.sh, setup-tmux.sh, start-dedicated-agents.sh, watchdog.sh, checkpoint-daemon.sh) and the companion TUI written in Go (agent-orchestrator/tui/*.go). The daemon is the consolidation of those scripts into a single, testable process.


1. Process model

  • Single static binary: claude-failover (package main in cmd/claude-failover).
  • One OS process, many goroutines. No child daemons.
  • Runs under systemd (User=ubuntu, Restart=always).
  • Persists state at /var/lib/claude-failover/state.json (atomic write).
  • Listens locally on 127.0.0.1:8765 for MCP HTTP control.
  • Single YAML config (/etc/claude-failover/config.yaml), see config.example.yaml.

2. Goroutines

Each goroutine is started from main with a context.Context derived from the root context. All goroutines obey cancellation and drain their pending work before returning.

2.1 dispatcher

  • Watches every registered project's .agent-queue/inbox/ directory via fsnotify.
  • For each new *.md (ignoring *.dispatched and *.dispatch-meta), parses YAML frontmatter (title, priority, tags, needs_claude_code, capabilities).
  • Selects an idle session from the pool based on:
    • per-project priority,
    • agent capability tags,
    • needs_claude_code: true bypass (direct to Claude Code, skipping the GPU triage path inherited from the current dispatcher),
    • dispatcher-level cooldowns to avoid flooding a freshly-launched session.
  • On successful assignment, renames <task>.md to <task>.md.dispatched atomically and writes a pointer into the target session's tmux prompt via TmuxAdapter.SendKeys.
  • Emits a TaskDispatched event on the notifier channel.

Migration note: Phase 1 wraps the existing smart-dispatcher.sh behind exec.Command and only takes over the fsnotify layer; Phase 2 re-implements dispatch selection natively in Go.

2.2 quota-monitor

  • Polls every account's usage counters on a configurable interval (quota.poll_interval, default 30s).
  • Two data sources, merged:
    1. Tmux pane scrape - runs tmux capture-pane -p -t <session> and parses the "Claude usage" footer lines emitted by Claude Code. Implementation reuses the tmux wrapper from claude-squad/session/tmux.
    2. Local telemetry - scans ~/.claude/statsig/ and ~/.claude/projects/*.jsonl for message timestamps, computes sliding windows (5h, 1 week).
  • Also reads SessionStart / SessionEnd / PreToolUse hook events written by Claude Code into ~/.claude/hooks-audit.log (opt-in via settings.json) for precise counters.
  • Emits QuotaWarning and SwapRequested events when thresholds in the config are crossed.

2.3 session-watcher

  • Maintains an in-memory table keyed by tmux session name (ccl-*).
  • For each session tracks:
    • process liveness via tmux has-session -t <name> (exit code),
    • heartbeat timestamp parsed from the project's .agent-queue/status.json,
    • current state field (idle / working / shell),
    • last assigned task id and deadline.
  • Sessions whose heartbeat is older than watchdog.stall_after while state == "working" raise a SessionStalled alert and become candidates for a forced restart (via account-switcher or launch-agent re-run).
  • Replaces watchdog.sh entirely once Phase 2 ships.

2.4 checkpoint

  • On checkpoint.interval (default 30m, a.k.a. the Anthropic 5h window / 10 checkpoints), serializes per-session context:
    • current task id,
    • last recorded tool call (name + truncated args),
    • cwd as reported by the session,
    • the last N lines of the session's scrollback (tmux capture-pane -pS -<N>).
  • Persists to SQLite at checkpoint.db (default /var/lib/claude-failover/checkpoints.db) with retention pruning based on the 30-minute rolling window.
    • Schema: checkpoints(id INTEGER PK, session TEXT, ts INTEGER, task_id TEXT, cwd TEXT, tool_calls JSON, scrollback TEXT).
    • Indexed on (session, ts DESC); old rows pruned by janitor.
  • Also exports a JSON snapshot per session to checkpoint.dir/<session>/<timestamp>.json for out-of-band tooling.

2.5 janitor

Periodic housekeeping (runs every janitor.interval, default 5m):

  • removes stale .md.dispatched markers whose source task is gone,
  • archives done/ entries older than janitor.done_horizon,
  • prunes checkpoint rows beyond checkpoint.keep,
  • rotates the daemon's own log file when it exceeds log.max_size_mb.

2.6 notifier

  • Fan-out of typed events (SwapFired, SessionStalled, TaskFailed, QuotaWarning, TaskDispatched) to configured sinks:
    • Telegram bot (alerts channel),
    • MCP control-plane push (HTTP callback),
    • structured log aggregator (stdout JSON).
  • Each sink is a goroutine with its own bounded buffered channel. Back-pressure is handled by dropping low-priority events (TaskDispatched) and logging a NotifierSaturated warning.

2.7 account-switcher (state machine)

The switcher is the only goroutine allowed to mutate State.ActiveAccount. It consumes SwapRequested events from a single-consumer channel and transitions through an atomic state machine:

          +--------+  SwapRequested   +--------+
          | normal | ---------------> | saving |
          +--------+                  +--------+
              ^                           |
              |                           v  (checkpoint all sessions)
              |                       +-----------+
              |                       | switching |
              |                       +-----------+
              |                           |
              |                           v  (relaunch tmux, flip HOME)
              |    ResumeComplete     +----------+
              +---------------------- | resuming |
                                      +----------+
  • normal -> saving: halt dispatcher intake, request an immediate checkpoint for every session, wait for ack with a deadline.
  • saving -> switching: kill tmux sessions in reverse launch order, repoint the ~/.claude symlink (or per-session HOME) to the target account's home directory.
  • switching -> resuming: relaunch sessions via TmuxAdapter.NewSession; each session replays its most recent checkpoint (task pointer + cwd + truncated scrollback) so Claude Code reopens at the same task.
  • resuming -> normal: unblock dispatcher, start cooldown timer on the old account, emit SwapFired.

Guarantees:

  • Only one swap in flight at a time (the channel has capacity 1, extra requests coalesce).
  • Swap is cancellable via context.Context; on cancel the switcher rolls back to the pre-swap ActiveAccount.
  • All FS writes go through FSAdapter and are *.tmp + rename to keep state.json recoverable.

See session-switch-analysis.md for why the shared-symlink approach is required (Claude Code bug #16103) and how resuming interactive sessions differs from resuming autonomous ones.


3. Shared state

type State struct {
    mu            sync.RWMutex
    ActiveAccount string
    Accounts      map[string]*AccountState
    Sessions      map[string]*SessionState
    Pool          PoolState
    Dispatcher    DispatcherState
    LastSwap      time.Time
    PendingSwap   bool
    StartedAt     time.Time
}

type AccountState struct {
    Name         string
    Home         string
    Priority     int
    Usage5h      float64 // 0.0 - 1.0
    UsageWeek    float64
    CooldownEnd  time.Time
    LastSwapIn   time.Time
    LastSwapOut  time.Time
}

type SessionState struct {
    Name       string
    Kind       string // "autonomous" | "dedicated" | "interactive" | "orchestrator"
    Project    string
    State      string // "idle" | "working" | "shell" | "stalled"
    Task       string
    AssignedAt time.Time
    Heartbeat  time.Time
}
  • Readers take RLock; the account-switcher takes Lock for the duration of a swap.
  • All other writers enqueue on a single-writer channel (stateOps chan func(*State)) consumed by a dedicated goroutine. This mirrors the actor pattern and eliminates lock-ordering issues.
  • Every mutation triggers a debounced flush of state.json (persist.debounce, default 1s) written atomically via FSAdapter.

4. Event system

Internal events are plain Go structs on typed channels. No reflection.

Channel Producer(s) Consumer Buffer
swapReq quota-monitor, HTTP account-switcher 1
sessionEvt session-watcher notifier, state 32
taskEvt dispatcher notifier, state 64
checkpointEvt checkpoint state 8
notifierSink[i] notifier sink i 128

Flow for a typical swap:

quota-monitor --SwapRequested--> swapReq --> account-switcher
       |                                            |
       |                                            +--> state (lock)
       |                                            +--> TmuxAdapter (kill/launch)
       |                                            +--> FSAdapter   (symlink flip)
       |                                            +--> sessionEvt  (SwapFired)
       v
   notifier --> telegram/mcp/log

5. Adapters (mockable interfaces)

Tmux

Thin wrapper around exec.Command("tmux", ...), mockable via interface:

type TmuxAdapter interface {
    HasSession(name string) (bool, error)
    NewSession(name, cwd string, env []string) error
    KillSession(name string) error
    SendKeys(name string, keys string) error
    CapturePane(name string, lines int) (string, error)
    ListSessions() ([]string, error)
}
  • Default implementation: ExecTmux shelling out to the tmux binary.
  • Test implementation: FakeTmux backed by an in-memory map.
  • Inspired by smtg-ai/claude-squad/session/tmux, which already battle-tests pane capture and send-keys escaping.

Filesystem & Claude CLI

type FSAdapter interface {
    ReadFile(path string) ([]byte, error)
    WriteFileAtomic(path string, data []byte, perm os.FileMode) error
    Rename(oldp, newp string) error
    Stat(path string) (os.FileInfo, error)
    Watch(path string) (<-chan fsnotify.Event, error)
}

type ClaudeCLIAdapter interface {
    ResumeSession(sessionID, cwd string) error
    InjectPrompt(sessionID, prompt string) error
    Quota() (Usage, error)
}

All three interfaces have Fake* implementations under internal/testutil/ so every goroutine can be unit-tested without a real tmux, filesystem, or Claude CLI.


6. HTTP control plane (MCP)

Bound to 127.0.0.1:8765 (config mcp_http.listen). Never exposed publicly - the SecuAAS MCP gateway proxies from the host.

Method Path Purpose
GET /api/v1/session_status Full state snapshot
GET /api/v1/accounts Account usage + limits
GET /api/v1/sessions Session table
POST /api/v1/trigger_dispatch Force inbox scan
POST /api/v1/trigger/swap Force failover (bearer required)
GET /healthz Liveness probe
  • All mutating routes require the bearer token from mcp_http.bearer_token_env.
  • Read routes are unauthenticated but only reachable on loopback.
  • Handlers are thin: they publish on the same internal channels used by goroutines, then wait on a reply channel with a timeout.

7. Graceful shutdown

  • main installs a signal.Notify for SIGTERM, SIGINT, SIGHUP.
  • On signal, the root context.Context is cancelled.
  • Goroutines drain in a defined order:
    1. HTTP server (Shutdown(ctx) with 5s deadline),
    2. dispatcher (stops accepting new tasks, finishes in-flight assignment),
    3. quota-monitor (cancels poll),
    4. session-watcher (flushes last state),
    5. checkpoint (forces one final checkpoint for every working session),
    6. account-switcher (refuses new swaps, completes an in-flight swap up to shutdown.swap_deadline),
    7. notifier (flushes bounded buffers),
    8. janitor (exits immediately).
  • state.json is flushed one last time before exit.
  • Exit code 0 on clean drain, 1 on deadline timeout (systemd will restart).

8. Configuration

Single YAML file. See config.example.yaml for the authoritative schema. Key sections:

  • accounts[] - ordered list of Anthropic accounts (name, home, limits, priority).
  • pool - dedicated sessions, autonomous pool prefix and min/max, shared projects dir.
  • quota - poll interval, 5h/weekly thresholds.
  • checkpoint - interval, dir, db, keep count.
  • janitor - interval, horizons.
  • mcp_http - listen addr, bearer token env var.
  • notifier - telegram, mcp callback, log sinks.

The config is loaded once at startup. SIGHUP triggers a re-parse; only safe fields (thresholds, intervals, notifier sinks) are hot-reloaded. Structural changes (accounts, pool) require a restart.


9. Migration plan

The existing orchestrator is a set of bash scripts under agent-orchestrator/ (graceful-switch.sh, launch-agent.sh, setup-tmux.sh, start-dedicated-agents.sh, watchdog.sh, checkpoint-daemon.sh, smart-dispatcher.sh). Cutover is staged to minimize risk.

Phase 1 - Daemon wrapper (weeks 1-2)

  • Build claude-failover as a supervisor that exec.Commands the existing .sh scripts.
  • Goroutines are skeletons that call scripts:
    • dispatcher -> smart-dispatcher.sh,
    • account-switcher -> graceful-switch.sh <target>,
    • session-watcher -> parses watchdog.sh --json output,
    • checkpoint -> calls checkpoint-daemon.sh --once.
  • Shared state, event system, HTTP control plane, and config are already native Go.
  • Systemd unit replaces the existing per-script cron/tmux launchers.

Phase 2 - Native rewrite (weeks 3-6)

  • Replace each shell wrapper with a native Go implementation one goroutine at a time, in this order: session-watcher -> checkpoint -> dispatcher -> account-switcher.
  • Shell scripts remain as a fallback until each native path has run a full week without regression.

Phase 3 - Removal (week 7+)

  • Delete agent-orchestrator/*.sh once Phase 2 is green.
  • Keep the TUI (agent-orchestrator/tui/) as a read-only client of the daemon's HTTP API.

10. Testing strategy

  • Adapters (TmuxAdapter, FSAdapter, ClaudeCLIAdapter) are interfaces with in-memory fakes under internal/testutil/.
  • Unit tests cover every goroutine in isolation:
    • dispatcher: feed fake fsnotify events, assert tmux SendKeys calls.
    • quota-monitor: feed canned pane captures, assert SwapRequested.
    • account-switcher: drive the state machine through a full swap, including cancellation and rollback.
  • Integration tests boot the full daemon with fake adapters and a httptest.Server on the MCP port. A golden-file test compares state.json after a scripted sequence.
  • Race detector mandatory in CI (go test -race ./...).
  • An e2e/ suite runs against a real tmux + real Claude Code CLI in a container, executed nightly.

11. Systemd unit

# /etc/systemd/system/claude-failover.service
[Unit]
Description=Claude Code failover & orchestration daemon
After=network-online.target
Wants=network-online.target

[Service]
Type=notify
User=ubuntu
Group=ubuntu
WorkingDirectory=/home/ubuntu
Environment=CLAUDE_FAILOVER_CONFIG=/etc/claude-failover/config.yaml
ExecStart=/usr/local/bin/claude-failover --config ${CLAUDE_FAILOVER_CONFIG}
ExecReload=/bin/kill -HUP $MAINPID
Restart=always
RestartSec=3
TimeoutStopSec=30
LimitNOFILE=65536
# State & logs
StateDirectory=claude-failover
LogsDirectory=claude-failover
# Hardening (tmux needs TTY access, keep minimal)
NoNewPrivileges=true
ProtectSystem=strict
ReadWritePaths=/var/lib/claude-failover /var/log/claude-failover /home/ubuntu/.claude /home/ubuntu/projects

[Install]
WantedBy=multi-user.target

Type=notify requires the daemon to call sd_notify(READY=1) once all goroutines have registered and the HTTP server is listening. Use github.com/coreos/go-systemd/v22/daemon.