claude-failover/docs/architecture.md

4.2 KiB

Architecture

claude-failover is a single Go binary structured as a set of cooperating goroutines. Each goroutine owns a narrow responsibility and communicates through typed channels and a shared State struct guarded by a sync.RWMutex. A single-writer discipline is enforced: only the account-switcher may mutate the active-account field.

Goroutines

dispatcher

Watches .agent-queue/inbox/ for every registered project (inotify on Linux) and pairs each incoming task with an idle session from the pool. It respects:

  • per-project priority
  • agent capability tags declared in the task frontmatter
  • the needs_claude_code: true bypass flag
  • dispatcher-level cooldowns to avoid flooding a freshly-launched session

On successful assignment it renames <task>.md to <task>.md.dispatched and writes a pointer into the target session's tmux prompt.

quota-monitor

Polls Anthropic usage counters for every configured account. Sources:

  1. Claude Code's local telemetry files under ~/.claude/statsig/ and ~/.claude/projects/*.jsonl (message timestamps).
  2. Optional: a reverse-engineered /api/quota endpoint if available.

It computes two sliding windows (5h, 1 week) and emits a swap-requested event once thresholds in the config are crossed.

session-watcher

Keeps a table of tmux sessions (ccl-*). For each one it tracks:

  • process liveness (via tmux has-session)
  • heartbeat timestamp from .agent-queue/status.json
  • current state field (idle / working / stalled)

Stalled sessions (heartbeat older than N minutes while state=working) raise an alert on the notifier channel and become candidates for a forced restart.

checkpoint

Every checkpoint.interval, serializes per-session context:

  • current task id
  • last recorded tool call (name + truncated args)
  • cwd as reported by the session
  • the last N lines of the session's scrollback

Files are written atomically (*.tmp + rename) to checkpoint.dir/<session>/<timestamp>.json and pruned to checkpoint.keep entries.

janitor

Periodic housekeeping:

  • removes stale .md.dispatched markers whose source task is gone
  • archives done/ older than a configurable horizon
  • prunes expired checkpoints
  • rotates the daemon's own log file when it exceeds a size threshold

notifier

Fan-out of typed events (SwapFired, SessionStalled, TaskFailed, QuotaWarning) to configured sinks:

  • Telegram bot (alerts channel)
  • MCP control-plane push
  • stdout / structured log aggregator

account-switcher

Serializes all account swaps behind a single mutex. Swap protocol:

  1. mark active account as draining
  2. tell each session to flush its current tool call and checkpoint
  3. stop tmux sessions in reverse launch order
  4. repoint the ~/.claude symlink (or equivalent per-session HOME) to the target account's home directory
  5. relaunch sessions; replay the latest checkpoint so each session reopens the same project and task pointer
  6. mark the new account active, start the cooldown timer on the old one

See session-switch-analysis.md for why the shared-symlink approach is required (Claude Code bug #16103).

Shared state

type State struct {
    mu              sync.RWMutex
    ActiveAccount   string
    Accounts        map[string]*AccountState
    Sessions        map[string]*SessionState
    LastSwap        time.Time
    PendingSwap     bool
}

Readers take RLock; the account-switcher takes Lock for the duration of a swap. All other writers go through a single-writer channel owned by the switcher, which guarantees swap atomicity.

HTTP control plane

The daemon exposes a small HTTP server (mcp_http.listen) consumed by the SecuAAS MCP gateway. Routes:

Method Path Purpose
GET /status Full state snapshot
GET /accounts Account usage + limits
GET /sessions Session table
POST /trigger/swap Force failover (requires bearer)
POST /trigger/dispatch Force inbox scan

All routes require the bearer token from mcp_http.bearer_token_env.