4.2 KiB
Architecture
claude-failover is a single Go binary structured as a set of cooperating
goroutines. Each goroutine owns a narrow responsibility and communicates
through typed channels and a shared State struct guarded by a
sync.RWMutex. A single-writer discipline is enforced: only the
account-switcher may mutate the active-account field.
Goroutines
dispatcher
Watches .agent-queue/inbox/ for every registered project (inotify on
Linux) and pairs each incoming task with an idle session from the pool.
It respects:
- per-project priority
- agent capability tags declared in the task frontmatter
- the
needs_claude_code: truebypass flag - dispatcher-level cooldowns to avoid flooding a freshly-launched session
On successful assignment it renames <task>.md to <task>.md.dispatched
and writes a pointer into the target session's tmux prompt.
quota-monitor
Polls Anthropic usage counters for every configured account. Sources:
- Claude Code's local telemetry files under
~/.claude/statsig/and~/.claude/projects/*.jsonl(message timestamps). - Optional: a reverse-engineered
/api/quotaendpoint if available.
It computes two sliding windows (5h, 1 week) and emits a swap-requested
event once thresholds in the config are crossed.
session-watcher
Keeps a table of tmux sessions (ccl-*). For each one it tracks:
- process liveness (via
tmux has-session) - heartbeat timestamp from
.agent-queue/status.json - current
statefield (idle / working / stalled)
Stalled sessions (heartbeat older than N minutes while state=working)
raise an alert on the notifier channel and become candidates for a
forced restart.
checkpoint
Every checkpoint.interval, serializes per-session context:
- current task id
- last recorded tool call (name + truncated args)
- cwd as reported by the session
- the last N lines of the session's scrollback
Files are written atomically (*.tmp + rename) to
checkpoint.dir/<session>/<timestamp>.json and pruned to
checkpoint.keep entries.
janitor
Periodic housekeeping:
- removes stale
.md.dispatchedmarkers whose source task is gone - archives
done/older than a configurable horizon - prunes expired checkpoints
- rotates the daemon's own log file when it exceeds a size threshold
notifier
Fan-out of typed events (SwapFired, SessionStalled, TaskFailed,
QuotaWarning) to configured sinks:
- Telegram bot (alerts channel)
- MCP control-plane push
- stdout / structured log aggregator
account-switcher
Serializes all account swaps behind a single mutex. Swap protocol:
- mark active account as
draining - tell each session to flush its current tool call and checkpoint
- stop tmux sessions in reverse launch order
- repoint the
~/.claudesymlink (or equivalent per-session HOME) to the target account's home directory - relaunch sessions; replay the latest checkpoint so each session reopens the same project and task pointer
- mark the new account
active, start the cooldown timer on the old one
See session-switch-analysis.md for why
the shared-symlink approach is required (Claude Code bug #16103).
Shared state
type State struct {
mu sync.RWMutex
ActiveAccount string
Accounts map[string]*AccountState
Sessions map[string]*SessionState
LastSwap time.Time
PendingSwap bool
}
Readers take RLock; the account-switcher takes Lock for the duration
of a swap. All other writers go through a single-writer channel owned by
the switcher, which guarantees swap atomicity.
HTTP control plane
The daemon exposes a small HTTP server (mcp_http.listen) consumed by
the SecuAAS MCP gateway. Routes:
| Method | Path | Purpose |
|---|---|---|
| GET | /status |
Full state snapshot |
| GET | /accounts |
Account usage + limits |
| GET | /sessions |
Session table |
| POST | /trigger/swap |
Force failover (requires bearer) |
| POST | /trigger/dispatch |
Force inbox scan |
All routes require the bearer token from mcp_http.bearer_token_env.