# Architecture `claude-failover` is a single Go binary structured as a set of cooperating goroutines. Each goroutine owns a narrow responsibility and communicates through typed channels and a shared `State` struct guarded by a `sync.RWMutex`. A single-writer discipline is enforced: only the **account-switcher** may mutate the active-account field. ## Goroutines ### dispatcher Watches `.agent-queue/inbox/` for every registered project (inotify on Linux) and pairs each incoming task with an idle session from the pool. It respects: - per-project priority - agent capability tags declared in the task frontmatter - the `needs_claude_code: true` bypass flag - dispatcher-level cooldowns to avoid flooding a freshly-launched session On successful assignment it renames `.md` to `.md.dispatched` and writes a pointer into the target session's tmux prompt. ### quota-monitor Polls Anthropic usage counters for every configured account. Sources: 1. Claude Code's local telemetry files under `~/.claude/statsig/` and `~/.claude/projects/*.jsonl` (message timestamps). 2. Optional: a reverse-engineered `/api/quota` endpoint if available. It computes two sliding windows (5h, 1 week) and emits a `swap-requested` event once thresholds in the config are crossed. ### session-watcher Keeps a table of tmux sessions (`ccl-*`). For each one it tracks: - process liveness (via `tmux has-session`) - heartbeat timestamp from `.agent-queue/status.json` - current `state` field (idle / working / stalled) Stalled sessions (heartbeat older than N minutes while `state=working`) raise an alert on the notifier channel and become candidates for a forced restart. ### checkpoint Every `checkpoint.interval`, serializes per-session context: - current task id - last recorded tool call (name + truncated args) - cwd as reported by the session - the last N lines of the session's scrollback Files are written atomically (`*.tmp` + rename) to `checkpoint.dir//.json` and pruned to `checkpoint.keep` entries. ### janitor Periodic housekeeping: - removes stale `.md.dispatched` markers whose source task is gone - archives `done/` older than a configurable horizon - prunes expired checkpoints - rotates the daemon's own log file when it exceeds a size threshold ### notifier Fan-out of typed events (`SwapFired`, `SessionStalled`, `TaskFailed`, `QuotaWarning`) to configured sinks: - Telegram bot (alerts channel) - MCP control-plane push - stdout / structured log aggregator ### account-switcher Serializes all account swaps behind a single mutex. Swap protocol: 1. mark active account as `draining` 2. tell each session to flush its current tool call and checkpoint 3. stop tmux sessions in reverse launch order 4. repoint the `~/.claude` symlink (or equivalent per-session HOME) to the target account's home directory 5. relaunch sessions; replay the latest checkpoint so each session reopens the same project and task pointer 6. mark the new account `active`, start the cooldown timer on the old one See [`session-switch-analysis.md`](./session-switch-analysis.md) for why the shared-symlink approach is required (Claude Code bug #16103). ## Shared state ```go type State struct { mu sync.RWMutex ActiveAccount string Accounts map[string]*AccountState Sessions map[string]*SessionState LastSwap time.Time PendingSwap bool } ``` Readers take `RLock`; the account-switcher takes `Lock` for the duration of a swap. All other writers go through a single-writer channel owned by the switcher, which guarantees swap atomicity. ## HTTP control plane The daemon exposes a small HTTP server (`mcp_http.listen`) consumed by the SecuAAS MCP gateway. Routes: | Method | Path | Purpose | |--------|-----------------------|--------------------------------| | GET | `/status` | Full state snapshot | | GET | `/accounts` | Account usage + limits | | GET | `/sessions` | Session table | | POST | `/trigger/swap` | Force failover (requires bearer) | | POST | `/trigger/dispatch` | Force inbox scan | All routes require the bearer token from `mcp_http.bearer_token_env`.