- WORK_IN_PROGRESS.md captures the v0.2.1→v0.2.3 incident, root cause, and the optional follow-ups (preserve dedicated sessions during swap, Telegram alert on SwapRequested, /quota/status endpoint). - architecture.md §2.2.1 describes the four-layer defense: strict patterns, 5xx veto, two-poll confirmation, post-swap cooldown. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
19 KiB
Architecture
claude-failover is a single Go binary structured as a set of cooperating
goroutines. Each goroutine owns a narrow responsibility and communicates
through typed channels and a shared State struct guarded by a
sync.RWMutex. A single-writer discipline is enforced: only the
account-switcher may mutate the active-account field.
This document enriches the high-level design with concrete Go implementation notes distilled from the current shell-based
agent-orchestrator(seegraceful-switch.sh,launch-agent.sh,setup-tmux.sh,start-dedicated-agents.sh,watchdog.sh,checkpoint-daemon.sh) and the companion TUI written in Go (agent-orchestrator/tui/*.go). The daemon is the consolidation of those scripts into a single, testable process.
1. Process model
- Single static binary:
claude-failover(package main incmd/claude-failover). - One OS process, many goroutines. No child daemons.
- Runs under systemd (
User=ubuntu,Restart=always). - Persists state at
/var/lib/claude-failover/state.json(atomic write). - Listens locally on
127.0.0.1:8765for MCP HTTP control. - Single YAML config (
/etc/claude-failover/config.yaml), seeconfig.example.yaml.
2. Goroutines
Each goroutine is started from main with a context.Context derived
from the root context. All goroutines obey cancellation and drain their
pending work before returning.
2.1 dispatcher
- Watches every registered project's
.agent-queue/inbox/directory viafsnotify. - For each new
*.md(ignoring*.dispatchedand*.dispatch-meta), parses YAML frontmatter (title, priority, tags,needs_claude_code,capabilities). - Selects an idle session from the pool based on:
- per-project priority,
- agent capability tags,
needs_claude_code: truebypass (direct to Claude Code, skipping the GPU triage path inherited from the current dispatcher),- dispatcher-level cooldowns to avoid flooding a freshly-launched session.
- On successful assignment, renames
<task>.mdto<task>.md.dispatchedatomically and writes a pointer into the target session's tmux prompt viaTmuxAdapter.SendKeys. - Emits a
TaskDispatchedevent on the notifier channel.
Migration note: Phase 1 wraps the existing smart-dispatcher.sh
behind exec.Command and only takes over the fsnotify layer; Phase 2
re-implements dispatch selection natively in Go.
2.2 quota-monitor
- Polls every account's usage counters on a configurable interval
(
quota.poll_interval, default30s). - Two data sources, merged:
- Tmux pane scrape - runs
tmux capture-pane -p -t <session>and parses the "Claude usage" footer lines emitted by Claude Code. Implementation reuses the tmux wrapper fromclaude-squad/session/tmux. - Local telemetry - scans
~/.claude/statsig/and~/.claude/projects/*.jsonlfor message timestamps, computes sliding windows (5h, 1 week).
- Tmux pane scrape - runs
- Also reads
SessionStart/SessionEnd/PreToolUsehook events written by Claude Code into~/.claude/hooks-audit.log(opt-in viasettings.json) for precise counters. - Emits
QuotaWarningandSwapRequestedevents when thresholds in the config are crossed.
2.2.1 False-positive protection (v0.2.3+)
Three layers prevent transient upstream errors from being mistaken for quota exhaustion and triggering useless — and destructive — swaps:
-
Strict pattern matching (
isQuotaExhausted).quotaPatternskeys on specific phrases that only a real 429 surfaces:"you've hit your limit","rate_limit_error"(Anthropic typed error),"quota exceeded","usage limit reached","claude pro usage","too many requests","5-hour limit". The generic substring"rate limit"is not a pattern — it matches inside the bodies of unrelated error transcripts. -
Server-error veto (
hasServerError). If the same pane also contains"api_error","overloaded_error","internal server error", or"api error: 5", the quota match is vetoed. An Anthropic 500/503 response is surfaced in the Claude Code conversation transcript and stays visible until the user scrolls; without this veto it would be re-matched on every poll. -
Two-poll confirmation (
Monitor.suspectedHitAt). A hit with no parseable reset time (real 429s always include one, seeextractResetTime) is treated as suspected on the first poll and only emitsSwapRequestedif a second consecutive poll also detects the same condition. A single-poll flash is absorbed. -
Post-swap cooldown (
state.QuotaState.LastSwapAt+quota.reactivate_cooldown, default 5m). After a swap, the monitor suppresses all detection for the cooldown window, breaking the ping-pong failure mode where both accounts appear exhausted in alternation.
Forensic logging on every SwapRequested includes the triggering
session name, the matched pattern, and a 120-char snippet of the pane
so production incidents can be diagnosed from journalctl alone.
2.3 session-watcher
- Maintains an in-memory table keyed by tmux session name (
ccl-*). - For each session tracks:
- process liveness via
tmux has-session -t <name>(exit code), - heartbeat timestamp parsed from the project's
.agent-queue/status.json, - current
statefield (idle/working/shell), - last assigned task id and deadline.
- process liveness via
- Sessions whose heartbeat is older than
watchdog.stall_afterwhilestate == "working"raise aSessionStalledalert and become candidates for a forced restart (viaaccount-switcherorlaunch-agentre-run). - Replaces
watchdog.shentirely once Phase 2 ships.
2.4 checkpoint
- On
checkpoint.interval(default30m, a.k.a. the Anthropic 5h window / 10 checkpoints), serializes per-session context:- current task id,
- last recorded tool call (name + truncated args),
- cwd as reported by the session,
- the last N lines of the session's scrollback
(
tmux capture-pane -pS -<N>).
- Persists to SQLite at
checkpoint.db(default/var/lib/claude-failover/checkpoints.db) with retention pruning based on the 30-minute rolling window.- Schema:
checkpoints(id INTEGER PK, session TEXT, ts INTEGER, task_id TEXT, cwd TEXT, tool_calls JSON, scrollback TEXT). - Indexed on
(session, ts DESC); old rows pruned byjanitor.
- Schema:
- Also exports a JSON snapshot per session to
checkpoint.dir/<session>/<timestamp>.jsonfor out-of-band tooling.
2.5 janitor
Periodic housekeeping (runs every janitor.interval, default 5m):
- removes stale
.md.dispatchedmarkers whose source task is gone, - archives
done/entries older thanjanitor.done_horizon, - prunes checkpoint rows beyond
checkpoint.keep, - rotates the daemon's own log file when it exceeds
log.max_size_mb.
2.6 notifier
- Fan-out of typed events (
SwapFired,SessionStalled,TaskFailed,QuotaWarning,TaskDispatched) to configured sinks:- Telegram bot (alerts channel),
- MCP control-plane push (HTTP callback),
- structured log aggregator (stdout JSON).
- Each sink is a goroutine with its own bounded buffered channel.
Back-pressure is handled by dropping low-priority events
(
TaskDispatched) and logging aNotifierSaturatedwarning.
2.7 account-switcher (state machine)
The switcher is the only goroutine allowed to mutate State.ActiveAccount.
It consumes SwapRequested events from a single-consumer channel and
transitions through an atomic state machine:
+--------+ SwapRequested +--------+
| normal | ---------------> | saving |
+--------+ +--------+
^ |
| v (checkpoint all sessions)
| +-----------+
| | switching |
| +-----------+
| |
| v (relaunch tmux, flip HOME)
| ResumeComplete +----------+
+---------------------- | resuming |
+----------+
normal -> saving: halt dispatcher intake, request an immediate checkpoint for every session, wait for ack with a deadline.saving -> switching: kill tmux sessions in reverse launch order, repoint the~/.claudesymlink (or per-sessionHOME) to the target account's home directory.switching -> resuming: relaunch sessions viaTmuxAdapter.NewSession; each session replays its most recent checkpoint (task pointer + cwd + truncated scrollback) so Claude Code reopens at the same task.resuming -> normal: unblock dispatcher, start cooldown timer on the old account, emitSwapFired.
Guarantees:
- Only one swap in flight at a time (the channel has capacity 1, extra requests coalesce).
- Swap is cancellable via
context.Context; on cancel the switcher rolls back to the pre-swapActiveAccount. - All FS writes go through
FSAdapterand are*.tmp + renameto keepstate.jsonrecoverable.
See session-switch-analysis.md for
why the shared-symlink approach is required (Claude Code bug #16103)
and how resuming interactive sessions differs from resuming autonomous
ones.
3. Shared state
type State struct {
mu sync.RWMutex
ActiveAccount string
Accounts map[string]*AccountState
Sessions map[string]*SessionState
Pool PoolState
Dispatcher DispatcherState
LastSwap time.Time
PendingSwap bool
StartedAt time.Time
}
type AccountState struct {
Name string
Home string
Priority int
Usage5h float64 // 0.0 - 1.0
UsageWeek float64
CooldownEnd time.Time
LastSwapIn time.Time
LastSwapOut time.Time
}
type SessionState struct {
Name string
Kind string // "autonomous" | "dedicated" | "interactive" | "orchestrator"
Project string
State string // "idle" | "working" | "shell" | "stalled"
Task string
AssignedAt time.Time
Heartbeat time.Time
}
- Readers take
RLock; the account-switcher takesLockfor the duration of a swap. - All other writers enqueue on a single-writer channel
(
stateOps chan func(*State)) consumed by a dedicated goroutine. This mirrors the actor pattern and eliminates lock-ordering issues. - Every mutation triggers a debounced flush of
state.json(persist.debounce, default1s) written atomically viaFSAdapter.
4. Event system
Internal events are plain Go structs on typed channels. No reflection.
| Channel | Producer(s) | Consumer | Buffer |
|---|---|---|---|
swapReq |
quota-monitor, HTTP | account-switcher | 1 |
sessionEvt |
session-watcher | notifier, state | 32 |
taskEvt |
dispatcher | notifier, state | 64 |
checkpointEvt |
checkpoint | state | 8 |
notifierSink[i] |
notifier | sink i | 128 |
Flow for a typical swap:
quota-monitor --SwapRequested--> swapReq --> account-switcher
| |
| +--> state (lock)
| +--> TmuxAdapter (kill/launch)
| +--> FSAdapter (symlink flip)
| +--> sessionEvt (SwapFired)
v
notifier --> telegram/mcp/log
5. Adapters (mockable interfaces)
Tmux
Thin wrapper around exec.Command("tmux", ...), mockable via interface:
type TmuxAdapter interface {
HasSession(name string) (bool, error)
NewSession(name, cwd string, env []string) error
KillSession(name string) error
SendKeys(name string, keys string) error
CapturePane(name string, lines int) (string, error)
ListSessions() ([]string, error)
}
- Default implementation:
ExecTmuxshelling out to thetmuxbinary. - Test implementation:
FakeTmuxbacked by an in-memory map. - Inspired by
smtg-ai/claude-squad/session/tmux, which already battle-tests pane capture and send-keys escaping.
Filesystem & Claude CLI
type FSAdapter interface {
ReadFile(path string) ([]byte, error)
WriteFileAtomic(path string, data []byte, perm os.FileMode) error
Rename(oldp, newp string) error
Stat(path string) (os.FileInfo, error)
Watch(path string) (<-chan fsnotify.Event, error)
}
type ClaudeCLIAdapter interface {
ResumeSession(sessionID, cwd string) error
InjectPrompt(sessionID, prompt string) error
Quota() (Usage, error)
}
All three interfaces have Fake* implementations under
internal/testutil/ so every goroutine can be unit-tested without a
real tmux, filesystem, or Claude CLI.
6. HTTP control plane (MCP)
Bound to 127.0.0.1:8765 (config mcp_http.listen). Never exposed
publicly - the SecuAAS MCP gateway proxies from the host.
| Method | Path | Purpose |
|---|---|---|
| GET | /api/v1/session_status |
Full state snapshot |
| GET | /api/v1/accounts |
Account usage + limits |
| GET | /api/v1/sessions |
Session table |
| POST | /api/v1/trigger_dispatch |
Force inbox scan |
| POST | /api/v1/trigger/swap |
Force failover (bearer required) |
| GET | /healthz |
Liveness probe |
- All mutating routes require the bearer token from
mcp_http.bearer_token_env. - Read routes are unauthenticated but only reachable on loopback.
- Handlers are thin: they publish on the same internal channels used by goroutines, then wait on a reply channel with a timeout.
7. Graceful shutdown
maininstalls asignal.NotifyforSIGTERM,SIGINT,SIGHUP.- On signal, the root
context.Contextis cancelled. - Goroutines drain in a defined order:
- HTTP server (
Shutdown(ctx)with 5s deadline), - dispatcher (stops accepting new tasks, finishes in-flight assignment),
- quota-monitor (cancels poll),
- session-watcher (flushes last state),
- checkpoint (forces one final checkpoint for every working session),
- account-switcher (refuses new swaps, completes an in-flight swap
up to
shutdown.swap_deadline), - notifier (flushes bounded buffers),
- janitor (exits immediately).
- HTTP server (
state.jsonis flushed one last time before exit.- Exit code
0on clean drain,1on deadline timeout (systemd will restart).
8. Configuration
Single YAML file. See config.example.yaml
for the authoritative schema. Key sections:
accounts[]- ordered list of Anthropic accounts (name, home, limits, priority).pool- dedicated sessions, autonomous pool prefix and min/max, shared projects dir.quota- poll interval, 5h/weekly thresholds.checkpoint- interval, dir, db, keep count.janitor- interval, horizons.mcp_http- listen addr, bearer token env var.notifier- telegram, mcp callback, log sinks.
The config is loaded once at startup. SIGHUP triggers a re-parse;
only safe fields (thresholds, intervals, notifier sinks) are
hot-reloaded. Structural changes (accounts, pool) require a restart.
9. Migration plan
The existing orchestrator is a set of bash scripts under
agent-orchestrator/ (graceful-switch.sh, launch-agent.sh,
setup-tmux.sh, start-dedicated-agents.sh, watchdog.sh,
checkpoint-daemon.sh, smart-dispatcher.sh). Cutover is staged to
minimize risk.
Phase 1 - Daemon wrapper (weeks 1-2)
- Build
claude-failoveras a supervisor thatexec.Commands the existing.shscripts. - Goroutines are skeletons that call scripts:
- dispatcher ->
smart-dispatcher.sh, - account-switcher ->
graceful-switch.sh <target>, - session-watcher -> parses
watchdog.sh --jsonoutput, - checkpoint -> calls
checkpoint-daemon.sh --once.
- dispatcher ->
- Shared state, event system, HTTP control plane, and config are already native Go.
- Systemd unit replaces the existing per-script cron/tmux launchers.
Phase 2 - Native rewrite (weeks 3-6)
- Replace each shell wrapper with a native Go implementation one goroutine at a time, in this order: session-watcher -> checkpoint -> dispatcher -> account-switcher.
- Shell scripts remain as a fallback until each native path has run a full week without regression.
Phase 3 - Removal (week 7+)
- Delete
agent-orchestrator/*.shonce Phase 2 is green. - Keep the TUI (
agent-orchestrator/tui/) as a read-only client of the daemon's HTTP API.
10. Testing strategy
- Adapters (
TmuxAdapter,FSAdapter,ClaudeCLIAdapter) are interfaces with in-memory fakes underinternal/testutil/. - Unit tests cover every goroutine in isolation:
- dispatcher: feed fake fsnotify events, assert tmux
SendKeyscalls. - quota-monitor: feed canned pane captures, assert
SwapRequested. - account-switcher: drive the state machine through a full swap, including cancellation and rollback.
- dispatcher: feed fake fsnotify events, assert tmux
- Integration tests boot the full daemon with fake adapters and a
httptest.Serveron the MCP port. A golden-file test comparesstate.jsonafter a scripted sequence. - Race detector mandatory in CI (
go test -race ./...). - An
e2e/suite runs against a real tmux + real Claude Code CLI in a container, executed nightly.
11. Systemd unit
# /etc/systemd/system/claude-failover.service
[Unit]
Description=Claude Code failover & orchestration daemon
After=network-online.target
Wants=network-online.target
[Service]
Type=notify
User=ubuntu
Group=ubuntu
WorkingDirectory=/home/ubuntu
Environment=CLAUDE_FAILOVER_CONFIG=/etc/claude-failover/config.yaml
ExecStart=/usr/local/bin/claude-failover --config ${CLAUDE_FAILOVER_CONFIG}
ExecReload=/bin/kill -HUP $MAINPID
Restart=always
RestartSec=3
TimeoutStopSec=30
LimitNOFILE=65536
# State & logs
StateDirectory=claude-failover
LogsDirectory=claude-failover
# Hardening (tmux needs TTY access, keep minimal)
NoNewPrivileges=true
ProtectSystem=strict
ReadWritePaths=/var/lib/claude-failover /var/log/claude-failover /home/ubuntu/.claude /home/ubuntu/projects
[Install]
WantedBy=multi-user.target
Type=notify requires the daemon to call sd_notify(READY=1) once all
goroutines have registered and the HTTP server is listening. Use
github.com/coreos/go-systemd/v22/daemon.