claude-failover/docs/architecture.md

459 lines
17 KiB
Markdown
Raw Normal View History

2026-04-14 13:29:24 +00:00
# Architecture
`claude-failover` is a single Go binary structured as a set of cooperating
goroutines. Each goroutine owns a narrow responsibility and communicates
through typed channels and a shared `State` struct guarded by a
`sync.RWMutex`. A single-writer discipline is enforced: only the
**account-switcher** may mutate the active-account field.
> This document enriches the high-level design with concrete Go
> implementation notes distilled from the current shell-based
> `agent-orchestrator` (see `graceful-switch.sh`, `launch-agent.sh`,
> `setup-tmux.sh`, `start-dedicated-agents.sh`, `watchdog.sh`,
> `checkpoint-daemon.sh`) and the companion TUI written in Go
> (`agent-orchestrator/tui/*.go`). The daemon is the consolidation of
> those scripts into a single, testable process.
---
## 1. Process model
- Single static binary: `claude-failover` (package main in `cmd/claude-failover`).
- One OS process, many goroutines. No child daemons.
- Runs under systemd (`User=ubuntu`, `Restart=always`).
- Persists state at `/var/lib/claude-failover/state.json` (atomic write).
- Listens locally on `127.0.0.1:8765` for MCP HTTP control.
- Single YAML config (`/etc/claude-failover/config.yaml`), see
[`config.example.yaml`](../config.example.yaml).
---
## 2. Goroutines
Each goroutine is started from `main` with a `context.Context` derived
from the root context. All goroutines obey cancellation and drain their
pending work before returning.
### 2.1 dispatcher
- Watches every registered project's `.agent-queue/inbox/` directory via
[`fsnotify`](https://pkg.go.dev/github.com/fsnotify/fsnotify).
- For each new `*.md` (ignoring `*.dispatched` and `*.dispatch-meta`),
parses YAML frontmatter (title, priority, tags, `needs_claude_code`,
`capabilities`).
- Selects an idle session from the pool based on:
- per-project priority,
- agent capability tags,
- `needs_claude_code: true` bypass (direct to Claude Code, skipping
the GPU triage path inherited from the current dispatcher),
- dispatcher-level cooldowns to avoid flooding a freshly-launched
session.
- On successful assignment, renames `<task>.md` to `<task>.md.dispatched`
atomically and writes a pointer into the target session's tmux prompt
via `TmuxAdapter.SendKeys`.
- Emits a `TaskDispatched` event on the notifier channel.
Migration note: Phase 1 wraps the existing `smart-dispatcher.sh`
behind `exec.Command` and only takes over the fsnotify layer; Phase 2
re-implements dispatch selection natively in Go.
### 2.2 quota-monitor
- Polls every account's usage counters on a configurable interval
(`quota.poll_interval`, default `30s`).
- Two data sources, merged:
1. **Tmux pane scrape** - runs `tmux capture-pane -p -t <session>`
and parses the "Claude usage" footer lines emitted by Claude Code.
Implementation reuses the tmux wrapper from
[`claude-squad/session/tmux`](https://github.com/smtg-ai/claude-squad/tree/main/session/tmux).
2. **Local telemetry** - scans `~/.claude/statsig/` and
`~/.claude/projects/*.jsonl` for message timestamps, computes
sliding windows (5h, 1 week).
- Also reads `SessionStart` / `SessionEnd` / `PreToolUse` hook events
written by Claude Code into `~/.claude/hooks-audit.log` (opt-in via
`settings.json`) for precise counters.
- Emits `QuotaWarning` and `SwapRequested` events when thresholds in
the config are crossed.
### 2.3 session-watcher
- Maintains an in-memory table keyed by tmux session name (`ccl-*`).
- For each session tracks:
- process liveness via `tmux has-session -t <name>` (exit code),
- heartbeat timestamp parsed from the project's
`.agent-queue/status.json`,
- current `state` field (`idle` / `working` / `shell`),
- last assigned task id and deadline.
- Sessions whose heartbeat is older than `watchdog.stall_after` while
`state == "working"` raise a `SessionStalled` alert and become
candidates for a forced restart (via `account-switcher` or
`launch-agent` re-run).
- Replaces `watchdog.sh` entirely once Phase 2 ships.
### 2.4 checkpoint
- On `checkpoint.interval` (default `30m`, a.k.a. the Anthropic 5h
window / 10 checkpoints), serializes per-session context:
- current task id,
- last recorded tool call (name + truncated args),
- cwd as reported by the session,
- the last N lines of the session's scrollback
(`tmux capture-pane -pS -<N>`).
- Persists to SQLite at `checkpoint.db` (default
`/var/lib/claude-failover/checkpoints.db`) with retention pruning
based on the 30-minute rolling window.
- Schema: `checkpoints(id INTEGER PK, session TEXT, ts INTEGER,
task_id TEXT, cwd TEXT, tool_calls JSON, scrollback TEXT)`.
- Indexed on `(session, ts DESC)`; old rows pruned by `janitor`.
- Also exports a JSON snapshot per session to
`checkpoint.dir/<session>/<timestamp>.json` for out-of-band tooling.
### 2.5 janitor
Periodic housekeeping (runs every `janitor.interval`, default `5m`):
- removes stale `.md.dispatched` markers whose source task is gone,
- archives `done/` entries older than `janitor.done_horizon`,
- prunes checkpoint rows beyond `checkpoint.keep`,
- rotates the daemon's own log file when it exceeds
`log.max_size_mb`.
### 2.6 notifier
- Fan-out of typed events (`SwapFired`, `SessionStalled`, `TaskFailed`,
`QuotaWarning`, `TaskDispatched`) to configured sinks:
- Telegram bot (alerts channel),
- MCP control-plane push (HTTP callback),
- structured log aggregator (stdout JSON).
- Each sink is a goroutine with its own bounded buffered channel.
Back-pressure is handled by dropping low-priority events
(`TaskDispatched`) and logging a `NotifierSaturated` warning.
### 2.7 account-switcher (state machine)
The switcher is the only goroutine allowed to mutate `State.ActiveAccount`.
It consumes `SwapRequested` events from a single-consumer channel and
transitions through an atomic state machine:
2026-04-14 13:29:24 +00:00
```
+--------+ SwapRequested +--------+
| normal | ---------------> | saving |
+--------+ +--------+
^ |
| v (checkpoint all sessions)
| +-----------+
| | switching |
| +-----------+
| |
| v (relaunch tmux, flip HOME)
| ResumeComplete +----------+
+---------------------- | resuming |
+----------+
```
2026-04-14 13:29:24 +00:00
- `normal -> saving`: halt dispatcher intake, request an immediate
checkpoint for every session, wait for ack with a deadline.
- `saving -> switching`: kill tmux sessions in reverse launch order,
repoint the `~/.claude` symlink (or per-session `HOME`) to the target
account's home directory.
- `switching -> resuming`: relaunch sessions via
`TmuxAdapter.NewSession`; each session replays its most recent
checkpoint (task pointer + cwd + truncated scrollback) so Claude Code
reopens at the same task.
- `resuming -> normal`: unblock dispatcher, start cooldown timer on
the old account, emit `SwapFired`.
2026-04-14 13:29:24 +00:00
Guarantees:
2026-04-14 13:29:24 +00:00
- Only one swap in flight at a time (the channel has capacity 1, extra
requests coalesce).
- Swap is cancellable via `context.Context`; on cancel the switcher
rolls back to the pre-swap `ActiveAccount`.
- All FS writes go through `FSAdapter` and are `*.tmp + rename` to keep
`state.json` recoverable.
2026-04-14 13:29:24 +00:00
See [`session-switch-analysis.md`](./session-switch-analysis.md) for
why the shared-symlink approach is required (Claude Code bug #16103)
and how resuming interactive sessions differs from resuming autonomous
ones.
2026-04-14 13:29:24 +00:00
---
2026-04-14 13:29:24 +00:00
## 3. Shared state
2026-04-14 13:29:24 +00:00
```go
type State struct {
mu sync.RWMutex
ActiveAccount string
Accounts map[string]*AccountState
Sessions map[string]*SessionState
Pool PoolState
Dispatcher DispatcherState
LastSwap time.Time
PendingSwap bool
StartedAt time.Time
}
2026-04-14 13:29:24 +00:00
type AccountState struct {
Name string
Home string
Priority int
Usage5h float64 // 0.0 - 1.0
UsageWeek float64
CooldownEnd time.Time
LastSwapIn time.Time
LastSwapOut time.Time
}
2026-04-14 13:29:24 +00:00
type SessionState struct {
Name string
Kind string // "autonomous" | "dedicated" | "interactive" | "orchestrator"
Project string
State string // "idle" | "working" | "shell" | "stalled"
Task string
AssignedAt time.Time
Heartbeat time.Time
}
```
2026-04-14 13:29:24 +00:00
- Readers take `RLock`; the account-switcher takes `Lock` for the
duration of a swap.
- All other writers enqueue on a single-writer channel
(`stateOps chan func(*State)`) consumed by a dedicated goroutine.
This mirrors the actor pattern and eliminates lock-ordering issues.
- Every mutation triggers a debounced flush of `state.json`
(`persist.debounce`, default `1s`) written atomically via
`FSAdapter`.
2026-04-14 13:29:24 +00:00
---
2026-04-14 13:29:24 +00:00
## 4. Event system
2026-04-14 13:29:24 +00:00
Internal events are plain Go structs on typed channels. No reflection.
2026-04-14 13:29:24 +00:00
| Channel | Producer(s) | Consumer | Buffer |
|-------------------|-----------------------|-------------------|--------|
| `swapReq` | quota-monitor, HTTP | account-switcher | 1 |
| `sessionEvt` | session-watcher | notifier, state | 32 |
| `taskEvt` | dispatcher | notifier, state | 64 |
| `checkpointEvt` | checkpoint | state | 8 |
| `notifierSink[i]` | notifier | sink i | 128 |
2026-04-14 13:29:24 +00:00
Flow for a typical swap:
2026-04-14 13:29:24 +00:00
```
quota-monitor --SwapRequested--> swapReq --> account-switcher
| |
| +--> state (lock)
| +--> TmuxAdapter (kill/launch)
| +--> FSAdapter (symlink flip)
| +--> sessionEvt (SwapFired)
v
notifier --> telegram/mcp/log
```
2026-04-14 13:29:24 +00:00
---
2026-04-14 13:29:24 +00:00
## 5. Adapters (mockable interfaces)
2026-04-14 13:29:24 +00:00
### Tmux
2026-04-14 13:29:24 +00:00
Thin wrapper around `exec.Command("tmux", ...)`, mockable via interface:
2026-04-14 13:29:24 +00:00
```go
type TmuxAdapter interface {
HasSession(name string) (bool, error)
NewSession(name, cwd string, env []string) error
KillSession(name string) error
SendKeys(name string, keys string) error
CapturePane(name string, lines int) (string, error)
ListSessions() ([]string, error)
2026-04-14 13:29:24 +00:00
}
```
- Default implementation: `ExecTmux` shelling out to the `tmux` binary.
- Test implementation: `FakeTmux` backed by an in-memory map.
- Inspired by
[`smtg-ai/claude-squad/session/tmux`](https://github.com/smtg-ai/claude-squad/tree/main/session/tmux),
which already battle-tests pane capture and send-keys escaping.
2026-04-14 13:29:24 +00:00
### Filesystem & Claude CLI
2026-04-14 13:29:24 +00:00
```go
type FSAdapter interface {
ReadFile(path string) ([]byte, error)
WriteFileAtomic(path string, data []byte, perm os.FileMode) error
Rename(oldp, newp string) error
Stat(path string) (os.FileInfo, error)
Watch(path string) (<-chan fsnotify.Event, error)
}
2026-04-14 13:29:24 +00:00
type ClaudeCLIAdapter interface {
ResumeSession(sessionID, cwd string) error
InjectPrompt(sessionID, prompt string) error
Quota() (Usage, error)
}
```
All three interfaces have `Fake*` implementations under
`internal/testutil/` so every goroutine can be unit-tested without a
real tmux, filesystem, or Claude CLI.
---
## 6. HTTP control plane (MCP)
Bound to `127.0.0.1:8765` (config `mcp_http.listen`). Never exposed
publicly - the SecuAAS MCP gateway proxies from the host.
| Method | Path | Purpose |
|--------|-------------------------------|----------------------------------|
| GET | `/api/v1/session_status` | Full state snapshot |
| GET | `/api/v1/accounts` | Account usage + limits |
| GET | `/api/v1/sessions` | Session table |
| POST | `/api/v1/trigger_dispatch` | Force inbox scan |
| POST | `/api/v1/trigger/swap` | Force failover (bearer required) |
| GET | `/healthz` | Liveness probe |
- All mutating routes require the bearer token from
`mcp_http.bearer_token_env`.
- Read routes are unauthenticated but only reachable on loopback.
- Handlers are thin: they publish on the same internal channels used
by goroutines, then wait on a reply channel with a timeout.
---
## 7. Graceful shutdown
- `main` installs a `signal.Notify` for `SIGTERM`, `SIGINT`, `SIGHUP`.
- On signal, the root `context.Context` is cancelled.
- Goroutines drain in a defined order:
1. HTTP server (`Shutdown(ctx)` with 5s deadline),
2. dispatcher (stops accepting new tasks, finishes in-flight assignment),
3. quota-monitor (cancels poll),
4. session-watcher (flushes last state),
5. checkpoint (forces one final checkpoint for every working session),
6. account-switcher (refuses new swaps, completes an in-flight swap
up to `shutdown.swap_deadline`),
7. notifier (flushes bounded buffers),
8. janitor (exits immediately).
- `state.json` is flushed one last time before exit.
- Exit code `0` on clean drain, `1` on deadline timeout (systemd will
restart).
---
## 8. Configuration
Single YAML file. See [`config.example.yaml`](../config.example.yaml)
for the authoritative schema. Key sections:
- `accounts[]` - ordered list of Anthropic accounts (name, home,
limits, priority).
- `pool` - dedicated sessions, autonomous pool prefix and min/max,
shared projects dir.
- `quota` - poll interval, 5h/weekly thresholds.
- `checkpoint` - interval, dir, db, keep count.
- `janitor` - interval, horizons.
- `mcp_http` - listen addr, bearer token env var.
- `notifier` - telegram, mcp callback, log sinks.
The config is loaded once at startup. `SIGHUP` triggers a re-parse;
only safe fields (thresholds, intervals, notifier sinks) are
hot-reloaded. Structural changes (accounts, pool) require a restart.
---
## 9. Migration plan
The existing orchestrator is a set of bash scripts under
`agent-orchestrator/` (`graceful-switch.sh`, `launch-agent.sh`,
`setup-tmux.sh`, `start-dedicated-agents.sh`, `watchdog.sh`,
`checkpoint-daemon.sh`, `smart-dispatcher.sh`). Cutover is staged to
minimize risk.
### Phase 1 - Daemon wrapper (weeks 1-2)
- Build `claude-failover` as a supervisor that `exec.Command`s the
existing `.sh` scripts.
- Goroutines are skeletons that call scripts:
- dispatcher -> `smart-dispatcher.sh`,
- account-switcher -> `graceful-switch.sh <target>`,
- session-watcher -> parses `watchdog.sh --json` output,
- checkpoint -> calls `checkpoint-daemon.sh --once`.
- Shared state, event system, HTTP control plane, and config are
already native Go.
- Systemd unit replaces the existing per-script cron/tmux launchers.
### Phase 2 - Native rewrite (weeks 3-6)
- Replace each shell wrapper with a native Go implementation one
goroutine at a time, in this order: session-watcher -> checkpoint ->
dispatcher -> account-switcher.
- Shell scripts remain as a fallback until each native path has run a
full week without regression.
### Phase 3 - Removal (week 7+)
- Delete `agent-orchestrator/*.sh` once Phase 2 is green.
- Keep the TUI (`agent-orchestrator/tui/`) as a read-only client of
the daemon's HTTP API.
---
## 10. Testing strategy
- Adapters (`TmuxAdapter`, `FSAdapter`, `ClaudeCLIAdapter`) are
interfaces with in-memory fakes under `internal/testutil/`.
- Unit tests cover every goroutine in isolation:
- dispatcher: feed fake fsnotify events, assert tmux `SendKeys`
calls.
- quota-monitor: feed canned pane captures, assert `SwapRequested`.
- account-switcher: drive the state machine through a full swap,
including cancellation and rollback.
- Integration tests boot the full daemon with fake adapters and a
`httptest.Server` on the MCP port. A golden-file test compares
`state.json` after a scripted sequence.
- Race detector mandatory in CI (`go test -race ./...`).
- An `e2e/` suite runs against a real tmux + real Claude Code CLI in a
container, executed nightly.
---
## 11. Systemd unit
```ini
# /etc/systemd/system/claude-failover.service
[Unit]
Description=Claude Code failover & orchestration daemon
After=network-online.target
Wants=network-online.target
[Service]
Type=notify
User=ubuntu
Group=ubuntu
WorkingDirectory=/home/ubuntu
Environment=CLAUDE_FAILOVER_CONFIG=/etc/claude-failover/config.yaml
ExecStart=/usr/local/bin/claude-failover --config ${CLAUDE_FAILOVER_CONFIG}
ExecReload=/bin/kill -HUP $MAINPID
Restart=always
RestartSec=3
TimeoutStopSec=30
LimitNOFILE=65536
# State & logs
StateDirectory=claude-failover
LogsDirectory=claude-failover
# Hardening (tmux needs TTY access, keep minimal)
NoNewPrivileges=true
ProtectSystem=strict
ReadWritePaths=/var/lib/claude-failover /var/log/claude-failover /home/ubuntu/.claude /home/ubuntu/projects
[Install]
WantedBy=multi-user.target
```
2026-04-14 13:29:24 +00:00
`Type=notify` requires the daemon to call `sd_notify(READY=1)` once all
goroutines have registered and the HTTP server is listening. Use
[`github.com/coreos/go-systemd/v22/daemon`](https://pkg.go.dev/github.com/coreos/go-systemd/v22/daemon).