From 65c25c895568cbd4010988c434a9f918b888b5eb Mon Sep 17 00:00:00 2001 From: claude-agent Date: Tue, 14 Apr 2026 13:36:20 +0000 Subject: [PATCH] docs: enrich architecture with Go implementation details --- docs/architecture.md | 493 ++++++++++++++++++++++++++++++++++++------- 1 file changed, 413 insertions(+), 80 deletions(-) diff --git a/docs/architecture.md b/docs/architecture.md index 4782ddf..5d0b944 100644 --- a/docs/architecture.md +++ b/docs/architecture.md @@ -6,120 +6,453 @@ through typed channels and a shared `State` struct guarded by a `sync.RWMutex`. A single-writer discipline is enforced: only the **account-switcher** may mutate the active-account field. -## Goroutines +> This document enriches the high-level design with concrete Go +> implementation notes distilled from the current shell-based +> `agent-orchestrator` (see `graceful-switch.sh`, `launch-agent.sh`, +> `setup-tmux.sh`, `start-dedicated-agents.sh`, `watchdog.sh`, +> `checkpoint-daemon.sh`) and the companion TUI written in Go +> (`agent-orchestrator/tui/*.go`). The daemon is the consolidation of +> those scripts into a single, testable process. -### dispatcher +--- -Watches `.agent-queue/inbox/` for every registered project (inotify on -Linux) and pairs each incoming task with an idle session from the pool. -It respects: +## 1. Process model -- per-project priority -- agent capability tags declared in the task frontmatter -- the `needs_claude_code: true` bypass flag -- dispatcher-level cooldowns to avoid flooding a freshly-launched session +- Single static binary: `claude-failover` (package main in `cmd/claude-failover`). +- One OS process, many goroutines. No child daemons. +- Runs under systemd (`User=ubuntu`, `Restart=always`). +- Persists state at `/var/lib/claude-failover/state.json` (atomic write). +- Listens locally on `127.0.0.1:8765` for MCP HTTP control. +- Single YAML config (`/etc/claude-failover/config.yaml`), see + [`config.example.yaml`](../config.example.yaml). -On successful assignment it renames `.md` to `.md.dispatched` -and writes a pointer into the target session's tmux prompt. +--- -### quota-monitor +## 2. Goroutines -Polls Anthropic usage counters for every configured account. Sources: +Each goroutine is started from `main` with a `context.Context` derived +from the root context. All goroutines obey cancellation and drain their +pending work before returning. -1. Claude Code's local telemetry files under `~/.claude/statsig/` and - `~/.claude/projects/*.jsonl` (message timestamps). -2. Optional: a reverse-engineered `/api/quota` endpoint if available. +### 2.1 dispatcher -It computes two sliding windows (5h, 1 week) and emits a `swap-requested` -event once thresholds in the config are crossed. +- Watches every registered project's `.agent-queue/inbox/` directory via + [`fsnotify`](https://pkg.go.dev/github.com/fsnotify/fsnotify). +- For each new `*.md` (ignoring `*.dispatched` and `*.dispatch-meta`), + parses YAML frontmatter (title, priority, tags, `needs_claude_code`, + `capabilities`). +- Selects an idle session from the pool based on: + - per-project priority, + - agent capability tags, + - `needs_claude_code: true` bypass (direct to Claude Code, skipping + the GPU triage path inherited from the current dispatcher), + - dispatcher-level cooldowns to avoid flooding a freshly-launched + session. +- On successful assignment, renames `.md` to `.md.dispatched` + atomically and writes a pointer into the target session's tmux prompt + via `TmuxAdapter.SendKeys`. +- Emits a `TaskDispatched` event on the notifier channel. -### session-watcher +Migration note: Phase 1 wraps the existing `smart-dispatcher.sh` +behind `exec.Command` and only takes over the fsnotify layer; Phase 2 +re-implements dispatch selection natively in Go. -Keeps a table of tmux sessions (`ccl-*`). For each one it tracks: +### 2.2 quota-monitor -- process liveness (via `tmux has-session`) -- heartbeat timestamp from `.agent-queue/status.json` -- current `state` field (idle / working / stalled) +- Polls every account's usage counters on a configurable interval + (`quota.poll_interval`, default `30s`). +- Two data sources, merged: + 1. **Tmux pane scrape** - runs `tmux capture-pane -p -t ` + and parses the "Claude usage" footer lines emitted by Claude Code. + Implementation reuses the tmux wrapper from + [`claude-squad/session/tmux`](https://github.com/smtg-ai/claude-squad/tree/main/session/tmux). + 2. **Local telemetry** - scans `~/.claude/statsig/` and + `~/.claude/projects/*.jsonl` for message timestamps, computes + sliding windows (5h, 1 week). +- Also reads `SessionStart` / `SessionEnd` / `PreToolUse` hook events + written by Claude Code into `~/.claude/hooks-audit.log` (opt-in via + `settings.json`) for precise counters. +- Emits `QuotaWarning` and `SwapRequested` events when thresholds in + the config are crossed. -Stalled sessions (heartbeat older than N minutes while `state=working`) -raise an alert on the notifier channel and become candidates for a -forced restart. +### 2.3 session-watcher -### checkpoint +- Maintains an in-memory table keyed by tmux session name (`ccl-*`). +- For each session tracks: + - process liveness via `tmux has-session -t ` (exit code), + - heartbeat timestamp parsed from the project's + `.agent-queue/status.json`, + - current `state` field (`idle` / `working` / `shell`), + - last assigned task id and deadline. +- Sessions whose heartbeat is older than `watchdog.stall_after` while + `state == "working"` raise a `SessionStalled` alert and become + candidates for a forced restart (via `account-switcher` or + `launch-agent` re-run). +- Replaces `watchdog.sh` entirely once Phase 2 ships. -Every `checkpoint.interval`, serializes per-session context: +### 2.4 checkpoint -- current task id -- last recorded tool call (name + truncated args) -- cwd as reported by the session -- the last N lines of the session's scrollback +- On `checkpoint.interval` (default `30m`, a.k.a. the Anthropic 5h + window / 10 checkpoints), serializes per-session context: + - current task id, + - last recorded tool call (name + truncated args), + - cwd as reported by the session, + - the last N lines of the session's scrollback + (`tmux capture-pane -pS -`). +- Persists to SQLite at `checkpoint.db` (default + `/var/lib/claude-failover/checkpoints.db`) with retention pruning + based on the 30-minute rolling window. + - Schema: `checkpoints(id INTEGER PK, session TEXT, ts INTEGER, + task_id TEXT, cwd TEXT, tool_calls JSON, scrollback TEXT)`. + - Indexed on `(session, ts DESC)`; old rows pruned by `janitor`. +- Also exports a JSON snapshot per session to + `checkpoint.dir//.json` for out-of-band tooling. -Files are written atomically (`*.tmp` + rename) to -`checkpoint.dir//.json` and pruned to -`checkpoint.keep` entries. +### 2.5 janitor -### janitor +Periodic housekeeping (runs every `janitor.interval`, default `5m`): -Periodic housekeeping: +- removes stale `.md.dispatched` markers whose source task is gone, +- archives `done/` entries older than `janitor.done_horizon`, +- prunes checkpoint rows beyond `checkpoint.keep`, +- rotates the daemon's own log file when it exceeds + `log.max_size_mb`. -- removes stale `.md.dispatched` markers whose source task is gone -- archives `done/` older than a configurable horizon -- prunes expired checkpoints -- rotates the daemon's own log file when it exceeds a size threshold +### 2.6 notifier -### notifier +- Fan-out of typed events (`SwapFired`, `SessionStalled`, `TaskFailed`, + `QuotaWarning`, `TaskDispatched`) to configured sinks: + - Telegram bot (alerts channel), + - MCP control-plane push (HTTP callback), + - structured log aggregator (stdout JSON). +- Each sink is a goroutine with its own bounded buffered channel. + Back-pressure is handled by dropping low-priority events + (`TaskDispatched`) and logging a `NotifierSaturated` warning. -Fan-out of typed events (`SwapFired`, `SessionStalled`, `TaskFailed`, -`QuotaWarning`) to configured sinks: +### 2.7 account-switcher (state machine) -- Telegram bot (alerts channel) -- MCP control-plane push -- stdout / structured log aggregator +The switcher is the only goroutine allowed to mutate `State.ActiveAccount`. +It consumes `SwapRequested` events from a single-consumer channel and +transitions through an atomic state machine: -### account-switcher +``` + +--------+ SwapRequested +--------+ + | normal | ---------------> | saving | + +--------+ +--------+ + ^ | + | v (checkpoint all sessions) + | +-----------+ + | | switching | + | +-----------+ + | | + | v (relaunch tmux, flip HOME) + | ResumeComplete +----------+ + +---------------------- | resuming | + +----------+ +``` -Serializes all account swaps behind a single mutex. Swap protocol: +- `normal -> saving`: halt dispatcher intake, request an immediate + checkpoint for every session, wait for ack with a deadline. +- `saving -> switching`: kill tmux sessions in reverse launch order, + repoint the `~/.claude` symlink (or per-session `HOME`) to the target + account's home directory. +- `switching -> resuming`: relaunch sessions via + `TmuxAdapter.NewSession`; each session replays its most recent + checkpoint (task pointer + cwd + truncated scrollback) so Claude Code + reopens at the same task. +- `resuming -> normal`: unblock dispatcher, start cooldown timer on + the old account, emit `SwapFired`. -1. mark active account as `draining` -2. tell each session to flush its current tool call and checkpoint -3. stop tmux sessions in reverse launch order -4. repoint the `~/.claude` symlink (or equivalent per-session HOME) to - the target account's home directory -5. relaunch sessions; replay the latest checkpoint so each session - reopens the same project and task pointer -6. mark the new account `active`, start the cooldown timer on the old one +Guarantees: -See [`session-switch-analysis.md`](./session-switch-analysis.md) for why -the shared-symlink approach is required (Claude Code bug #16103). +- Only one swap in flight at a time (the channel has capacity 1, extra + requests coalesce). +- Swap is cancellable via `context.Context`; on cancel the switcher + rolls back to the pre-swap `ActiveAccount`. +- All FS writes go through `FSAdapter` and are `*.tmp + rename` to keep + `state.json` recoverable. -## Shared state +See [`session-switch-analysis.md`](./session-switch-analysis.md) for +why the shared-symlink approach is required (Claude Code bug #16103) +and how resuming interactive sessions differs from resuming autonomous +ones. + +--- + +## 3. Shared state ```go type State struct { - mu sync.RWMutex - ActiveAccount string - Accounts map[string]*AccountState - Sessions map[string]*SessionState - LastSwap time.Time - PendingSwap bool + mu sync.RWMutex + ActiveAccount string + Accounts map[string]*AccountState + Sessions map[string]*SessionState + Pool PoolState + Dispatcher DispatcherState + LastSwap time.Time + PendingSwap bool + StartedAt time.Time +} + +type AccountState struct { + Name string + Home string + Priority int + Usage5h float64 // 0.0 - 1.0 + UsageWeek float64 + CooldownEnd time.Time + LastSwapIn time.Time + LastSwapOut time.Time +} + +type SessionState struct { + Name string + Kind string // "autonomous" | "dedicated" | "interactive" | "orchestrator" + Project string + State string // "idle" | "working" | "shell" | "stalled" + Task string + AssignedAt time.Time + Heartbeat time.Time } ``` -Readers take `RLock`; the account-switcher takes `Lock` for the duration -of a swap. All other writers go through a single-writer channel owned by -the switcher, which guarantees swap atomicity. +- Readers take `RLock`; the account-switcher takes `Lock` for the + duration of a swap. +- All other writers enqueue on a single-writer channel + (`stateOps chan func(*State)`) consumed by a dedicated goroutine. + This mirrors the actor pattern and eliminates lock-ordering issues. +- Every mutation triggers a debounced flush of `state.json` + (`persist.debounce`, default `1s`) written atomically via + `FSAdapter`. -## HTTP control plane +--- -The daemon exposes a small HTTP server (`mcp_http.listen`) consumed by -the SecuAAS MCP gateway. Routes: +## 4. Event system -| Method | Path | Purpose | -|--------|-----------------------|--------------------------------| -| GET | `/status` | Full state snapshot | -| GET | `/accounts` | Account usage + limits | -| GET | `/sessions` | Session table | -| POST | `/trigger/swap` | Force failover (requires bearer) | -| POST | `/trigger/dispatch` | Force inbox scan | +Internal events are plain Go structs on typed channels. No reflection. -All routes require the bearer token from `mcp_http.bearer_token_env`. +| Channel | Producer(s) | Consumer | Buffer | +|-------------------|-----------------------|-------------------|--------| +| `swapReq` | quota-monitor, HTTP | account-switcher | 1 | +| `sessionEvt` | session-watcher | notifier, state | 32 | +| `taskEvt` | dispatcher | notifier, state | 64 | +| `checkpointEvt` | checkpoint | state | 8 | +| `notifierSink[i]` | notifier | sink i | 128 | + +Flow for a typical swap: + +``` +quota-monitor --SwapRequested--> swapReq --> account-switcher + | | + | +--> state (lock) + | +--> TmuxAdapter (kill/launch) + | +--> FSAdapter (symlink flip) + | +--> sessionEvt (SwapFired) + v + notifier --> telegram/mcp/log +``` + +--- + +## 5. Adapters (mockable interfaces) + +### Tmux + +Thin wrapper around `exec.Command("tmux", ...)`, mockable via interface: + +```go +type TmuxAdapter interface { + HasSession(name string) (bool, error) + NewSession(name, cwd string, env []string) error + KillSession(name string) error + SendKeys(name string, keys string) error + CapturePane(name string, lines int) (string, error) + ListSessions() ([]string, error) +} +``` + +- Default implementation: `ExecTmux` shelling out to the `tmux` binary. +- Test implementation: `FakeTmux` backed by an in-memory map. +- Inspired by + [`smtg-ai/claude-squad/session/tmux`](https://github.com/smtg-ai/claude-squad/tree/main/session/tmux), + which already battle-tests pane capture and send-keys escaping. + +### Filesystem & Claude CLI + +```go +type FSAdapter interface { + ReadFile(path string) ([]byte, error) + WriteFileAtomic(path string, data []byte, perm os.FileMode) error + Rename(oldp, newp string) error + Stat(path string) (os.FileInfo, error) + Watch(path string) (<-chan fsnotify.Event, error) +} + +type ClaudeCLIAdapter interface { + ResumeSession(sessionID, cwd string) error + InjectPrompt(sessionID, prompt string) error + Quota() (Usage, error) +} +``` + +All three interfaces have `Fake*` implementations under +`internal/testutil/` so every goroutine can be unit-tested without a +real tmux, filesystem, or Claude CLI. + +--- + +## 6. HTTP control plane (MCP) + +Bound to `127.0.0.1:8765` (config `mcp_http.listen`). Never exposed +publicly - the SecuAAS MCP gateway proxies from the host. + +| Method | Path | Purpose | +|--------|-------------------------------|----------------------------------| +| GET | `/api/v1/session_status` | Full state snapshot | +| GET | `/api/v1/accounts` | Account usage + limits | +| GET | `/api/v1/sessions` | Session table | +| POST | `/api/v1/trigger_dispatch` | Force inbox scan | +| POST | `/api/v1/trigger/swap` | Force failover (bearer required) | +| GET | `/healthz` | Liveness probe | + +- All mutating routes require the bearer token from + `mcp_http.bearer_token_env`. +- Read routes are unauthenticated but only reachable on loopback. +- Handlers are thin: they publish on the same internal channels used + by goroutines, then wait on a reply channel with a timeout. + +--- + +## 7. Graceful shutdown + +- `main` installs a `signal.Notify` for `SIGTERM`, `SIGINT`, `SIGHUP`. +- On signal, the root `context.Context` is cancelled. +- Goroutines drain in a defined order: + 1. HTTP server (`Shutdown(ctx)` with 5s deadline), + 2. dispatcher (stops accepting new tasks, finishes in-flight assignment), + 3. quota-monitor (cancels poll), + 4. session-watcher (flushes last state), + 5. checkpoint (forces one final checkpoint for every working session), + 6. account-switcher (refuses new swaps, completes an in-flight swap + up to `shutdown.swap_deadline`), + 7. notifier (flushes bounded buffers), + 8. janitor (exits immediately). +- `state.json` is flushed one last time before exit. +- Exit code `0` on clean drain, `1` on deadline timeout (systemd will + restart). + +--- + +## 8. Configuration + +Single YAML file. See [`config.example.yaml`](../config.example.yaml) +for the authoritative schema. Key sections: + +- `accounts[]` - ordered list of Anthropic accounts (name, home, + limits, priority). +- `pool` - dedicated sessions, autonomous pool prefix and min/max, + shared projects dir. +- `quota` - poll interval, 5h/weekly thresholds. +- `checkpoint` - interval, dir, db, keep count. +- `janitor` - interval, horizons. +- `mcp_http` - listen addr, bearer token env var. +- `notifier` - telegram, mcp callback, log sinks. + +The config is loaded once at startup. `SIGHUP` triggers a re-parse; +only safe fields (thresholds, intervals, notifier sinks) are +hot-reloaded. Structural changes (accounts, pool) require a restart. + +--- + +## 9. Migration plan + +The existing orchestrator is a set of bash scripts under +`agent-orchestrator/` (`graceful-switch.sh`, `launch-agent.sh`, +`setup-tmux.sh`, `start-dedicated-agents.sh`, `watchdog.sh`, +`checkpoint-daemon.sh`, `smart-dispatcher.sh`). Cutover is staged to +minimize risk. + +### Phase 1 - Daemon wrapper (weeks 1-2) + +- Build `claude-failover` as a supervisor that `exec.Command`s the + existing `.sh` scripts. +- Goroutines are skeletons that call scripts: + - dispatcher -> `smart-dispatcher.sh`, + - account-switcher -> `graceful-switch.sh `, + - session-watcher -> parses `watchdog.sh --json` output, + - checkpoint -> calls `checkpoint-daemon.sh --once`. +- Shared state, event system, HTTP control plane, and config are + already native Go. +- Systemd unit replaces the existing per-script cron/tmux launchers. + +### Phase 2 - Native rewrite (weeks 3-6) + +- Replace each shell wrapper with a native Go implementation one + goroutine at a time, in this order: session-watcher -> checkpoint -> + dispatcher -> account-switcher. +- Shell scripts remain as a fallback until each native path has run a + full week without regression. + +### Phase 3 - Removal (week 7+) + +- Delete `agent-orchestrator/*.sh` once Phase 2 is green. +- Keep the TUI (`agent-orchestrator/tui/`) as a read-only client of + the daemon's HTTP API. + +--- + +## 10. Testing strategy + +- Adapters (`TmuxAdapter`, `FSAdapter`, `ClaudeCLIAdapter`) are + interfaces with in-memory fakes under `internal/testutil/`. +- Unit tests cover every goroutine in isolation: + - dispatcher: feed fake fsnotify events, assert tmux `SendKeys` + calls. + - quota-monitor: feed canned pane captures, assert `SwapRequested`. + - account-switcher: drive the state machine through a full swap, + including cancellation and rollback. +- Integration tests boot the full daemon with fake adapters and a + `httptest.Server` on the MCP port. A golden-file test compares + `state.json` after a scripted sequence. +- Race detector mandatory in CI (`go test -race ./...`). +- An `e2e/` suite runs against a real tmux + real Claude Code CLI in a + container, executed nightly. + +--- + +## 11. Systemd unit + +```ini +# /etc/systemd/system/claude-failover.service +[Unit] +Description=Claude Code failover & orchestration daemon +After=network-online.target +Wants=network-online.target + +[Service] +Type=notify +User=ubuntu +Group=ubuntu +WorkingDirectory=/home/ubuntu +Environment=CLAUDE_FAILOVER_CONFIG=/etc/claude-failover/config.yaml +ExecStart=/usr/local/bin/claude-failover --config ${CLAUDE_FAILOVER_CONFIG} +ExecReload=/bin/kill -HUP $MAINPID +Restart=always +RestartSec=3 +TimeoutStopSec=30 +LimitNOFILE=65536 +# State & logs +StateDirectory=claude-failover +LogsDirectory=claude-failover +# Hardening (tmux needs TTY access, keep minimal) +NoNewPrivileges=true +ProtectSystem=strict +ReadWritePaths=/var/lib/claude-failover /var/log/claude-failover /home/ubuntu/.claude /home/ubuntu/projects + +[Install] +WantedBy=multi-user.target +``` + +`Type=notify` requires the daemon to call `sd_notify(READY=1)` once all +goroutines have registered and the HTTP server is listening. Use +[`github.com/coreos/go-systemd/v22/daemon`](https://pkg.go.dev/github.com/coreos/go-systemd/v22/daemon).