2026-04-14 13:29:24 +00:00
|
|
|
# Architecture
|
|
|
|
|
|
|
|
|
|
`claude-failover` is a single Go binary structured as a set of cooperating
|
|
|
|
|
goroutines. Each goroutine owns a narrow responsibility and communicates
|
|
|
|
|
through typed channels and a shared `State` struct guarded by a
|
|
|
|
|
`sync.RWMutex`. A single-writer discipline is enforced: only the
|
|
|
|
|
**account-switcher** may mutate the active-account field.
|
|
|
|
|
|
2026-04-14 13:36:20 +00:00
|
|
|
> This document enriches the high-level design with concrete Go
|
|
|
|
|
> implementation notes distilled from the current shell-based
|
|
|
|
|
> `agent-orchestrator` (see `graceful-switch.sh`, `launch-agent.sh`,
|
|
|
|
|
> `setup-tmux.sh`, `start-dedicated-agents.sh`, `watchdog.sh`,
|
|
|
|
|
> `checkpoint-daemon.sh`) and the companion TUI written in Go
|
|
|
|
|
> (`agent-orchestrator/tui/*.go`). The daemon is the consolidation of
|
|
|
|
|
> those scripts into a single, testable process.
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## 1. Process model
|
|
|
|
|
|
|
|
|
|
- Single static binary: `claude-failover` (package main in `cmd/claude-failover`).
|
|
|
|
|
- One OS process, many goroutines. No child daemons.
|
|
|
|
|
- Runs under systemd (`User=ubuntu`, `Restart=always`).
|
|
|
|
|
- Persists state at `/var/lib/claude-failover/state.json` (atomic write).
|
|
|
|
|
- Listens locally on `127.0.0.1:8765` for MCP HTTP control.
|
|
|
|
|
- Single YAML config (`/etc/claude-failover/config.yaml`), see
|
|
|
|
|
[`config.example.yaml`](../config.example.yaml).
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## 2. Goroutines
|
|
|
|
|
|
|
|
|
|
Each goroutine is started from `main` with a `context.Context` derived
|
|
|
|
|
from the root context. All goroutines obey cancellation and drain their
|
|
|
|
|
pending work before returning.
|
|
|
|
|
|
|
|
|
|
### 2.1 dispatcher
|
|
|
|
|
|
|
|
|
|
- Watches every registered project's `.agent-queue/inbox/` directory via
|
|
|
|
|
[`fsnotify`](https://pkg.go.dev/github.com/fsnotify/fsnotify).
|
|
|
|
|
- For each new `*.md` (ignoring `*.dispatched` and `*.dispatch-meta`),
|
|
|
|
|
parses YAML frontmatter (title, priority, tags, `needs_claude_code`,
|
|
|
|
|
`capabilities`).
|
|
|
|
|
- Selects an idle session from the pool based on:
|
|
|
|
|
- per-project priority,
|
|
|
|
|
- agent capability tags,
|
|
|
|
|
- `needs_claude_code: true` bypass (direct to Claude Code, skipping
|
|
|
|
|
the GPU triage path inherited from the current dispatcher),
|
|
|
|
|
- dispatcher-level cooldowns to avoid flooding a freshly-launched
|
|
|
|
|
session.
|
|
|
|
|
- On successful assignment, renames `<task>.md` to `<task>.md.dispatched`
|
|
|
|
|
atomically and writes a pointer into the target session's tmux prompt
|
|
|
|
|
via `TmuxAdapter.SendKeys`.
|
|
|
|
|
- Emits a `TaskDispatched` event on the notifier channel.
|
|
|
|
|
|
|
|
|
|
Migration note: Phase 1 wraps the existing `smart-dispatcher.sh`
|
|
|
|
|
behind `exec.Command` and only takes over the fsnotify layer; Phase 2
|
|
|
|
|
re-implements dispatch selection natively in Go.
|
|
|
|
|
|
|
|
|
|
### 2.2 quota-monitor
|
|
|
|
|
|
|
|
|
|
- Polls every account's usage counters on a configurable interval
|
|
|
|
|
(`quota.poll_interval`, default `30s`).
|
|
|
|
|
- Two data sources, merged:
|
|
|
|
|
1. **Tmux pane scrape** - runs `tmux capture-pane -p -t <session>`
|
|
|
|
|
and parses the "Claude usage" footer lines emitted by Claude Code.
|
|
|
|
|
Implementation reuses the tmux wrapper from
|
|
|
|
|
[`claude-squad/session/tmux`](https://github.com/smtg-ai/claude-squad/tree/main/session/tmux).
|
|
|
|
|
2. **Local telemetry** - scans `~/.claude/statsig/` and
|
|
|
|
|
`~/.claude/projects/*.jsonl` for message timestamps, computes
|
|
|
|
|
sliding windows (5h, 1 week).
|
|
|
|
|
- Also reads `SessionStart` / `SessionEnd` / `PreToolUse` hook events
|
|
|
|
|
written by Claude Code into `~/.claude/hooks-audit.log` (opt-in via
|
|
|
|
|
`settings.json`) for precise counters.
|
|
|
|
|
- Emits `QuotaWarning` and `SwapRequested` events when thresholds in
|
|
|
|
|
the config are crossed.
|
|
|
|
|
|
|
|
|
|
### 2.3 session-watcher
|
|
|
|
|
|
|
|
|
|
- Maintains an in-memory table keyed by tmux session name (`ccl-*`).
|
|
|
|
|
- For each session tracks:
|
|
|
|
|
- process liveness via `tmux has-session -t <name>` (exit code),
|
|
|
|
|
- heartbeat timestamp parsed from the project's
|
|
|
|
|
`.agent-queue/status.json`,
|
|
|
|
|
- current `state` field (`idle` / `working` / `shell`),
|
|
|
|
|
- last assigned task id and deadline.
|
|
|
|
|
- Sessions whose heartbeat is older than `watchdog.stall_after` while
|
|
|
|
|
`state == "working"` raise a `SessionStalled` alert and become
|
|
|
|
|
candidates for a forced restart (via `account-switcher` or
|
|
|
|
|
`launch-agent` re-run).
|
|
|
|
|
- Replaces `watchdog.sh` entirely once Phase 2 ships.
|
|
|
|
|
|
|
|
|
|
### 2.4 checkpoint
|
|
|
|
|
|
|
|
|
|
- On `checkpoint.interval` (default `30m`, a.k.a. the Anthropic 5h
|
|
|
|
|
window / 10 checkpoints), serializes per-session context:
|
|
|
|
|
- current task id,
|
|
|
|
|
- last recorded tool call (name + truncated args),
|
|
|
|
|
- cwd as reported by the session,
|
|
|
|
|
- the last N lines of the session's scrollback
|
|
|
|
|
(`tmux capture-pane -pS -<N>`).
|
|
|
|
|
- Persists to SQLite at `checkpoint.db` (default
|
|
|
|
|
`/var/lib/claude-failover/checkpoints.db`) with retention pruning
|
|
|
|
|
based on the 30-minute rolling window.
|
|
|
|
|
- Schema: `checkpoints(id INTEGER PK, session TEXT, ts INTEGER,
|
|
|
|
|
task_id TEXT, cwd TEXT, tool_calls JSON, scrollback TEXT)`.
|
|
|
|
|
- Indexed on `(session, ts DESC)`; old rows pruned by `janitor`.
|
|
|
|
|
- Also exports a JSON snapshot per session to
|
|
|
|
|
`checkpoint.dir/<session>/<timestamp>.json` for out-of-band tooling.
|
|
|
|
|
|
|
|
|
|
### 2.5 janitor
|
|
|
|
|
|
|
|
|
|
Periodic housekeeping (runs every `janitor.interval`, default `5m`):
|
|
|
|
|
|
|
|
|
|
- removes stale `.md.dispatched` markers whose source task is gone,
|
|
|
|
|
- archives `done/` entries older than `janitor.done_horizon`,
|
|
|
|
|
- prunes checkpoint rows beyond `checkpoint.keep`,
|
|
|
|
|
- rotates the daemon's own log file when it exceeds
|
|
|
|
|
`log.max_size_mb`.
|
|
|
|
|
|
|
|
|
|
### 2.6 notifier
|
|
|
|
|
|
|
|
|
|
- Fan-out of typed events (`SwapFired`, `SessionStalled`, `TaskFailed`,
|
|
|
|
|
`QuotaWarning`, `TaskDispatched`) to configured sinks:
|
|
|
|
|
- Telegram bot (alerts channel),
|
|
|
|
|
- MCP control-plane push (HTTP callback),
|
|
|
|
|
- structured log aggregator (stdout JSON).
|
|
|
|
|
- Each sink is a goroutine with its own bounded buffered channel.
|
|
|
|
|
Back-pressure is handled by dropping low-priority events
|
|
|
|
|
(`TaskDispatched`) and logging a `NotifierSaturated` warning.
|
|
|
|
|
|
|
|
|
|
### 2.7 account-switcher (state machine)
|
|
|
|
|
|
|
|
|
|
The switcher is the only goroutine allowed to mutate `State.ActiveAccount`.
|
|
|
|
|
It consumes `SwapRequested` events from a single-consumer channel and
|
|
|
|
|
transitions through an atomic state machine:
|
2026-04-14 13:29:24 +00:00
|
|
|
|
2026-04-14 13:36:20 +00:00
|
|
|
```
|
|
|
|
|
+--------+ SwapRequested +--------+
|
|
|
|
|
| normal | ---------------> | saving |
|
|
|
|
|
+--------+ +--------+
|
|
|
|
|
^ |
|
|
|
|
|
| v (checkpoint all sessions)
|
|
|
|
|
| +-----------+
|
|
|
|
|
| | switching |
|
|
|
|
|
| +-----------+
|
|
|
|
|
| |
|
|
|
|
|
| v (relaunch tmux, flip HOME)
|
|
|
|
|
| ResumeComplete +----------+
|
|
|
|
|
+---------------------- | resuming |
|
|
|
|
|
+----------+
|
|
|
|
|
```
|
2026-04-14 13:29:24 +00:00
|
|
|
|
2026-04-14 13:36:20 +00:00
|
|
|
- `normal -> saving`: halt dispatcher intake, request an immediate
|
|
|
|
|
checkpoint for every session, wait for ack with a deadline.
|
|
|
|
|
- `saving -> switching`: kill tmux sessions in reverse launch order,
|
|
|
|
|
repoint the `~/.claude` symlink (or per-session `HOME`) to the target
|
|
|
|
|
account's home directory.
|
|
|
|
|
- `switching -> resuming`: relaunch sessions via
|
|
|
|
|
`TmuxAdapter.NewSession`; each session replays its most recent
|
|
|
|
|
checkpoint (task pointer + cwd + truncated scrollback) so Claude Code
|
|
|
|
|
reopens at the same task.
|
|
|
|
|
- `resuming -> normal`: unblock dispatcher, start cooldown timer on
|
|
|
|
|
the old account, emit `SwapFired`.
|
2026-04-14 13:29:24 +00:00
|
|
|
|
2026-04-14 13:36:20 +00:00
|
|
|
Guarantees:
|
2026-04-14 13:29:24 +00:00
|
|
|
|
2026-04-14 13:36:20 +00:00
|
|
|
- Only one swap in flight at a time (the channel has capacity 1, extra
|
|
|
|
|
requests coalesce).
|
|
|
|
|
- Swap is cancellable via `context.Context`; on cancel the switcher
|
|
|
|
|
rolls back to the pre-swap `ActiveAccount`.
|
|
|
|
|
- All FS writes go through `FSAdapter` and are `*.tmp + rename` to keep
|
|
|
|
|
`state.json` recoverable.
|
2026-04-14 13:29:24 +00:00
|
|
|
|
2026-04-14 13:36:20 +00:00
|
|
|
See [`session-switch-analysis.md`](./session-switch-analysis.md) for
|
|
|
|
|
why the shared-symlink approach is required (Claude Code bug #16103)
|
|
|
|
|
and how resuming interactive sessions differs from resuming autonomous
|
|
|
|
|
ones.
|
2026-04-14 13:29:24 +00:00
|
|
|
|
2026-04-14 13:36:20 +00:00
|
|
|
---
|
2026-04-14 13:29:24 +00:00
|
|
|
|
2026-04-14 13:36:20 +00:00
|
|
|
## 3. Shared state
|
2026-04-14 13:29:24 +00:00
|
|
|
|
2026-04-14 13:36:20 +00:00
|
|
|
```go
|
|
|
|
|
type State struct {
|
|
|
|
|
mu sync.RWMutex
|
|
|
|
|
ActiveAccount string
|
|
|
|
|
Accounts map[string]*AccountState
|
|
|
|
|
Sessions map[string]*SessionState
|
|
|
|
|
Pool PoolState
|
|
|
|
|
Dispatcher DispatcherState
|
|
|
|
|
LastSwap time.Time
|
|
|
|
|
PendingSwap bool
|
|
|
|
|
StartedAt time.Time
|
|
|
|
|
}
|
2026-04-14 13:29:24 +00:00
|
|
|
|
2026-04-14 13:36:20 +00:00
|
|
|
type AccountState struct {
|
|
|
|
|
Name string
|
|
|
|
|
Home string
|
|
|
|
|
Priority int
|
|
|
|
|
Usage5h float64 // 0.0 - 1.0
|
|
|
|
|
UsageWeek float64
|
|
|
|
|
CooldownEnd time.Time
|
|
|
|
|
LastSwapIn time.Time
|
|
|
|
|
LastSwapOut time.Time
|
|
|
|
|
}
|
2026-04-14 13:29:24 +00:00
|
|
|
|
2026-04-14 13:36:20 +00:00
|
|
|
type SessionState struct {
|
|
|
|
|
Name string
|
|
|
|
|
Kind string // "autonomous" | "dedicated" | "interactive" | "orchestrator"
|
|
|
|
|
Project string
|
|
|
|
|
State string // "idle" | "working" | "shell" | "stalled"
|
|
|
|
|
Task string
|
|
|
|
|
AssignedAt time.Time
|
|
|
|
|
Heartbeat time.Time
|
|
|
|
|
}
|
|
|
|
|
```
|
2026-04-14 13:29:24 +00:00
|
|
|
|
2026-04-14 13:36:20 +00:00
|
|
|
- Readers take `RLock`; the account-switcher takes `Lock` for the
|
|
|
|
|
duration of a swap.
|
|
|
|
|
- All other writers enqueue on a single-writer channel
|
|
|
|
|
(`stateOps chan func(*State)`) consumed by a dedicated goroutine.
|
|
|
|
|
This mirrors the actor pattern and eliminates lock-ordering issues.
|
|
|
|
|
- Every mutation triggers a debounced flush of `state.json`
|
|
|
|
|
(`persist.debounce`, default `1s`) written atomically via
|
|
|
|
|
`FSAdapter`.
|
2026-04-14 13:29:24 +00:00
|
|
|
|
2026-04-14 13:36:20 +00:00
|
|
|
---
|
2026-04-14 13:29:24 +00:00
|
|
|
|
2026-04-14 13:36:20 +00:00
|
|
|
## 4. Event system
|
2026-04-14 13:29:24 +00:00
|
|
|
|
2026-04-14 13:36:20 +00:00
|
|
|
Internal events are plain Go structs on typed channels. No reflection.
|
2026-04-14 13:29:24 +00:00
|
|
|
|
2026-04-14 13:36:20 +00:00
|
|
|
| Channel | Producer(s) | Consumer | Buffer |
|
|
|
|
|
|-------------------|-----------------------|-------------------|--------|
|
|
|
|
|
| `swapReq` | quota-monitor, HTTP | account-switcher | 1 |
|
|
|
|
|
| `sessionEvt` | session-watcher | notifier, state | 32 |
|
|
|
|
|
| `taskEvt` | dispatcher | notifier, state | 64 |
|
|
|
|
|
| `checkpointEvt` | checkpoint | state | 8 |
|
|
|
|
|
| `notifierSink[i]` | notifier | sink i | 128 |
|
2026-04-14 13:29:24 +00:00
|
|
|
|
2026-04-14 13:36:20 +00:00
|
|
|
Flow for a typical swap:
|
2026-04-14 13:29:24 +00:00
|
|
|
|
2026-04-14 13:36:20 +00:00
|
|
|
```
|
|
|
|
|
quota-monitor --SwapRequested--> swapReq --> account-switcher
|
|
|
|
|
| |
|
|
|
|
|
| +--> state (lock)
|
|
|
|
|
| +--> TmuxAdapter (kill/launch)
|
|
|
|
|
| +--> FSAdapter (symlink flip)
|
|
|
|
|
| +--> sessionEvt (SwapFired)
|
|
|
|
|
v
|
|
|
|
|
notifier --> telegram/mcp/log
|
|
|
|
|
```
|
2026-04-14 13:29:24 +00:00
|
|
|
|
2026-04-14 13:36:20 +00:00
|
|
|
---
|
2026-04-14 13:29:24 +00:00
|
|
|
|
2026-04-14 13:36:20 +00:00
|
|
|
## 5. Adapters (mockable interfaces)
|
2026-04-14 13:29:24 +00:00
|
|
|
|
2026-04-14 13:36:20 +00:00
|
|
|
### Tmux
|
2026-04-14 13:29:24 +00:00
|
|
|
|
2026-04-14 13:36:20 +00:00
|
|
|
Thin wrapper around `exec.Command("tmux", ...)`, mockable via interface:
|
2026-04-14 13:29:24 +00:00
|
|
|
|
|
|
|
|
```go
|
2026-04-14 13:36:20 +00:00
|
|
|
type TmuxAdapter interface {
|
|
|
|
|
HasSession(name string) (bool, error)
|
|
|
|
|
NewSession(name, cwd string, env []string) error
|
|
|
|
|
KillSession(name string) error
|
|
|
|
|
SendKeys(name string, keys string) error
|
|
|
|
|
CapturePane(name string, lines int) (string, error)
|
|
|
|
|
ListSessions() ([]string, error)
|
2026-04-14 13:29:24 +00:00
|
|
|
}
|
|
|
|
|
```
|
|
|
|
|
|
2026-04-14 13:36:20 +00:00
|
|
|
- Default implementation: `ExecTmux` shelling out to the `tmux` binary.
|
|
|
|
|
- Test implementation: `FakeTmux` backed by an in-memory map.
|
|
|
|
|
- Inspired by
|
|
|
|
|
[`smtg-ai/claude-squad/session/tmux`](https://github.com/smtg-ai/claude-squad/tree/main/session/tmux),
|
|
|
|
|
which already battle-tests pane capture and send-keys escaping.
|
2026-04-14 13:29:24 +00:00
|
|
|
|
2026-04-14 13:36:20 +00:00
|
|
|
### Filesystem & Claude CLI
|
2026-04-14 13:29:24 +00:00
|
|
|
|
2026-04-14 13:36:20 +00:00
|
|
|
```go
|
|
|
|
|
type FSAdapter interface {
|
|
|
|
|
ReadFile(path string) ([]byte, error)
|
|
|
|
|
WriteFileAtomic(path string, data []byte, perm os.FileMode) error
|
|
|
|
|
Rename(oldp, newp string) error
|
|
|
|
|
Stat(path string) (os.FileInfo, error)
|
|
|
|
|
Watch(path string) (<-chan fsnotify.Event, error)
|
|
|
|
|
}
|
2026-04-14 13:29:24 +00:00
|
|
|
|
2026-04-14 13:36:20 +00:00
|
|
|
type ClaudeCLIAdapter interface {
|
|
|
|
|
ResumeSession(sessionID, cwd string) error
|
|
|
|
|
InjectPrompt(sessionID, prompt string) error
|
|
|
|
|
Quota() (Usage, error)
|
|
|
|
|
}
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
All three interfaces have `Fake*` implementations under
|
|
|
|
|
`internal/testutil/` so every goroutine can be unit-tested without a
|
|
|
|
|
real tmux, filesystem, or Claude CLI.
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## 6. HTTP control plane (MCP)
|
|
|
|
|
|
|
|
|
|
Bound to `127.0.0.1:8765` (config `mcp_http.listen`). Never exposed
|
|
|
|
|
publicly - the SecuAAS MCP gateway proxies from the host.
|
|
|
|
|
|
|
|
|
|
| Method | Path | Purpose |
|
|
|
|
|
|--------|-------------------------------|----------------------------------|
|
|
|
|
|
| GET | `/api/v1/session_status` | Full state snapshot |
|
|
|
|
|
| GET | `/api/v1/accounts` | Account usage + limits |
|
|
|
|
|
| GET | `/api/v1/sessions` | Session table |
|
|
|
|
|
| POST | `/api/v1/trigger_dispatch` | Force inbox scan |
|
|
|
|
|
| POST | `/api/v1/trigger/swap` | Force failover (bearer required) |
|
|
|
|
|
| GET | `/healthz` | Liveness probe |
|
|
|
|
|
|
|
|
|
|
- All mutating routes require the bearer token from
|
|
|
|
|
`mcp_http.bearer_token_env`.
|
|
|
|
|
- Read routes are unauthenticated but only reachable on loopback.
|
|
|
|
|
- Handlers are thin: they publish on the same internal channels used
|
|
|
|
|
by goroutines, then wait on a reply channel with a timeout.
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## 7. Graceful shutdown
|
|
|
|
|
|
|
|
|
|
- `main` installs a `signal.Notify` for `SIGTERM`, `SIGINT`, `SIGHUP`.
|
|
|
|
|
- On signal, the root `context.Context` is cancelled.
|
|
|
|
|
- Goroutines drain in a defined order:
|
|
|
|
|
1. HTTP server (`Shutdown(ctx)` with 5s deadline),
|
|
|
|
|
2. dispatcher (stops accepting new tasks, finishes in-flight assignment),
|
|
|
|
|
3. quota-monitor (cancels poll),
|
|
|
|
|
4. session-watcher (flushes last state),
|
|
|
|
|
5. checkpoint (forces one final checkpoint for every working session),
|
|
|
|
|
6. account-switcher (refuses new swaps, completes an in-flight swap
|
|
|
|
|
up to `shutdown.swap_deadline`),
|
|
|
|
|
7. notifier (flushes bounded buffers),
|
|
|
|
|
8. janitor (exits immediately).
|
|
|
|
|
- `state.json` is flushed one last time before exit.
|
|
|
|
|
- Exit code `0` on clean drain, `1` on deadline timeout (systemd will
|
|
|
|
|
restart).
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## 8. Configuration
|
|
|
|
|
|
|
|
|
|
Single YAML file. See [`config.example.yaml`](../config.example.yaml)
|
|
|
|
|
for the authoritative schema. Key sections:
|
|
|
|
|
|
|
|
|
|
- `accounts[]` - ordered list of Anthropic accounts (name, home,
|
|
|
|
|
limits, priority).
|
|
|
|
|
- `pool` - dedicated sessions, autonomous pool prefix and min/max,
|
|
|
|
|
shared projects dir.
|
|
|
|
|
- `quota` - poll interval, 5h/weekly thresholds.
|
|
|
|
|
- `checkpoint` - interval, dir, db, keep count.
|
|
|
|
|
- `janitor` - interval, horizons.
|
|
|
|
|
- `mcp_http` - listen addr, bearer token env var.
|
|
|
|
|
- `notifier` - telegram, mcp callback, log sinks.
|
|
|
|
|
|
|
|
|
|
The config is loaded once at startup. `SIGHUP` triggers a re-parse;
|
|
|
|
|
only safe fields (thresholds, intervals, notifier sinks) are
|
|
|
|
|
hot-reloaded. Structural changes (accounts, pool) require a restart.
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## 9. Migration plan
|
|
|
|
|
|
|
|
|
|
The existing orchestrator is a set of bash scripts under
|
|
|
|
|
`agent-orchestrator/` (`graceful-switch.sh`, `launch-agent.sh`,
|
|
|
|
|
`setup-tmux.sh`, `start-dedicated-agents.sh`, `watchdog.sh`,
|
|
|
|
|
`checkpoint-daemon.sh`, `smart-dispatcher.sh`). Cutover is staged to
|
|
|
|
|
minimize risk.
|
|
|
|
|
|
|
|
|
|
### Phase 1 - Daemon wrapper (weeks 1-2)
|
|
|
|
|
|
|
|
|
|
- Build `claude-failover` as a supervisor that `exec.Command`s the
|
|
|
|
|
existing `.sh` scripts.
|
|
|
|
|
- Goroutines are skeletons that call scripts:
|
|
|
|
|
- dispatcher -> `smart-dispatcher.sh`,
|
|
|
|
|
- account-switcher -> `graceful-switch.sh <target>`,
|
|
|
|
|
- session-watcher -> parses `watchdog.sh --json` output,
|
|
|
|
|
- checkpoint -> calls `checkpoint-daemon.sh --once`.
|
|
|
|
|
- Shared state, event system, HTTP control plane, and config are
|
|
|
|
|
already native Go.
|
|
|
|
|
- Systemd unit replaces the existing per-script cron/tmux launchers.
|
|
|
|
|
|
|
|
|
|
### Phase 2 - Native rewrite (weeks 3-6)
|
|
|
|
|
|
|
|
|
|
- Replace each shell wrapper with a native Go implementation one
|
|
|
|
|
goroutine at a time, in this order: session-watcher -> checkpoint ->
|
|
|
|
|
dispatcher -> account-switcher.
|
|
|
|
|
- Shell scripts remain as a fallback until each native path has run a
|
|
|
|
|
full week without regression.
|
|
|
|
|
|
|
|
|
|
### Phase 3 - Removal (week 7+)
|
|
|
|
|
|
|
|
|
|
- Delete `agent-orchestrator/*.sh` once Phase 2 is green.
|
|
|
|
|
- Keep the TUI (`agent-orchestrator/tui/`) as a read-only client of
|
|
|
|
|
the daemon's HTTP API.
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## 10. Testing strategy
|
|
|
|
|
|
|
|
|
|
- Adapters (`TmuxAdapter`, `FSAdapter`, `ClaudeCLIAdapter`) are
|
|
|
|
|
interfaces with in-memory fakes under `internal/testutil/`.
|
|
|
|
|
- Unit tests cover every goroutine in isolation:
|
|
|
|
|
- dispatcher: feed fake fsnotify events, assert tmux `SendKeys`
|
|
|
|
|
calls.
|
|
|
|
|
- quota-monitor: feed canned pane captures, assert `SwapRequested`.
|
|
|
|
|
- account-switcher: drive the state machine through a full swap,
|
|
|
|
|
including cancellation and rollback.
|
|
|
|
|
- Integration tests boot the full daemon with fake adapters and a
|
|
|
|
|
`httptest.Server` on the MCP port. A golden-file test compares
|
|
|
|
|
`state.json` after a scripted sequence.
|
|
|
|
|
- Race detector mandatory in CI (`go test -race ./...`).
|
|
|
|
|
- An `e2e/` suite runs against a real tmux + real Claude Code CLI in a
|
|
|
|
|
container, executed nightly.
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## 11. Systemd unit
|
|
|
|
|
|
|
|
|
|
```ini
|
|
|
|
|
# /etc/systemd/system/claude-failover.service
|
|
|
|
|
[Unit]
|
|
|
|
|
Description=Claude Code failover & orchestration daemon
|
|
|
|
|
After=network-online.target
|
|
|
|
|
Wants=network-online.target
|
|
|
|
|
|
|
|
|
|
[Service]
|
|
|
|
|
Type=notify
|
|
|
|
|
User=ubuntu
|
|
|
|
|
Group=ubuntu
|
|
|
|
|
WorkingDirectory=/home/ubuntu
|
|
|
|
|
Environment=CLAUDE_FAILOVER_CONFIG=/etc/claude-failover/config.yaml
|
|
|
|
|
ExecStart=/usr/local/bin/claude-failover --config ${CLAUDE_FAILOVER_CONFIG}
|
|
|
|
|
ExecReload=/bin/kill -HUP $MAINPID
|
|
|
|
|
Restart=always
|
|
|
|
|
RestartSec=3
|
|
|
|
|
TimeoutStopSec=30
|
|
|
|
|
LimitNOFILE=65536
|
|
|
|
|
# State & logs
|
|
|
|
|
StateDirectory=claude-failover
|
|
|
|
|
LogsDirectory=claude-failover
|
|
|
|
|
# Hardening (tmux needs TTY access, keep minimal)
|
|
|
|
|
NoNewPrivileges=true
|
|
|
|
|
ProtectSystem=strict
|
|
|
|
|
ReadWritePaths=/var/lib/claude-failover /var/log/claude-failover /home/ubuntu/.claude /home/ubuntu/projects
|
|
|
|
|
|
|
|
|
|
[Install]
|
|
|
|
|
WantedBy=multi-user.target
|
|
|
|
|
```
|
2026-04-14 13:29:24 +00:00
|
|
|
|
2026-04-14 13:36:20 +00:00
|
|
|
`Type=notify` requires the daemon to call `sd_notify(READY=1)` once all
|
|
|
|
|
goroutines have registered and the HTTP server is listening. Use
|
|
|
|
|
[`github.com/coreos/go-systemd/v22/daemon`](https://pkg.go.dev/github.com/coreos/go-systemd/v22/daemon).
|