126 lines
4.2 KiB
Markdown
126 lines
4.2 KiB
Markdown
|
|
# Architecture
|
||
|
|
|
||
|
|
`claude-failover` is a single Go binary structured as a set of cooperating
|
||
|
|
goroutines. Each goroutine owns a narrow responsibility and communicates
|
||
|
|
through typed channels and a shared `State` struct guarded by a
|
||
|
|
`sync.RWMutex`. A single-writer discipline is enforced: only the
|
||
|
|
**account-switcher** may mutate the active-account field.
|
||
|
|
|
||
|
|
## Goroutines
|
||
|
|
|
||
|
|
### dispatcher
|
||
|
|
|
||
|
|
Watches `.agent-queue/inbox/` for every registered project (inotify on
|
||
|
|
Linux) and pairs each incoming task with an idle session from the pool.
|
||
|
|
It respects:
|
||
|
|
|
||
|
|
- per-project priority
|
||
|
|
- agent capability tags declared in the task frontmatter
|
||
|
|
- the `needs_claude_code: true` bypass flag
|
||
|
|
- dispatcher-level cooldowns to avoid flooding a freshly-launched session
|
||
|
|
|
||
|
|
On successful assignment it renames `<task>.md` to `<task>.md.dispatched`
|
||
|
|
and writes a pointer into the target session's tmux prompt.
|
||
|
|
|
||
|
|
### quota-monitor
|
||
|
|
|
||
|
|
Polls Anthropic usage counters for every configured account. Sources:
|
||
|
|
|
||
|
|
1. Claude Code's local telemetry files under `~/.claude/statsig/` and
|
||
|
|
`~/.claude/projects/*.jsonl` (message timestamps).
|
||
|
|
2. Optional: a reverse-engineered `/api/quota` endpoint if available.
|
||
|
|
|
||
|
|
It computes two sliding windows (5h, 1 week) and emits a `swap-requested`
|
||
|
|
event once thresholds in the config are crossed.
|
||
|
|
|
||
|
|
### session-watcher
|
||
|
|
|
||
|
|
Keeps a table of tmux sessions (`ccl-*`). For each one it tracks:
|
||
|
|
|
||
|
|
- process liveness (via `tmux has-session`)
|
||
|
|
- heartbeat timestamp from `.agent-queue/status.json`
|
||
|
|
- current `state` field (idle / working / stalled)
|
||
|
|
|
||
|
|
Stalled sessions (heartbeat older than N minutes while `state=working`)
|
||
|
|
raise an alert on the notifier channel and become candidates for a
|
||
|
|
forced restart.
|
||
|
|
|
||
|
|
### checkpoint
|
||
|
|
|
||
|
|
Every `checkpoint.interval`, serializes per-session context:
|
||
|
|
|
||
|
|
- current task id
|
||
|
|
- last recorded tool call (name + truncated args)
|
||
|
|
- cwd as reported by the session
|
||
|
|
- the last N lines of the session's scrollback
|
||
|
|
|
||
|
|
Files are written atomically (`*.tmp` + rename) to
|
||
|
|
`checkpoint.dir/<session>/<timestamp>.json` and pruned to
|
||
|
|
`checkpoint.keep` entries.
|
||
|
|
|
||
|
|
### janitor
|
||
|
|
|
||
|
|
Periodic housekeeping:
|
||
|
|
|
||
|
|
- removes stale `.md.dispatched` markers whose source task is gone
|
||
|
|
- archives `done/` older than a configurable horizon
|
||
|
|
- prunes expired checkpoints
|
||
|
|
- rotates the daemon's own log file when it exceeds a size threshold
|
||
|
|
|
||
|
|
### notifier
|
||
|
|
|
||
|
|
Fan-out of typed events (`SwapFired`, `SessionStalled`, `TaskFailed`,
|
||
|
|
`QuotaWarning`) to configured sinks:
|
||
|
|
|
||
|
|
- Telegram bot (alerts channel)
|
||
|
|
- MCP control-plane push
|
||
|
|
- stdout / structured log aggregator
|
||
|
|
|
||
|
|
### account-switcher
|
||
|
|
|
||
|
|
Serializes all account swaps behind a single mutex. Swap protocol:
|
||
|
|
|
||
|
|
1. mark active account as `draining`
|
||
|
|
2. tell each session to flush its current tool call and checkpoint
|
||
|
|
3. stop tmux sessions in reverse launch order
|
||
|
|
4. repoint the `~/.claude` symlink (or equivalent per-session HOME) to
|
||
|
|
the target account's home directory
|
||
|
|
5. relaunch sessions; replay the latest checkpoint so each session
|
||
|
|
reopens the same project and task pointer
|
||
|
|
6. mark the new account `active`, start the cooldown timer on the old one
|
||
|
|
|
||
|
|
See [`session-switch-analysis.md`](./session-switch-analysis.md) for why
|
||
|
|
the shared-symlink approach is required (Claude Code bug #16103).
|
||
|
|
|
||
|
|
## Shared state
|
||
|
|
|
||
|
|
```go
|
||
|
|
type State struct {
|
||
|
|
mu sync.RWMutex
|
||
|
|
ActiveAccount string
|
||
|
|
Accounts map[string]*AccountState
|
||
|
|
Sessions map[string]*SessionState
|
||
|
|
LastSwap time.Time
|
||
|
|
PendingSwap bool
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
Readers take `RLock`; the account-switcher takes `Lock` for the duration
|
||
|
|
of a swap. All other writers go through a single-writer channel owned by
|
||
|
|
the switcher, which guarantees swap atomicity.
|
||
|
|
|
||
|
|
## HTTP control plane
|
||
|
|
|
||
|
|
The daemon exposes a small HTTP server (`mcp_http.listen`) consumed by
|
||
|
|
the SecuAAS MCP gateway. Routes:
|
||
|
|
|
||
|
|
| Method | Path | Purpose |
|
||
|
|
|--------|-----------------------|--------------------------------|
|
||
|
|
| GET | `/status` | Full state snapshot |
|
||
|
|
| GET | `/accounts` | Account usage + limits |
|
||
|
|
| GET | `/sessions` | Session table |
|
||
|
|
| POST | `/trigger/swap` | Force failover (requires bearer) |
|
||
|
|
| POST | `/trigger/dispatch` | Force inbox scan |
|
||
|
|
|
||
|
|
All routes require the bearer token from `mcp_http.bearer_token_env`.
|