feat: initial project structure

This commit is contained in:
Olivier 2026-04-14 13:29:24 +00:00
commit cf4957010f
10 changed files with 621 additions and 0 deletions

125
docs/architecture.md Normal file
View file

@ -0,0 +1,125 @@
# Architecture
`claude-failover` is a single Go binary structured as a set of cooperating
goroutines. Each goroutine owns a narrow responsibility and communicates
through typed channels and a shared `State` struct guarded by a
`sync.RWMutex`. A single-writer discipline is enforced: only the
**account-switcher** may mutate the active-account field.
## Goroutines
### dispatcher
Watches `.agent-queue/inbox/` for every registered project (inotify on
Linux) and pairs each incoming task with an idle session from the pool.
It respects:
- per-project priority
- agent capability tags declared in the task frontmatter
- the `needs_claude_code: true` bypass flag
- dispatcher-level cooldowns to avoid flooding a freshly-launched session
On successful assignment it renames `<task>.md` to `<task>.md.dispatched`
and writes a pointer into the target session's tmux prompt.
### quota-monitor
Polls Anthropic usage counters for every configured account. Sources:
1. Claude Code's local telemetry files under `~/.claude/statsig/` and
`~/.claude/projects/*.jsonl` (message timestamps).
2. Optional: a reverse-engineered `/api/quota` endpoint if available.
It computes two sliding windows (5h, 1 week) and emits a `swap-requested`
event once thresholds in the config are crossed.
### session-watcher
Keeps a table of tmux sessions (`ccl-*`). For each one it tracks:
- process liveness (via `tmux has-session`)
- heartbeat timestamp from `.agent-queue/status.json`
- current `state` field (idle / working / stalled)
Stalled sessions (heartbeat older than N minutes while `state=working`)
raise an alert on the notifier channel and become candidates for a
forced restart.
### checkpoint
Every `checkpoint.interval`, serializes per-session context:
- current task id
- last recorded tool call (name + truncated args)
- cwd as reported by the session
- the last N lines of the session's scrollback
Files are written atomically (`*.tmp` + rename) to
`checkpoint.dir/<session>/<timestamp>.json` and pruned to
`checkpoint.keep` entries.
### janitor
Periodic housekeeping:
- removes stale `.md.dispatched` markers whose source task is gone
- archives `done/` older than a configurable horizon
- prunes expired checkpoints
- rotates the daemon's own log file when it exceeds a size threshold
### notifier
Fan-out of typed events (`SwapFired`, `SessionStalled`, `TaskFailed`,
`QuotaWarning`) to configured sinks:
- Telegram bot (alerts channel)
- MCP control-plane push
- stdout / structured log aggregator
### account-switcher
Serializes all account swaps behind a single mutex. Swap protocol:
1. mark active account as `draining`
2. tell each session to flush its current tool call and checkpoint
3. stop tmux sessions in reverse launch order
4. repoint the `~/.claude` symlink (or equivalent per-session HOME) to
the target account's home directory
5. relaunch sessions; replay the latest checkpoint so each session
reopens the same project and task pointer
6. mark the new account `active`, start the cooldown timer on the old one
See [`session-switch-analysis.md`](./session-switch-analysis.md) for why
the shared-symlink approach is required (Claude Code bug #16103).
## Shared state
```go
type State struct {
mu sync.RWMutex
ActiveAccount string
Accounts map[string]*AccountState
Sessions map[string]*SessionState
LastSwap time.Time
PendingSwap bool
}
```
Readers take `RLock`; the account-switcher takes `Lock` for the duration
of a swap. All other writers go through a single-writer channel owned by
the switcher, which guarantees swap atomicity.
## HTTP control plane
The daemon exposes a small HTTP server (`mcp_http.listen`) consumed by
the SecuAAS MCP gateway. Routes:
| Method | Path | Purpose |
|--------|-----------------------|--------------------------------|
| GET | `/status` | Full state snapshot |
| GET | `/accounts` | Account usage + limits |
| GET | `/sessions` | Session table |
| POST | `/trigger/swap` | Force failover (requires bearer) |
| POST | `/trigger/dispatch` | Force inbox scan |
All routes require the bearer token from `mcp_http.bearer_token_env`.