feat: initial project structure

This commit is contained in:
Olivier 2026-04-14 13:29:24 +00:00
commit cf4957010f
10 changed files with 621 additions and 0 deletions

125
docs/architecture.md Normal file
View file

@ -0,0 +1,125 @@
# Architecture
`claude-failover` is a single Go binary structured as a set of cooperating
goroutines. Each goroutine owns a narrow responsibility and communicates
through typed channels and a shared `State` struct guarded by a
`sync.RWMutex`. A single-writer discipline is enforced: only the
**account-switcher** may mutate the active-account field.
## Goroutines
### dispatcher
Watches `.agent-queue/inbox/` for every registered project (inotify on
Linux) and pairs each incoming task with an idle session from the pool.
It respects:
- per-project priority
- agent capability tags declared in the task frontmatter
- the `needs_claude_code: true` bypass flag
- dispatcher-level cooldowns to avoid flooding a freshly-launched session
On successful assignment it renames `<task>.md` to `<task>.md.dispatched`
and writes a pointer into the target session's tmux prompt.
### quota-monitor
Polls Anthropic usage counters for every configured account. Sources:
1. Claude Code's local telemetry files under `~/.claude/statsig/` and
`~/.claude/projects/*.jsonl` (message timestamps).
2. Optional: a reverse-engineered `/api/quota` endpoint if available.
It computes two sliding windows (5h, 1 week) and emits a `swap-requested`
event once thresholds in the config are crossed.
### session-watcher
Keeps a table of tmux sessions (`ccl-*`). For each one it tracks:
- process liveness (via `tmux has-session`)
- heartbeat timestamp from `.agent-queue/status.json`
- current `state` field (idle / working / stalled)
Stalled sessions (heartbeat older than N minutes while `state=working`)
raise an alert on the notifier channel and become candidates for a
forced restart.
### checkpoint
Every `checkpoint.interval`, serializes per-session context:
- current task id
- last recorded tool call (name + truncated args)
- cwd as reported by the session
- the last N lines of the session's scrollback
Files are written atomically (`*.tmp` + rename) to
`checkpoint.dir/<session>/<timestamp>.json` and pruned to
`checkpoint.keep` entries.
### janitor
Periodic housekeeping:
- removes stale `.md.dispatched` markers whose source task is gone
- archives `done/` older than a configurable horizon
- prunes expired checkpoints
- rotates the daemon's own log file when it exceeds a size threshold
### notifier
Fan-out of typed events (`SwapFired`, `SessionStalled`, `TaskFailed`,
`QuotaWarning`) to configured sinks:
- Telegram bot (alerts channel)
- MCP control-plane push
- stdout / structured log aggregator
### account-switcher
Serializes all account swaps behind a single mutex. Swap protocol:
1. mark active account as `draining`
2. tell each session to flush its current tool call and checkpoint
3. stop tmux sessions in reverse launch order
4. repoint the `~/.claude` symlink (or equivalent per-session HOME) to
the target account's home directory
5. relaunch sessions; replay the latest checkpoint so each session
reopens the same project and task pointer
6. mark the new account `active`, start the cooldown timer on the old one
See [`session-switch-analysis.md`](./session-switch-analysis.md) for why
the shared-symlink approach is required (Claude Code bug #16103).
## Shared state
```go
type State struct {
mu sync.RWMutex
ActiveAccount string
Accounts map[string]*AccountState
Sessions map[string]*SessionState
LastSwap time.Time
PendingSwap bool
}
```
Readers take `RLock`; the account-switcher takes `Lock` for the duration
of a swap. All other writers go through a single-writer channel owned by
the switcher, which guarantees swap atomicity.
## HTTP control plane
The daemon exposes a small HTTP server (`mcp_http.listen`) consumed by
the SecuAAS MCP gateway. Routes:
| Method | Path | Purpose |
|--------|-----------------------|--------------------------------|
| GET | `/status` | Full state snapshot |
| GET | `/accounts` | Account usage + limits |
| GET | `/sessions` | Session table |
| POST | `/trigger/swap` | Force failover (requires bearer) |
| POST | `/trigger/dispatch` | Force inbox scan |
All routes require the bearer token from `mcp_http.bearer_token_env`.

71
docs/configuration.md Normal file
View file

@ -0,0 +1,71 @@
# Configuration
Configuration is a single YAML file (`config.yaml`) read at startup.
`config.example.yaml` at the repository root is the annotated template —
copy and edit.
## Sections
### `accounts`
List of Anthropic accounts the daemon may rotate through.
| Field | Type | Description |
|------------------|----------|-------------|
| `name` | string | Short identifier used in logs and metrics. |
| `home` | path | Per-account home directory (houses the private `~/.claude` credentials). |
| `limits.hourly_msgs` | int | Local soft cap on messages per hour. `0` disables. |
| `limits.weekly_msgs` | int | Local soft cap per 7-day window. `0` disables. |
| `priority` | int | Lower numbers preferred when selecting a primary. |
### `pool`
Session pool layout.
- `dedicated` — long-lived sessions bound to a specific project path.
Each entry has `name` (tmux session) and `project` (absolute path).
- `autonomous` — autoscaling pool for the inbox dispatcher.
`prefix` + serial → session names; `min`/`max` bound the pool size.
- `shared_projects_dir` — canonical Claude Code `projects/` tree,
symlinked from every account home (see
`docs/session-switch-analysis.md`).
### `quota`
Failover thresholds.
- `poll_interval` — how often the quota-monitor samples usage.
- `window_5h_threshold` — fraction (0..1) of the 5h cap above which a
swap is requested.
- `window_week_threshold` — same idea for the weekly cap.
- `reactivate_cooldown` — minimum time before a drained account is
eligible again.
### `checkpoint`
- `dir` — where per-session snapshots are written.
- `interval` — time between snapshots.
- `keep` — retention per session (older files are pruned).
### `mcp_http`
HTTP control plane for the MCP gateway.
- `listen` — bind address (keep it loopback unless fronted by a reverse
proxy).
- `bearer_token_env` — name of the env var from which to read the
bearer token required on every request.
- `enable_trigger` — gate for mutating routes (`/trigger/*`).
## Secrets
The config file itself holds no secrets. Bearer tokens and account
credentials are read from the environment. In production deploy via a
systemd unit with a drop-in that sources `/run/secrets/claude-failover.env`
(populated by `secuops` / OVH KMS at boot).
## Reload
Config reload is not supported in the initial version. A SIGHUP handler
is planned — the account-switcher makes a clean live reload feasible
(drain, reconfigure, resume).

View file

@ -0,0 +1,61 @@
# Cross-account session resume — bug analysis
## Symptom
When Claude Code is relaunched under a different Anthropic account
(different `~/.claude/` directory), invoking `claude --resume <id>`
fails with a "session not found" error even though the session
transcript JSONL still exists on disk. The new account has no record of
the session id because Claude Code tracks resumable sessions per-account
in its local state store.
Upstream report: Claude Code issue **#16103** (cross-account resume).
## Consequence for claude-failover
Naïve account swap (stop sessions on account A, start on account B
pointing at a different `~/.claude/projects/` tree) loses every running
session. That would defeat the purpose of failover — we would be forced
to kill in-flight tasks on every quota boundary.
## Solution — shared projects tree via symlink
The `projects/` subtree under `~/.claude/` is where the session
transcripts live. We keep a single canonical copy at
`/home/ubuntu/.claude-projects-shared/` and symlink each account's
`~/.claude/projects` to it:
```
/home/ubuntu/.claude-compte1/projects -> /home/ubuntu/.claude-projects-shared/
/home/ubuntu/.claude-compte2/projects -> /home/ubuntu/.claude-projects-shared/
```
With this layout:
- Account A records session `S1` while running. Its transcript lands in
the shared directory.
- On swap, account B's Claude Code process starts with its own
credentials but sees the same `projects/` tree.
- `claude --resume S1` finds the transcript and replays it.
The per-account state that is **not** shared — credentials, telemetry
cache, statsig flags — stays isolated because only `projects/` is
symlinked, not the whole `~/.claude`.
## Validation checklist (for implementers)
- [ ] `claude --resume` succeeds across accounts when `projects/` is
symlinked
- [ ] No transcript corruption when both accounts write concurrently
(they cannot — only one account is active at a time)
- [ ] Permissions on the shared dir allow the daemon user to read/write
- [ ] Backup strategy in place before first production failover
## Open questions
- Does Claude Code cache the session id in a per-account index that
needs to be pre-populated? If yes, the account-switcher must write a
small stub entry there on swap.
- Does a running session survive the HOME symlink flip, or must it be
restarted? Current assumption: restart is required, hence the
checkpoint goroutine.