feat: initial project structure
This commit is contained in:
commit
cf4957010f
10 changed files with 621 additions and 0 deletions
125
docs/architecture.md
Normal file
125
docs/architecture.md
Normal file
|
|
@ -0,0 +1,125 @@
|
|||
# Architecture
|
||||
|
||||
`claude-failover` is a single Go binary structured as a set of cooperating
|
||||
goroutines. Each goroutine owns a narrow responsibility and communicates
|
||||
through typed channels and a shared `State` struct guarded by a
|
||||
`sync.RWMutex`. A single-writer discipline is enforced: only the
|
||||
**account-switcher** may mutate the active-account field.
|
||||
|
||||
## Goroutines
|
||||
|
||||
### dispatcher
|
||||
|
||||
Watches `.agent-queue/inbox/` for every registered project (inotify on
|
||||
Linux) and pairs each incoming task with an idle session from the pool.
|
||||
It respects:
|
||||
|
||||
- per-project priority
|
||||
- agent capability tags declared in the task frontmatter
|
||||
- the `needs_claude_code: true` bypass flag
|
||||
- dispatcher-level cooldowns to avoid flooding a freshly-launched session
|
||||
|
||||
On successful assignment it renames `<task>.md` to `<task>.md.dispatched`
|
||||
and writes a pointer into the target session's tmux prompt.
|
||||
|
||||
### quota-monitor
|
||||
|
||||
Polls Anthropic usage counters for every configured account. Sources:
|
||||
|
||||
1. Claude Code's local telemetry files under `~/.claude/statsig/` and
|
||||
`~/.claude/projects/*.jsonl` (message timestamps).
|
||||
2. Optional: a reverse-engineered `/api/quota` endpoint if available.
|
||||
|
||||
It computes two sliding windows (5h, 1 week) and emits a `swap-requested`
|
||||
event once thresholds in the config are crossed.
|
||||
|
||||
### session-watcher
|
||||
|
||||
Keeps a table of tmux sessions (`ccl-*`). For each one it tracks:
|
||||
|
||||
- process liveness (via `tmux has-session`)
|
||||
- heartbeat timestamp from `.agent-queue/status.json`
|
||||
- current `state` field (idle / working / stalled)
|
||||
|
||||
Stalled sessions (heartbeat older than N minutes while `state=working`)
|
||||
raise an alert on the notifier channel and become candidates for a
|
||||
forced restart.
|
||||
|
||||
### checkpoint
|
||||
|
||||
Every `checkpoint.interval`, serializes per-session context:
|
||||
|
||||
- current task id
|
||||
- last recorded tool call (name + truncated args)
|
||||
- cwd as reported by the session
|
||||
- the last N lines of the session's scrollback
|
||||
|
||||
Files are written atomically (`*.tmp` + rename) to
|
||||
`checkpoint.dir/<session>/<timestamp>.json` and pruned to
|
||||
`checkpoint.keep` entries.
|
||||
|
||||
### janitor
|
||||
|
||||
Periodic housekeeping:
|
||||
|
||||
- removes stale `.md.dispatched` markers whose source task is gone
|
||||
- archives `done/` older than a configurable horizon
|
||||
- prunes expired checkpoints
|
||||
- rotates the daemon's own log file when it exceeds a size threshold
|
||||
|
||||
### notifier
|
||||
|
||||
Fan-out of typed events (`SwapFired`, `SessionStalled`, `TaskFailed`,
|
||||
`QuotaWarning`) to configured sinks:
|
||||
|
||||
- Telegram bot (alerts channel)
|
||||
- MCP control-plane push
|
||||
- stdout / structured log aggregator
|
||||
|
||||
### account-switcher
|
||||
|
||||
Serializes all account swaps behind a single mutex. Swap protocol:
|
||||
|
||||
1. mark active account as `draining`
|
||||
2. tell each session to flush its current tool call and checkpoint
|
||||
3. stop tmux sessions in reverse launch order
|
||||
4. repoint the `~/.claude` symlink (or equivalent per-session HOME) to
|
||||
the target account's home directory
|
||||
5. relaunch sessions; replay the latest checkpoint so each session
|
||||
reopens the same project and task pointer
|
||||
6. mark the new account `active`, start the cooldown timer on the old one
|
||||
|
||||
See [`session-switch-analysis.md`](./session-switch-analysis.md) for why
|
||||
the shared-symlink approach is required (Claude Code bug #16103).
|
||||
|
||||
## Shared state
|
||||
|
||||
```go
|
||||
type State struct {
|
||||
mu sync.RWMutex
|
||||
ActiveAccount string
|
||||
Accounts map[string]*AccountState
|
||||
Sessions map[string]*SessionState
|
||||
LastSwap time.Time
|
||||
PendingSwap bool
|
||||
}
|
||||
```
|
||||
|
||||
Readers take `RLock`; the account-switcher takes `Lock` for the duration
|
||||
of a swap. All other writers go through a single-writer channel owned by
|
||||
the switcher, which guarantees swap atomicity.
|
||||
|
||||
## HTTP control plane
|
||||
|
||||
The daemon exposes a small HTTP server (`mcp_http.listen`) consumed by
|
||||
the SecuAAS MCP gateway. Routes:
|
||||
|
||||
| Method | Path | Purpose |
|
||||
|--------|-----------------------|--------------------------------|
|
||||
| GET | `/status` | Full state snapshot |
|
||||
| GET | `/accounts` | Account usage + limits |
|
||||
| GET | `/sessions` | Session table |
|
||||
| POST | `/trigger/swap` | Force failover (requires bearer) |
|
||||
| POST | `/trigger/dispatch` | Force inbox scan |
|
||||
|
||||
All routes require the bearer token from `mcp_http.bearer_token_env`.
|
||||
71
docs/configuration.md
Normal file
71
docs/configuration.md
Normal file
|
|
@ -0,0 +1,71 @@
|
|||
# Configuration
|
||||
|
||||
Configuration is a single YAML file (`config.yaml`) read at startup.
|
||||
`config.example.yaml` at the repository root is the annotated template —
|
||||
copy and edit.
|
||||
|
||||
## Sections
|
||||
|
||||
### `accounts`
|
||||
|
||||
List of Anthropic accounts the daemon may rotate through.
|
||||
|
||||
| Field | Type | Description |
|
||||
|------------------|----------|-------------|
|
||||
| `name` | string | Short identifier used in logs and metrics. |
|
||||
| `home` | path | Per-account home directory (houses the private `~/.claude` credentials). |
|
||||
| `limits.hourly_msgs` | int | Local soft cap on messages per hour. `0` disables. |
|
||||
| `limits.weekly_msgs` | int | Local soft cap per 7-day window. `0` disables. |
|
||||
| `priority` | int | Lower numbers preferred when selecting a primary. |
|
||||
|
||||
### `pool`
|
||||
|
||||
Session pool layout.
|
||||
|
||||
- `dedicated` — long-lived sessions bound to a specific project path.
|
||||
Each entry has `name` (tmux session) and `project` (absolute path).
|
||||
- `autonomous` — autoscaling pool for the inbox dispatcher.
|
||||
`prefix` + serial → session names; `min`/`max` bound the pool size.
|
||||
- `shared_projects_dir` — canonical Claude Code `projects/` tree,
|
||||
symlinked from every account home (see
|
||||
`docs/session-switch-analysis.md`).
|
||||
|
||||
### `quota`
|
||||
|
||||
Failover thresholds.
|
||||
|
||||
- `poll_interval` — how often the quota-monitor samples usage.
|
||||
- `window_5h_threshold` — fraction (0..1) of the 5h cap above which a
|
||||
swap is requested.
|
||||
- `window_week_threshold` — same idea for the weekly cap.
|
||||
- `reactivate_cooldown` — minimum time before a drained account is
|
||||
eligible again.
|
||||
|
||||
### `checkpoint`
|
||||
|
||||
- `dir` — where per-session snapshots are written.
|
||||
- `interval` — time between snapshots.
|
||||
- `keep` — retention per session (older files are pruned).
|
||||
|
||||
### `mcp_http`
|
||||
|
||||
HTTP control plane for the MCP gateway.
|
||||
|
||||
- `listen` — bind address (keep it loopback unless fronted by a reverse
|
||||
proxy).
|
||||
- `bearer_token_env` — name of the env var from which to read the
|
||||
bearer token required on every request.
|
||||
- `enable_trigger` — gate for mutating routes (`/trigger/*`).
|
||||
|
||||
## Secrets
|
||||
|
||||
The config file itself holds no secrets. Bearer tokens and account
|
||||
credentials are read from the environment. In production deploy via a
|
||||
systemd unit with a drop-in that sources `/run/secrets/claude-failover.env`
|
||||
(populated by `secuops` / OVH KMS at boot).
|
||||
|
||||
## Reload
|
||||
|
||||
Config reload is not supported in the initial version. A SIGHUP handler
|
||||
is planned — the account-switcher makes a clean live reload feasible
|
||||
(drain, reconfigure, resume).
|
||||
61
docs/session-switch-analysis.md
Normal file
61
docs/session-switch-analysis.md
Normal file
|
|
@ -0,0 +1,61 @@
|
|||
# Cross-account session resume — bug analysis
|
||||
|
||||
## Symptom
|
||||
|
||||
When Claude Code is relaunched under a different Anthropic account
|
||||
(different `~/.claude/` directory), invoking `claude --resume <id>`
|
||||
fails with a "session not found" error even though the session
|
||||
transcript JSONL still exists on disk. The new account has no record of
|
||||
the session id because Claude Code tracks resumable sessions per-account
|
||||
in its local state store.
|
||||
|
||||
Upstream report: Claude Code issue **#16103** (cross-account resume).
|
||||
|
||||
## Consequence for claude-failover
|
||||
|
||||
Naïve account swap (stop sessions on account A, start on account B
|
||||
pointing at a different `~/.claude/projects/` tree) loses every running
|
||||
session. That would defeat the purpose of failover — we would be forced
|
||||
to kill in-flight tasks on every quota boundary.
|
||||
|
||||
## Solution — shared projects tree via symlink
|
||||
|
||||
The `projects/` subtree under `~/.claude/` is where the session
|
||||
transcripts live. We keep a single canonical copy at
|
||||
`/home/ubuntu/.claude-projects-shared/` and symlink each account's
|
||||
`~/.claude/projects` to it:
|
||||
|
||||
```
|
||||
/home/ubuntu/.claude-compte1/projects -> /home/ubuntu/.claude-projects-shared/
|
||||
/home/ubuntu/.claude-compte2/projects -> /home/ubuntu/.claude-projects-shared/
|
||||
```
|
||||
|
||||
With this layout:
|
||||
|
||||
- Account A records session `S1` while running. Its transcript lands in
|
||||
the shared directory.
|
||||
- On swap, account B's Claude Code process starts with its own
|
||||
credentials but sees the same `projects/` tree.
|
||||
- `claude --resume S1` finds the transcript and replays it.
|
||||
|
||||
The per-account state that is **not** shared — credentials, telemetry
|
||||
cache, statsig flags — stays isolated because only `projects/` is
|
||||
symlinked, not the whole `~/.claude`.
|
||||
|
||||
## Validation checklist (for implementers)
|
||||
|
||||
- [ ] `claude --resume` succeeds across accounts when `projects/` is
|
||||
symlinked
|
||||
- [ ] No transcript corruption when both accounts write concurrently
|
||||
(they cannot — only one account is active at a time)
|
||||
- [ ] Permissions on the shared dir allow the daemon user to read/write
|
||||
- [ ] Backup strategy in place before first production failover
|
||||
|
||||
## Open questions
|
||||
|
||||
- Does Claude Code cache the session id in a per-account index that
|
||||
needs to be pre-populated? If yes, the account-switcher must write a
|
||||
small stub entry there on swap.
|
||||
- Does a running session survive the HOME symlink flip, or must it be
|
||||
restarted? Current assumption: restart is required, hence the
|
||||
checkpoint goroutine.
|
||||
Loading…
Add table
Add a link
Reference in a new issue