Wire symlinks.EnsureForAccount into executeSwitch, called immediately after the ~/.claude flip. Guarantees the three shared-state links (session-env, file-history, projects) exist on the target account home even for freshly-provisioned accounts, preventing silent transcript duplication and undo-history divergence on first resume. Best-effort: errors are logged as WARN but never abort the swap. If we returned here the daemon would be left inconsistent (symlink flipped, SetActiveAccount never called). Operator sees the warning in logs and resolves divergent links manually. Tests: - TestFlipReconcilesSharedSymlinksOnTargetHome: empty target home gets all three links pointing at canonical targets after the flip. - TestFlipEnsureSymlinksFailureDoesNotAbortSwap: a planted divergent link triggers the symlinks-package error; the swap completes anyway and the active account is updated. Hermetic: added AccountSwitcher.sharedSymlinks override so tests scope the reconcile inside t.TempDir() and never touch /home/ubuntu/.claude-*-shared. Existing tests migrated to this pattern and hardcoded /tmp/claude-*-xxxx paths replaced with tmpdirs. Phase 1 / Chantier A — task A3. |
||
|---|---|---|
| cmd/claude-failover | ||
| docs | ||
| internal | ||
| scripts | ||
| .gitignore | ||
| CLAUDE.md | ||
| config.example.yaml | ||
| go.mod | ||
| go.sum | ||
| LICENSE | ||
| README.md | ||
| VERSION.md | ||
| WORK_IN_PROGRESS.md | ||
claude-failover
Go daemon for Claude Code multi-account session orchestration with automatic quota-based failover.
Overview
claude-failover orchestrates a pool of Claude Code sessions running under
multiple Anthropic accounts. When the active account reaches its quota
threshold (5-hour usage window or weekly cap), the daemon transparently fails
over the workload to a backup account without losing in-flight session state.
It is the runtime glue behind the SecuAAS agent pool (ccl-0..ccl-9,
ccl-auto-11..ccl-auto-20) and is engineered to hold sessions warm across
account swaps by sharing the ~/.claude/projects/ transcript tree via
symlinks.
Architecture (goroutines)
The daemon is a single Go binary composed of cooperating goroutines:
- dispatcher — reads
.agent-queue/inbox/*.mdacross registered projects and assigns tasks to idle sessions. - quota-monitor — polls each configured Anthropic account's usage window and triggers a failover when the active account crosses its threshold.
- session-watcher — tracks tmux session liveness (
ccl-*), heartbeats, and.agent-queue/status.jsontransitions (idle / working). - checkpoint — periodically snapshots session context (current task, last tool call, working dir) so an interrupted session can resume on a different account.
- janitor — cleans stale
.dispatchedmarkers, archives olddone/tasks, prunes expired checkpoints. - notifier — pushes state changes (failover fired, session degraded, task failed) to Telegram / MCP dashboard / log aggregator.
- account-switcher — performs the actual swap: stop sessions on account A, rehome symlinks, relaunch sessions on account B, replay last checkpoint. Serialized via a single mutex so only one swap can happen at a time.
All goroutines communicate through typed channels plus a shared state struct
behind a sync.RWMutex. The daemon exposes an HTTP control plane for the
MCP server to query status and force-trigger operations.
Relationship to SecuAAS agent-orchestrator
This project extracts the session-management and failover logic that
currently lives in dev-management/agent-orchestrator/ (shell scripts:
launch-agent.sh, graceful-switch.sh, watchdog.sh,
checkpoint-daemon.sh, start-dedicated-agents.sh) and reimplements it
as a single Go service. See the orchestrator docs for the operational
context this daemon is designed to replace.
Repository layout
cmd/claude-failover/ Main entrypoint
docs/ Architecture, configuration, analysis notes
scripts/ Setup helpers (shared-projects symlink, etc.)
config.example.yaml Annotated example config
Status
Pre-alpha. Design and scaffolding only — no working binary yet.
License
MIT — see LICENSE.