Observed: tasks from filesecure/.agent-queue/inbox and SecuScan/ .agent-queue/inbox were being routed into ccl-1-conformvault and ccl-2-scanyze whenever those sessions happened to be idle. Those are the operator's manual interactive Claude sessions, not dispatch targets — the auto-dispatch was (a) hijacking a Claude instance the operator was using and (b) triggering /exit via the watcher's completion path when the side-task finished, kicking the operator out mid-conversation. findFreeSession was iterating Pool.Dedicated before the autonomous pool, so any idle dedicated session was the first candidate. - Dispatcher.findFreeSession: remove the Dedicated loop entirely. Auto-dispatch is now pool-only (ccl-auto-11..20). - Watcher.completeSession: defense-in-depth — even if a dedicated session ever ends up in "working" state, it is no longer /exit'd; just marked idle. Pool /exit behaviour unchanged (context recycle). - Tests: new TestFindFreeSessionSkipsDedicated proves the routing; 3 existing tests rewritten to use the autonomous pool instead of relying on Dedicated as a fake pool. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|---|---|---|
| cmd/claude-failover | ||
| docs | ||
| internal | ||
| scripts | ||
| .gitignore | ||
| CLAUDE.md | ||
| config.example.yaml | ||
| go.mod | ||
| go.sum | ||
| LICENSE | ||
| README.md | ||
| VERSION.md | ||
| WORK_IN_PROGRESS.md | ||
claude-failover
Go daemon for Claude Code multi-account session orchestration with automatic quota-based failover.
Overview
claude-failover orchestrates a pool of Claude Code sessions running under
multiple Anthropic accounts. When the active account reaches its quota
threshold (5-hour usage window or weekly cap), the daemon transparently fails
over the workload to a backup account without losing in-flight session state.
It is the runtime glue behind the SecuAAS agent pool (ccl-0..ccl-9,
ccl-auto-11..ccl-auto-20) and is engineered to hold sessions warm across
account swaps by sharing the ~/.claude/projects/ transcript tree via
symlinks.
Architecture (goroutines)
The daemon is a single Go binary composed of cooperating goroutines:
- dispatcher — reads
.agent-queue/inbox/*.mdacross registered projects and assigns tasks to idle sessions. - quota-monitor — polls each configured Anthropic account's usage window and triggers a failover when the active account crosses its threshold.
- session-watcher — tracks tmux session liveness (
ccl-*), heartbeats, and.agent-queue/status.jsontransitions (idle / working). - checkpoint — periodically snapshots session context (current task, last tool call, working dir) so an interrupted session can resume on a different account.
- janitor — cleans stale
.dispatchedmarkers, archives olddone/tasks, prunes expired checkpoints. - notifier — pushes state changes (failover fired, session degraded, task failed) to Telegram / MCP dashboard / log aggregator.
- account-switcher — performs the actual swap: stop sessions on account A, rehome symlinks, relaunch sessions on account B, replay last checkpoint. Serialized via a single mutex so only one swap can happen at a time.
All goroutines communicate through typed channels plus a shared state struct
behind a sync.RWMutex. The daemon exposes an HTTP control plane for the
MCP server to query status and force-trigger operations.
Relationship to SecuAAS agent-orchestrator
This project extracts the session-management and failover logic that
currently lives in dev-management/agent-orchestrator/ (shell scripts:
launch-agent.sh, graceful-switch.sh, watchdog.sh,
checkpoint-daemon.sh, start-dedicated-agents.sh) and reimplements it
as a single Go service. See the orchestrator docs for the operational
context this daemon is designed to replace.
Repository layout
cmd/claude-failover/ Main entrypoint
docs/ Architecture, configuration, analysis notes
scripts/ Setup helpers (shared-projects symlink, etc.)
config.example.yaml Annotated example config
Status
Pre-alpha. Design and scaffolding only — no working binary yet.
License
MIT — see LICENSE.