Before claiming a session for a task, the dispatcher now: 1. Parses the task's frontmatter 2. If `depends_on: [project:task_id]` is non-empty, checks each entry against `<projectsDir>/<project>/.agent-queue/done/<task_id>.md` 3. If any dep is unresolved -> skip the task and write `<task>.md.blocked` next to it. The watchdog (G1) will resolve this marker on its next tick. The `.blocked` marker is idempotent: re-running the dispatcher does not refresh its mtime, so the watchdog can compute the blocked-since timestamp from the FIRST detection (timeout precision). Path-traversal hardening: project / task_id segments must match `[A-Za-z0-9._-]+` and cannot be `.` or `..`. A malicious frontmatter like `depends_on: ../../tmp:foo` is rejected before any filesystem lookup. assignNextTask (the doneChan path) applies the same gate so that a session freed mid-cycle cannot bypass enforcement. Tests (-race clean): - DependsOnUnresolved -> .blocked marker, no dispatch - DependsOnResolved -> normal dispatch, no marker - PartialResolution -> stay blocked - RejectPathTraversal -> blocked, not dispatched - BlockedMarker idempotent (mtime stable across passes) - NoDependsOn regression guard |
||
|---|---|---|
| cmd/claude-failover | ||
| docs | ||
| internal | ||
| scripts | ||
| .gitignore | ||
| CLAUDE.md | ||
| config.example.yaml | ||
| go.mod | ||
| go.sum | ||
| LICENSE | ||
| README.md | ||
| VERSION.md | ||
| WORK_IN_PROGRESS.md | ||
claude-failover
Go daemon for Claude Code multi-account session orchestration with automatic quota-based failover.
Overview
claude-failover orchestrates a pool of Claude Code sessions running under
multiple Anthropic accounts. When the active account reaches its quota
threshold (5-hour usage window or weekly cap), the daemon transparently fails
over the workload to a backup account without losing in-flight session state.
It is the runtime glue behind the SecuAAS agent pool (ccl-0..ccl-9,
ccl-auto-11..ccl-auto-20) and is engineered to hold sessions warm across
account swaps by sharing the ~/.claude/projects/ transcript tree via
symlinks.
Architecture (goroutines)
The daemon is a single Go binary composed of cooperating goroutines:
- dispatcher — reads
.agent-queue/inbox/*.mdacross registered projects and assigns tasks to idle sessions. - quota-monitor — polls each configured Anthropic account's usage window and triggers a failover when the active account crosses its threshold.
- session-watcher — tracks tmux session liveness (
ccl-*), heartbeats, and.agent-queue/status.jsontransitions (idle / working). - checkpoint — periodically snapshots session context (current task, last tool call, working dir) so an interrupted session can resume on a different account.
- janitor — cleans stale
.dispatchedmarkers, archives olddone/tasks, prunes expired checkpoints. - notifier — pushes state changes (failover fired, session degraded, task failed) to Telegram / MCP dashboard / log aggregator.
- account-switcher — performs the actual swap: stop sessions on account A, rehome symlinks, relaunch sessions on account B, replay last checkpoint. Serialized via a single mutex so only one swap can happen at a time.
All goroutines communicate through typed channels plus a shared state struct
behind a sync.RWMutex. The daemon exposes an HTTP control plane for the
MCP server to query status and force-trigger operations.
Relationship to SecuAAS agent-orchestrator
This project extracts the session-management and failover logic that
currently lives in dev-management/agent-orchestrator/ (shell scripts:
launch-agent.sh, graceful-switch.sh, watchdog.sh,
checkpoint-daemon.sh, start-dedicated-agents.sh) and reimplements it
as a single Go service. See the orchestrator docs for the operational
context this daemon is designed to replace.
Repository layout
cmd/claude-failover/ Main entrypoint
docs/ Architecture, configuration, analysis notes
scripts/ Setup helpers (shared-projects symlink, etc.)
config.example.yaml Annotated example config
Status
Pre-alpha. Design and scaffolding only — no working binary yet.
License
MIT — see LICENSE.