Go daemon for Claude Code multi-account session orchestration with automatic quota-based failover
Find a file
Ubuntu 62e98cb9e7 fix(quota): veto 5xx errors + tighten patterns to stop false-positive swaps
v0.2.2's 2-poll confirmation was insufficient because Anthropic 500/503
errors are printed into Claude Code's conversation transcript and stay
visible in every tmux capture until the user scrolls. A persistent
server error would confirm on the second poll and still trigger a swap.

Root cause: the pattern "rate limit" (bare substring) matched any 500
payload that happened to mention rate limits in its error text. Real
HTTP 429s from Anthropic are typed as "rate_limit_error" in the error
payload — and that's the signature we should actually key on.

- Remove "rate limit" from quotaPatterns (too generic — matches transcripts).
- Add "rate_limit_error" (Anthropic's typed 429 error) and "5-hour limit".
- Add serverErrorPatterns veto: "api_error", "overloaded_error",
  "internal server error", "api error: 5". When any is present in the
  pane, isQuotaExhausted returns false even if a quota pattern matched.
- 4 new subtests covering the veto paths + sanity that real 429s pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 19:26:00 +00:00
cmd/claude-failover feat: Phase 2.7+3 — full integration, config update, systemd unit 2026-04-15 00:15:06 +00:00
docs feat(watcher): Phase 2.1 — SessionWatcher goroutine 2026-04-14 20:27:51 +00:00
internal fix(quota): veto 5xx errors + tighten patterns to stop false-positive swaps 2026-04-15 19:26:00 +00:00
scripts chore: add test-and-migrate.sh script 2026-04-15 01:12:49 +00:00
.gitignore chore(gitignore): ignore built binary and .security-reviewed marker 2026-04-15 00:00:23 +00:00
CLAUDE.md chore: add CLAUDE.md and update gitignore 2026-04-14 17:55:29 +00:00
config.example.yaml feat: Phase 2.7+3 — full integration, config update, systemd unit 2026-04-15 00:15:06 +00:00
go.mod feat(dispatcher): Phase 2.2 — Task Dispatcher avec fsnotify 2026-04-14 20:30:08 +00:00
go.sum feat(dispatcher): Phase 2.2 — Task Dispatcher avec fsnotify 2026-04-14 20:30:08 +00:00
LICENSE feat: initial project structure 2026-04-14 13:29:24 +00:00
README.md feat: initial project structure 2026-04-14 13:29:24 +00:00
VERSION.md fix(quota): veto 5xx errors + tighten patterns to stop false-positive swaps 2026-04-15 19:26:00 +00:00

claude-failover

Go daemon for Claude Code multi-account session orchestration with automatic quota-based failover.

Overview

claude-failover orchestrates a pool of Claude Code sessions running under multiple Anthropic accounts. When the active account reaches its quota threshold (5-hour usage window or weekly cap), the daemon transparently fails over the workload to a backup account without losing in-flight session state.

It is the runtime glue behind the SecuAAS agent pool (ccl-0..ccl-9, ccl-auto-11..ccl-auto-20) and is engineered to hold sessions warm across account swaps by sharing the ~/.claude/projects/ transcript tree via symlinks.

Architecture (goroutines)

The daemon is a single Go binary composed of cooperating goroutines:

  • dispatcher — reads .agent-queue/inbox/*.md across registered projects and assigns tasks to idle sessions.
  • quota-monitor — polls each configured Anthropic account's usage window and triggers a failover when the active account crosses its threshold.
  • session-watcher — tracks tmux session liveness (ccl-*), heartbeats, and .agent-queue/status.json transitions (idle / working).
  • checkpoint — periodically snapshots session context (current task, last tool call, working dir) so an interrupted session can resume on a different account.
  • janitor — cleans stale .dispatched markers, archives old done/ tasks, prunes expired checkpoints.
  • notifier — pushes state changes (failover fired, session degraded, task failed) to Telegram / MCP dashboard / log aggregator.
  • account-switcher — performs the actual swap: stop sessions on account A, rehome symlinks, relaunch sessions on account B, replay last checkpoint. Serialized via a single mutex so only one swap can happen at a time.

All goroutines communicate through typed channels plus a shared state struct behind a sync.RWMutex. The daemon exposes an HTTP control plane for the MCP server to query status and force-trigger operations.

Relationship to SecuAAS agent-orchestrator

This project extracts the session-management and failover logic that currently lives in dev-management/agent-orchestrator/ (shell scripts: launch-agent.sh, graceful-switch.sh, watchdog.sh, checkpoint-daemon.sh, start-dedicated-agents.sh) and reimplements it as a single Go service. See the orchestrator docs for the operational context this daemon is designed to replace.

Repository layout

cmd/claude-failover/     Main entrypoint
docs/                    Architecture, configuration, analysis notes
scripts/                 Setup helpers (shared-projects symlink, etc.)
config.example.yaml      Annotated example config

Status

Pre-alpha. Design and scaffolding only — no working binary yet.

License

MIT — see LICENSE.