Go daemon for Claude Code multi-account session orchestration with automatic quota-based failover
Find a file
Ubuntu 20063b1939 fix(switcher+symlinks): rollback on ensure failure (Bug #1) + requiredShared contract test (Bug #10)
Bug #1 (CRITIQUE) — A3 flip+ensure inconsistency
- Before: EnsureForAccount failure after flip was WARN-only, SetActiveAccount
  still fired → daemon declared target active while shared symlinks were
  absent/divergent → transcripts silently duplicated, resume broken.
- After: ensure failure triggers rollback flip to previous account home;
  if rollback succeeds → explicit error, ActiveAccount stays on previous.
  If rollback ALSO fails → sticky partialSwap flag + ErrPartialSwap; all
  further swaps refused until operator intervention (daemon restart).
- New public IsPartialSwap() for watchdog / health-check integration.

Bug #10 (MOYENNE) — requiredShared contract never exercised
- All existing tests override a.sharedSymlinks with tmpdir-scoped lists,
  so symlinks.RequiredShared itself was never tested. A rename or drop
  would pass every test but silently break prod failover.
- TestRequiredSharedIsCoherent asserts (no filesystem): 3 entries with
  the exact required names, absolute targets, and a single shared parent
  directory (invariant EnsureForAccount depends on).

Tests:
- go test ./... PASS
- go test -race ./... PASS (no data race)
- 2 new switcher tests: TestFlipEnsureFailureTriggersRollback,
  TestFlipEnsureAndRollbackFailure
- 1 new symlinks test: TestRequiredSharedIsCoherent
- 1 obsolete test replaced: TestFlipEnsureSymlinksFailureDoesNotAbortSwap
  (encoded the old buggy best-effort behaviour)
2026-04-16 19:53:48 +00:00
cmd/claude-failover feat(lifecycle): validate shared symlinks at daemon startup (A2) 2026-04-16 19:03:43 +00:00
docs docs: add WORK_IN_PROGRESS.md and document false-positive protection 2026-04-15 19:51:15 +00:00
internal fix(switcher+symlinks): rollback on ensure failure (Bug #1) + requiredShared contract test (Bug #10) 2026-04-16 19:53:48 +00:00
scripts chore: add test-and-migrate.sh script 2026-04-15 01:12:49 +00:00
.gitignore chore(gitignore): ignore built binary and .security-reviewed marker 2026-04-15 00:00:23 +00:00
CLAUDE.md chore: add CLAUDE.md and update gitignore 2026-04-14 17:55:29 +00:00
config.example.yaml feat: Phase 2.7+3 — full integration, config update, systemd unit 2026-04-15 00:15:06 +00:00
go.mod feat(dispatcher): Phase 2.2 — Task Dispatcher avec fsnotify 2026-04-14 20:30:08 +00:00
go.sum feat(dispatcher): Phase 2.2 — Task Dispatcher avec fsnotify 2026-04-14 20:30:08 +00:00
LICENSE feat: initial project structure 2026-04-14 13:29:24 +00:00
README.md feat: initial project structure 2026-04-14 13:29:24 +00:00
VERSION.md fix(switcher+symlinks): rollback on ensure failure (Bug #1) + requiredShared contract test (Bug #10) 2026-04-16 19:53:48 +00:00
WORK_IN_PROGRESS.md feat(symlinks): add shared-state symlink manager (A1) 2026-04-16 18:55:32 +00:00

claude-failover

Go daemon for Claude Code multi-account session orchestration with automatic quota-based failover.

Overview

claude-failover orchestrates a pool of Claude Code sessions running under multiple Anthropic accounts. When the active account reaches its quota threshold (5-hour usage window or weekly cap), the daemon transparently fails over the workload to a backup account without losing in-flight session state.

It is the runtime glue behind the SecuAAS agent pool (ccl-0..ccl-9, ccl-auto-11..ccl-auto-20) and is engineered to hold sessions warm across account swaps by sharing the ~/.claude/projects/ transcript tree via symlinks.

Architecture (goroutines)

The daemon is a single Go binary composed of cooperating goroutines:

  • dispatcher — reads .agent-queue/inbox/*.md across registered projects and assigns tasks to idle sessions.
  • quota-monitor — polls each configured Anthropic account's usage window and triggers a failover when the active account crosses its threshold.
  • session-watcher — tracks tmux session liveness (ccl-*), heartbeats, and .agent-queue/status.json transitions (idle / working).
  • checkpoint — periodically snapshots session context (current task, last tool call, working dir) so an interrupted session can resume on a different account.
  • janitor — cleans stale .dispatched markers, archives old done/ tasks, prunes expired checkpoints.
  • notifier — pushes state changes (failover fired, session degraded, task failed) to Telegram / MCP dashboard / log aggregator.
  • account-switcher — performs the actual swap: stop sessions on account A, rehome symlinks, relaunch sessions on account B, replay last checkpoint. Serialized via a single mutex so only one swap can happen at a time.

All goroutines communicate through typed channels plus a shared state struct behind a sync.RWMutex. The daemon exposes an HTTP control plane for the MCP server to query status and force-trigger operations.

Relationship to SecuAAS agent-orchestrator

This project extracts the session-management and failover logic that currently lives in dev-management/agent-orchestrator/ (shell scripts: launch-agent.sh, graceful-switch.sh, watchdog.sh, checkpoint-daemon.sh, start-dedicated-agents.sh) and reimplements it as a single Go service. See the orchestrator docs for the operational context this daemon is designed to replace.

Repository layout

cmd/claude-failover/     Main entrypoint
docs/                    Architecture, configuration, analysis notes
scripts/                 Setup helpers (shared-projects symlink, etc.)
config.example.yaml      Annotated example config

Status

Pre-alpha. Design and scaffolding only — no working binary yet.

License

MIT — see LICENSE.