claude-failover

Author	SHA1	Message	Date
Ubuntu	e16e3526a0	feat(lifecycle): validate shared symlinks at daemon startup (A2) Wire symlinks.ValidateAll into the lifecycle manager so the daemon refuses to start if any configured account is missing one of the shared-state symlinks or if a link diverges from the canonical target. Previously, a missing link on a freshly deployed VM would silently create a divergent state tree per account (duplicate JSONL transcripts, broken undo history) — exactly the failure mode the symlinks package (A1) was introduced to prevent. The check runs once at startup before EnsureAllSessions, guarding a single well-defined invariant: "every account home shares the same projects/, file-history/ and session-env/ roots". No auto-heal on divergence — we fail fast with an explicit error so the operator fixes it manually rather than one account's state being overwritten. Part of Phase 1 Chantier A — Failover robuste. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-16 19:03:43 +00:00
Ubuntu	91091d7abf	feat(symlinks): add shared-state symlink manager (A1) Adds internal/symlinks package that encodes in code the convention previously maintained by hand on the VM: every Claude account home must expose `session-env`, `file-history` and `projects` as symlinks to a single shared target, so account failover does not create divergent state (duplicate JSONL transcripts, broken undo history). - EnsureForAccount(home, required) creates missing links and target directories, refuses to auto-correct a divergent link (risks data loss), and errors when a regular file sits where the link belongs. - ValidateAll(homes, required) aggregates errors across both accounts so the operator sees every problem at once rather than fixing one per restart cycle. - RequiredShared exposes the production defaults so lifecycle and switcher (A2/A3) can depend on it directly. 9/9 unit tests green. Part of Phase 1 Chantier A — Failover robuste.	2026-04-16 18:55:32 +00:00
Ubuntu	4cbdcf143a	fix(dispatcher+watcher): never auto-dispatch into dedicated sessions Observed: tasks from filesecure/.agent-queue/inbox and SecuScan/ .agent-queue/inbox were being routed into ccl-1-conformvault and ccl-2-scanyze whenever those sessions happened to be idle. Those are the operator's manual interactive Claude sessions, not dispatch targets — the auto-dispatch was (a) hijacking a Claude instance the operator was using and (b) triggering /exit via the watcher's completion path when the side-task finished, kicking the operator out mid-conversation. findFreeSession was iterating Pool.Dedicated before the autonomous pool, so any idle dedicated session was the first candidate. - Dispatcher.findFreeSession: remove the Dedicated loop entirely. Auto-dispatch is now pool-only (ccl-auto-11..20). - Watcher.completeSession: defense-in-depth — even if a dedicated session ever ends up in "working" state, it is no longer /exit'd; just marked idle. Pool /exit behaviour unchanged (context recycle). - Tests: new TestFindFreeSessionSkipsDedicated proves the routing; 3 existing tests rewritten to use the autonomous pool instead of relying on Dedicated as a fake pool. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 13:30:26 +00:00
Ubuntu	6b109ed1bc	fix(dispatcher): send a lone Enter after the task paste to submit it Multi-line task bodies arrived in Claude Code as "[Pasted text #N +M lines]" and sat in the input buffer forever — the trailing Enter that SendKeys appends to the paste is consumed as a newline inside the paste, not as a submit. Observed live on ccl-auto-11 (secumon) and ccl-auto-12 (secuops): prompt visible, agent idle. - tmux.Client grows a SendEnter(session) method. ExecClient runs `tmux send-keys -t <sess> Enter` (no preceding text), which Claude's TUI accepts as the explicit submit action after a paste. - Dispatcher: after SendKeys(msg), sleep 500ms for the paste to register, then SendEnter. Same sequence a human would perform. - Five mockTmux implementations updated (quota, dispatcher, switcher, lifecycle, watcher tests). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 20:49:59 +00:00
Ubuntu	eb6b74c547	feat(pool): add start_index so manual and auto pools can coexist Production had two disjoint tmux pools named alike but for different purposes: ccl-0..ccl-9 — manual/interactive sessions (operator) ccl-auto-11..ccl-auto-20 — autonomous dispatcher pool Until now the daemon's loops iterated prefix + 0..Max, so with the deployed config ("prefix: ccl-auto", min=2, max=10) the dispatcher looked for sessions "ccl-auto0..ccl-auto9" that never existed, while the real auto pool ccl-auto-11..20 was invisible. Net effect: no task was ever dispatched, and killAllPoolSessions fabricated phantom "ccl-auto0/1" sessions on each swap. - AutonomousConfig gains StartIndex (yaml start_index, default 0). Behaviour is unchanged when StartIndex is 0. - Monitor, switcher (kill + recreate), dispatcher (findFreeSession), and lifecycle (EnsureAll + reconcile) all iterate [StartIndex, StartIndex+Max) so the daemon only touches its own range and leaves ccl-0..ccl-9 alone. - Production config updated to prefix: "ccl-auto-", start_index: 11, min: 10, max: 10 — covering the 10 real ccl-auto-11..20 sessions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 20:39:57 +00:00
Ubuntu	8fdb1937fc	feat(switcher): auto-resume dedicated sessions after a swap When a legitimate quota hit triggered a swap, killAllPoolSessions tore down the dedicated interactive sessions (ccl-1-conformvault, ccl-2-scanyze) along with the pool, then recreatePoolSessions re-opened them at a bare bash prompt. The operator had to manually re-run CLAUDE_CONFIG_DIR=<target> claude --dangerously-skip-permissions --resume <uuid> after every swap, losing whatever conversation was mid-flight. saveAllSessions only iterates sessions tracked as "working" in state; user-driven dedicated sessions are rarely in that state so their resume UUIDs were never saved. - saveDedicatedUUIDs: capture resume UUID for every configured dedicated session regardless of tracked state, before kill. - relaunchDedicatedSessions(targetHome): after recreate, send a resume command on each dedicated session pointing CLAUDE_CONFIG_DIR at the target account's home. Missing UUID → leave at shell, no blind launch. - isValidResumeUUID hardens against a corrupted resume-id.txt. New TestDedicatedRelaunchAfterSwap verifies end-to-end: pane capture → UUID persisted → resume command sent with the correct CLAUDE_CONFIG_DIR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 20:24:38 +00:00
Ubuntu	62e98cb9e7	fix(quota): veto 5xx errors + tighten patterns to stop false-positive swaps v0.2.2's 2-poll confirmation was insufficient because Anthropic 500/503 errors are printed into Claude Code's conversation transcript and stay visible in every tmux capture until the user scrolls. A persistent server error would confirm on the second poll and still trigger a swap. Root cause: the pattern "rate limit" (bare substring) matched any 500 payload that happened to mention rate limits in its error text. Real HTTP 429s from Anthropic are typed as "rate_limit_error" in the error payload — and that's the signature we should actually key on. - Remove "rate limit" from quotaPatterns (too generic — matches transcripts). - Add "rate_limit_error" (Anthropic's typed 429 error) and "5-hour limit". - Add serverErrorPatterns veto: "api_error", "overloaded_error", "internal server error", "api error: 5". When any is present in the pane, isQuotaExhausted returns false even if a quota pattern matched. - 4 new subtests covering the veto paths + sanity that real 429s pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 19:26:00 +00:00
Ubuntu	7c5f8384fa	fix(quota): add cooldown + 2-poll confirmation to prevent swap ping-pong Anthropic HTTP 500 errors surface in the TUI with payloads containing "rate limit" text, which the monitor was matching against quotaPatterns and treating as a real 429 quota hit. With no cooldown and no confirmation, a burst of 500s produced sub-minute ping-pong swaps that tore down user sessions. Two-layer fix: - quota.reactivate_cooldown (already in config, 5m) now gates the monitor too — not just the dispatcher. A completed swap suppresses further detection for the cooldown window. - A hit with no parseable reset time is treated as suspected only on the first poll; a second consecutive poll is required before emitting SwapRequested. Legitimate 429s with "resets in ..." still swap instantly on the first detection. Adds state.RecordSwap / LastSwapInfo for the cooldown, and a forensic log line on every detection: trigger_session, matched pattern, 120-char pane snippet. Tests cover: instant swap with reset, 2-poll confirmation without reset, and suspected-state reset on recovery. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 19:18:27 +00:00
Ubuntu	c87145ea0b	feat(watcher): Phase 2.1 — SessionWatcher goroutine - internal/watcher: detecte fin de tache via signal file, prompt ❯, idle timeout - state: ForEachWorking, SetStalled, SetActiveAccount, ActiveAccount - config: WatcherConfig, DispatcherConfig, JanitorConfig, NotificationsConfig + defaults - 5 tests unitaires, go test ./... -race OK Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 20:27:51 +00:00

9 commits