When a legitimate quota hit triggered a swap, killAllPoolSessions tore
down the dedicated interactive sessions (ccl-1-conformvault, ccl-2-scanyze)
along with the pool, then recreatePoolSessions re-opened them at a bare
bash prompt. The operator had to manually re-run
CLAUDE_CONFIG_DIR=<target> claude --dangerously-skip-permissions --resume <uuid>
after every swap, losing whatever conversation was mid-flight.
saveAllSessions only iterates sessions tracked as "working" in state;
user-driven dedicated sessions are rarely in that state so their resume
UUIDs were never saved.
- saveDedicatedUUIDs: capture resume UUID for every configured dedicated
session regardless of tracked state, before kill.
- relaunchDedicatedSessions(targetHome): after recreate, send a resume
command on each dedicated session pointing CLAUDE_CONFIG_DIR at the
target account's home. Missing UUID → leave at shell, no blind launch.
- isValidResumeUUID hardens against a corrupted resume-id.txt.
New TestDedicatedRelaunchAfterSwap verifies end-to-end: pane capture →
UUID persisted → resume command sent with the correct CLAUDE_CONFIG_DIR.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
v0.2.2's 2-poll confirmation was insufficient because Anthropic 500/503
errors are printed into Claude Code's conversation transcript and stay
visible in every tmux capture until the user scrolls. A persistent
server error would confirm on the second poll and still trigger a swap.
Root cause: the pattern "rate limit" (bare substring) matched any 500
payload that happened to mention rate limits in its error text. Real
HTTP 429s from Anthropic are typed as "rate_limit_error" in the error
payload — and that's the signature we should actually key on.
- Remove "rate limit" from quotaPatterns (too generic — matches transcripts).
- Add "rate_limit_error" (Anthropic's typed 429 error) and "5-hour limit".
- Add serverErrorPatterns veto: "api_error", "overloaded_error",
"internal server error", "api error: 5". When any is present in the
pane, isQuotaExhausted returns false even if a quota pattern matched.
- 4 new subtests covering the veto paths + sanity that real 429s pass.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Anthropic HTTP 500 errors surface in the TUI with payloads containing
"rate limit" text, which the monitor was matching against quotaPatterns
and treating as a real 429 quota hit. With no cooldown and no
confirmation, a burst of 500s produced sub-minute ping-pong swaps that
tore down user sessions.
Two-layer fix:
- quota.reactivate_cooldown (already in config, 5m) now gates the
monitor too — not just the dispatcher. A completed swap suppresses
further detection for the cooldown window.
- A hit with no parseable reset time is treated as suspected only on
the first poll; a second consecutive poll is required before
emitting SwapRequested. Legitimate 429s with "resets in ..." still
swap instantly on the first detection.
Adds state.RecordSwap / LastSwapInfo for the cooldown, and a
forensic log line on every detection: trigger_session, matched
pattern, 120-char pane snippet.
Tests cover: instant swap with reset, 2-poll confirmation without
reset, and suspected-state reset on recovery.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Script de test complet du daemon claude-failover (30 tests) et migration
vers daemon exclusif en commentant les 5 crons orchestrateur.
Fixes appliqués au script:
- ((PASS++)) → PASS=$((PASS + 1)) pour éviter exit avec set -e quand PASS=0
- grep -c || echo 0 → || errors=0 pour éviter double-valeur dans variable
- HOME/GOPATH/GOCACHE explicites pour go test ./... (module cache hors /root)
- Pattern sed étendu pour matcher `* * * * *` dispatcher (pas seulement */N)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Le test TestKillAndRecreatePoolSessions appelait executeSwitch() qui faisait
flipSymlink() sur le VRAI $HOME via os.UserHomeDir(). Resultat: ~/.claude
etait repointe vers une cible /tmp/... qui disparaissait au reboot, rendant
Claude Code inutilisable apres redemarrage.
Fix:
- Ajout du champ AccountSwitcher.homeDir (override pour tests).
- Nouveau helper resolveHomeDir() qui retourne homeDir si defini, sinon
os.UserHomeDir().
- flipSymlink() et resumeContextDir() utilisent maintenant resolveHomeDir().
- Le test TestKillAndRecreatePoolSessions assigne a.homeDir = t.TempDir()
avant executeSwitch().
Verifie: go test ./... passe et /home/ubuntu/.claude reste intact.
- Add internal/lifecycle/manager.go with Manager struct, Run() ticker loop
(15s interval), EnsureAllSessions() for boot-time session creation, and
reconcile() that recreates idle sessions and recovers working ones via
SetFailed + CreateSession
- Add state.SetFailed() to record crash timestamp on SessionState
- Add internal/lifecycle/manager_test.go with mock tmux client and 3 tests:
TestReconcileCreatesDeadSession, TestReconcileRecoversCrashedSession,
TestEnsureAllSessions — all pass
- Wire lifecycle.Manager into cmd/claude-failover/main.go after state init
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>