claude-failover

Author	SHA1	Message	Date
Ubuntu	58690da69f	feat(safety): PreToolUse hook gating destructive tool calls (FNDG-04b, Option A) Adds internal/safety/ — the in-repo source of truth for the PreToolUse hook deployed into every project before a Claude Code agent is launched. The hook blocks destructive Bash/Edit/Write patterns on sessions running with --dangerously-skip-permissions, closing the exploitation path where a prompt injection via MCP sessions.send could otherwise trigger arbitrary destruction without interactive confirmation. Wire-up: - internal/dispatcher/dispatcher.go launchAgent: deploys hook before claude launch; fail-closed if deployment fails. - internal/switcher/account_switcher.go relaunchDedicatedSessions: redeploys hook before --resume after account failover; fail-open (log + continue) since the initial deployment is still in place. Blocks (exit 2, stderr shown to model): - rm -rf targeting /, ~, $HOME, /etc, /var, /usr, /boot - dd of=/dev/{sd,nvme,disk,hd,mmcblk}, mkfs - git push --force (but allows --force-with-lease) - git reset --hard on main\|master\|production - sudo outside short allowlist (systemctl, journalctl, cp, install, apt) - curl\|sh, bash <(curl ...), eval "$(curl ...)", fork bomb, crontab -e - chmod 777 on system paths / home - Writes to .claude/settings.json, .claude/hooks/, ~/.ssh/authorized_keys, shell rc files, /etc/sudoers, /etc/systemd/ Warn-only (logged, not blocked): - kubectl delete, helm uninstall, terraform destroy - DROP TABLE, TRUNCATE TABLE, DELETE FROM ... WHERE 1=1 Hook script is embedded via //go:embed so a single binary release carries the authoritative copy. Every launch rewrites the deployed file with mode 0555 (anti-tamper); the hook itself also blocks writes to .claude/hooks/ for defense in depth. Decision: Olivier, 2026-04-19 — Option A now, Option C (two pools) tracked separately. Complements FNDG-04 input sanitization in secuaas-mcp. Tests: 8 unit/integration tests in internal/safety/, plus a dispatcher-level test verifying the hook is written before launch. go vet clean, go test ./... all pass. Refs: FNDG-04 audit (secuaas-mcp branch audit/mcp-stdio-2026-04-18) Task: .agent-queue/inbox/20260418-211102-fndg-04b-*.md Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-19 17:48:27 +00:00
Ubuntu	336f1f27bb	chore(deps): go mod tidy — promote fsnotify to direct No functional change. Groups yaml.v3 and fsnotify as direct deps, isolates golang.org/x/sys as indirect. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-16 21:00:16 +00:00
Ubuntu	20063b1939	fix(switcher+symlinks): rollback on ensure failure (Bug #1 ) + requiredShared contract test (Bug #10 ) Bug #1 (CRITIQUE) — A3 flip+ensure inconsistency - Before: EnsureForAccount failure after flip was WARN-only, SetActiveAccount still fired → daemon declared target active while shared symlinks were absent/divergent → transcripts silently duplicated, resume broken. - After: ensure failure triggers rollback flip to previous account home; if rollback succeeds → explicit error, ActiveAccount stays on previous. If rollback ALSO fails → sticky partialSwap flag + ErrPartialSwap; all further swaps refused until operator intervention (daemon restart). - New public IsPartialSwap() for watchdog / health-check integration. Bug #10 (MOYENNE) — requiredShared contract never exercised - All existing tests override a.sharedSymlinks with tmpdir-scoped lists, so symlinks.RequiredShared itself was never tested. A rename or drop would pass every test but silently break prod failover. - TestRequiredSharedIsCoherent asserts (no filesystem): 3 entries with the exact required names, absolute targets, and a single shared parent directory (invariant EnsureForAccount depends on). Tests: - go test ./... PASS - go test -race ./... PASS (no data race) - 2 new switcher tests: TestFlipEnsureFailureTriggersRollback, TestFlipEnsureAndRollbackFailure - 1 new symlinks test: TestRequiredSharedIsCoherent - 1 obsolete test replaced: TestFlipEnsureSymlinksFailureDoesNotAbortSwap (encoded the old buggy best-effort behaviour)	2026-04-16 19:53:48 +00:00
Ubuntu	8eaf0bbd35	feat(switcher): ensure shared symlinks on target home after flip (A3) Wire symlinks.EnsureForAccount into executeSwitch, called immediately after the ~/.claude flip. Guarantees the three shared-state links (session-env, file-history, projects) exist on the target account home even for freshly-provisioned accounts, preventing silent transcript duplication and undo-history divergence on first resume. Best-effort: errors are logged as WARN but never abort the swap. If we returned here the daemon would be left inconsistent (symlink flipped, SetActiveAccount never called). Operator sees the warning in logs and resolves divergent links manually. Tests: - TestFlipReconcilesSharedSymlinksOnTargetHome: empty target home gets all three links pointing at canonical targets after the flip. - TestFlipEnsureSymlinksFailureDoesNotAbortSwap: a planted divergent link triggers the symlinks-package error; the swap completes anyway and the active account is updated. Hermetic: added AccountSwitcher.sharedSymlinks override so tests scope the reconcile inside t.TempDir() and never touch /home/ubuntu/.claude--shared. Existing tests migrated to this pattern and hardcoded /tmp/claude--xxxx paths replaced with tmpdirs. Phase 1 / Chantier A — task A3.	2026-04-16 19:34:03 +00:00
Ubuntu	e16e3526a0	feat(lifecycle): validate shared symlinks at daemon startup (A2) Wire symlinks.ValidateAll into the lifecycle manager so the daemon refuses to start if any configured account is missing one of the shared-state symlinks or if a link diverges from the canonical target. Previously, a missing link on a freshly deployed VM would silently create a divergent state tree per account (duplicate JSONL transcripts, broken undo history) — exactly the failure mode the symlinks package (A1) was introduced to prevent. The check runs once at startup before EnsureAllSessions, guarding a single well-defined invariant: "every account home shares the same projects/, file-history/ and session-env/ roots". No auto-heal on divergence — we fail fast with an explicit error so the operator fixes it manually rather than one account's state being overwritten. Part of Phase 1 Chantier A — Failover robuste. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-16 19:03:43 +00:00
Ubuntu	91091d7abf	feat(symlinks): add shared-state symlink manager (A1) Adds internal/symlinks package that encodes in code the convention previously maintained by hand on the VM: every Claude account home must expose `session-env`, `file-history` and `projects` as symlinks to a single shared target, so account failover does not create divergent state (duplicate JSONL transcripts, broken undo history). - EnsureForAccount(home, required) creates missing links and target directories, refuses to auto-correct a divergent link (risks data loss), and errors when a regular file sits where the link belongs. - ValidateAll(homes, required) aggregates errors across both accounts so the operator sees every problem at once rather than fixing one per restart cycle. - RequiredShared exposes the production defaults so lifecycle and switcher (A2/A3) can depend on it directly. 9/9 unit tests green. Part of Phase 1 Chantier A — Failover robuste.	2026-04-16 18:55:32 +00:00
Ubuntu	4cbdcf143a	fix(dispatcher+watcher): never auto-dispatch into dedicated sessions Observed: tasks from filesecure/.agent-queue/inbox and SecuScan/ .agent-queue/inbox were being routed into ccl-1-conformvault and ccl-2-scanyze whenever those sessions happened to be idle. Those are the operator's manual interactive Claude sessions, not dispatch targets — the auto-dispatch was (a) hijacking a Claude instance the operator was using and (b) triggering /exit via the watcher's completion path when the side-task finished, kicking the operator out mid-conversation. findFreeSession was iterating Pool.Dedicated before the autonomous pool, so any idle dedicated session was the first candidate. - Dispatcher.findFreeSession: remove the Dedicated loop entirely. Auto-dispatch is now pool-only (ccl-auto-11..20). - Watcher.completeSession: defense-in-depth — even if a dedicated session ever ends up in "working" state, it is no longer /exit'd; just marked idle. Pool /exit behaviour unchanged (context recycle). - Tests: new TestFindFreeSessionSkipsDedicated proves the routing; 3 existing tests rewritten to use the autonomous pool instead of relying on Dedicated as a fake pool. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 13:30:26 +00:00
Ubuntu	6b109ed1bc	fix(dispatcher): send a lone Enter after the task paste to submit it Multi-line task bodies arrived in Claude Code as "[Pasted text #N +M lines]" and sat in the input buffer forever — the trailing Enter that SendKeys appends to the paste is consumed as a newline inside the paste, not as a submit. Observed live on ccl-auto-11 (secumon) and ccl-auto-12 (secuops): prompt visible, agent idle. - tmux.Client grows a SendEnter(session) method. ExecClient runs `tmux send-keys -t <sess> Enter` (no preceding text), which Claude's TUI accepts as the explicit submit action after a paste. - Dispatcher: after SendKeys(msg), sleep 500ms for the paste to register, then SendEnter. Same sequence a human would perform. - Five mockTmux implementations updated (quota, dispatcher, switcher, lifecycle, watcher tests). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 20:49:59 +00:00
Ubuntu	eb6b74c547	feat(pool): add start_index so manual and auto pools can coexist Production had two disjoint tmux pools named alike but for different purposes: ccl-0..ccl-9 — manual/interactive sessions (operator) ccl-auto-11..ccl-auto-20 — autonomous dispatcher pool Until now the daemon's loops iterated prefix + 0..Max, so with the deployed config ("prefix: ccl-auto", min=2, max=10) the dispatcher looked for sessions "ccl-auto0..ccl-auto9" that never existed, while the real auto pool ccl-auto-11..20 was invisible. Net effect: no task was ever dispatched, and killAllPoolSessions fabricated phantom "ccl-auto0/1" sessions on each swap. - AutonomousConfig gains StartIndex (yaml start_index, default 0). Behaviour is unchanged when StartIndex is 0. - Monitor, switcher (kill + recreate), dispatcher (findFreeSession), and lifecycle (EnsureAll + reconcile) all iterate [StartIndex, StartIndex+Max) so the daemon only touches its own range and leaves ccl-0..ccl-9 alone. - Production config updated to prefix: "ccl-auto-", start_index: 11, min: 10, max: 10 — covering the 10 real ccl-auto-11..20 sessions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 20:39:57 +00:00
Ubuntu	8fdb1937fc	feat(switcher): auto-resume dedicated sessions after a swap When a legitimate quota hit triggered a swap, killAllPoolSessions tore down the dedicated interactive sessions (ccl-1-conformvault, ccl-2-scanyze) along with the pool, then recreatePoolSessions re-opened them at a bare bash prompt. The operator had to manually re-run CLAUDE_CONFIG_DIR=<target> claude --dangerously-skip-permissions --resume <uuid> after every swap, losing whatever conversation was mid-flight. saveAllSessions only iterates sessions tracked as "working" in state; user-driven dedicated sessions are rarely in that state so their resume UUIDs were never saved. - saveDedicatedUUIDs: capture resume UUID for every configured dedicated session regardless of tracked state, before kill. - relaunchDedicatedSessions(targetHome): after recreate, send a resume command on each dedicated session pointing CLAUDE_CONFIG_DIR at the target account's home. Missing UUID → leave at shell, no blind launch. - isValidResumeUUID hardens against a corrupted resume-id.txt. New TestDedicatedRelaunchAfterSwap verifies end-to-end: pane capture → UUID persisted → resume command sent with the correct CLAUDE_CONFIG_DIR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 20:24:38 +00:00
Ubuntu	62e98cb9e7	fix(quota): veto 5xx errors + tighten patterns to stop false-positive swaps v0.2.2's 2-poll confirmation was insufficient because Anthropic 500/503 errors are printed into Claude Code's conversation transcript and stay visible in every tmux capture until the user scrolls. A persistent server error would confirm on the second poll and still trigger a swap. Root cause: the pattern "rate limit" (bare substring) matched any 500 payload that happened to mention rate limits in its error text. Real HTTP 429s from Anthropic are typed as "rate_limit_error" in the error payload — and that's the signature we should actually key on. - Remove "rate limit" from quotaPatterns (too generic — matches transcripts). - Add "rate_limit_error" (Anthropic's typed 429 error) and "5-hour limit". - Add serverErrorPatterns veto: "api_error", "overloaded_error", "internal server error", "api error: 5". When any is present in the pane, isQuotaExhausted returns false even if a quota pattern matched. - 4 new subtests covering the veto paths + sanity that real 429s pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 19:26:00 +00:00
Ubuntu	7c5f8384fa	fix(quota): add cooldown + 2-poll confirmation to prevent swap ping-pong Anthropic HTTP 500 errors surface in the TUI with payloads containing "rate limit" text, which the monitor was matching against quotaPatterns and treating as a real 429 quota hit. With no cooldown and no confirmation, a burst of 500s produced sub-minute ping-pong swaps that tore down user sessions. Two-layer fix: - quota.reactivate_cooldown (already in config, 5m) now gates the monitor too — not just the dispatcher. A completed swap suppresses further detection for the cooldown window. - A hit with no parseable reset time is treated as suspected only on the first poll; a second consecutive poll is required before emitting SwapRequested. Legitimate 429s with "resets in ..." still swap instantly on the first detection. Adds state.RecordSwap / LastSwapInfo for the cooldown, and a forensic log line on every detection: trigger_session, matched pattern, 120-char pane snippet. Tests cover: instant swap with reset, 2-poll confirmation without reset, and suspected-state reset on recovery. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 19:18:27 +00:00
Ubuntu	c87145ea0b	feat(watcher): Phase 2.1 — SessionWatcher goroutine - internal/watcher: detecte fin de tache via signal file, prompt ❯, idle timeout - state: ForEachWorking, SetStalled, SetActiveAccount, ActiveAccount - config: WatcherConfig, DispatcherConfig, JanitorConfig, NotificationsConfig + defaults - 5 tests unitaires, go test ./... -race OK Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 20:27:51 +00:00

13 commits