Commit graph

16 commits

Author SHA1 Message Date
Ubuntu
5cad53ac7a docs: add WORK_IN_PROGRESS.md and document false-positive protection
- WORK_IN_PROGRESS.md captures the v0.2.1→v0.2.3 incident, root cause,
  and the optional follow-ups (preserve dedicated sessions during swap,
  Telegram alert on SwapRequested, /quota/status endpoint).
- architecture.md §2.2.1 describes the four-layer defense:
  strict patterns, 5xx veto, two-poll confirmation, post-swap cooldown.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 19:51:15 +00:00
Ubuntu
62e98cb9e7 fix(quota): veto 5xx errors + tighten patterns to stop false-positive swaps
v0.2.2's 2-poll confirmation was insufficient because Anthropic 500/503
errors are printed into Claude Code's conversation transcript and stay
visible in every tmux capture until the user scrolls. A persistent
server error would confirm on the second poll and still trigger a swap.

Root cause: the pattern "rate limit" (bare substring) matched any 500
payload that happened to mention rate limits in its error text. Real
HTTP 429s from Anthropic are typed as "rate_limit_error" in the error
payload — and that's the signature we should actually key on.

- Remove "rate limit" from quotaPatterns (too generic — matches transcripts).
- Add "rate_limit_error" (Anthropic's typed 429 error) and "5-hour limit".
- Add serverErrorPatterns veto: "api_error", "overloaded_error",
  "internal server error", "api error: 5". When any is present in the
  pane, isQuotaExhausted returns false even if a quota pattern matched.
- 4 new subtests covering the veto paths + sanity that real 429s pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 19:26:00 +00:00
Ubuntu
7c5f8384fa fix(quota): add cooldown + 2-poll confirmation to prevent swap ping-pong
Anthropic HTTP 500 errors surface in the TUI with payloads containing
"rate limit" text, which the monitor was matching against quotaPatterns
and treating as a real 429 quota hit. With no cooldown and no
confirmation, a burst of 500s produced sub-minute ping-pong swaps that
tore down user sessions.

Two-layer fix:
- quota.reactivate_cooldown (already in config, 5m) now gates the
  monitor too — not just the dispatcher. A completed swap suppresses
  further detection for the cooldown window.
- A hit with no parseable reset time is treated as suspected only on
  the first poll; a second consecutive poll is required before
  emitting SwapRequested. Legitimate 429s with "resets in ..." still
  swap instantly on the first detection.

Adds state.RecordSwap / LastSwapInfo for the cooldown, and a
forensic log line on every detection: trigger_session, matched
pattern, 120-char pane snippet.

Tests cover: instant swap with reset, 2-poll confirmation without
reset, and suspected-state reset on recovery.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 19:18:27 +00:00
Ubuntu
75b5110748 chore: add test-and-migrate.sh script
Script de test complet du daemon claude-failover (30 tests) et migration
vers daemon exclusif en commentant les 5 crons orchestrateur.

Fixes appliqués au script:
- ((PASS++)) → PASS=$((PASS + 1)) pour éviter exit avec set -e quand PASS=0
- grep -c || echo 0 → || errors=0 pour éviter double-valeur dans variable
- HOME/GOPATH/GOCACHE explicites pour go test ./... (module cache hors /root)
- Pattern sed étendu pour matcher `* * * * *` dispatcher (pas seulement */N)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 01:12:49 +00:00
Ubuntu
28f82a66c8 feat: Phase 2.7+3 — full integration, config update, systemd unit
- Wire all goroutines in main.go: watcher, quota monitor, account
  switcher, dispatcher, janitor, and 10s state flush loop
- Add missing sections to config.example.yaml: notifications,
  dispatcher, watcher, janitor
- Add scripts/claude-failover.service systemd unit

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 00:15:06 +00:00
Ubuntu
d4260f71b4 chore(gitignore): ignore built binary and .security-reviewed marker
Le commit precedent a accidentellement inclus le binaire compile et le
marker de security review. Ces fichiers ne doivent jamais etre versionnes.
2026-04-15 00:00:23 +00:00
Ubuntu
9f7da110d2 fix(tests): isolate test symlink manipulation via t.TempDir() to prevent overwriting ~/.claude
Le test TestKillAndRecreatePoolSessions appelait executeSwitch() qui faisait
flipSymlink() sur le VRAI $HOME via os.UserHomeDir(). Resultat: ~/.claude
etait repointe vers une cible /tmp/... qui disparaissait au reboot, rendant
Claude Code inutilisable apres redemarrage.

Fix:
- Ajout du champ AccountSwitcher.homeDir (override pour tests).
- Nouveau helper resolveHomeDir() qui retourne homeDir si defini, sinon
  os.UserHomeDir().
- flipSymlink() et resumeContextDir() utilisent maintenant resolveHomeDir().
- Le test TestKillAndRecreatePoolSessions assigne a.homeDir = t.TempDir()
  avant executeSwitch().

Verifie: go test ./... passe et /home/ubuntu/.claude reste intact.
2026-04-14 23:59:59 +00:00
Ubuntu
133165b432 feat(quota): Phase 2.3 — QuotaMonitor (scraping pane tmux)
- internal/quota: SwitchRequest, poll() toutes les 30s
- isQuotaExhausted: 5 patterns (hit limit, rate limit, quota exceeded, etc.)
- extractResetTime: regex pour "resets 8pm / resets at 11:30pm / resets in N min"
- Seuils: >=2 sessions pool OU >=1 session dedicated → SwitchRequest channel(1)
- 5 tests: patterns, reset time, trigger 2 pool, trigger 1 dedicated, no-trigger

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 20:31:04 +00:00
Ubuntu
0a7e5efcfd feat(dispatcher): Phase 2.2 — Task Dispatcher avec fsnotify
- internal/dispatcher: fsnotify sur inbox/, fallback poll 60s, launchAgent
- parseFrontmatter YAML, modelForPriority (critical→opus, reste→sonnet)
- waitForPrompt polling ❯, buildTaskMessage, 1 tache par session
- isSessionFree: check tmux liveness + state idle + cooldown 5min
- 5 tests unitaires (parse, model, dispatch, no-session, missing-tmux)
- go.mod: ajout fsnotify v1.9.0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 20:30:08 +00:00
Ubuntu
46c49d0f2f feat(notify): Phase 2.5 — Notifier Telegram + Resend email
- internal/notify: Telegram (POST /sendMessage) et Resend email (POST /emails)
- Credentials lus depuis env vars (telegramBaseURL/resendBaseURL overridables en test)
- No-op gracieux quand token/key absents
- 5 tests unitaires avec httptest.Server

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 20:28:46 +00:00
Ubuntu
c87145ea0b feat(watcher): Phase 2.1 — SessionWatcher goroutine
- internal/watcher: detecte fin de tache via signal file, prompt ❯, idle timeout
- state: ForEachWorking, SetStalled, SetActiveAccount, ActiveAccount
- config: WatcherConfig, DispatcherConfig, JanitorConfig, NotificationsConfig + defaults
- 5 tests unitaires, go test ./... -race OK

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 20:27:51 +00:00
Ubuntu
978b60ccf7 feat: SessionLifecycleManager — auto-detect and repair dead tmux sessions
- Add internal/lifecycle/manager.go with Manager struct, Run() ticker loop
  (15s interval), EnsureAllSessions() for boot-time session creation, and
  reconcile() that recreates idle sessions and recovers working ones via
  SetFailed + CreateSession
- Add state.SetFailed() to record crash timestamp on SessionState
- Add internal/lifecycle/manager_test.go with mock tmux client and 3 tests:
  TestReconcileCreatesDeadSession, TestReconcileRecoversCrashedSession,
  TestEnsureAllSessions — all pass
- Wire lifecycle.Manager into cmd/claude-failover/main.go after state init

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 18:02:25 +00:00
Ubuntu
2d43580c18 chore: add CLAUDE.md and update gitignore 2026-04-14 17:55:29 +00:00
claude-agent
65c25c8955 docs: enrich architecture with Go implementation details 2026-04-14 13:36:20 +00:00
481e276d7d chore: update module path after repo transfer to olivier/ 2026-04-14 13:30:18 +00:00
cf4957010f feat: initial project structure 2026-04-14 13:29:24 +00:00