diff --git a/VERSION.md b/VERSION.md index 9fa2f32..fe1d813 100644 --- a/VERSION.md +++ b/VERSION.md @@ -1,161 +1,4 @@ -# Version actuelle : 0.3.8 - -## [0.3.8] - 2026-04-16 -**Type:** Patch — Bug #1 (A3 flip+ensure inconsistency) + Bug #10 (requiredShared contract test) - -### Corrigé — Bug #1 (CRITIQUE) -- `AccountSwitcher.executeSwitch` ne continue plus silencieusement quand - `symlinks.EnsureForAccount` échoue après le flip : il **roll-back** le lien - `~/.claude` vers le home du compte précédent et **n'appelle pas** - `SetActiveAccount`. Évite l'état incohérent où le daemon déclare le compte - cible actif alors que ses shared symlinks sont divergents → transcripts - dupliqués silencieusement, resume cassé. -- Si le rollback réussit : swap annulé, état filesystem = état pré-swap, - erreur explicite retournée par `executeSwitchE`. -- Si ensure ET rollback échouent : `partialSwap` atomique sticky set, - `ErrPartialSwap` retourné, tout futur swap est refusé tant que le - daemon n'est pas redémarré par l'opérateur. -- Nouvelle méthode publique `AccountSwitcher.IsPartialSwap() bool` pour - que health-checks et watchdog exposent l'état dégradé. - -### Ajouté — Tests Bug #1 -- `TestFlipEnsureFailureTriggersRollback` : plant un lien divergent sur - le home cible → ensure échoue → rollback réussit → `ActiveAccount` reste - compte1 → `~/.claude` pointe sur previousHome → `IsPartialSwap` = false. -- `TestFlipEnsureAndRollbackFailure` : force les deux flips à échouer - (homeDir = fichier régulier) → `ErrPartialSwap` retourné, flag sticky - set, swap suivant refusé. - -### Ajouté — Bug #10 -- `TestRequiredSharedIsCoherent` (`internal/symlinks/shared_test.go`) : - valide le contrat de la constante package-level `RequiredShared` jamais - exercée auparavant (tous les autres tests utilisent un override scoped - en `t.TempDir()`). Vérifie sans toucher au filesystem : - - exactement 3 entrées (`session-env`, `file-history`, `projects`) - - targets absolus - - `filepath.Dir(target)` identique pour les 3 entrées (invariant - "3 liens sous un même shared root" sur lequel repose `EnsureForAccount`). - -### Rationale -- Continuer après un ensure échoué revient à valider que le compte cible - est "sain" alors que les shared symlinks sont absents ou divergents. - Conséquence en prod : premier `claude --resume` écrit dans - `~/.claude/projects/` (privé) → transcripts dupliqués, undo - désynchronisé, failover complètement cassé sans log d'alerte. -- Le rollback garantit qu'un compte cible mal configuré ne peut PAS - dégrader le state du daemon : on retourne à l'état pré-swap et on - signale l'erreur à l'appelant. -- `ErrPartialSwap` + `IsPartialSwap()` documente un état où l'intervention - humaine est obligatoire — préférable à un retry automatique qui - empirerait la divergence. - -### Tests -- ✅ `go test ./...` : tous les packages PASS -- ✅ `go test -race ./...` : PASS, aucun data race -- ✅ `go vet ./...` : clean -- ✅ `go build ./...` : clean - -### Fichiers modifiés -- `internal/switcher/account_switcher.go` (+rollback + IsPartialSwap + ErrPartialSwap) -- `internal/switcher/account_switcher_test.go` (2 nouveaux tests, 1 test obsolète remplacé) -- `internal/symlinks/shared_test.go` (+TestRequiredSharedIsCoherent) - -## [0.3.7] - 2026-04-16 -**Type:** Patch — Phase 1 / Chantier A3 : wire EnsureForAccount post-flip - -### Ajouté -- `AccountSwitcher.executeSwitch` appelle désormais - `symlinks.EnsureForAccount(target.Home, ...)` **juste après** le flip - du lien principal `~/.claude`. Garantit que les 3 liens partagés - (`session-env`, `file-history`, `projects`) existent et pointent aux - bons targets sur le compte cible, même si celui-ci vient juste - d'être provisionné. -- `AccountSwitcher.sharedSymlinks` : override test-only (accepte une - liste `[]symlinks.SharedSymlink`). Défaut = `symlinks.RequiredShared`. - Les tests peuvent scoper la réconciliation dans un `t.TempDir()` pour - ne jamais toucher `/home/ubuntu/.claude-*-shared`. -- 2 tests unitaires : - - `TestFlipReconcilesSharedSymlinksOnTargetHome` : target home vide → - les 3 liens sont créés après le flip et pointent aux targets canoniques. - - `TestFlipEnsureSymlinksFailureDoesNotAbortSwap` : lien divergent - planté à la main → `EnsureForAccount` renvoie une erreur, logguée - en WARN, mais le swap complète quand même (best-effort post-flip). - -### Rationale -- Sans cet appel, un compte cible fraîchement provisionné n'aurait - pas encore ses 3 liens ; au premier `claude --resume`, Claude Code - écrirait dans `~/.claude/projects/` (privé) au lieu de - `/home/ubuntu/.claude-projects-shared` → transcripts dupliqués, - undo désynchronisé, resume silencieusement cassé. -- L'ensure est **best-effort** : une erreur est logguée en WARN mais - NE bloque PAS le flip. Si on abortait ici, on laisserait le daemon - dans un état incohérent (symlink déjà flippé mais `SetActiveAccount` - pas appelé). -- L'opérateur voit le WARN dans les logs et peut corriger la - divergence manuellement (ex: lien pointant sur le mauvais target). - -### Tests -- ✅ `go test ./...` : tous les packages PASS (incluant - `internal/switcher` et `internal/symlinks`). -- ✅ `go test -race ./internal/switcher/...` : PASS. -- ✅ `go vet ./...` : clean. - -### Fichiers modifiés -- `internal/switcher/account_switcher.go` -- `internal/switcher/account_switcher_test.go` - -## [0.3.6] - 2026-04-16 -**Type:** Patch — Phase 1 / Chantier A2 : validation des symlinks au startup - -### Ajouté -- `Manager.ValidateSharedSymlinks()` : nouvelle méthode dans - `internal/lifecycle` qui agrège les `Home` de tous les comptes - configurés et délègue à `symlinks.ValidateAll`. Échoue dur si un - compte n'a pas de `home` défini ou si un lien est absent/divergent. -- `cmd/claude-failover/main.go` appelle cette validation **avant** - `EnsureAllSessions()` : un état partagé cassé ne laissera plus le - daemon démarrer et divergér silencieusement. - -### Rationale -- Un opérateur qui copie la config sur une nouvelle VM ne peut plus - oublier les liens — le daemon refuse de démarrer jusqu'à ce qu'ils - soient corrects. -- Pas d'auto-heal sur divergence : on préfère un message d'erreur - explicite à un `rm -f` silencieux qui détruirait l'autre compte. - -### Tests -- ✅ `go test ./...` : tous les packages PASS (incluant - `internal/lifecycle` et `internal/symlinks`). - -### Fichiers modifiés -- `cmd/claude-failover/main.go` (+9) -- `internal/lifecycle/manager.go` (+31) - -## [0.3.5] - 2026-04-16 -**Type:** Patch — Phase 1 / Chantier A1 : package `internal/symlinks` - -### Ajouté -- `internal/symlinks/shared.go` : `EnsureForAccount` + `ValidateAll` qui - encodent en code la convention des 3 symlinks partagés par compte - (`session-env`, `file-history`, `projects`). Jusqu'à aujourd'hui ces - liens étaient maintenus à la main et leur absence silencieuse cassait - le failover (JSONL dupliqués, undo désynchronisé). -- Tests unitaires couvrant : création missing, idempotence, divergence - (refus d'auto-correction pour éviter la perte de données), fichier - régulier à la place du lien, home vide, agrégation d'erreurs multi-comptes. - -### Rationale -- Un déploiement sur une nouvelle VM ne peut plus omettre les liens. -- Divergent link → erreur explicite, jamais de correction silencieuse. -- Préparation des tâches A2 (ValidateAll au startup) et A3 (EnsureForAccount - post-flipSymlink dans le switcher). - -### Tests -- ✅ `go test ./internal/symlinks/...` : 9/9 PASS - -### Fichiers ajoutés -- `internal/symlinks/shared.go` -- `internal/symlinks/shared_test.go` +# Version actuelle : 0.3.4 ## [0.3.4] - 2026-04-16 **Type:** Patch — Dispatcher ne route JAMAIS vers les sessions dédiées diff --git a/WORK_IN_PROGRESS.md b/WORK_IN_PROGRESS.md index 7538918..049b34a 100644 --- a/WORK_IN_PROGRESS.md +++ b/WORK_IN_PROGRESS.md @@ -1,26 +1,13 @@ # Travaux en Cours - claude-failover ## Dernière mise à jour -2026-04-16 19:00:00 +2026-04-15 19:30:00 ## Version Actuelle -0.3.5 (en cours de progression vers 0.4.0) +0.3.0 ## Demande Actuelle -**Phase 1 / Chantier A — Failover robuste** (spec dans `ccl-platform/phases/phase1/A-failover.md`). -Rendre le failover compte1 ↔ compte2 déterministe en intégrant dans le code les fixes manuels -(symlinks partagés), en ajoutant un registre UUID fiable, et en durcissant tmux send-keys. - -Branche : `feat/phase1-A-failover-robust`. - -## Sous-tâches Chantier A -- [x] A1 — `internal/symlinks/shared.go` (+ tests) — v0.3.5 -- [ ] A2 — `lifecycle/manager.go` : `ValidateAll` au startup -- [ ] A3 — `switcher/account_switcher.go` : `EnsureForAccount` post-flip -- [ ] A4 — `internal/registry/uuid_registry.go` (+ tests) -- [ ] A5 — `internal/tmux/send.go` avec retry exponentiel (+ tests) -- [ ] A6 — Capture UUID 200 → 500 lignes -- [ ] A7 — `scripts/test-failover.sh` dans ccl-platform + scripts associés +Aucune — v0.2.3 shippée, service stable. ## Étapes Complétées - [x] v0.2.1 — Cooldown post-swap + log forensique (trigger_session, pattern, snippet) diff --git a/cmd/claude-failover/main.go b/cmd/claude-failover/main.go index 2c29f89..8bc8fc5 100644 --- a/cmd/claude-failover/main.go +++ b/cmd/claude-failover/main.go @@ -51,15 +51,6 @@ func main() { // Initialise tmux client and lifecycle manager. tmuxClient := tmux.NewExecClient() lm := lifecycle.New(tmuxClient, s, cfg) - - // Validate (and self-heal) the shared-state symlinks BEFORE spawning - // any sessions. A divergent link would silently fork transcripts - // between accounts and make failover destructive, so we fail fast here - // rather than after work is in flight. - if err := lm.ValidateSharedSymlinks(); err != nil { - log.Fatalf("shared symlinks validation failed: %v", err) - } - lm.EnsureAllSessions() // Block until SIGINT or SIGTERM. diff --git a/internal/lifecycle/manager.go b/internal/lifecycle/manager.go index eeed9cc..40fa4b0 100644 --- a/internal/lifecycle/manager.go +++ b/internal/lifecycle/manager.go @@ -4,13 +4,11 @@ package lifecycle import ( "context" - "fmt" "log" "time" "forge.secuaas.ovh/olivier/claude-failover/internal/config" "forge.secuaas.ovh/olivier/claude-failover/internal/state" - "forge.secuaas.ovh/olivier/claude-failover/internal/symlinks" "forge.secuaas.ovh/olivier/claude-failover/internal/tmux" ) @@ -49,35 +47,6 @@ func (m *Manager) Run(ctx context.Context) { } } -// ValidateSharedSymlinks verifies that every configured account home has -// the three shared-state symlinks (session-env, file-history, projects) -// in place and pointing at the canonical shared targets. -// -// Called once at daemon startup BEFORE sessions are recreated. A missing -// or divergent link would silently fork the state tree between the two -// accounts, breaking failover. We fail fast so the operator fixes it -// before any work is in flight. -// -// EnsureForAccount creates missing links but refuses to touch divergent -// ones — see internal/symlinks for the rationale. -func (m *Manager) ValidateSharedSymlinks() error { - if len(m.config.Accounts) == 0 { - return fmt.Errorf("[lifecycle] no accounts configured — cannot validate shared symlinks") - } - homes := make([]string, 0, len(m.config.Accounts)) - for _, acc := range m.config.Accounts { - if acc.Home == "" { - return fmt.Errorf("[lifecycle] account %q has empty home — refusing to continue", acc.Name) - } - homes = append(homes, acc.Home) - } - if err := symlinks.ValidateAll(homes, symlinks.RequiredShared); err != nil { - return fmt.Errorf("shared symlinks invalid, refusing to start: %w", err) - } - m.logger.Printf("[lifecycle] shared symlinks OK for %d account(s)", len(homes)) - return nil -} - // EnsureAllSessions creates all configured sessions that are not yet present in tmux. // It is intended to be called once at daemon startup before Run is launched. func (m *Manager) EnsureAllSessions() { diff --git a/internal/switcher/account_switcher.go b/internal/switcher/account_switcher.go index 3778403..e8de796 100644 --- a/internal/switcher/account_switcher.go +++ b/internal/switcher/account_switcher.go @@ -4,7 +4,6 @@ package switcher import ( "context" - "errors" "fmt" "log" "os" @@ -12,27 +11,15 @@ import ( "regexp" "strconv" "strings" - "sync/atomic" "time" "forge.secuaas.ovh/olivier/claude-failover/internal/config" "forge.secuaas.ovh/olivier/claude-failover/internal/notify" "forge.secuaas.ovh/olivier/claude-failover/internal/quota" "forge.secuaas.ovh/olivier/claude-failover/internal/state" - "forge.secuaas.ovh/olivier/claude-failover/internal/symlinks" "forge.secuaas.ovh/olivier/claude-failover/internal/tmux" ) -// ErrPartialSwap is returned (and wrapped) when the switcher flipped -// ~/.claude to the target home, EnsureForAccount failed on the target, -// and the rollback flip back to the previous home ALSO failed. The daemon -// is in a documented degraded state: the active-account setter was NOT -// called, but the filesystem symlink may point at an account whose shared -// state is inconsistent. Operator intervention is required. Callers can -// interrogate AccountSwitcher.IsPartialSwap() to expose the flag to -// health-checks / watchdogs. -var ErrPartialSwap = errors.New("switcher: partial swap — flip succeeded but ensure + rollback both failed") - // SwitchState represents the current phase of a failover operation. type SwitchState string @@ -65,19 +52,6 @@ type AccountSwitcher struct { // homeDir is the directory containing the .claude symlink. Overridable for tests. // When empty, os.UserHomeDir() is used. homeDir string - // sharedSymlinks is the list of shared-state links reconciled on the - // target account home after every flip. Overridable for tests so the - // suite never touches the operator's real /home/ubuntu/.claude-* - // shared directories. When nil, symlinks.RequiredShared is used. - sharedSymlinks []symlinks.SharedSymlink - // partialSwap is set to 1 when a flip+ensure+rollback sequence left the - // daemon in an inconsistent state (symlink possibly flipped, but active - // account NOT updated, and rollback flip ALSO failed). Health-checks / - // watchdogs read this flag via IsPartialSwap(). It is sticky: once set, - // it stays set until the operator restarts the daemon after fixing the - // filesystem state. We use atomic access so watchdog goroutines can read - // it without blocking the switcher. - partialSwap atomic.Bool } // New creates an AccountSwitcher. @@ -114,30 +88,8 @@ func (a *AccountSwitcher) Run(ctx context.Context) { // executeSwitch performs the full failover sequence. func (a *AccountSwitcher) executeSwitch(req quota.SwitchRequest) { - if err := a.executeSwitchE(req); err != nil { - // executeSwitchE already logs the detail; we swallow the error here - // because the public Run loop has no return channel. The partialSwap - // flag (if set) remains visible via IsPartialSwap(). - a.logger.Printf("[switcher] SWAP aborted: %v", err) - } -} - -// executeSwitchE runs the swap and returns an error describing any abort or -// partial-swap condition. Split out from executeSwitch so tests can assert -// on the error value without routing through a channel. -func (a *AccountSwitcher) executeSwitchE(req quota.SwitchRequest) error { a.logger.Printf("[switcher] SWAP initiated from=%q reset=%q", req.From, req.ResetTime) - // Refuse to proceed if a previous swap left the daemon in an - // inconsistent state. The operator must intervene (fix the filesystem, - // restart the daemon) before any further failover can be attempted — - // otherwise we'd stack symlink flips on top of a broken state. - if a.partialSwap.Load() { - err := fmt.Errorf("refusing swap: daemon is in partial-swap degraded state (operator intervention required)") - a.logger.Printf("[switcher] %v", err) - return err - } - // 1. SAVING — capture resume UUIDs from all working sessions plus // every dedicated session unconditionally (dedicated sessions are // user-driven and may not be tracked as "working" in state, but their @@ -152,52 +104,12 @@ func (a *AccountSwitcher) executeSwitchE(req quota.SwitchRequest) error { if target == nil { a.logger.Printf("[switcher] no alternate account found for %q — aborting swap", req.From) a.currentState = StateNormal - return nil + return } - previous := a.findAccountByName(req.From) if err := a.flipSymlink(target.Home); err != nil { a.logger.Printf("[switcher] flipSymlink error: %v", err) } - // Ensure the target account home exposes the three shared-state - // symlinks (session-env, file-history, projects). If this fails we - // MUST NOT proceed with SetActiveAccount — the daemon would otherwise - // declare the target "active" while its shared state is divergent, - // silently writing transcripts into private /projects directories and - // breaking `claude --resume` across sessions. Instead we attempt to - // roll back the ~/.claude flip to the previous account. If the - // rollback also fails, the daemon is in a documented degraded state - // (ErrPartialSwap) and the operator must intervene. - if err := symlinks.EnsureForAccount(target.Home, a.requiredShared()); err != nil { - a.logger.Printf("[switcher] ensure shared symlinks for %q failed: %v — attempting rollback", target.Home, err) - if previous == nil || previous.Home == "" { - // No known previous home to roll back to — set the degraded - // flag and bail out. This is equivalent to a rollback failure - // because the filesystem is pointed at a broken target. - a.partialSwap.Store(true) - a.currentState = StateNormal - return fmt.Errorf("%w: ensure failed (%v) and no previous account home is known for rollback", ErrPartialSwap, err) - } - if rbErr := a.flipSymlink(previous.Home); rbErr != nil { - // Both the ensure AND the rollback failed. The daemon is now - // in a documented inconsistent state: ~/.claude may point at - // target whose shared-state is divergent, but SetActiveAccount - // has NOT been called so state.ActiveAccount is still the - // previous account. No further failover can be attempted - // until the operator intervenes. - a.partialSwap.Store(true) - a.logger.Printf("[switcher] CRITICAL partial swap: ensure=%v rollback=%v — daemon in degraded state, operator intervention required", err, rbErr) - a.currentState = StateNormal - return fmt.Errorf("%w: ensure=%v rollback=%v", ErrPartialSwap, err, rbErr) - } - // Rollback succeeded — symlink is back on the previous account, - // SetActiveAccount was NEVER called, state is consistent with - // "no swap happened". Return an explicit error so the caller - // knows the swap was cancelled. - a.logger.Printf("[switcher] rollback successful: ~/.claude → %s (swap cancelled)", previous.Home) - a.currentState = StateNormal - return fmt.Errorf("swap cancelled: ensure shared symlinks failed on target %q: %w", target.Home, err) - } a.killAllPoolSessions() a.recreatePoolSessions() a.relaunchDedicatedSessions(target.Home) @@ -223,31 +135,6 @@ func (a *AccountSwitcher) executeSwitchE(req quota.SwitchRequest) error { } a.currentState = StateNormal - return nil -} - -// IsPartialSwap reports whether the switcher is in a degraded state after a -// flip+ensure+rollback sequence all failed. Health-checks and watchdogs use -// this signal to surface an operator-actionable alert. The flag is sticky -// for the lifetime of the process: once set, it remains set until the daemon -// is restarted (after the operator has fixed the filesystem). -func (a *AccountSwitcher) IsPartialSwap() bool { - return a.partialSwap.Load() -} - -// findAccountByName returns the account config entry matching name, or nil. -// Unlike findTargetAccount (which returns the first NON-matching account), -// this is used by the rollback path to recover the previous home. -func (a *AccountSwitcher) findAccountByName(name string) *config.AccountConfig { - if name == "" { - return nil - } - for i := range a.config.Accounts { - if a.config.Accounts[i].Name == name { - return &a.config.Accounts[i] - } - } - return nil } // saveDedicatedUUIDs captures the resume UUID for every configured dedicated @@ -338,16 +225,6 @@ func (a *AccountSwitcher) saveAllSessions() { }) } -// requiredShared returns the shared-symlink list used when reconciling the -// target account home after a flip. Tests may set a.sharedSymlinks to a -// tmpdir-scoped list so they never touch /home/ubuntu/.claude-*-shared. -func (a *AccountSwitcher) requiredShared() []symlinks.SharedSymlink { - if a.sharedSymlinks != nil { - return a.sharedSymlinks - } - return symlinks.RequiredShared -} - // resolveHomeDir returns the configured homeDir (test override) or the real // user home. Tests MUST set a.homeDir to a tmpdir to avoid clobbering the // production ~/.claude symlink. diff --git a/internal/switcher/account_switcher_test.go b/internal/switcher/account_switcher_test.go index 14df33e..8c3b292 100644 --- a/internal/switcher/account_switcher_test.go +++ b/internal/switcher/account_switcher_test.go @@ -1,9 +1,6 @@ package switcher import ( - "errors" - "os" - "path/filepath" "strings" "testing" "time" @@ -11,19 +8,8 @@ import ( "forge.secuaas.ovh/olivier/claude-failover/internal/config" "forge.secuaas.ovh/olivier/claude-failover/internal/quota" "forge.secuaas.ovh/olivier/claude-failover/internal/state" - "forge.secuaas.ovh/olivier/claude-failover/internal/symlinks" ) -// tmpShared returns a SharedSymlink list whose targets live entirely under -// tmpDir, so switcher tests never touch /home/ubuntu/.claude-*-shared. -func tmpShared(tmpDir string) []symlinks.SharedSymlink { - return []symlinks.SharedSymlink{ - {Target: filepath.Join(tmpDir, "session-env-shared"), Name: "session-env"}, - {Target: filepath.Join(tmpDir, "file-history-shared"), Name: "file-history"}, - {Target: filepath.Join(tmpDir, "projects-shared"), Name: "projects"}, - } -} - // mockTmux for switcher tests. type mockTmux struct { sessions map[string]bool @@ -157,9 +143,6 @@ func TestKillAndRecreatePoolSessions(t *testing.T) { // touches the real ~/.claude (regression: a reboot used to leave Claude // Code unusable because the test had repointed ~/.claude to /tmp/...). a.homeDir = t.TempDir() - // Scope shared-symlink targets to a tmpdir so the post-flip ensure - // pass does not write inside /home/ubuntu/.claude-*-shared. - a.sharedSymlinks = tmpShared(t.TempDir()) a.executeSwitch(quota.SwitchRequest{From: "compte1"}) // Active account must have changed. @@ -203,12 +186,10 @@ func TestDedicatedRelaunchAfterSwap(t *testing.T) { s := state.New("") s.SetActiveAccount("compte1") - home1 := filepath.Join(t.TempDir(), "claude-1-xxxx") - home2 := filepath.Join(t.TempDir(), "claude-2-xxxx") cfg := &config.Config{ Accounts: []config.AccountConfig{ - {Name: "compte1", Home: home1}, - {Name: "compte2", Home: home2}, + {Name: "compte1", Home: "/tmp/claude-1-xxxx"}, + {Name: "compte2", Home: "/tmp/claude-2-xxxx"}, }, Pool: config.PoolConfig{ Dedicated: []config.DedicatedSession{{Name: "dedicated-1", Project: "/tmp"}}, @@ -218,7 +199,6 @@ func TestDedicatedRelaunchAfterSwap(t *testing.T) { a := New(tc, s, cfg, make(chan quota.SwitchRequest), nil) a.homeDir = t.TempDir() - a.sharedSymlinks = tmpShared(t.TempDir()) a.executeSwitch(quota.SwitchRequest{From: "compte1"}) // The relaunch must send a resume command on the dedicated session, @@ -233,225 +213,10 @@ func TestDedicatedRelaunchAfterSwap(t *testing.T) { if relaunch == "" { t.Fatalf("expected dedicated-1 relaunch send-keys; got %v", tc.sendKeyCalls) } - if !strings.Contains(relaunch, "CLAUDE_CONFIG_DIR="+home2) { + if !strings.Contains(relaunch, "CLAUDE_CONFIG_DIR=/tmp/claude-2-xxxx") { t.Errorf("relaunch should set CLAUDE_CONFIG_DIR to target home; got %q", relaunch) } if !strings.Contains(relaunch, "aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee") { t.Errorf("relaunch should include captured UUID; got %q", relaunch) } } - -// TestFlipReconcilesSharedSymlinksOnTargetHome verifies A3: after the main -// ~/.claude flip, the switcher reconciles the three shared-state symlinks -// (session-env / file-history / projects) on the TARGET account home. -// Scenario: the target home has NO links yet — a freshly-provisioned account -// that has never been flipped into. Post-switch, all three links must exist -// inside the target home and point at the canonical shared targets. -func TestFlipReconcilesSharedSymlinksOnTargetHome(t *testing.T) { - tc := newMockTmux() - - s := state.New("") - s.SetActiveAccount("compte1") - - // Target home starts empty: EnsureForAccount will mkdir + create links. - targetHome := filepath.Join(t.TempDir(), "claude-compte2") - cfg := &config.Config{ - Accounts: []config.AccountConfig{ - {Name: "compte1", Home: filepath.Join(t.TempDir(), "claude-compte1")}, - {Name: "compte2", Home: targetHome}, - }, - Pool: config.PoolConfig{ - Autonomous: config.AutonomousConfig{Prefix: "ccl-auto-", Min: 0, Max: 0}, - }, - } - - a := New(tc, s, cfg, make(chan quota.SwitchRequest), nil) - a.homeDir = t.TempDir() - shared := tmpShared(t.TempDir()) - a.sharedSymlinks = shared - - // Pre-assert: no link exists in targetHome. - for _, sl := range shared { - if _, err := os.Lstat(filepath.Join(targetHome, sl.Name)); !os.IsNotExist(err) { - t.Fatalf("pre-condition: %q should not exist yet (err=%v)", sl.Name, err) - } - } - - a.executeSwitch(quota.SwitchRequest{From: "compte1"}) - - // Post-assert: every required link exists and points at the canonical - // target under the tmpdir-scoped shared root. - for _, sl := range shared { - linkPath := filepath.Join(targetHome, sl.Name) - info, err := os.Lstat(linkPath) - if err != nil { - t.Errorf("expected link at %s after flip: %v", linkPath, err) - continue - } - if info.Mode()&os.ModeSymlink == 0 { - t.Errorf("%s exists but is not a symlink", linkPath) - continue - } - got, err := os.Readlink(linkPath) - if err != nil { - t.Errorf("readlink %s: %v", linkPath, err) - continue - } - if got != sl.Target { - t.Errorf("link %s points to %q, want %q", linkPath, got, sl.Target) - } - } -} - -// TestFlipEnsureFailureTriggersRollback verifies the fix for the A3 bug -// (flip+ensure inconsistency): if EnsureForAccount fails on the target home -// after the ~/.claude flip, the switcher MUST NOT mark the target account -// active. It must instead roll back the ~/.claude symlink to the previous -// account's home, leaving the daemon in the pre-swap state so subsequent -// session work keeps writing to the known-good shared state. -// -// Old (buggy) behaviour: ensure error was WARN-only, SetActiveAccount still -// happened, dedicated sessions were relaunched against a target whose -// /projects, /session-env, /file-history were missing or divergent → -// transcripts duplicated silently, resume broke, undo history diverged. -func TestFlipEnsureFailureTriggersRollback(t *testing.T) { - tc := newMockTmux() - - s := state.New("") - s.SetActiveAccount("compte1") - - previousHome := filepath.Join(t.TempDir(), "claude-compte1") - targetHome := filepath.Join(t.TempDir(), "claude-compte2") - if err := os.MkdirAll(previousHome, 0700); err != nil { - t.Fatalf("mkdir previous home: %v", err) - } - if err := os.MkdirAll(targetHome, 0700); err != nil { - t.Fatalf("mkdir target home: %v", err) - } - // Plant a divergent link at /session-env. The symlinks - // package refuses to auto-correct this (data-loss safeguard) and will - // return an error, which must now trigger a rollback. - bogus := filepath.Join(t.TempDir(), "somewhere-else") - if err := os.MkdirAll(bogus, 0700); err != nil { - t.Fatalf("mkdir bogus: %v", err) - } - if err := os.Symlink(bogus, filepath.Join(targetHome, "session-env")); err != nil { - t.Fatalf("plant divergent link: %v", err) - } - - cfg := &config.Config{ - Accounts: []config.AccountConfig{ - {Name: "compte1", Home: previousHome}, - {Name: "compte2", Home: targetHome}, - }, - Pool: config.PoolConfig{ - Autonomous: config.AutonomousConfig{Prefix: "ccl-auto-", Min: 0, Max: 0}, - }, - } - - a := New(tc, s, cfg, make(chan quota.SwitchRequest), nil) - homeDir := t.TempDir() - a.homeDir = homeDir - a.sharedSymlinks = tmpShared(t.TempDir()) - - err := a.executeSwitchE(quota.SwitchRequest{From: "compte1"}) - if err == nil { - t.Fatalf("executeSwitchE: expected cancellation error, got nil") - } - // The public symmetric swap-cancelled error must mention ensure and - // wrap the underlying symlinks package message. ErrPartialSwap must - // NOT be set (rollback succeeded → recoverable condition). - if errors.Is(err, ErrPartialSwap) { - t.Errorf("did not expect ErrPartialSwap; rollback succeeded; got %v", err) - } - if a.IsPartialSwap() { - t.Errorf("IsPartialSwap should be false when rollback succeeds") - } - - // Active account must remain the previous one — SetActiveAccount must - // NOT have been called. - if got := s.ActiveAccount(); got != "compte1" { - t.Errorf("active account should stay compte1 after rollback; got %q", got) - } - - // ~/.claude must now point at the previous home (rollback target). - link, rlErr := os.Readlink(filepath.Join(homeDir, ".claude")) - if rlErr != nil { - t.Fatalf("readlink ~/.claude: %v", rlErr) - } - if link != previousHome { - t.Errorf("~/.claude should point at previous home %q after rollback; got %q", previousHome, link) - } -} - -// TestFlipEnsureAndRollbackFailure verifies that when BOTH EnsureForAccount -// AND the rollback flip fail, the switcher sets the sticky partial-swap -// flag and returns ErrPartialSwap. The daemon is then in a documented -// degraded state where any further swap is refused until the operator -// restarts it. -func TestFlipEnsureAndRollbackFailure(t *testing.T) { - tc := newMockTmux() - - s := state.New("") - s.SetActiveAccount("compte1") - - previousHome := filepath.Join(t.TempDir(), "claude-compte1") - targetHome := filepath.Join(t.TempDir(), "claude-compte2") - if err := os.MkdirAll(previousHome, 0700); err != nil { - t.Fatalf("mkdir previous home: %v", err) - } - if err := os.MkdirAll(targetHome, 0700); err != nil { - t.Fatalf("mkdir target home: %v", err) - } - // Plant the divergent link that will cause EnsureForAccount to fail. - bogus := filepath.Join(t.TempDir(), "somewhere-else") - if err := os.MkdirAll(bogus, 0700); err != nil { - t.Fatalf("mkdir bogus: %v", err) - } - if err := os.Symlink(bogus, filepath.Join(targetHome, "session-env")); err != nil { - t.Fatalf("plant divergent link: %v", err) - } - - cfg := &config.Config{ - Accounts: []config.AccountConfig{ - {Name: "compte1", Home: previousHome}, - {Name: "compte2", Home: targetHome}, - }, - Pool: config.PoolConfig{ - Autonomous: config.AutonomousConfig{Prefix: "ccl-auto-", Min: 0, Max: 0}, - }, - } - - a := New(tc, s, cfg, make(chan quota.SwitchRequest), nil) - - // Force the rollback flip to fail: point homeDir at a file that cannot - // host a .claude symlink. We use a regular file; the flipSymlink - // implementation does os.Remove() then os.Symlink() under homeDir, - // which fails when homeDir is itself a file (ENOTDIR). - badHomeFile := filepath.Join(t.TempDir(), "not-a-dir") - if err := os.WriteFile(badHomeFile, []byte("block"), 0600); err != nil { - t.Fatalf("write bad home: %v", err) - } - a.homeDir = badHomeFile - a.sharedSymlinks = tmpShared(t.TempDir()) - - err := a.executeSwitchE(quota.SwitchRequest{From: "compte1"}) - if err == nil { - t.Fatalf("expected ErrPartialSwap, got nil") - } - if !errors.Is(err, ErrPartialSwap) { - t.Errorf("expected ErrPartialSwap, got %v", err) - } - if !a.IsPartialSwap() { - t.Errorf("IsPartialSwap should be true when both ensure AND rollback fail") - } - // SetActiveAccount must still not have been called. - if got := s.ActiveAccount(); got != "compte1" { - t.Errorf("active account must stay compte1 in partial-swap; got %q", got) - } - - // A subsequent swap attempt must be refused while the flag is set. - if err2 := a.executeSwitchE(quota.SwitchRequest{From: "compte1"}); err2 == nil { - t.Errorf("expected subsequent swap to be refused in degraded state") - } -} diff --git a/internal/symlinks/shared.go b/internal/symlinks/shared.go deleted file mode 100644 index 7687fb0..0000000 --- a/internal/symlinks/shared.go +++ /dev/null @@ -1,165 +0,0 @@ -// Package symlinks manages the shared-state symlinks that every Claude -// account home must expose, so that account failover does not create state -// divergence (duplicated JSONL transcripts, broken undo history, drifted -// session env). -// -// Rationale -// -// Claude Code stores three directories whose content MUST be identical -// across the two configured accounts for failover to be a no-op: -// -// - projects/ — session JSONL transcripts (used by `claude --resume`) -// - session-env/ — per-session environment and working-dir metadata -// - file-history/ — undo/redo history persistence -// -// If account A writes under `~/.claude-compte1/projects/...` while account -// B later runs under `~/.claude-compte2/projects/...`, resume fails -// silently with "session not found" and the operator loses every in-flight -// conversation. -// -// Historically we fixed this by creating symlinks manually on the -// operator's VM. Any fresh deployment that forgets those links silently -// reintroduces the bug. This package encodes the convention in code: -// EnsureForAccount creates missing links, ValidateAll fails fast at -// startup when an account home is misconfigured. -package symlinks - -import ( - "errors" - "fmt" - "os" - "path/filepath" - "strings" -) - -// DefaultSharedRoot is the directory under which the three shared targets -// live. All SharedSymlink.Target values default to a subdirectory of this -// root so tests can override the root without rewriting the shared list. -const DefaultSharedRoot = "/home/ubuntu" - -// SharedSymlink describes one required link inside a Claude account home. -// -// Target is the absolute path on disk that holds the real shared -// directory. Name is the basename of the link that must exist inside each -// account home (e.g. `session-env`, `file-history`, `projects`). -type SharedSymlink struct { - Target string - Name string -} - -// RequiredShared lists the three symlinks every Claude account home must -// expose. The list is package-level so integration tests can read it, but -// callers SHOULD prefer the EnsureForAccount / ValidateAll entry points -// that accept an override list for isolation. -var RequiredShared = []SharedSymlink{ - {Target: "/home/ubuntu/.claude-session-env-shared", Name: "session-env"}, - {Target: "/home/ubuntu/.claude-file-history-shared", Name: "file-history"}, - {Target: "/home/ubuntu/.claude-projects-shared", Name: "projects"}, -} - -// EnsureForAccount verifies (and creates if missing) every required shared -// symlink for a single account home. Behaviour: -// -// - If accountHome does not exist, it is created (mode 0700). -// - If Target (the shared destination) does not exist, it is created -// (mode 0700). Both accounts pointing at a non-existent target would -// produce two separate state trees on first write. -// - If the link is absent, it is created. -// - If the link is present and points at Target, nothing happens. -// - If the link is present but points elsewhere, an error is returned. -// We REFUSE to auto-correct a divergent link because fixing it blindly -// could delete user data: the "wrong" target may contain the only copy -// of the session transcripts. -// - If a regular file or directory exists where the link should be, -// an error is returned for the same reason. -func EnsureForAccount(accountHome string, required []SharedSymlink) error { - if accountHome == "" { - return errors.New("symlinks: accountHome is empty") - } - - if err := os.MkdirAll(accountHome, 0700); err != nil { - return fmt.Errorf("symlinks: create account home %q: %w", accountHome, err) - } - - for _, sl := range required { - if err := ensureTarget(sl.Target); err != nil { - return err - } - if err := ensureLink(accountHome, sl); err != nil { - return err - } - } - return nil -} - -// ValidateAll runs EnsureForAccount on every account home. It aggregates -// all errors and returns a single error with every failure inlined, so the -// operator sees the full picture at startup rather than fixing one link, -// restarting, hitting the next one, repeat. -func ValidateAll(accountHomes []string, required []SharedSymlink) error { - if len(accountHomes) == 0 { - return errors.New("symlinks: no account homes provided") - } - var errs []string - for _, home := range accountHomes { - if err := EnsureForAccount(home, required); err != nil { - errs = append(errs, err.Error()) - } - } - if len(errs) > 0 { - return fmt.Errorf("symlinks: validation failed for %d account home(s): %s", - len(errs), strings.Join(errs, "; ")) - } - return nil -} - -// ensureTarget creates Target as an empty directory when absent. -// An existing file (non-directory, non-symlink) at Target is an operator -// error we cannot resolve automatically. -func ensureTarget(target string) error { - info, err := os.Stat(target) - if err != nil { - if !os.IsNotExist(err) { - return fmt.Errorf("symlinks: stat shared target %q: %w", target, err) - } - if mkErr := os.MkdirAll(target, 0700); mkErr != nil { - return fmt.Errorf("symlinks: create shared target %q: %w", target, mkErr) - } - return nil - } - if !info.IsDir() { - return fmt.Errorf("symlinks: shared target %q is not a directory", target) - } - return nil -} - -// ensureLink reconciles one link entry inside accountHome. -func ensureLink(accountHome string, sl SharedSymlink) error { - linkPath := filepath.Join(accountHome, sl.Name) - - info, err := os.Lstat(linkPath) - if err != nil { - if os.IsNotExist(err) { - if linkErr := os.Symlink(sl.Target, linkPath); linkErr != nil { - return fmt.Errorf("symlinks: create %q → %q: %w", linkPath, sl.Target, linkErr) - } - return nil - } - return fmt.Errorf("symlinks: lstat %q: %w", linkPath, err) - } - - // Path exists — must be a symlink pointing at Target. - if info.Mode()&os.ModeSymlink == 0 { - return fmt.Errorf("symlinks: %q exists but is not a symlink (expected → %q)", - linkPath, sl.Target) - } - currentTarget, err := os.Readlink(linkPath) - if err != nil { - return fmt.Errorf("symlinks: readlink %q: %w", linkPath, err) - } - if currentTarget != sl.Target { - return fmt.Errorf("symlinks: divergent link at %q: points to %q, expected %q (refusing to auto-correct to avoid data loss)", - linkPath, currentTarget, sl.Target) - } - return nil -} diff --git a/internal/symlinks/shared_test.go b/internal/symlinks/shared_test.go deleted file mode 100644 index e472678..0000000 --- a/internal/symlinks/shared_test.go +++ /dev/null @@ -1,276 +0,0 @@ -package symlinks - -import ( - "os" - "path/filepath" - "strings" - "testing" -) - -// testRequired returns a SharedSymlink list whose Targets live entirely -// under tmpDir, so the tests never touch the operator's real home. -func testRequired(tmpDir string) []SharedSymlink { - return []SharedSymlink{ - {Target: filepath.Join(tmpDir, "session-env-shared"), Name: "session-env"}, - {Target: filepath.Join(tmpDir, "file-history-shared"), Name: "file-history"}, - {Target: filepath.Join(tmpDir, "projects-shared"), Name: "projects"}, - } -} - -func TestEnsureForAccount_missingCreatesLinksAndTargets(t *testing.T) { - tmp := t.TempDir() - home := filepath.Join(tmp, "account1") - req := testRequired(tmp) - - if err := EnsureForAccount(home, req); err != nil { - t.Fatalf("EnsureForAccount: %v", err) - } - - for _, sl := range req { - linkPath := filepath.Join(home, sl.Name) - info, err := os.Lstat(linkPath) - if err != nil { - t.Errorf("expected link at %s: %v", linkPath, err) - continue - } - if info.Mode()&os.ModeSymlink == 0 { - t.Errorf("%s exists but is not a symlink", linkPath) - } - got, err := os.Readlink(linkPath) - if err != nil { - t.Errorf("readlink %s: %v", linkPath, err) - continue - } - if got != sl.Target { - t.Errorf("link %s points to %q, want %q", linkPath, got, sl.Target) - } - // Target directory must exist too. - if st, err := os.Stat(sl.Target); err != nil || !st.IsDir() { - t.Errorf("target %s should be a directory, err=%v", sl.Target, err) - } - } -} - -func TestEnsureForAccount_idempotent(t *testing.T) { - tmp := t.TempDir() - home := filepath.Join(tmp, "account1") - req := testRequired(tmp) - - if err := EnsureForAccount(home, req); err != nil { - t.Fatalf("first pass: %v", err) - } - if err := EnsureForAccount(home, req); err != nil { - t.Fatalf("second pass should be a no-op, got: %v", err) - } -} - -func TestEnsureForAccount_divergentLinkReturnsError(t *testing.T) { - tmp := t.TempDir() - home := filepath.Join(tmp, "account1") - req := testRequired(tmp) - - // Pre-create a wrong symlink for "projects". - if err := os.MkdirAll(home, 0700); err != nil { - t.Fatalf("mkdir home: %v", err) - } - wrongTarget := filepath.Join(tmp, "someone-elses-dir") - if err := os.MkdirAll(wrongTarget, 0700); err != nil { - t.Fatalf("mkdir wrong target: %v", err) - } - linkPath := filepath.Join(home, "projects") - if err := os.Symlink(wrongTarget, linkPath); err != nil { - t.Fatalf("seed wrong symlink: %v", err) - } - - err := EnsureForAccount(home, req) - if err == nil { - t.Fatal("expected error for divergent link, got nil") - } - if !strings.Contains(err.Error(), "divergent") { - t.Errorf("error should mention 'divergent': %v", err) - } - - // The wrong symlink MUST be preserved (no auto-correction). - got, err := os.Readlink(linkPath) - if err != nil { - t.Fatalf("readlink after error: %v", err) - } - if got != wrongTarget { - t.Errorf("divergent link was mutated: now %q, want preserved %q", got, wrongTarget) - } -} - -func TestEnsureForAccount_regularFileInsteadOfLinkFails(t *testing.T) { - tmp := t.TempDir() - home := filepath.Join(tmp, "account1") - req := testRequired(tmp) - - if err := os.MkdirAll(home, 0700); err != nil { - t.Fatalf("mkdir home: %v", err) - } - // Create a regular file at the session-env path. - bogus := filepath.Join(home, "session-env") - if err := os.WriteFile(bogus, []byte("oops"), 0600); err != nil { - t.Fatalf("seed regular file: %v", err) - } - - err := EnsureForAccount(home, req) - if err == nil { - t.Fatal("expected error for regular-file-at-link-path, got nil") - } - if !strings.Contains(err.Error(), "not a symlink") { - t.Errorf("error should mention 'not a symlink': %v", err) - } -} - -func TestEnsureForAccount_emptyHomeReturnsError(t *testing.T) { - if err := EnsureForAccount("", nil); err == nil { - t.Fatal("expected error for empty home, got nil") - } -} - -func TestValidateAll_multipleAccountsAllOK(t *testing.T) { - tmp := t.TempDir() - req := testRequired(tmp) - homes := []string{ - filepath.Join(tmp, "a"), - filepath.Join(tmp, "b"), - } - if err := ValidateAll(homes, req); err != nil { - t.Fatalf("ValidateAll: %v", err) - } -} - -func TestValidateAll_aggregatesErrors(t *testing.T) { - tmp := t.TempDir() - req := testRequired(tmp) - homes := []string{ - filepath.Join(tmp, "a"), - filepath.Join(tmp, "b"), - } - - // Pre-seed account `a` with a divergent link so ValidateAll must - // surface that error while still processing account `b`. - if err := os.MkdirAll(homes[0], 0700); err != nil { - t.Fatalf("mkdir a: %v", err) - } - wrongTarget := filepath.Join(tmp, "bad") - if err := os.MkdirAll(wrongTarget, 0700); err != nil { - t.Fatalf("mkdir bad: %v", err) - } - if err := os.Symlink(wrongTarget, filepath.Join(homes[0], "projects")); err != nil { - t.Fatalf("seed wrong link: %v", err) - } - - err := ValidateAll(homes, req) - if err == nil { - t.Fatal("expected aggregated error, got nil") - } - if !strings.Contains(err.Error(), "divergent") { - t.Errorf("should surface divergent: %v", err) - } - - // Account `b` must have been configured successfully even though `a` - // failed. Otherwise the operator cannot see the full state at once. - for _, sl := range req { - if _, err := os.Lstat(filepath.Join(homes[1], sl.Name)); err != nil { - t.Errorf("account b link %s should have been created despite a's failure: %v", sl.Name, err) - } - } -} - -func TestValidateAll_emptyListReturnsError(t *testing.T) { - if err := ValidateAll(nil, nil); err == nil { - t.Fatal("expected error for empty account list") - } -} - -// TestRequiredShared_defaultsAreReasonable pins the default SharedSymlink -// list so an accidental edit that breaks production is caught. -func TestRequiredShared_defaultsAreReasonable(t *testing.T) { - want := map[string]string{ - "session-env": "/home/ubuntu/.claude-session-env-shared", - "file-history": "/home/ubuntu/.claude-file-history-shared", - "projects": "/home/ubuntu/.claude-projects-shared", - } - if len(RequiredShared) != len(want) { - t.Fatalf("RequiredShared has %d entries, want %d", len(RequiredShared), len(want)) - } - for _, sl := range RequiredShared { - if got, ok := want[sl.Name]; !ok { - t.Errorf("unexpected RequiredShared entry %q", sl.Name) - } else if got != sl.Target { - t.Errorf("RequiredShared %q target = %q, want %q", sl.Name, sl.Target, got) - } - } -} - -// TestRequiredSharedIsCoherent validates the contract of the package-level -// RequiredShared constant that the switcher and lifecycle manager consume -// in production. The rest of the suite exercises EnsureForAccount / -// ValidateAll with tmpdir-scoped override lists, so the actual prod -// constant (pointing under /home/ubuntu/.claude-*-shared) is never touched -// by those tests — a regression that shrinks or renames RequiredShared -// would pass every other test but silently break failover on real VMs -// (missing a link → writes to private state → transcripts duplicated). -// -// The test is filesystem-free: it only asserts the shape of the constant. -// -// 1. Exactly three entries, one per Name required by the A-failover design. -// 2. Every Target is absolute. -// 3. All three Targets share the same parent directory — there is no -// mode of operation where one shared dir lives elsewhere than the -// others. `filepath.Dir(target)` must be identical across entries. -// -// This encodes the "3 links under one shared root" invariant that -// EnsureForAccount relies on. Any future change to RequiredShared that -// breaks this invariant should force the author to update the switcher -// contract explicitly. -func TestRequiredSharedIsCoherent(t *testing.T) { - expectedNames := map[string]bool{ - "session-env": false, - "file-history": false, - "projects": false, - } - if len(RequiredShared) != len(expectedNames) { - t.Fatalf("RequiredShared must contain exactly %d entries (session-env, file-history, projects); got %d: %+v", - len(expectedNames), len(RequiredShared), RequiredShared) - } - - var sharedParent string - for i, sl := range RequiredShared { - if _, ok := expectedNames[sl.Name]; !ok { - t.Errorf("RequiredShared[%d]: unexpected Name %q (allowed: session-env / file-history / projects)", i, sl.Name) - continue - } - if expectedNames[sl.Name] { - t.Errorf("RequiredShared[%d]: duplicate Name %q", i, sl.Name) - } - expectedNames[sl.Name] = true - - if sl.Target == "" { - t.Errorf("RequiredShared[%d] (%q): empty Target", i, sl.Name) - continue - } - if !filepath.IsAbs(sl.Target) { - t.Errorf("RequiredShared[%d] (%q): Target %q must be absolute", i, sl.Name, sl.Target) - continue - } - - parent := filepath.Dir(sl.Target) - if i == 0 { - sharedParent = parent - continue - } - if parent != sharedParent { - t.Errorf("RequiredShared[%d] (%q): parent dir %q diverges from %q — all shared targets must live under the same root", - i, sl.Name, parent, sharedParent) - } - } - - for name, seen := range expectedNames { - if !seen { - t.Errorf("RequiredShared is missing the required %q entry", name) - } - } -}