Wire symlinks.ValidateAll into the lifecycle manager so the daemon
refuses to start if any configured account is missing one of the
shared-state symlinks or if a link diverges from the canonical target.
Previously, a missing link on a freshly deployed VM would silently
create a divergent state tree per account (duplicate JSONL transcripts,
broken undo history) — exactly the failure mode the symlinks package
(A1) was introduced to prevent.
The check runs once at startup before EnsureAllSessions, guarding a
single well-defined invariant: "every account home shares the same
projects/, file-history/ and session-env/ roots". No auto-heal on
divergence — we fail fast with an explicit error so the operator fixes
it manually rather than one account's state being overwritten.
Part of Phase 1 Chantier A — Failover robuste.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Production had two disjoint tmux pools named alike but for different
purposes:
ccl-0..ccl-9 — manual/interactive sessions (operator)
ccl-auto-11..ccl-auto-20 — autonomous dispatcher pool
Until now the daemon's loops iterated prefix + 0..Max, so with the
deployed config ("prefix: ccl-auto", min=2, max=10) the dispatcher
looked for sessions "ccl-auto0..ccl-auto9" that never existed, while
the real auto pool ccl-auto-11..20 was invisible. Net effect: no task
was ever dispatched, and killAllPoolSessions fabricated phantom
"ccl-auto0/1" sessions on each swap.
- AutonomousConfig gains StartIndex (yaml start_index, default 0).
Behaviour is unchanged when StartIndex is 0.
- Monitor, switcher (kill + recreate), dispatcher (findFreeSession),
and lifecycle (EnsureAll + reconcile) all iterate
[StartIndex, StartIndex+Max) so the daemon only touches its own
range and leaves ccl-0..ccl-9 alone.
- Production config updated to prefix: "ccl-auto-", start_index: 11,
min: 10, max: 10 — covering the 10 real ccl-auto-11..20 sessions.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add internal/lifecycle/manager.go with Manager struct, Run() ticker loop
(15s interval), EnsureAllSessions() for boot-time session creation, and
reconcile() that recreates idle sessions and recovers working ones via
SetFailed + CreateSession
- Add state.SetFailed() to record crash timestamp on SessionState
- Add internal/lifecycle/manager_test.go with mock tmux client and 3 tests:
TestReconcileCreatesDeadSession, TestReconcileRecoversCrashedSession,
TestEnsureAllSessions — all pass
- Wire lifecycle.Manager into cmd/claude-failover/main.go after state init
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>