Production had two disjoint tmux pools named alike but for different
purposes:
ccl-0..ccl-9 — manual/interactive sessions (operator)
ccl-auto-11..ccl-auto-20 — autonomous dispatcher pool
Until now the daemon's loops iterated prefix + 0..Max, so with the
deployed config ("prefix: ccl-auto", min=2, max=10) the dispatcher
looked for sessions "ccl-auto0..ccl-auto9" that never existed, while
the real auto pool ccl-auto-11..20 was invisible. Net effect: no task
was ever dispatched, and killAllPoolSessions fabricated phantom
"ccl-auto0/1" sessions on each swap.
- AutonomousConfig gains StartIndex (yaml start_index, default 0).
Behaviour is unchanged when StartIndex is 0.
- Monitor, switcher (kill + recreate), dispatcher (findFreeSession),
and lifecycle (EnsureAll + reconcile) all iterate
[StartIndex, StartIndex+Max) so the daemon only touches its own
range and leaves ccl-0..ccl-9 alone.
- Production config updated to prefix: "ccl-auto-", start_index: 11,
min: 10, max: 10 — covering the 10 real ccl-auto-11..20 sessions.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add internal/lifecycle/manager.go with Manager struct, Run() ticker loop
(15s interval), EnsureAllSessions() for boot-time session creation, and
reconcile() that recreates idle sessions and recovers working ones via
SetFailed + CreateSession
- Add state.SetFailed() to record crash timestamp on SessionState
- Add internal/lifecycle/manager_test.go with mock tmux client and 3 tests:
TestReconcileCreatesDeadSession, TestReconcileRecoversCrashedSession,
TestEnsureAllSessions — all pass
- Wire lifecycle.Manager into cmd/claude-failover/main.go after state init
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>