Wire symlinks.ValidateAll into the lifecycle manager so the daemon
refuses to start if any configured account is missing one of the
shared-state symlinks or if a link diverges from the canonical target.
Previously, a missing link on a freshly deployed VM would silently
create a divergent state tree per account (duplicate JSONL transcripts,
broken undo history) — exactly the failure mode the symlinks package
(A1) was introduced to prevent.
The check runs once at startup before EnsureAllSessions, guarding a
single well-defined invariant: "every account home shares the same
projects/, file-history/ and session-env/ roots". No auto-heal on
divergence — we fail fast with an explicit error so the operator fixes
it manually rather than one account's state being overwritten.
Part of Phase 1 Chantier A — Failover robuste.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds internal/symlinks package that encodes in code the convention
previously maintained by hand on the VM: every Claude account home
must expose `session-env`, `file-history` and `projects` as symlinks
to a single shared target, so account failover does not create
divergent state (duplicate JSONL transcripts, broken undo history).
- EnsureForAccount(home, required) creates missing links and target
directories, refuses to auto-correct a divergent link (risks data
loss), and errors when a regular file sits where the link belongs.
- ValidateAll(homes, required) aggregates errors across both accounts
so the operator sees every problem at once rather than fixing one
per restart cycle.
- RequiredShared exposes the production defaults so lifecycle and
switcher (A2/A3) can depend on it directly.
9/9 unit tests green.
Part of Phase 1 Chantier A — Failover robuste.
Observed: tasks from filesecure/.agent-queue/inbox and SecuScan/
.agent-queue/inbox were being routed into ccl-1-conformvault and
ccl-2-scanyze whenever those sessions happened to be idle. Those are
the operator's manual interactive Claude sessions, not dispatch
targets — the auto-dispatch was (a) hijacking a Claude instance the
operator was using and (b) triggering /exit via the watcher's
completion path when the side-task finished, kicking the operator out
mid-conversation.
findFreeSession was iterating Pool.Dedicated before the autonomous
pool, so any idle dedicated session was the first candidate.
- Dispatcher.findFreeSession: remove the Dedicated loop entirely.
Auto-dispatch is now pool-only (ccl-auto-11..20).
- Watcher.completeSession: defense-in-depth — even if a dedicated
session ever ends up in "working" state, it is no longer /exit'd;
just marked idle. Pool /exit behaviour unchanged (context recycle).
- Tests: new TestFindFreeSessionSkipsDedicated proves the routing;
3 existing tests rewritten to use the autonomous pool instead of
relying on Dedicated as a fake pool.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Multi-line task bodies arrived in Claude Code as "[Pasted text #N +M lines]"
and sat in the input buffer forever — the trailing Enter that SendKeys
appends to the paste is consumed as a newline inside the paste, not as a
submit. Observed live on ccl-auto-11 (secumon) and ccl-auto-12 (secuops):
prompt visible, agent idle.
- tmux.Client grows a SendEnter(session) method. ExecClient runs
`tmux send-keys -t <sess> Enter` (no preceding text), which Claude's
TUI accepts as the explicit submit action after a paste.
- Dispatcher: after SendKeys(msg), sleep 500ms for the paste to register,
then SendEnter. Same sequence a human would perform.
- Five mockTmux implementations updated (quota, dispatcher, switcher,
lifecycle, watcher tests).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Production had two disjoint tmux pools named alike but for different
purposes:
ccl-0..ccl-9 — manual/interactive sessions (operator)
ccl-auto-11..ccl-auto-20 — autonomous dispatcher pool
Until now the daemon's loops iterated prefix + 0..Max, so with the
deployed config ("prefix: ccl-auto", min=2, max=10) the dispatcher
looked for sessions "ccl-auto0..ccl-auto9" that never existed, while
the real auto pool ccl-auto-11..20 was invisible. Net effect: no task
was ever dispatched, and killAllPoolSessions fabricated phantom
"ccl-auto0/1" sessions on each swap.
- AutonomousConfig gains StartIndex (yaml start_index, default 0).
Behaviour is unchanged when StartIndex is 0.
- Monitor, switcher (kill + recreate), dispatcher (findFreeSession),
and lifecycle (EnsureAll + reconcile) all iterate
[StartIndex, StartIndex+Max) so the daemon only touches its own
range and leaves ccl-0..ccl-9 alone.
- Production config updated to prefix: "ccl-auto-", start_index: 11,
min: 10, max: 10 — covering the 10 real ccl-auto-11..20 sessions.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a legitimate quota hit triggered a swap, killAllPoolSessions tore
down the dedicated interactive sessions (ccl-1-conformvault, ccl-2-scanyze)
along with the pool, then recreatePoolSessions re-opened them at a bare
bash prompt. The operator had to manually re-run
CLAUDE_CONFIG_DIR=<target> claude --dangerously-skip-permissions --resume <uuid>
after every swap, losing whatever conversation was mid-flight.
saveAllSessions only iterates sessions tracked as "working" in state;
user-driven dedicated sessions are rarely in that state so their resume
UUIDs were never saved.
- saveDedicatedUUIDs: capture resume UUID for every configured dedicated
session regardless of tracked state, before kill.
- relaunchDedicatedSessions(targetHome): after recreate, send a resume
command on each dedicated session pointing CLAUDE_CONFIG_DIR at the
target account's home. Missing UUID → leave at shell, no blind launch.
- isValidResumeUUID hardens against a corrupted resume-id.txt.
New TestDedicatedRelaunchAfterSwap verifies end-to-end: pane capture →
UUID persisted → resume command sent with the correct CLAUDE_CONFIG_DIR.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
v0.2.2's 2-poll confirmation was insufficient because Anthropic 500/503
errors are printed into Claude Code's conversation transcript and stay
visible in every tmux capture until the user scrolls. A persistent
server error would confirm on the second poll and still trigger a swap.
Root cause: the pattern "rate limit" (bare substring) matched any 500
payload that happened to mention rate limits in its error text. Real
HTTP 429s from Anthropic are typed as "rate_limit_error" in the error
payload — and that's the signature we should actually key on.
- Remove "rate limit" from quotaPatterns (too generic — matches transcripts).
- Add "rate_limit_error" (Anthropic's typed 429 error) and "5-hour limit".
- Add serverErrorPatterns veto: "api_error", "overloaded_error",
"internal server error", "api error: 5". When any is present in the
pane, isQuotaExhausted returns false even if a quota pattern matched.
- 4 new subtests covering the veto paths + sanity that real 429s pass.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Anthropic HTTP 500 errors surface in the TUI with payloads containing
"rate limit" text, which the monitor was matching against quotaPatterns
and treating as a real 429 quota hit. With no cooldown and no
confirmation, a burst of 500s produced sub-minute ping-pong swaps that
tore down user sessions.
Two-layer fix:
- quota.reactivate_cooldown (already in config, 5m) now gates the
monitor too — not just the dispatcher. A completed swap suppresses
further detection for the cooldown window.
- A hit with no parseable reset time is treated as suspected only on
the first poll; a second consecutive poll is required before
emitting SwapRequested. Legitimate 429s with "resets in ..." still
swap instantly on the first detection.
Adds state.RecordSwap / LastSwapInfo for the cooldown, and a
forensic log line on every detection: trigger_session, matched
pattern, 120-char pane snippet.
Tests cover: instant swap with reset, 2-poll confirmation without
reset, and suspected-state reset on recovery.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>