Commit graph

12 commits

Author SHA1 Message Date
Ubuntu
3e20085204 feat(phase2-E): multi-provider routing via secutools delegation
Adds optional delegation of agent-queue tasks to the SecuAAS secutools
AI platform (GPU / Gemini / Claude API) instead of dispatching to a
local Claude Code tmux session. Per-task opt-in via YAML frontmatter
fields preferred_ai, allow_delegation, complexity_hint — absence keeps
the Phase 1 behaviour exactly (zero breaking change).

Go side:
- internal/secutools: HTTP client with exponential-backoff retries
  (SubmitJob/GetJob/WaitForResult), DecideProvider map adapter for CLI
  use, table tests.
- internal/router: struct-typed Decide() with strict precedence
  (needs_claude_code > preferred_ai=claude-code > allow_delegation=false
  > preferred_ai > fail-safe local on unknown).
- internal/delegation: Manager submits jobs, writes .md.delegated
  markers for on-restart recovery, runs a periodic reaper that moves
  completed jobs into done/ with provider/cost footer and failed jobs
  into failed/.
- internal/dispatcher: WithDelegation() opt-in, routeTask hook before
  findFreeSession, skips .md.delegated in assignNextTask.
- internal/api: /api/delegated/status (active jobs + counters),
  /watchdog/status extended with delegation counters.
- cmd/ccl-delegate: small CLI exposing submit/get/result/decide so the
  bash dispatcher can call the same contract without duplicating logic.
- cmd/claude-failover: delegation wired opt-in via SECUTOOLS_API_KEY.

Tests:
- 29+ new unit tests across router, secutools, delegation, dispatcher,
  api packages. go test -race -count=1 clean.
- tests/phase2-E-integration.sh: bash end-to-end against a Python
  stdlib mock HTTP server, exercising the dev-management scripts.

Forward-compat with watchdog (Phase 1 B1 already ignores
state=delegated_to_secutools) so delegated tasks aren't flagged stale.
2026-04-17 02:17:19 +00:00
Ubuntu
20063b1939 fix(switcher+symlinks): rollback on ensure failure (Bug #1) + requiredShared contract test (Bug #10)
Bug #1 (CRITIQUE) — A3 flip+ensure inconsistency
- Before: EnsureForAccount failure after flip was WARN-only, SetActiveAccount
  still fired → daemon declared target active while shared symlinks were
  absent/divergent → transcripts silently duplicated, resume broken.
- After: ensure failure triggers rollback flip to previous account home;
  if rollback succeeds → explicit error, ActiveAccount stays on previous.
  If rollback ALSO fails → sticky partialSwap flag + ErrPartialSwap; all
  further swaps refused until operator intervention (daemon restart).
- New public IsPartialSwap() for watchdog / health-check integration.

Bug #10 (MOYENNE) — requiredShared contract never exercised
- All existing tests override a.sharedSymlinks with tmpdir-scoped lists,
  so symlinks.RequiredShared itself was never tested. A rename or drop
  would pass every test but silently break prod failover.
- TestRequiredSharedIsCoherent asserts (no filesystem): 3 entries with
  the exact required names, absolute targets, and a single shared parent
  directory (invariant EnsureForAccount depends on).

Tests:
- go test ./... PASS
- go test -race ./... PASS (no data race)
- 2 new switcher tests: TestFlipEnsureFailureTriggersRollback,
  TestFlipEnsureAndRollbackFailure
- 1 new symlinks test: TestRequiredSharedIsCoherent
- 1 obsolete test replaced: TestFlipEnsureSymlinksFailureDoesNotAbortSwap
  (encoded the old buggy best-effort behaviour)
2026-04-16 19:53:48 +00:00
Ubuntu
8eaf0bbd35 feat(switcher): ensure shared symlinks on target home after flip (A3)
Wire symlinks.EnsureForAccount into executeSwitch, called immediately
after the ~/.claude flip. Guarantees the three shared-state links
(session-env, file-history, projects) exist on the target account home
even for freshly-provisioned accounts, preventing silent transcript
duplication and undo-history divergence on first resume.

Best-effort: errors are logged as WARN but never abort the swap. If we
returned here the daemon would be left inconsistent (symlink flipped,
SetActiveAccount never called). Operator sees the warning in logs and
resolves divergent links manually.

Tests:
- TestFlipReconcilesSharedSymlinksOnTargetHome: empty target home gets
  all three links pointing at canonical targets after the flip.
- TestFlipEnsureSymlinksFailureDoesNotAbortSwap: a planted divergent
  link triggers the symlinks-package error; the swap completes anyway
  and the active account is updated.

Hermetic: added AccountSwitcher.sharedSymlinks override so tests scope
the reconcile inside t.TempDir() and never touch
/home/ubuntu/.claude-*-shared. Existing tests migrated to this pattern
and hardcoded /tmp/claude-*-xxxx paths replaced with tmpdirs.

Phase 1 / Chantier A — task A3.
2026-04-16 19:34:03 +00:00
Ubuntu
e16e3526a0 feat(lifecycle): validate shared symlinks at daemon startup (A2)
Wire symlinks.ValidateAll into the lifecycle manager so the daemon
refuses to start if any configured account is missing one of the
shared-state symlinks or if a link diverges from the canonical target.

Previously, a missing link on a freshly deployed VM would silently
create a divergent state tree per account (duplicate JSONL transcripts,
broken undo history) — exactly the failure mode the symlinks package
(A1) was introduced to prevent.

The check runs once at startup before EnsureAllSessions, guarding a
single well-defined invariant: "every account home shares the same
projects/, file-history/ and session-env/ roots". No auto-heal on
divergence — we fail fast with an explicit error so the operator fixes
it manually rather than one account's state being overwritten.

Part of Phase 1 Chantier A — Failover robuste.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-16 19:03:43 +00:00
Ubuntu
91091d7abf feat(symlinks): add shared-state symlink manager (A1)
Adds internal/symlinks package that encodes in code the convention
previously maintained by hand on the VM: every Claude account home
must expose `session-env`, `file-history` and `projects` as symlinks
to a single shared target, so account failover does not create
divergent state (duplicate JSONL transcripts, broken undo history).

- EnsureForAccount(home, required) creates missing links and target
  directories, refuses to auto-correct a divergent link (risks data
  loss), and errors when a regular file sits where the link belongs.
- ValidateAll(homes, required) aggregates errors across both accounts
  so the operator sees every problem at once rather than fixing one
  per restart cycle.
- RequiredShared exposes the production defaults so lifecycle and
  switcher (A2/A3) can depend on it directly.

9/9 unit tests green.

Part of Phase 1 Chantier A — Failover robuste.
2026-04-16 18:55:32 +00:00
Ubuntu
4cbdcf143a fix(dispatcher+watcher): never auto-dispatch into dedicated sessions
Observed: tasks from filesecure/.agent-queue/inbox and SecuScan/
.agent-queue/inbox were being routed into ccl-1-conformvault and
ccl-2-scanyze whenever those sessions happened to be idle. Those are
the operator's manual interactive Claude sessions, not dispatch
targets — the auto-dispatch was (a) hijacking a Claude instance the
operator was using and (b) triggering /exit via the watcher's
completion path when the side-task finished, kicking the operator out
mid-conversation.

findFreeSession was iterating Pool.Dedicated before the autonomous
pool, so any idle dedicated session was the first candidate.

- Dispatcher.findFreeSession: remove the Dedicated loop entirely.
  Auto-dispatch is now pool-only (ccl-auto-11..20).
- Watcher.completeSession: defense-in-depth — even if a dedicated
  session ever ends up in "working" state, it is no longer /exit'd;
  just marked idle. Pool /exit behaviour unchanged (context recycle).
- Tests: new TestFindFreeSessionSkipsDedicated proves the routing;
  3 existing tests rewritten to use the autonomous pool instead of
  relying on Dedicated as a fake pool.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 13:30:26 +00:00
Ubuntu
6b109ed1bc fix(dispatcher): send a lone Enter after the task paste to submit it
Multi-line task bodies arrived in Claude Code as "[Pasted text #N +M lines]"
and sat in the input buffer forever — the trailing Enter that SendKeys
appends to the paste is consumed as a newline inside the paste, not as a
submit. Observed live on ccl-auto-11 (secumon) and ccl-auto-12 (secuops):
prompt visible, agent idle.

- tmux.Client grows a SendEnter(session) method. ExecClient runs
  `tmux send-keys -t <sess> Enter` (no preceding text), which Claude's
  TUI accepts as the explicit submit action after a paste.
- Dispatcher: after SendKeys(msg), sleep 500ms for the paste to register,
  then SendEnter. Same sequence a human would perform.
- Five mockTmux implementations updated (quota, dispatcher, switcher,
  lifecycle, watcher tests).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 20:49:59 +00:00
Ubuntu
eb6b74c547 feat(pool): add start_index so manual and auto pools can coexist
Production had two disjoint tmux pools named alike but for different
purposes:
  ccl-0..ccl-9           — manual/interactive sessions (operator)
  ccl-auto-11..ccl-auto-20 — autonomous dispatcher pool

Until now the daemon's loops iterated prefix + 0..Max, so with the
deployed config ("prefix: ccl-auto", min=2, max=10) the dispatcher
looked for sessions "ccl-auto0..ccl-auto9" that never existed, while
the real auto pool ccl-auto-11..20 was invisible. Net effect: no task
was ever dispatched, and killAllPoolSessions fabricated phantom
"ccl-auto0/1" sessions on each swap.

- AutonomousConfig gains StartIndex (yaml start_index, default 0).
  Behaviour is unchanged when StartIndex is 0.
- Monitor, switcher (kill + recreate), dispatcher (findFreeSession),
  and lifecycle (EnsureAll + reconcile) all iterate
  [StartIndex, StartIndex+Max) so the daemon only touches its own
  range and leaves ccl-0..ccl-9 alone.
- Production config updated to prefix: "ccl-auto-", start_index: 11,
  min: 10, max: 10 — covering the 10 real ccl-auto-11..20 sessions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 20:39:57 +00:00
Ubuntu
8fdb1937fc feat(switcher): auto-resume dedicated sessions after a swap
When a legitimate quota hit triggered a swap, killAllPoolSessions tore
down the dedicated interactive sessions (ccl-1-conformvault, ccl-2-scanyze)
along with the pool, then recreatePoolSessions re-opened them at a bare
bash prompt. The operator had to manually re-run
  CLAUDE_CONFIG_DIR=<target> claude --dangerously-skip-permissions --resume <uuid>
after every swap, losing whatever conversation was mid-flight.

saveAllSessions only iterates sessions tracked as "working" in state;
user-driven dedicated sessions are rarely in that state so their resume
UUIDs were never saved.

- saveDedicatedUUIDs: capture resume UUID for every configured dedicated
  session regardless of tracked state, before kill.
- relaunchDedicatedSessions(targetHome): after recreate, send a resume
  command on each dedicated session pointing CLAUDE_CONFIG_DIR at the
  target account's home. Missing UUID → leave at shell, no blind launch.
- isValidResumeUUID hardens against a corrupted resume-id.txt.

New TestDedicatedRelaunchAfterSwap verifies end-to-end: pane capture →
UUID persisted → resume command sent with the correct CLAUDE_CONFIG_DIR.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 20:24:38 +00:00
Ubuntu
62e98cb9e7 fix(quota): veto 5xx errors + tighten patterns to stop false-positive swaps
v0.2.2's 2-poll confirmation was insufficient because Anthropic 500/503
errors are printed into Claude Code's conversation transcript and stay
visible in every tmux capture until the user scrolls. A persistent
server error would confirm on the second poll and still trigger a swap.

Root cause: the pattern "rate limit" (bare substring) matched any 500
payload that happened to mention rate limits in its error text. Real
HTTP 429s from Anthropic are typed as "rate_limit_error" in the error
payload — and that's the signature we should actually key on.

- Remove "rate limit" from quotaPatterns (too generic — matches transcripts).
- Add "rate_limit_error" (Anthropic's typed 429 error) and "5-hour limit".
- Add serverErrorPatterns veto: "api_error", "overloaded_error",
  "internal server error", "api error: 5". When any is present in the
  pane, isQuotaExhausted returns false even if a quota pattern matched.
- 4 new subtests covering the veto paths + sanity that real 429s pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 19:26:00 +00:00
Ubuntu
7c5f8384fa fix(quota): add cooldown + 2-poll confirmation to prevent swap ping-pong
Anthropic HTTP 500 errors surface in the TUI with payloads containing
"rate limit" text, which the monitor was matching against quotaPatterns
and treating as a real 429 quota hit. With no cooldown and no
confirmation, a burst of 500s produced sub-minute ping-pong swaps that
tore down user sessions.

Two-layer fix:
- quota.reactivate_cooldown (already in config, 5m) now gates the
  monitor too — not just the dispatcher. A completed swap suppresses
  further detection for the cooldown window.
- A hit with no parseable reset time is treated as suspected only on
  the first poll; a second consecutive poll is required before
  emitting SwapRequested. Legitimate 429s with "resets in ..." still
  swap instantly on the first detection.

Adds state.RecordSwap / LastSwapInfo for the cooldown, and a
forensic log line on every detection: trigger_session, matched
pattern, 120-char pane snippet.

Tests cover: instant swap with reset, 2-poll confirmation without
reset, and suspected-state reset on recovery.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 19:18:27 +00:00
Ubuntu
c87145ea0b feat(watcher): Phase 2.1 — SessionWatcher goroutine
- internal/watcher: detecte fin de tache via signal file, prompt ❯, idle timeout
- state: ForEachWorking, SetStalled, SetActiveAccount, ActiveAccount
- config: WatcherConfig, DispatcherConfig, JanitorConfig, NotificationsConfig + defaults
- 5 tests unitaires, go test ./... -race OK

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 20:27:51 +00:00