docs: add baseline-aware orchestrator design spec

2026-04-02 14:38:07 +08:00 · 2026-04-02 14:38:07 +08:00 · 2d2e89eed4
commit 2d2e89eed4
parent 6560cdde97
1 changed files with 344 additions and 0 deletions
--- a/docs/superpowers/specs/2026-04-02-baseline-aware-single-iteration-orchestrator-design.md
+++ b/docs/superpowers/specs/2026-04-02-baseline-aware-single-iteration-orchestrator-design.md
@ -0,0 +1,344 @@
+# Baseline-Aware Single-Iteration Orchestrator Design
+
+Date: 2026-04-02
+
+## Summary
+
+This document defines the next-stage design for the Artifact Loop Engine: a baseline-aware single-iteration orchestrator.
+
+The current engine can load a task, run it, score it, decide keep or discard, and log the result. However, it is still too close to a task runner. To become a usable optimizer, it must manage three things correctly:
+
+1. A stable baseline
+2. A candidate workspace isolated from the main working tree
+3. A deterministic keep or discard transition
+
+This design adds those guarantees without yet introducing a multi-iteration loop.
+
+## Problem Statement
+
+The current implementation has a usable evaluation pipeline, but it is missing the orchestration layer that makes optimization safe and real.
+
+Current limitations:
+
+- The CLI can run and score a task, but it is not responsible for generating a candidate edit
+- Mutation validation cannot be applied meaningfully in the current CLI flow because there is no pre-edit baseline view of a candidate
+- Keep or discard semantics are still too close to "run in place and recover" rather than "evaluate isolated candidate and selectively adopt result"
+- The current design is therefore closer to an evaluation runner than a true optimizer
+
+To satisfy the intended use case, the engine must support one complete optimization iteration with real candidate generation, validation, execution, scoring, and state transition.
+
+## Goals
+
+- Add a true baseline-aware orchestration layer
+- Support one complete candidate iteration from mutation to keep or discard
+- Keep the main workspace stable unless a candidate is explicitly accepted
+- Make mutation validation meaningful by comparing baseline artifacts to candidate artifacts
+- Preserve the existing engine modules where possible
+
+## Non-Goals
+
+- Multi-iteration search loops
+- Search strategies beyond one candidate at a time
+- Parallel candidate execution
+- Git-backed orchestration
+- Multi-agent orchestration
+- Changes to the existing training workflow
+
+## Core Decision
+
+The orchestrator will use a candidate sandbox.
+
+It will not mutate the main working tree directly during candidate evaluation.
+
+Why:
+
+- Baseline semantics become simple and reliable
+- Mutation validation can compare baseline artifacts to candidate artifacts
+- Discard and crash are cheap and safe because the sandbox can be deleted
+- Keep becomes an explicit sync of allowed artifact files back into the main workspace
+
+This is more robust than in-place mutation with rollback.
+
+## High-Level Architecture
+
+The existing engine remains the foundation.
+
+The new orchestrator adds one new layer:
+
+- `task_loader`: load and validate task spec
+- `artifact_manager`: snapshot, restore, diff, managed-root behavior
+- `mutation_engine`: validate candidate changes against mutation budget
+- `runner`: execute commands
+- `scorer`: parse score output
+- `decision_engine`: compare baseline vs candidate and apply policy
+- `orchestrator`: create sandbox, run mutator, validate candidate, run candidate, decide keep or discard, sync accepted artifacts
+
+## Runtime Model
+
+Each single iteration works with three states:
+
+### 1. Baseline Artifacts
+
+These are the current accepted artifact files in the main workspace, limited to the task's allowed artifact set.
+
+They are the source of truth for:
+
+- mutation validation
+- candidate comparison
+- keep or discard decisions
+
+### 2. Candidate Workspace
+
+This is a temporary sandbox directory that contains a candidate version of the repository or task workspace.
+
+The candidate workspace is where:
+
+- mutator commands run
+- task runner commands run
+- scorer commands run
+
+The candidate workspace is disposable.
+
+### 3. Iteration Record
+
+Each iteration produces a structured record containing:
+
+- `task_id`
+- `status`
+- `reason`
+- `baseline_score`
+- `candidate_score`
+- `diff_summary`
+- `mutator_result`
+- `runner_result`
+- `scorer_result`
+
+This record is the audit trail for a single candidate attempt.
+
+## Task Spec Changes
+
+The current task spec has `mutation`, but it does not describe how a candidate is generated.
+
+The task spec must add a new `mutator` section.
+
+Example:
+
+```yaml
+id: skill-quality
+description: Optimize one skill file against a deterministic rubric.
+
+artifacts:
+  include:
+    - fixtures/SKILL.md
+  exclude: []
+  max_files_per_iteration: 1
+
+mutation:
+  mode: direct_edit
+  allowed_file_types:
+    - .md
+  max_changed_lines: 20
+
+mutator:
+  type: command
+  command: "python ../../scripts/mutate_skill_task.py --task-dir . --artifact fixtures/SKILL.md"
+  cwd: "tasks/skill-quality"
+  timeout_seconds: 60
+
+runner:
+  command: "python ../../scripts/evaluate_skill_task.py --task-dir . --artifact fixtures/SKILL.md --output ../../work/skill-run.json"
+  cwd: "tasks/skill-quality"
+  timeout_seconds: 30
+
+scorer:
+  type: command
+  command: "python scripts/score_skill_task.py --input work/skill-run.json"
+  parse:
+    format: json
+    score_field: score
+    metrics_field: metrics
+```
+
+### Why `mutator` is separate from `mutation`
+
+- `mutation` describes limits and policy
+- `mutator` describes how candidate edits are produced
+
+This separation is required because the engine must distinguish:
+
+- what is allowed
+- how a candidate is generated
+
+## Proposed Model Changes
+
+The following additions are needed in `engine/models.py`:
+
+- `MutatorSpec`
+- optional `MutatorResult`
+- optional `IterationRecord`
+
+Suggested shape:
+
+```python
+@dataclass(frozen=True)
+class MutatorSpec:
+    type: str
+    command: str
+    cwd: str
+    timeout_seconds: int
+```
+
+The current task model should be extended so `TaskSpec` includes `mutator`.
+
+## Orchestration Flow
+
+The baseline-aware single iteration runs as follows:
+
+1. Load task spec
+2. Capture baseline artifact snapshot from the main workspace
+3. Create a candidate sandbox directory
+4. Copy the required workspace into the sandbox
+5. Run the `mutator` inside the sandbox
+6. Compare baseline artifacts to candidate artifacts
+7. Run mutation validation using the baseline snapshot and the candidate artifact state
+8. If mutation validation fails:
+   - mark `discard`
+   - do not modify main workspace
+   - write iteration record
+   - delete sandbox
+9. Run the task `runner` inside the sandbox
+10. If runner fails:
+   - mark `crash`
+   - do not modify main workspace
+   - write iteration record
+   - delete sandbox
+11. Run the `scorer` inside the sandbox or against sandbox outputs
+12. If scoring fails:
+   - mark `crash`
+   - do not modify main workspace
+   - write iteration record
+   - delete sandbox
+13. Compare candidate vs baseline using `decision_engine`
+14. If decision is `keep`:
+   - copy allowed artifact files from sandbox back to main workspace
+   - write iteration record
+15. If decision is `discard`:
+   - do not modify main workspace
+   - write iteration record
+16. Delete sandbox
+
+## Keep, Discard, and Crash Semantics
+
+### Keep
+
+- Candidate is accepted
+- Only allowed artifact files are copied from sandbox to main workspace
+- Main workspace becomes the new accepted baseline
+
+### Discard
+
+- Candidate is rejected
+- Main workspace remains unchanged
+- Sandbox is deleted
+
+### Crash
+
+- Candidate evaluation failed
+- Main workspace remains unchanged
+- Sandbox is deleted
+- CLI should return non-zero
+
+These semantics are stricter and simpler than in-place mutation with rollback.
+
+## Candidate Sync Rules
+
+Only files in the allowed artifact set may be copied back from sandbox to main workspace.
+
+The orchestrator must not sync:
+
+- runner outputs outside allowed artifacts
+- task logs
+- scorer temp files
+- unrelated files inside the sandbox
+
+This prevents accidental expansion of the accepted state.
+
+## Path Model
+
+The orchestrator must make path anchoring explicit.
+
+Recommended rules:
+
+- Task-relative paths resolve from the task file location
+- Repo-managed paths resolve from repository root
+- Candidate execution paths resolve inside the sandbox root
+- Results logging for the current iteration may remain in the main workspace if it is intended as global orchestration output
+
+This avoids the `Path.cwd()` ambiguity that previously caused path fragility.
+
+## Testing Strategy
+
+The minimum required tests for this orchestrator are:
+
+### Keep Case
+
+- candidate mutates one allowed artifact
+- candidate scores better than baseline
+- artifact is synced back to main workspace
+
+### Discard Case
+
+- candidate mutates one allowed artifact
+- candidate does not improve
+- main workspace remains unchanged
+
+### Crash Case
+
+- mutator, runner, or scorer fails
+- main workspace remains unchanged
+- CLI returns non-zero for crash
+
+### Mutation Budget Case
+
+- candidate exceeds allowed file count or changed line budget
+- candidate is discarded before runner execution
+
+### Path Isolation Case
+
+- orchestrator runs from outside repo root
+- absolute task path still works
+
+These tests are required for the orchestrator to be considered usable.
+
+## Recommended File Changes
+
+The next implementation phase should touch only the following areas:
+
+- `engine/models.py`
+- `engine/task_loader.py`
+- `engine/mutation_engine.py`
+- new `engine/orchestrator.py`
+- `scripts/run_task.py`
+- `tasks/skill-quality/task.yaml`
+- tests for orchestration and sync-back behavior
+
+This keeps the scope focused on adding one reliable orchestration layer.
+
+## Open Questions Deferred
+
+The following topics are explicitly deferred:
+
+- Multi-iteration looping
+- Search strategies
+- Branch-per-candidate workflows
+- Git-backed isolation
+- Parallel execution
+- Human review checkpoints
+
+These are all downstream of the single-iteration orchestrator.
+
+## Recommendation
+
+Proceed with a candidate-sandbox baseline-aware orchestrator as the next implementation step.
+
+Do not extend directly to multi-iteration autonomy yet. First make one iteration correct, isolated, and safe.