From 2d2e89eed498f5b3167170da787d3543b1531302 Mon Sep 17 00:00:00 2001 From: sladro Date: Thu, 2 Apr 2026 14:38:07 +0800 Subject: [PATCH] docs: add baseline-aware orchestrator design spec --- ...re-single-iteration-orchestrator-design.md | 344 ++++++++++++++++++ 1 file changed, 344 insertions(+) create mode 100644 docs/superpowers/specs/2026-04-02-baseline-aware-single-iteration-orchestrator-design.md diff --git a/docs/superpowers/specs/2026-04-02-baseline-aware-single-iteration-orchestrator-design.md b/docs/superpowers/specs/2026-04-02-baseline-aware-single-iteration-orchestrator-design.md new file mode 100644 index 0000000..efcac27 --- /dev/null +++ b/docs/superpowers/specs/2026-04-02-baseline-aware-single-iteration-orchestrator-design.md @@ -0,0 +1,344 @@ +# Baseline-Aware Single-Iteration Orchestrator Design + +Date: 2026-04-02 + +## Summary + +This document defines the next-stage design for the Artifact Loop Engine: a baseline-aware single-iteration orchestrator. + +The current engine can load a task, run it, score it, decide keep or discard, and log the result. However, it is still too close to a task runner. To become a usable optimizer, it must manage three things correctly: + +1. A stable baseline +2. A candidate workspace isolated from the main working tree +3. A deterministic keep or discard transition + +This design adds those guarantees without yet introducing a multi-iteration loop. + +## Problem Statement + +The current implementation has a usable evaluation pipeline, but it is missing the orchestration layer that makes optimization safe and real. + +Current limitations: + +- The CLI can run and score a task, but it is not responsible for generating a candidate edit +- Mutation validation cannot be applied meaningfully in the current CLI flow because there is no pre-edit baseline view of a candidate +- Keep or discard semantics are still too close to "run in place and recover" rather than "evaluate isolated candidate and selectively adopt result" +- The current design is therefore closer to an evaluation runner than a true optimizer + +To satisfy the intended use case, the engine must support one complete optimization iteration with real candidate generation, validation, execution, scoring, and state transition. + +## Goals + +- Add a true baseline-aware orchestration layer +- Support one complete candidate iteration from mutation to keep or discard +- Keep the main workspace stable unless a candidate is explicitly accepted +- Make mutation validation meaningful by comparing baseline artifacts to candidate artifacts +- Preserve the existing engine modules where possible + +## Non-Goals + +- Multi-iteration search loops +- Search strategies beyond one candidate at a time +- Parallel candidate execution +- Git-backed orchestration +- Multi-agent orchestration +- Changes to the existing training workflow + +## Core Decision + +The orchestrator will use a candidate sandbox. + +It will not mutate the main working tree directly during candidate evaluation. + +Why: + +- Baseline semantics become simple and reliable +- Mutation validation can compare baseline artifacts to candidate artifacts +- Discard and crash are cheap and safe because the sandbox can be deleted +- Keep becomes an explicit sync of allowed artifact files back into the main workspace + +This is more robust than in-place mutation with rollback. + +## High-Level Architecture + +The existing engine remains the foundation. + +The new orchestrator adds one new layer: + +- `task_loader`: load and validate task spec +- `artifact_manager`: snapshot, restore, diff, managed-root behavior +- `mutation_engine`: validate candidate changes against mutation budget +- `runner`: execute commands +- `scorer`: parse score output +- `decision_engine`: compare baseline vs candidate and apply policy +- `orchestrator`: create sandbox, run mutator, validate candidate, run candidate, decide keep or discard, sync accepted artifacts + +## Runtime Model + +Each single iteration works with three states: + +### 1. Baseline Artifacts + +These are the current accepted artifact files in the main workspace, limited to the task's allowed artifact set. + +They are the source of truth for: + +- mutation validation +- candidate comparison +- keep or discard decisions + +### 2. Candidate Workspace + +This is a temporary sandbox directory that contains a candidate version of the repository or task workspace. + +The candidate workspace is where: + +- mutator commands run +- task runner commands run +- scorer commands run + +The candidate workspace is disposable. + +### 3. Iteration Record + +Each iteration produces a structured record containing: + +- `task_id` +- `status` +- `reason` +- `baseline_score` +- `candidate_score` +- `diff_summary` +- `mutator_result` +- `runner_result` +- `scorer_result` + +This record is the audit trail for a single candidate attempt. + +## Task Spec Changes + +The current task spec has `mutation`, but it does not describe how a candidate is generated. + +The task spec must add a new `mutator` section. + +Example: + +```yaml +id: skill-quality +description: Optimize one skill file against a deterministic rubric. + +artifacts: + include: + - fixtures/SKILL.md + exclude: [] + max_files_per_iteration: 1 + +mutation: + mode: direct_edit + allowed_file_types: + - .md + max_changed_lines: 20 + +mutator: + type: command + command: "python ../../scripts/mutate_skill_task.py --task-dir . --artifact fixtures/SKILL.md" + cwd: "tasks/skill-quality" + timeout_seconds: 60 + +runner: + command: "python ../../scripts/evaluate_skill_task.py --task-dir . --artifact fixtures/SKILL.md --output ../../work/skill-run.json" + cwd: "tasks/skill-quality" + timeout_seconds: 30 + +scorer: + type: command + command: "python scripts/score_skill_task.py --input work/skill-run.json" + parse: + format: json + score_field: score + metrics_field: metrics +``` + +### Why `mutator` is separate from `mutation` + +- `mutation` describes limits and policy +- `mutator` describes how candidate edits are produced + +This separation is required because the engine must distinguish: + +- what is allowed +- how a candidate is generated + +## Proposed Model Changes + +The following additions are needed in `engine/models.py`: + +- `MutatorSpec` +- optional `MutatorResult` +- optional `IterationRecord` + +Suggested shape: + +```python +@dataclass(frozen=True) +class MutatorSpec: + type: str + command: str + cwd: str + timeout_seconds: int +``` + +The current task model should be extended so `TaskSpec` includes `mutator`. + +## Orchestration Flow + +The baseline-aware single iteration runs as follows: + +1. Load task spec +2. Capture baseline artifact snapshot from the main workspace +3. Create a candidate sandbox directory +4. Copy the required workspace into the sandbox +5. Run the `mutator` inside the sandbox +6. Compare baseline artifacts to candidate artifacts +7. Run mutation validation using the baseline snapshot and the candidate artifact state +8. If mutation validation fails: + - mark `discard` + - do not modify main workspace + - write iteration record + - delete sandbox +9. Run the task `runner` inside the sandbox +10. If runner fails: + - mark `crash` + - do not modify main workspace + - write iteration record + - delete sandbox +11. Run the `scorer` inside the sandbox or against sandbox outputs +12. If scoring fails: + - mark `crash` + - do not modify main workspace + - write iteration record + - delete sandbox +13. Compare candidate vs baseline using `decision_engine` +14. If decision is `keep`: + - copy allowed artifact files from sandbox back to main workspace + - write iteration record +15. If decision is `discard`: + - do not modify main workspace + - write iteration record +16. Delete sandbox + +## Keep, Discard, and Crash Semantics + +### Keep + +- Candidate is accepted +- Only allowed artifact files are copied from sandbox to main workspace +- Main workspace becomes the new accepted baseline + +### Discard + +- Candidate is rejected +- Main workspace remains unchanged +- Sandbox is deleted + +### Crash + +- Candidate evaluation failed +- Main workspace remains unchanged +- Sandbox is deleted +- CLI should return non-zero + +These semantics are stricter and simpler than in-place mutation with rollback. + +## Candidate Sync Rules + +Only files in the allowed artifact set may be copied back from sandbox to main workspace. + +The orchestrator must not sync: + +- runner outputs outside allowed artifacts +- task logs +- scorer temp files +- unrelated files inside the sandbox + +This prevents accidental expansion of the accepted state. + +## Path Model + +The orchestrator must make path anchoring explicit. + +Recommended rules: + +- Task-relative paths resolve from the task file location +- Repo-managed paths resolve from repository root +- Candidate execution paths resolve inside the sandbox root +- Results logging for the current iteration may remain in the main workspace if it is intended as global orchestration output + +This avoids the `Path.cwd()` ambiguity that previously caused path fragility. + +## Testing Strategy + +The minimum required tests for this orchestrator are: + +### Keep Case + +- candidate mutates one allowed artifact +- candidate scores better than baseline +- artifact is synced back to main workspace + +### Discard Case + +- candidate mutates one allowed artifact +- candidate does not improve +- main workspace remains unchanged + +### Crash Case + +- mutator, runner, or scorer fails +- main workspace remains unchanged +- CLI returns non-zero for crash + +### Mutation Budget Case + +- candidate exceeds allowed file count or changed line budget +- candidate is discarded before runner execution + +### Path Isolation Case + +- orchestrator runs from outside repo root +- absolute task path still works + +These tests are required for the orchestrator to be considered usable. + +## Recommended File Changes + +The next implementation phase should touch only the following areas: + +- `engine/models.py` +- `engine/task_loader.py` +- `engine/mutation_engine.py` +- new `engine/orchestrator.py` +- `scripts/run_task.py` +- `tasks/skill-quality/task.yaml` +- tests for orchestration and sync-back behavior + +This keeps the scope focused on adding one reliable orchestration layer. + +## Open Questions Deferred + +The following topics are explicitly deferred: + +- Multi-iteration looping +- Search strategies +- Branch-per-candidate workflows +- Git-backed isolation +- Parallel execution +- Human review checkpoints + +These are all downstream of the single-iteration orchestrator. + +## Recommendation + +Proceed with a candidate-sandbox baseline-aware orchestrator as the next implementation step. + +Do not extend directly to multi-iteration autonomy yet. First make one iteration correct, isolated, and safe.