docs: add baseline-aware orchestrator design spec

This commit is contained in:
sladro 2026-04-02 14:38:07 +08:00
parent 6560cdde97
commit 2d2e89eed4

View File

@ -0,0 +1,344 @@
# Baseline-Aware Single-Iteration Orchestrator Design
Date: 2026-04-02
## Summary
This document defines the next-stage design for the Artifact Loop Engine: a baseline-aware single-iteration orchestrator.
The current engine can load a task, run it, score it, decide keep or discard, and log the result. However, it is still too close to a task runner. To become a usable optimizer, it must manage three things correctly:
1. A stable baseline
2. A candidate workspace isolated from the main working tree
3. A deterministic keep or discard transition
This design adds those guarantees without yet introducing a multi-iteration loop.
## Problem Statement
The current implementation has a usable evaluation pipeline, but it is missing the orchestration layer that makes optimization safe and real.
Current limitations:
- The CLI can run and score a task, but it is not responsible for generating a candidate edit
- Mutation validation cannot be applied meaningfully in the current CLI flow because there is no pre-edit baseline view of a candidate
- Keep or discard semantics are still too close to "run in place and recover" rather than "evaluate isolated candidate and selectively adopt result"
- The current design is therefore closer to an evaluation runner than a true optimizer
To satisfy the intended use case, the engine must support one complete optimization iteration with real candidate generation, validation, execution, scoring, and state transition.
## Goals
- Add a true baseline-aware orchestration layer
- Support one complete candidate iteration from mutation to keep or discard
- Keep the main workspace stable unless a candidate is explicitly accepted
- Make mutation validation meaningful by comparing baseline artifacts to candidate artifacts
- Preserve the existing engine modules where possible
## Non-Goals
- Multi-iteration search loops
- Search strategies beyond one candidate at a time
- Parallel candidate execution
- Git-backed orchestration
- Multi-agent orchestration
- Changes to the existing training workflow
## Core Decision
The orchestrator will use a candidate sandbox.
It will not mutate the main working tree directly during candidate evaluation.
Why:
- Baseline semantics become simple and reliable
- Mutation validation can compare baseline artifacts to candidate artifacts
- Discard and crash are cheap and safe because the sandbox can be deleted
- Keep becomes an explicit sync of allowed artifact files back into the main workspace
This is more robust than in-place mutation with rollback.
## High-Level Architecture
The existing engine remains the foundation.
The new orchestrator adds one new layer:
- `task_loader`: load and validate task spec
- `artifact_manager`: snapshot, restore, diff, managed-root behavior
- `mutation_engine`: validate candidate changes against mutation budget
- `runner`: execute commands
- `scorer`: parse score output
- `decision_engine`: compare baseline vs candidate and apply policy
- `orchestrator`: create sandbox, run mutator, validate candidate, run candidate, decide keep or discard, sync accepted artifacts
## Runtime Model
Each single iteration works with three states:
### 1. Baseline Artifacts
These are the current accepted artifact files in the main workspace, limited to the task's allowed artifact set.
They are the source of truth for:
- mutation validation
- candidate comparison
- keep or discard decisions
### 2. Candidate Workspace
This is a temporary sandbox directory that contains a candidate version of the repository or task workspace.
The candidate workspace is where:
- mutator commands run
- task runner commands run
- scorer commands run
The candidate workspace is disposable.
### 3. Iteration Record
Each iteration produces a structured record containing:
- `task_id`
- `status`
- `reason`
- `baseline_score`
- `candidate_score`
- `diff_summary`
- `mutator_result`
- `runner_result`
- `scorer_result`
This record is the audit trail for a single candidate attempt.
## Task Spec Changes
The current task spec has `mutation`, but it does not describe how a candidate is generated.
The task spec must add a new `mutator` section.
Example:
```yaml
id: skill-quality
description: Optimize one skill file against a deterministic rubric.
artifacts:
include:
- fixtures/SKILL.md
exclude: []
max_files_per_iteration: 1
mutation:
mode: direct_edit
allowed_file_types:
- .md
max_changed_lines: 20
mutator:
type: command
command: "python ../../scripts/mutate_skill_task.py --task-dir . --artifact fixtures/SKILL.md"
cwd: "tasks/skill-quality"
timeout_seconds: 60
runner:
command: "python ../../scripts/evaluate_skill_task.py --task-dir . --artifact fixtures/SKILL.md --output ../../work/skill-run.json"
cwd: "tasks/skill-quality"
timeout_seconds: 30
scorer:
type: command
command: "python scripts/score_skill_task.py --input work/skill-run.json"
parse:
format: json
score_field: score
metrics_field: metrics
```
### Why `mutator` is separate from `mutation`
- `mutation` describes limits and policy
- `mutator` describes how candidate edits are produced
This separation is required because the engine must distinguish:
- what is allowed
- how a candidate is generated
## Proposed Model Changes
The following additions are needed in `engine/models.py`:
- `MutatorSpec`
- optional `MutatorResult`
- optional `IterationRecord`
Suggested shape:
```python
@dataclass(frozen=True)
class MutatorSpec:
type: str
command: str
cwd: str
timeout_seconds: int
```
The current task model should be extended so `TaskSpec` includes `mutator`.
## Orchestration Flow
The baseline-aware single iteration runs as follows:
1. Load task spec
2. Capture baseline artifact snapshot from the main workspace
3. Create a candidate sandbox directory
4. Copy the required workspace into the sandbox
5. Run the `mutator` inside the sandbox
6. Compare baseline artifacts to candidate artifacts
7. Run mutation validation using the baseline snapshot and the candidate artifact state
8. If mutation validation fails:
- mark `discard`
- do not modify main workspace
- write iteration record
- delete sandbox
9. Run the task `runner` inside the sandbox
10. If runner fails:
- mark `crash`
- do not modify main workspace
- write iteration record
- delete sandbox
11. Run the `scorer` inside the sandbox or against sandbox outputs
12. If scoring fails:
- mark `crash`
- do not modify main workspace
- write iteration record
- delete sandbox
13. Compare candidate vs baseline using `decision_engine`
14. If decision is `keep`:
- copy allowed artifact files from sandbox back to main workspace
- write iteration record
15. If decision is `discard`:
- do not modify main workspace
- write iteration record
16. Delete sandbox
## Keep, Discard, and Crash Semantics
### Keep
- Candidate is accepted
- Only allowed artifact files are copied from sandbox to main workspace
- Main workspace becomes the new accepted baseline
### Discard
- Candidate is rejected
- Main workspace remains unchanged
- Sandbox is deleted
### Crash
- Candidate evaluation failed
- Main workspace remains unchanged
- Sandbox is deleted
- CLI should return non-zero
These semantics are stricter and simpler than in-place mutation with rollback.
## Candidate Sync Rules
Only files in the allowed artifact set may be copied back from sandbox to main workspace.
The orchestrator must not sync:
- runner outputs outside allowed artifacts
- task logs
- scorer temp files
- unrelated files inside the sandbox
This prevents accidental expansion of the accepted state.
## Path Model
The orchestrator must make path anchoring explicit.
Recommended rules:
- Task-relative paths resolve from the task file location
- Repo-managed paths resolve from repository root
- Candidate execution paths resolve inside the sandbox root
- Results logging for the current iteration may remain in the main workspace if it is intended as global orchestration output
This avoids the `Path.cwd()` ambiguity that previously caused path fragility.
## Testing Strategy
The minimum required tests for this orchestrator are:
### Keep Case
- candidate mutates one allowed artifact
- candidate scores better than baseline
- artifact is synced back to main workspace
### Discard Case
- candidate mutates one allowed artifact
- candidate does not improve
- main workspace remains unchanged
### Crash Case
- mutator, runner, or scorer fails
- main workspace remains unchanged
- CLI returns non-zero for crash
### Mutation Budget Case
- candidate exceeds allowed file count or changed line budget
- candidate is discarded before runner execution
### Path Isolation Case
- orchestrator runs from outside repo root
- absolute task path still works
These tests are required for the orchestrator to be considered usable.
## Recommended File Changes
The next implementation phase should touch only the following areas:
- `engine/models.py`
- `engine/task_loader.py`
- `engine/mutation_engine.py`
- new `engine/orchestrator.py`
- `scripts/run_task.py`
- `tasks/skill-quality/task.yaml`
- tests for orchestration and sync-back behavior
This keeps the scope focused on adding one reliable orchestration layer.
## Open Questions Deferred
The following topics are explicitly deferred:
- Multi-iteration looping
- Search strategies
- Branch-per-candidate workflows
- Git-backed isolation
- Parallel execution
- Human review checkpoints
These are all downstream of the single-iteration orchestrator.
## Recommendation
Proceed with a candidate-sandbox baseline-aware orchestrator as the next implementation step.
Do not extend directly to multi-iteration autonomy yet. First make one iteration correct, isolated, and safe.