docs: add baseline-aware orchestrator design spec
This commit is contained in:
parent
6560cdde97
commit
2d2e89eed4
@ -0,0 +1,344 @@
|
||||
# Baseline-Aware Single-Iteration Orchestrator Design
|
||||
|
||||
Date: 2026-04-02
|
||||
|
||||
## Summary
|
||||
|
||||
This document defines the next-stage design for the Artifact Loop Engine: a baseline-aware single-iteration orchestrator.
|
||||
|
||||
The current engine can load a task, run it, score it, decide keep or discard, and log the result. However, it is still too close to a task runner. To become a usable optimizer, it must manage three things correctly:
|
||||
|
||||
1. A stable baseline
|
||||
2. A candidate workspace isolated from the main working tree
|
||||
3. A deterministic keep or discard transition
|
||||
|
||||
This design adds those guarantees without yet introducing a multi-iteration loop.
|
||||
|
||||
## Problem Statement
|
||||
|
||||
The current implementation has a usable evaluation pipeline, but it is missing the orchestration layer that makes optimization safe and real.
|
||||
|
||||
Current limitations:
|
||||
|
||||
- The CLI can run and score a task, but it is not responsible for generating a candidate edit
|
||||
- Mutation validation cannot be applied meaningfully in the current CLI flow because there is no pre-edit baseline view of a candidate
|
||||
- Keep or discard semantics are still too close to "run in place and recover" rather than "evaluate isolated candidate and selectively adopt result"
|
||||
- The current design is therefore closer to an evaluation runner than a true optimizer
|
||||
|
||||
To satisfy the intended use case, the engine must support one complete optimization iteration with real candidate generation, validation, execution, scoring, and state transition.
|
||||
|
||||
## Goals
|
||||
|
||||
- Add a true baseline-aware orchestration layer
|
||||
- Support one complete candidate iteration from mutation to keep or discard
|
||||
- Keep the main workspace stable unless a candidate is explicitly accepted
|
||||
- Make mutation validation meaningful by comparing baseline artifacts to candidate artifacts
|
||||
- Preserve the existing engine modules where possible
|
||||
|
||||
## Non-Goals
|
||||
|
||||
- Multi-iteration search loops
|
||||
- Search strategies beyond one candidate at a time
|
||||
- Parallel candidate execution
|
||||
- Git-backed orchestration
|
||||
- Multi-agent orchestration
|
||||
- Changes to the existing training workflow
|
||||
|
||||
## Core Decision
|
||||
|
||||
The orchestrator will use a candidate sandbox.
|
||||
|
||||
It will not mutate the main working tree directly during candidate evaluation.
|
||||
|
||||
Why:
|
||||
|
||||
- Baseline semantics become simple and reliable
|
||||
- Mutation validation can compare baseline artifacts to candidate artifacts
|
||||
- Discard and crash are cheap and safe because the sandbox can be deleted
|
||||
- Keep becomes an explicit sync of allowed artifact files back into the main workspace
|
||||
|
||||
This is more robust than in-place mutation with rollback.
|
||||
|
||||
## High-Level Architecture
|
||||
|
||||
The existing engine remains the foundation.
|
||||
|
||||
The new orchestrator adds one new layer:
|
||||
|
||||
- `task_loader`: load and validate task spec
|
||||
- `artifact_manager`: snapshot, restore, diff, managed-root behavior
|
||||
- `mutation_engine`: validate candidate changes against mutation budget
|
||||
- `runner`: execute commands
|
||||
- `scorer`: parse score output
|
||||
- `decision_engine`: compare baseline vs candidate and apply policy
|
||||
- `orchestrator`: create sandbox, run mutator, validate candidate, run candidate, decide keep or discard, sync accepted artifacts
|
||||
|
||||
## Runtime Model
|
||||
|
||||
Each single iteration works with three states:
|
||||
|
||||
### 1. Baseline Artifacts
|
||||
|
||||
These are the current accepted artifact files in the main workspace, limited to the task's allowed artifact set.
|
||||
|
||||
They are the source of truth for:
|
||||
|
||||
- mutation validation
|
||||
- candidate comparison
|
||||
- keep or discard decisions
|
||||
|
||||
### 2. Candidate Workspace
|
||||
|
||||
This is a temporary sandbox directory that contains a candidate version of the repository or task workspace.
|
||||
|
||||
The candidate workspace is where:
|
||||
|
||||
- mutator commands run
|
||||
- task runner commands run
|
||||
- scorer commands run
|
||||
|
||||
The candidate workspace is disposable.
|
||||
|
||||
### 3. Iteration Record
|
||||
|
||||
Each iteration produces a structured record containing:
|
||||
|
||||
- `task_id`
|
||||
- `status`
|
||||
- `reason`
|
||||
- `baseline_score`
|
||||
- `candidate_score`
|
||||
- `diff_summary`
|
||||
- `mutator_result`
|
||||
- `runner_result`
|
||||
- `scorer_result`
|
||||
|
||||
This record is the audit trail for a single candidate attempt.
|
||||
|
||||
## Task Spec Changes
|
||||
|
||||
The current task spec has `mutation`, but it does not describe how a candidate is generated.
|
||||
|
||||
The task spec must add a new `mutator` section.
|
||||
|
||||
Example:
|
||||
|
||||
```yaml
|
||||
id: skill-quality
|
||||
description: Optimize one skill file against a deterministic rubric.
|
||||
|
||||
artifacts:
|
||||
include:
|
||||
- fixtures/SKILL.md
|
||||
exclude: []
|
||||
max_files_per_iteration: 1
|
||||
|
||||
mutation:
|
||||
mode: direct_edit
|
||||
allowed_file_types:
|
||||
- .md
|
||||
max_changed_lines: 20
|
||||
|
||||
mutator:
|
||||
type: command
|
||||
command: "python ../../scripts/mutate_skill_task.py --task-dir . --artifact fixtures/SKILL.md"
|
||||
cwd: "tasks/skill-quality"
|
||||
timeout_seconds: 60
|
||||
|
||||
runner:
|
||||
command: "python ../../scripts/evaluate_skill_task.py --task-dir . --artifact fixtures/SKILL.md --output ../../work/skill-run.json"
|
||||
cwd: "tasks/skill-quality"
|
||||
timeout_seconds: 30
|
||||
|
||||
scorer:
|
||||
type: command
|
||||
command: "python scripts/score_skill_task.py --input work/skill-run.json"
|
||||
parse:
|
||||
format: json
|
||||
score_field: score
|
||||
metrics_field: metrics
|
||||
```
|
||||
|
||||
### Why `mutator` is separate from `mutation`
|
||||
|
||||
- `mutation` describes limits and policy
|
||||
- `mutator` describes how candidate edits are produced
|
||||
|
||||
This separation is required because the engine must distinguish:
|
||||
|
||||
- what is allowed
|
||||
- how a candidate is generated
|
||||
|
||||
## Proposed Model Changes
|
||||
|
||||
The following additions are needed in `engine/models.py`:
|
||||
|
||||
- `MutatorSpec`
|
||||
- optional `MutatorResult`
|
||||
- optional `IterationRecord`
|
||||
|
||||
Suggested shape:
|
||||
|
||||
```python
|
||||
@dataclass(frozen=True)
|
||||
class MutatorSpec:
|
||||
type: str
|
||||
command: str
|
||||
cwd: str
|
||||
timeout_seconds: int
|
||||
```
|
||||
|
||||
The current task model should be extended so `TaskSpec` includes `mutator`.
|
||||
|
||||
## Orchestration Flow
|
||||
|
||||
The baseline-aware single iteration runs as follows:
|
||||
|
||||
1. Load task spec
|
||||
2. Capture baseline artifact snapshot from the main workspace
|
||||
3. Create a candidate sandbox directory
|
||||
4. Copy the required workspace into the sandbox
|
||||
5. Run the `mutator` inside the sandbox
|
||||
6. Compare baseline artifacts to candidate artifacts
|
||||
7. Run mutation validation using the baseline snapshot and the candidate artifact state
|
||||
8. If mutation validation fails:
|
||||
- mark `discard`
|
||||
- do not modify main workspace
|
||||
- write iteration record
|
||||
- delete sandbox
|
||||
9. Run the task `runner` inside the sandbox
|
||||
10. If runner fails:
|
||||
- mark `crash`
|
||||
- do not modify main workspace
|
||||
- write iteration record
|
||||
- delete sandbox
|
||||
11. Run the `scorer` inside the sandbox or against sandbox outputs
|
||||
12. If scoring fails:
|
||||
- mark `crash`
|
||||
- do not modify main workspace
|
||||
- write iteration record
|
||||
- delete sandbox
|
||||
13. Compare candidate vs baseline using `decision_engine`
|
||||
14. If decision is `keep`:
|
||||
- copy allowed artifact files from sandbox back to main workspace
|
||||
- write iteration record
|
||||
15. If decision is `discard`:
|
||||
- do not modify main workspace
|
||||
- write iteration record
|
||||
16. Delete sandbox
|
||||
|
||||
## Keep, Discard, and Crash Semantics
|
||||
|
||||
### Keep
|
||||
|
||||
- Candidate is accepted
|
||||
- Only allowed artifact files are copied from sandbox to main workspace
|
||||
- Main workspace becomes the new accepted baseline
|
||||
|
||||
### Discard
|
||||
|
||||
- Candidate is rejected
|
||||
- Main workspace remains unchanged
|
||||
- Sandbox is deleted
|
||||
|
||||
### Crash
|
||||
|
||||
- Candidate evaluation failed
|
||||
- Main workspace remains unchanged
|
||||
- Sandbox is deleted
|
||||
- CLI should return non-zero
|
||||
|
||||
These semantics are stricter and simpler than in-place mutation with rollback.
|
||||
|
||||
## Candidate Sync Rules
|
||||
|
||||
Only files in the allowed artifact set may be copied back from sandbox to main workspace.
|
||||
|
||||
The orchestrator must not sync:
|
||||
|
||||
- runner outputs outside allowed artifacts
|
||||
- task logs
|
||||
- scorer temp files
|
||||
- unrelated files inside the sandbox
|
||||
|
||||
This prevents accidental expansion of the accepted state.
|
||||
|
||||
## Path Model
|
||||
|
||||
The orchestrator must make path anchoring explicit.
|
||||
|
||||
Recommended rules:
|
||||
|
||||
- Task-relative paths resolve from the task file location
|
||||
- Repo-managed paths resolve from repository root
|
||||
- Candidate execution paths resolve inside the sandbox root
|
||||
- Results logging for the current iteration may remain in the main workspace if it is intended as global orchestration output
|
||||
|
||||
This avoids the `Path.cwd()` ambiguity that previously caused path fragility.
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
The minimum required tests for this orchestrator are:
|
||||
|
||||
### Keep Case
|
||||
|
||||
- candidate mutates one allowed artifact
|
||||
- candidate scores better than baseline
|
||||
- artifact is synced back to main workspace
|
||||
|
||||
### Discard Case
|
||||
|
||||
- candidate mutates one allowed artifact
|
||||
- candidate does not improve
|
||||
- main workspace remains unchanged
|
||||
|
||||
### Crash Case
|
||||
|
||||
- mutator, runner, or scorer fails
|
||||
- main workspace remains unchanged
|
||||
- CLI returns non-zero for crash
|
||||
|
||||
### Mutation Budget Case
|
||||
|
||||
- candidate exceeds allowed file count or changed line budget
|
||||
- candidate is discarded before runner execution
|
||||
|
||||
### Path Isolation Case
|
||||
|
||||
- orchestrator runs from outside repo root
|
||||
- absolute task path still works
|
||||
|
||||
These tests are required for the orchestrator to be considered usable.
|
||||
|
||||
## Recommended File Changes
|
||||
|
||||
The next implementation phase should touch only the following areas:
|
||||
|
||||
- `engine/models.py`
|
||||
- `engine/task_loader.py`
|
||||
- `engine/mutation_engine.py`
|
||||
- new `engine/orchestrator.py`
|
||||
- `scripts/run_task.py`
|
||||
- `tasks/skill-quality/task.yaml`
|
||||
- tests for orchestration and sync-back behavior
|
||||
|
||||
This keeps the scope focused on adding one reliable orchestration layer.
|
||||
|
||||
## Open Questions Deferred
|
||||
|
||||
The following topics are explicitly deferred:
|
||||
|
||||
- Multi-iteration looping
|
||||
- Search strategies
|
||||
- Branch-per-candidate workflows
|
||||
- Git-backed isolation
|
||||
- Parallel execution
|
||||
- Human review checkpoints
|
||||
|
||||
These are all downstream of the single-iteration orchestrator.
|
||||
|
||||
## Recommendation
|
||||
|
||||
Proceed with a candidate-sandbox baseline-aware orchestrator as the next implementation step.
|
||||
|
||||
Do not extend directly to multi-iteration autonomy yet. First make one iteration correct, isolated, and safe.
|
||||
Loading…
Reference in New Issue
Block a user