# Baseline-Aware Single-Iteration Orchestrator Usage ## What It Does The Artifact Loop Engine now runs one baseline-aware optimization iteration in a sandbox: 1. Load a task spec. 2. Snapshot the current accepted artifact baseline. 3. Copy the repo into a temporary sandbox. 4. Run the task mutator in the sandbox. 5. Validate candidate changes against mutation limits. 6. Run and score the candidate in the sandbox. 7. Keep or discard the candidate. 8. Sync back only allowed artifact files on `keep`. The main workspace stays unchanged on `discard` and `crash`. ## Quick Start Run the sample task from the repo root: ```bash uv run python scripts/run_task.py --task tasks/skill-quality/task.yaml ``` Expected behavior: - The command prints one JSON record to stdout. - A matching JSON line is appended to `work/results.jsonl`. - `tasks/skill-quality/fixtures/SKILL.md` is updated only if the candidate is kept. Example result: ```json {"task_id":"skill-quality","status":"keep","reason":"no baseline available","candidate_score":4.0,"diff_summary":""} ``` ## Task Schema A task file must include these sections: - `id` - `description` - `artifacts` - `mutation` - `mutator` - `runner` - `scorer` - `objective` - `constraints` - `policy` - `budget` - `logging` Important runtime fields: ```yaml mutator: type: command command: "python ../../scripts/mutate_skill_task.py --task-dir . --artifact fixtures/SKILL.md" cwd: "tasks/skill-quality" timeout_seconds: 30 runner: command: "python ../../scripts/evaluate_skill_task.py --task-dir . --artifact fixtures/SKILL.md --output ../../work/skill-run.json" cwd: "tasks/skill-quality" timeout_seconds: 30 scorer: type: command command: "python scripts/score_skill_task.py --input work/skill-run.json" timeout_seconds: 30 parse: format: json score_field: score metrics_field: metrics ``` ## Path Rules - `task.root_dir` is the directory containing `task.yaml`. - `artifacts.include` paths are resolved relative to the task directory. - `mutator.cwd` and `runner.cwd` are repo-relative paths. - Absolute `cwd` values are rejected. - `..` segments in `cwd` are rejected. ## Keep, Discard, Crash ### Keep - Candidate is accepted. - Only allowed artifact files are copied back into the main workspace. ### Discard - Candidate is rejected. - Main workspace remains unchanged. ### Crash - Mutator, runner, or scorer execution failed. - Main workspace remains unchanged. - CLI exits non-zero. ## Validation Rules The orchestrator rejects a candidate before runner execution when: - changed file count exceeds `artifacts.max_files_per_iteration` - changed line count exceeds `mutation.max_changed_lines` - changed file type is not allowed - a non-artifact file was mutated The orchestrator also revalidates artifact state before sync-back on `keep`, so later runner or scorer edits cannot bypass mutation limits. ## Repo Directories Ignored By Sandbox State Checks These repo-root directories are intentionally ignored during sandbox copy/hash validation: - `work` - `.venv` - `.pytest_cache` Reason: - They are runtime or cache state, not accepted source artifacts. - Including them can distort keepability validation or make real runs unnecessarily slow. ## Output Record Each CLI run appends one JSON line to `work/results.jsonl` with: - `task_id` - `status` - `reason` - `candidate_score` - `diff_summary` ## Recommended Workflow For Adding A New Task 1. Create a task directory under `tasks/`. 2. Define the artifact set narrowly. 3. Set conservative mutation limits first. 4. Add a deterministic mutator command. 5. Add a deterministic runner and scorer. 6. Run `scripts/run_task.py` directly. 7. Inspect the latest line in `work/results.jsonl`. ## Common Failure Cases ### `status = "discard"` Usually means: - mutation budget exceeded - disallowed file type - non-artifact change detected - candidate did not improve ### `status = "crash"` Usually means: - mutator command failed - runner command failed - scorer command failed - scorer output was not parseable - configured `cwd` does not exist in the sandbox ## Current Scope This implementation supports exactly one isolated optimization iteration. It does not yet implement: - multi-iteration search - parallel candidate execution - git-backed sandboxing - branch-per-candidate workflows