CommonAutoRearsh/docs/superpowers/usage/2026-04-02-baseline-aware-single-iteration-orchestrator-usage.md

4.3 KiB

Baseline-Aware Single-Iteration Orchestrator Usage

What It Does

The Artifact Loop Engine now runs one baseline-aware optimization iteration in a sandbox:

  1. Load a task spec.
  2. Snapshot the current accepted artifact baseline.
  3. Copy the repo into a temporary sandbox.
  4. Run the task mutator in the sandbox.
  5. Validate candidate changes against mutation limits.
  6. Run and score the candidate in the sandbox.
  7. Keep or discard the candidate.
  8. Sync back only allowed artifact files on keep.

The main workspace stays unchanged on discard and crash.

Quick Start

Run the sample task from the repo root:

uv run python scripts/run_task.py --task tasks/skill-quality/task.yaml

Expected behavior:

  • The command prints one JSON record to stdout.
  • A matching JSON line is appended to work/results.jsonl.
  • tasks/skill-quality/fixtures/SKILL.md is updated only if the candidate is kept.

Example result:

{"task_id":"skill-quality","status":"keep","reason":"no baseline available","candidate_score":4.0,"diff_summary":""}

Task Schema

A task file must include these sections:

  • id
  • description
  • artifacts
  • mutation
  • mutator
  • runner
  • scorer
  • objective
  • constraints
  • policy
  • budget
  • logging

Important runtime fields:

mutator:
  type: command
  command: "python ../../scripts/mutate_skill_task.py --task-dir . --artifact fixtures/SKILL.md"
  cwd: "tasks/skill-quality"
  timeout_seconds: 30

runner:
  command: "python ../../scripts/evaluate_skill_task.py --task-dir . --artifact fixtures/SKILL.md --output ../../work/skill-run.json"
  cwd: "tasks/skill-quality"
  timeout_seconds: 30

scorer:
  type: command
  command: "python scripts/score_skill_task.py --input work/skill-run.json"
  timeout_seconds: 30
  parse:
    format: json
    score_field: score
    metrics_field: metrics

Path Rules

  • task.root_dir is the directory containing task.yaml.
  • artifacts.include paths are resolved relative to the task directory.
  • mutator.cwd and runner.cwd are repo-relative paths.
  • Absolute cwd values are rejected.
  • .. segments in cwd are rejected.

Keep, Discard, Crash

Keep

  • Candidate is accepted.
  • Only allowed artifact files are copied back into the main workspace.

Discard

  • Candidate is rejected.
  • Main workspace remains unchanged.

Crash

  • Mutator, runner, or scorer execution failed.
  • Main workspace remains unchanged.
  • CLI exits non-zero.

Validation Rules

The orchestrator rejects a candidate before runner execution when:

  • changed file count exceeds artifacts.max_files_per_iteration
  • changed line count exceeds mutation.max_changed_lines
  • changed file type is not allowed
  • a non-artifact file was mutated

The orchestrator also revalidates artifact state before sync-back on keep, so later runner or scorer edits cannot bypass mutation limits.

Repo Directories Ignored By Sandbox State Checks

These repo-root directories are intentionally ignored during sandbox copy/hash validation:

  • work
  • .venv
  • .pytest_cache

Reason:

  • They are runtime or cache state, not accepted source artifacts.
  • Including them can distort keepability validation or make real runs unnecessarily slow.

Output Record

Each CLI run appends one JSON line to work/results.jsonl with:

  • task_id
  • status
  • reason
  • candidate_score
  • diff_summary
  1. Create a task directory under tasks/.
  2. Define the artifact set narrowly.
  3. Set conservative mutation limits first.
  4. Add a deterministic mutator command.
  5. Add a deterministic runner and scorer.
  6. Run scripts/run_task.py directly.
  7. Inspect the latest line in work/results.jsonl.

Common Failure Cases

status = "discard"

Usually means:

  • mutation budget exceeded
  • disallowed file type
  • non-artifact change detected
  • candidate did not improve

status = "crash"

Usually means:

  • mutator command failed
  • runner command failed
  • scorer command failed
  • scorer output was not parseable
  • configured cwd does not exist in the sandbox

Current Scope

This implementation supports exactly one isolated optimization iteration.

It does not yet implement:

  • multi-iteration search
  • parallel candidate execution
  • git-backed sandboxing
  • branch-per-candidate workflows