Doni/CommonAutoRearsh

Fork 0

sladro 159d816d7f feat: add baseline-aware artifact loop orchestration

2026-04-02 17:30:26 +08:00

4.3 KiB

Raw Blame History

Baseline-Aware Single-Iteration Orchestrator Usage

What It Does

The Artifact Loop Engine now runs one baseline-aware optimization iteration in a sandbox:

Load a task spec.
Snapshot the current accepted artifact baseline.
Copy the repo into a temporary sandbox.
Run the task mutator in the sandbox.
Validate candidate changes against mutation limits.
Run and score the candidate in the sandbox.
Keep or discard the candidate.
Sync back only allowed artifact files on keep.

The main workspace stays unchanged on discard and crash.

Quick Start

Run the sample task from the repo root:

uv run python scripts/run_task.py --task tasks/skill-quality/task.yaml

Expected behavior:

The command prints one JSON record to stdout.
A matching JSON line is appended to work/results.jsonl.
tasks/skill-quality/fixtures/SKILL.md is updated only if the candidate is kept.

Example result:

{"task_id":"skill-quality","status":"keep","reason":"no baseline available","candidate_score":4.0,"diff_summary":""}

Task Schema

A task file must include these sections:

id
description
artifacts
mutation
mutator
runner
scorer
objective
constraints
policy
budget
logging

Important runtime fields:

mutator:
  type: command
  command: "python ../../scripts/mutate_skill_task.py --task-dir . --artifact fixtures/SKILL.md"
  cwd: "tasks/skill-quality"
  timeout_seconds: 30

runner:
  command: "python ../../scripts/evaluate_skill_task.py --task-dir . --artifact fixtures/SKILL.md --output ../../work/skill-run.json"
  cwd: "tasks/skill-quality"
  timeout_seconds: 30

scorer:
  type: command
  command: "python scripts/score_skill_task.py --input work/skill-run.json"
  timeout_seconds: 30
  parse:
    format: json
    score_field: score
    metrics_field: metrics

Path Rules

task.root_dir is the directory containing task.yaml.
artifacts.include paths are resolved relative to the task directory.
mutator.cwd and runner.cwd are repo-relative paths.
Absolute cwd values are rejected.
.. segments in cwd are rejected.

Keep, Discard, Crash

Keep

Candidate is accepted.
Only allowed artifact files are copied back into the main workspace.

Discard

Candidate is rejected.
Main workspace remains unchanged.

Crash

Mutator, runner, or scorer execution failed.
Main workspace remains unchanged.
CLI exits non-zero.

Validation Rules

The orchestrator rejects a candidate before runner execution when:

changed file count exceeds artifacts.max_files_per_iteration
changed line count exceeds mutation.max_changed_lines
changed file type is not allowed
a non-artifact file was mutated

The orchestrator also revalidates artifact state before sync-back on keep, so later runner or scorer edits cannot bypass mutation limits.

Repo Directories Ignored By Sandbox State Checks

These repo-root directories are intentionally ignored during sandbox copy/hash validation:

work
.venv
.pytest_cache

Reason:

They are runtime or cache state, not accepted source artifacts.
Including them can distort keepability validation or make real runs unnecessarily slow.

Output Record

Each CLI run appends one JSON line to work/results.jsonl with:

task_id
status
reason
candidate_score
diff_summary

Recommended Workflow For Adding A New Task

Create a task directory under tasks/.
Define the artifact set narrowly.
Set conservative mutation limits first.
Add a deterministic mutator command.
Add a deterministic runner and scorer.
Run scripts/run_task.py directly.
Inspect the latest line in work/results.jsonl.

Common Failure Cases

`status = "discard"`

Usually means:

mutation budget exceeded
disallowed file type
non-artifact change detected
candidate did not improve

`status = "crash"`