Doni/CommonAutoRearsh

Fork 0

sladro 159d816d7f feat: add baseline-aware artifact loop orchestration

2026-04-02 17:30:26 +08:00

8.4 KiB

Raw Blame History

Usage

What This Is

This project should be used as a safe single-iteration optimization engine.

It is not a free-form "let the AI edit the repo however it wants" loop. It gives the AI a controlled way to:

define what may change
generate one candidate in a sandbox
run evaluation in isolation
score the result
keep or discard the candidate

If you want repeated optimization, the AI should call this workflow repeatedly.

How AI Should Use This Project

The correct mental model is:

the AI is the search and decision layer
this project is the sandboxed execution and evaluation layer

The AI should work in this loop:

Pick one concrete optimization target.
Create or update a task for that target.
Run one iteration with scripts/run_task.py.
Read the JSON result and the latest line in work/results.jsonl.
Decide what to do next:
- keep: continue from the new baseline
- discard: adjust the mutator or task constraints and try again
- crash: fix the task, runner, scorer, or paths before continuing
Repeat until the budget or goal is reached.

Do not treat the main workspace as the experiment loop.

Quick Start

Run the sample task from the repo root:

uv run python scripts/run_task.py --task tasks/skill-quality/task.yaml

Expected result:

stdout prints one JSON record
work/results.jsonl gets one new JSON line
tasks/skill-quality/fixtures/SKILL.md changes only if the candidate is kept

Example:

{"task_id":"skill-quality","status":"keep","reason":"no baseline available","candidate_score":4.0,"diff_summary":""}

What A Task Must Define

Each task file must include:

artifacts
mutation
mutator
runner
scorer
objective
constraints
policy
budget
logging

Important runtime fields:

mutator:
  type: command
  command: "..."
  cwd: "tasks/your-task"
  timeout_seconds: 30

runner:
  command: "..."
  cwd: "tasks/your-task"
  timeout_seconds: 30

scorer:
  type: command
  command: "..."
  timeout_seconds: 30
  parse:
    format: json
    score_field: score
    metrics_field: metrics

What Each Part Means

`artifacts`

The files that may be accepted back into the main workspace.

Keep this set narrow. Start with one file whenever possible.

`mutation`

The safety budget:

how many files may change
how many lines may change
which file types are allowed

`mutator`

The command that creates one candidate in the sandbox.

This is where the AI's proposal becomes an actual candidate artifact.

`runner`

The command that evaluates the candidate.

Examples:

run a deterministic checker for a skill
run a short training job
run validation on a model checkpoint
run a forecasting backtest
run a scheduling simulation

`scorer`

The command that returns structured JSON for the decision step.

Examples:

rubric score
mAP50-95
F1 score
validation loss
RMSE
maintenance-event prediction precision/recall
production schedule cost or lateness penalty

Path Rules

artifact paths are task-relative
mutator.cwd and runner.cwd are repo-relative
absolute cwd values are rejected
.. in cwd is rejected

Status Meaning

keep: candidate accepted and allowed artifacts synced back
discard: candidate rejected and main workspace unchanged
crash: execution failed, main workspace unchanged, CLI exits non-zero

How AI Should Interpret Results

`keep`

Meaning:

the candidate is better than the current baseline
the main workspace now contains the accepted artifact

What the AI should do next:

treat the current artifact as the new baseline
decide whether another iteration is still worth trying

`discard`

Meaning:

the candidate was valid enough to evaluate, or was rejected during validation, but it should not replace the baseline

Common reasons:

too many changed files
too many changed lines
disallowed file type
non-artifact change detected
candidate did not improve

What the AI should do next:

shrink the mutator if the candidate was too aggressive
fix task boundaries if the candidate touched the wrong files
adjust strategy if the score did not improve

`crash`

Meaning:

mutator, runner, scorer, or task configuration failed

Common reasons:

command failed
scorer output was not valid JSON
configured sandbox cwd does not exist

What the AI should do next:

fix the task definition or scripts first
do not continue blind iteration until the crash cause is removed

What AI Must Not Do

do not directly use the main workspace as the experiment loop
do not manually copy files out of sandbox logic
do not treat work, .venv, or .pytest_cache as source artifacts
do not let runner/scorer outputs become accepted artifacts unless explicitly intended
do not start with a wide artifact scope if one-file optimization is possible
do not open up entire model codebases when a config or recipe file is enough

How To Create A New Task

When the AI wants to optimize a new problem, use this sequence:

Pick the smallest artifact that can express the change.
Define a strict mutation budget.
Write a deterministic mutator.
Write a deterministic runner.
Write a scorer that emits structured JSON.
Run one iteration.
Inspect the result.
Tighten or broaden the task only if the current boundary is too small.

Recommended Optimization Strategy

Start from the narrowest controllable object.

Good first artifacts:

one SKILL.md
one prompt file
one YAML config
one training recipe
one augmentation policy
one scheduling policy file
one feature engineering config

Avoid starting with:

large source trees
multiple unrelated configs
code plus docs plus scripts in one task

General Deep Learning Use Cases

This engine is not limited to skills or YOLO.

It fits any scenario where:

one candidate can be generated deterministically
one evaluation run can be executed
one score can be extracted

Examples:

object detection tuning
classification tuning
time-series forecasting
predictive maintenance
production scheduling
demand prediction
anomaly detection
tabular model feature-policy search
inference prompt or tool-policy optimization

Pattern For Deep Learning Tasks

For most deep learning tasks, the AI should optimize one of these first:

training config
augmentation config
feature config
inference threshold config
loss-weight config
scheduling or dispatch policy config

Typical setup:

artifacts: one config file
mutator: edits config values
runner: runs a short fixed-budget experiment or simulation
scorer: emits a JSON score and metrics

This is usually better than letting the AI directly rewrite model code.

Example Mental Templates

Skill Optimization

artifact: one SKILL.md
mutator: rewrites structure or instructions
runner: checks required sections and formatting
scorer: outputs rubric-based score

YOLO Or Vision Model Tuning

artifact: one experiment YAML
mutator: changes lr, augmentation, image size, loss weights, thresholds
runner: runs a short training/eval job
scorer: outputs mAP, latency, memory, and constraint violations

Predictive Maintenance

artifact: one feature/training config
mutator: changes window sizes, features, thresholds, class weighting
runner: runs a short train + validation or backtest
scorer: outputs precision, recall, F1, false-alarm penalty

Production Scheduling

artifact: one scheduling policy or simulation config
mutator: changes dispatch heuristics, penalty weights, batching rules
runner: runs a deterministic simulation
scorer: outputs lateness, throughput, utilization, total cost

Ignored Runtime Directories

These repo-root directories are ignored by sandbox copy/hash validation:

work
.venv
.pytest_cache

Reason:

they are runtime or cache state, not accepted source artifacts

When To Extend The Project Itself

Do not default to changing the engine.

Only extend the engine when the AI cannot express the optimization problem cleanly with:

a task file
a mutator
a runner
a scorer

In most cases, the right move is to add a new task, not to modify the orchestrator.

More Detail

See the full guide:

2026-04-02-baseline-aware-single-iteration-orchestrator-usage.md

8.4 KiB Raw Blame History

Usage

What This Is

How AI Should Use This Project

Quick Start

What A Task Must Define

What Each Part Means

artifacts

mutation

mutator

runner

scorer

Path Rules

Status Meaning

How AI Should Interpret Results

keep

discard

crash

What AI Must Not Do

How To Create A New Task

Recommended Optimization Strategy

General Deep Learning Use Cases

Pattern For Deep Learning Tasks

Example Mental Templates

Skill Optimization

YOLO Or Vision Model Tuning

Predictive Maintenance

Production Scheduling

Ignored Runtime Directories

When To Extend The Project Itself

More Detail

8.4 KiB

Raw Blame History

`artifacts`

`mutation`

`mutator`

`runner`

`scorer`

`keep`

`discard`

`crash`