8.4 KiB
Usage
What This Is
This project should be used as a safe single-iteration optimization engine.
It is not a free-form "let the AI edit the repo however it wants" loop. It gives the AI a controlled way to:
- define what may change
- generate one candidate in a sandbox
- run evaluation in isolation
- score the result
- keep or discard the candidate
If you want repeated optimization, the AI should call this workflow repeatedly.
How AI Should Use This Project
The correct mental model is:
- the AI is the search and decision layer
- this project is the sandboxed execution and evaluation layer
The AI should work in this loop:
- Pick one concrete optimization target.
- Create or update a task for that target.
- Run one iteration with
scripts/run_task.py. - Read the JSON result and the latest line in
work/results.jsonl. - Decide what to do next:
keep: continue from the new baselinediscard: adjust the mutator or task constraints and try againcrash: fix the task, runner, scorer, or paths before continuing
- Repeat until the budget or goal is reached.
Do not treat the main workspace as the experiment loop.
Quick Start
Run the sample task from the repo root:
uv run python scripts/run_task.py --task tasks/skill-quality/task.yaml
Expected result:
- stdout prints one JSON record
work/results.jsonlgets one new JSON linetasks/skill-quality/fixtures/SKILL.mdchanges only if the candidate is kept
Example:
{"task_id":"skill-quality","status":"keep","reason":"no baseline available","candidate_score":4.0,"diff_summary":""}
What A Task Must Define
Each task file must include:
artifactsmutationmutatorrunnerscorerobjectiveconstraintspolicybudgetlogging
Important runtime fields:
mutator:
type: command
command: "..."
cwd: "tasks/your-task"
timeout_seconds: 30
runner:
command: "..."
cwd: "tasks/your-task"
timeout_seconds: 30
scorer:
type: command
command: "..."
timeout_seconds: 30
parse:
format: json
score_field: score
metrics_field: metrics
What Each Part Means
artifacts
The files that may be accepted back into the main workspace.
Keep this set narrow. Start with one file whenever possible.
mutation
The safety budget:
- how many files may change
- how many lines may change
- which file types are allowed
mutator
The command that creates one candidate in the sandbox.
This is where the AI's proposal becomes an actual candidate artifact.
runner
The command that evaluates the candidate.
Examples:
- run a deterministic checker for a skill
- run a short training job
- run validation on a model checkpoint
- run a forecasting backtest
- run a scheduling simulation
scorer
The command that returns structured JSON for the decision step.
Examples:
- rubric score
mAP50-95- F1 score
- validation loss
- RMSE
- maintenance-event prediction precision/recall
- production schedule cost or lateness penalty
Path Rules
- artifact paths are task-relative
mutator.cwdandrunner.cwdare repo-relative- absolute
cwdvalues are rejected ..incwdis rejected
Status Meaning
keep: candidate accepted and allowed artifacts synced backdiscard: candidate rejected and main workspace unchangedcrash: execution failed, main workspace unchanged, CLI exits non-zero
How AI Should Interpret Results
keep
Meaning:
- the candidate is better than the current baseline
- the main workspace now contains the accepted artifact
What the AI should do next:
- treat the current artifact as the new baseline
- decide whether another iteration is still worth trying
discard
Meaning:
- the candidate was valid enough to evaluate, or was rejected during validation, but it should not replace the baseline
Common reasons:
- too many changed files
- too many changed lines
- disallowed file type
- non-artifact change detected
- candidate did not improve
What the AI should do next:
- shrink the mutator if the candidate was too aggressive
- fix task boundaries if the candidate touched the wrong files
- adjust strategy if the score did not improve
crash
Meaning:
- mutator, runner, scorer, or task configuration failed
Common reasons:
- command failed
- scorer output was not valid JSON
- configured sandbox
cwddoes not exist
What the AI should do next:
- fix the task definition or scripts first
- do not continue blind iteration until the crash cause is removed
What AI Must Not Do
- do not directly use the main workspace as the experiment loop
- do not manually copy files out of sandbox logic
- do not treat
work,.venv, or.pytest_cacheas source artifacts - do not let runner/scorer outputs become accepted artifacts unless explicitly intended
- do not start with a wide artifact scope if one-file optimization is possible
- do not open up entire model codebases when a config or recipe file is enough
How To Create A New Task
When the AI wants to optimize a new problem, use this sequence:
- Pick the smallest artifact that can express the change.
- Define a strict mutation budget.
- Write a deterministic mutator.
- Write a deterministic runner.
- Write a scorer that emits structured JSON.
- Run one iteration.
- Inspect the result.
- Tighten or broaden the task only if the current boundary is too small.
Recommended Optimization Strategy
Start from the narrowest controllable object.
Good first artifacts:
- one
SKILL.md - one prompt file
- one YAML config
- one training recipe
- one augmentation policy
- one scheduling policy file
- one feature engineering config
Avoid starting with:
- large source trees
- multiple unrelated configs
- code plus docs plus scripts in one task
General Deep Learning Use Cases
This engine is not limited to skills or YOLO.
It fits any scenario where:
- one candidate can be generated deterministically
- one evaluation run can be executed
- one score can be extracted
Examples:
- object detection tuning
- classification tuning
- time-series forecasting
- predictive maintenance
- production scheduling
- demand prediction
- anomaly detection
- tabular model feature-policy search
- inference prompt or tool-policy optimization
Pattern For Deep Learning Tasks
For most deep learning tasks, the AI should optimize one of these first:
- training config
- augmentation config
- feature config
- inference threshold config
- loss-weight config
- scheduling or dispatch policy config
Typical setup:
artifacts: one config filemutator: edits config valuesrunner: runs a short fixed-budget experiment or simulationscorer: emits a JSON score and metrics
This is usually better than letting the AI directly rewrite model code.
Example Mental Templates
Skill Optimization
- artifact: one
SKILL.md - mutator: rewrites structure or instructions
- runner: checks required sections and formatting
- scorer: outputs rubric-based score
YOLO Or Vision Model Tuning
- artifact: one experiment YAML
- mutator: changes lr, augmentation, image size, loss weights, thresholds
- runner: runs a short training/eval job
- scorer: outputs
mAP, latency, memory, and constraint violations
Predictive Maintenance
- artifact: one feature/training config
- mutator: changes window sizes, features, thresholds, class weighting
- runner: runs a short train + validation or backtest
- scorer: outputs precision, recall, F1, false-alarm penalty
Production Scheduling
- artifact: one scheduling policy or simulation config
- mutator: changes dispatch heuristics, penalty weights, batching rules
- runner: runs a deterministic simulation
- scorer: outputs lateness, throughput, utilization, total cost
Ignored Runtime Directories
These repo-root directories are ignored by sandbox copy/hash validation:
work.venv.pytest_cache
Reason:
- they are runtime or cache state, not accepted source artifacts
When To Extend The Project Itself
Do not default to changing the engine.
Only extend the engine when the AI cannot express the optimization problem cleanly with:
- a task file
- a mutator
- a runner
- a scorer
In most cases, the right move is to add a new task, not to modify the orchestrator.
More Detail
See the full guide:
2026-04-02-baseline-aware-single-iteration-orchestrator-usage.md