355 lines
8.4 KiB
Markdown
355 lines
8.4 KiB
Markdown
# Usage
|
|
|
|
## What This Is
|
|
|
|
This project should be used as a safe single-iteration optimization engine.
|
|
|
|
It is not a free-form "let the AI edit the repo however it wants" loop.
|
|
It gives the AI a controlled way to:
|
|
|
|
1. define what may change
|
|
2. generate one candidate in a sandbox
|
|
3. run evaluation in isolation
|
|
4. score the result
|
|
5. keep or discard the candidate
|
|
|
|
If you want repeated optimization, the AI should call this workflow repeatedly.
|
|
|
|
## How AI Should Use This Project
|
|
|
|
The correct mental model is:
|
|
|
|
- the AI is the search and decision layer
|
|
- this project is the sandboxed execution and evaluation layer
|
|
|
|
The AI should work in this loop:
|
|
|
|
1. Pick one concrete optimization target.
|
|
2. Create or update a task for that target.
|
|
3. Run one iteration with `scripts/run_task.py`.
|
|
4. Read the JSON result and the latest line in `work/results.jsonl`.
|
|
5. Decide what to do next:
|
|
- `keep`: continue from the new baseline
|
|
- `discard`: adjust the mutator or task constraints and try again
|
|
- `crash`: fix the task, runner, scorer, or paths before continuing
|
|
6. Repeat until the budget or goal is reached.
|
|
|
|
Do not treat the main workspace as the experiment loop.
|
|
|
|
## Quick Start
|
|
|
|
Run the sample task from the repo root:
|
|
|
|
```bash
|
|
uv run python scripts/run_task.py --task tasks/skill-quality/task.yaml
|
|
```
|
|
|
|
Expected result:
|
|
|
|
- stdout prints one JSON record
|
|
- `work/results.jsonl` gets one new JSON line
|
|
- `tasks/skill-quality/fixtures/SKILL.md` changes only if the candidate is kept
|
|
|
|
Example:
|
|
|
|
```json
|
|
{"task_id":"skill-quality","status":"keep","reason":"no baseline available","candidate_score":4.0,"diff_summary":""}
|
|
```
|
|
|
|
## What A Task Must Define
|
|
|
|
Each task file must include:
|
|
|
|
- `artifacts`
|
|
- `mutation`
|
|
- `mutator`
|
|
- `runner`
|
|
- `scorer`
|
|
- `objective`
|
|
- `constraints`
|
|
- `policy`
|
|
- `budget`
|
|
- `logging`
|
|
|
|
Important runtime fields:
|
|
|
|
```yaml
|
|
mutator:
|
|
type: command
|
|
command: "..."
|
|
cwd: "tasks/your-task"
|
|
timeout_seconds: 30
|
|
|
|
runner:
|
|
command: "..."
|
|
cwd: "tasks/your-task"
|
|
timeout_seconds: 30
|
|
|
|
scorer:
|
|
type: command
|
|
command: "..."
|
|
timeout_seconds: 30
|
|
parse:
|
|
format: json
|
|
score_field: score
|
|
metrics_field: metrics
|
|
```
|
|
|
|
## What Each Part Means
|
|
|
|
### `artifacts`
|
|
|
|
The files that may be accepted back into the main workspace.
|
|
|
|
Keep this set narrow.
|
|
Start with one file whenever possible.
|
|
|
|
### `mutation`
|
|
|
|
The safety budget:
|
|
|
|
- how many files may change
|
|
- how many lines may change
|
|
- which file types are allowed
|
|
|
|
### `mutator`
|
|
|
|
The command that creates one candidate in the sandbox.
|
|
|
|
This is where the AI's proposal becomes an actual candidate artifact.
|
|
|
|
### `runner`
|
|
|
|
The command that evaluates the candidate.
|
|
|
|
Examples:
|
|
|
|
- run a deterministic checker for a skill
|
|
- run a short training job
|
|
- run validation on a model checkpoint
|
|
- run a forecasting backtest
|
|
- run a scheduling simulation
|
|
|
|
### `scorer`
|
|
|
|
The command that returns structured JSON for the decision step.
|
|
|
|
Examples:
|
|
|
|
- rubric score
|
|
- `mAP50-95`
|
|
- F1 score
|
|
- validation loss
|
|
- RMSE
|
|
- maintenance-event prediction precision/recall
|
|
- production schedule cost or lateness penalty
|
|
|
|
## Path Rules
|
|
|
|
- artifact paths are task-relative
|
|
- `mutator.cwd` and `runner.cwd` are repo-relative
|
|
- absolute `cwd` values are rejected
|
|
- `..` in `cwd` is rejected
|
|
|
|
## Status Meaning
|
|
|
|
- `keep`: candidate accepted and allowed artifacts synced back
|
|
- `discard`: candidate rejected and main workspace unchanged
|
|
- `crash`: execution failed, main workspace unchanged, CLI exits non-zero
|
|
|
|
## How AI Should Interpret Results
|
|
|
|
### `keep`
|
|
|
|
Meaning:
|
|
|
|
- the candidate is better than the current baseline
|
|
- the main workspace now contains the accepted artifact
|
|
|
|
What the AI should do next:
|
|
|
|
- treat the current artifact as the new baseline
|
|
- decide whether another iteration is still worth trying
|
|
|
|
### `discard`
|
|
|
|
Meaning:
|
|
|
|
- the candidate was valid enough to evaluate, or was rejected during validation, but it should not replace the baseline
|
|
|
|
Common reasons:
|
|
|
|
- too many changed files
|
|
- too many changed lines
|
|
- disallowed file type
|
|
- non-artifact change detected
|
|
- candidate did not improve
|
|
|
|
What the AI should do next:
|
|
|
|
- shrink the mutator if the candidate was too aggressive
|
|
- fix task boundaries if the candidate touched the wrong files
|
|
- adjust strategy if the score did not improve
|
|
|
|
### `crash`
|
|
|
|
Meaning:
|
|
|
|
- mutator, runner, scorer, or task configuration failed
|
|
|
|
Common reasons:
|
|
|
|
- command failed
|
|
- scorer output was not valid JSON
|
|
- configured sandbox `cwd` does not exist
|
|
|
|
What the AI should do next:
|
|
|
|
- fix the task definition or scripts first
|
|
- do not continue blind iteration until the crash cause is removed
|
|
|
|
## What AI Must Not Do
|
|
|
|
- do not directly use the main workspace as the experiment loop
|
|
- do not manually copy files out of sandbox logic
|
|
- do not treat `work`, `.venv`, or `.pytest_cache` as source artifacts
|
|
- do not let runner/scorer outputs become accepted artifacts unless explicitly intended
|
|
- do not start with a wide artifact scope if one-file optimization is possible
|
|
- do not open up entire model codebases when a config or recipe file is enough
|
|
|
|
## How To Create A New Task
|
|
|
|
When the AI wants to optimize a new problem, use this sequence:
|
|
|
|
1. Pick the smallest artifact that can express the change.
|
|
2. Define a strict mutation budget.
|
|
3. Write a deterministic mutator.
|
|
4. Write a deterministic runner.
|
|
5. Write a scorer that emits structured JSON.
|
|
6. Run one iteration.
|
|
7. Inspect the result.
|
|
8. Tighten or broaden the task only if the current boundary is too small.
|
|
|
|
## Recommended Optimization Strategy
|
|
|
|
Start from the narrowest controllable object.
|
|
|
|
Good first artifacts:
|
|
|
|
- one `SKILL.md`
|
|
- one prompt file
|
|
- one YAML config
|
|
- one training recipe
|
|
- one augmentation policy
|
|
- one scheduling policy file
|
|
- one feature engineering config
|
|
|
|
Avoid starting with:
|
|
|
|
- large source trees
|
|
- multiple unrelated configs
|
|
- code plus docs plus scripts in one task
|
|
|
|
## General Deep Learning Use Cases
|
|
|
|
This engine is not limited to skills or YOLO.
|
|
|
|
It fits any scenario where:
|
|
|
|
- one candidate can be generated deterministically
|
|
- one evaluation run can be executed
|
|
- one score can be extracted
|
|
|
|
Examples:
|
|
|
|
- object detection tuning
|
|
- classification tuning
|
|
- time-series forecasting
|
|
- predictive maintenance
|
|
- production scheduling
|
|
- demand prediction
|
|
- anomaly detection
|
|
- tabular model feature-policy search
|
|
- inference prompt or tool-policy optimization
|
|
|
|
## Pattern For Deep Learning Tasks
|
|
|
|
For most deep learning tasks, the AI should optimize one of these first:
|
|
|
|
- training config
|
|
- augmentation config
|
|
- feature config
|
|
- inference threshold config
|
|
- loss-weight config
|
|
- scheduling or dispatch policy config
|
|
|
|
Typical setup:
|
|
|
|
- `artifacts`: one config file
|
|
- `mutator`: edits config values
|
|
- `runner`: runs a short fixed-budget experiment or simulation
|
|
- `scorer`: emits a JSON score and metrics
|
|
|
|
This is usually better than letting the AI directly rewrite model code.
|
|
|
|
## Example Mental Templates
|
|
|
|
### Skill Optimization
|
|
|
|
- artifact: one `SKILL.md`
|
|
- mutator: rewrites structure or instructions
|
|
- runner: checks required sections and formatting
|
|
- scorer: outputs rubric-based score
|
|
|
|
### YOLO Or Vision Model Tuning
|
|
|
|
- artifact: one experiment YAML
|
|
- mutator: changes lr, augmentation, image size, loss weights, thresholds
|
|
- runner: runs a short training/eval job
|
|
- scorer: outputs `mAP`, latency, memory, and constraint violations
|
|
|
|
### Predictive Maintenance
|
|
|
|
- artifact: one feature/training config
|
|
- mutator: changes window sizes, features, thresholds, class weighting
|
|
- runner: runs a short train + validation or backtest
|
|
- scorer: outputs precision, recall, F1, false-alarm penalty
|
|
|
|
### Production Scheduling
|
|
|
|
- artifact: one scheduling policy or simulation config
|
|
- mutator: changes dispatch heuristics, penalty weights, batching rules
|
|
- runner: runs a deterministic simulation
|
|
- scorer: outputs lateness, throughput, utilization, total cost
|
|
|
|
## Ignored Runtime Directories
|
|
|
|
These repo-root directories are ignored by sandbox copy/hash validation:
|
|
|
|
- `work`
|
|
- `.venv`
|
|
- `.pytest_cache`
|
|
|
|
Reason:
|
|
|
|
- they are runtime or cache state, not accepted source artifacts
|
|
|
|
## When To Extend The Project Itself
|
|
|
|
Do not default to changing the engine.
|
|
|
|
Only extend the engine when the AI cannot express the optimization problem cleanly with:
|
|
|
|
- a task file
|
|
- a mutator
|
|
- a runner
|
|
- a scorer
|
|
|
|
In most cases, the right move is to add a new task, not to modify the orchestrator.
|
|
|
|
## More Detail
|
|
|
|
See the full guide:
|
|
|
|
[2026-04-02-baseline-aware-single-iteration-orchestrator-usage.md](D:\App\GitHub\autoresearch\docs\superpowers\usage\2026-04-02-baseline-aware-single-iteration-orchestrator-usage.md)
|