CommonAutoRearsh/USAGE.md

355 lines
8.4 KiB
Markdown

# Usage
## What This Is
This project should be used as a safe single-iteration optimization engine.
It is not a free-form "let the AI edit the repo however it wants" loop.
It gives the AI a controlled way to:
1. define what may change
2. generate one candidate in a sandbox
3. run evaluation in isolation
4. score the result
5. keep or discard the candidate
If you want repeated optimization, the AI should call this workflow repeatedly.
## How AI Should Use This Project
The correct mental model is:
- the AI is the search and decision layer
- this project is the sandboxed execution and evaluation layer
The AI should work in this loop:
1. Pick one concrete optimization target.
2. Create or update a task for that target.
3. Run one iteration with `scripts/run_task.py`.
4. Read the JSON result and the latest line in `work/results.jsonl`.
5. Decide what to do next:
- `keep`: continue from the new baseline
- `discard`: adjust the mutator or task constraints and try again
- `crash`: fix the task, runner, scorer, or paths before continuing
6. Repeat until the budget or goal is reached.
Do not treat the main workspace as the experiment loop.
## Quick Start
Run the sample task from the repo root:
```bash
uv run python scripts/run_task.py --task tasks/skill-quality/task.yaml
```
Expected result:
- stdout prints one JSON record
- `work/results.jsonl` gets one new JSON line
- `tasks/skill-quality/fixtures/SKILL.md` changes only if the candidate is kept
Example:
```json
{"task_id":"skill-quality","status":"keep","reason":"no baseline available","candidate_score":4.0,"diff_summary":""}
```
## What A Task Must Define
Each task file must include:
- `artifacts`
- `mutation`
- `mutator`
- `runner`
- `scorer`
- `objective`
- `constraints`
- `policy`
- `budget`
- `logging`
Important runtime fields:
```yaml
mutator:
type: command
command: "..."
cwd: "tasks/your-task"
timeout_seconds: 30
runner:
command: "..."
cwd: "tasks/your-task"
timeout_seconds: 30
scorer:
type: command
command: "..."
timeout_seconds: 30
parse:
format: json
score_field: score
metrics_field: metrics
```
## What Each Part Means
### `artifacts`
The files that may be accepted back into the main workspace.
Keep this set narrow.
Start with one file whenever possible.
### `mutation`
The safety budget:
- how many files may change
- how many lines may change
- which file types are allowed
### `mutator`
The command that creates one candidate in the sandbox.
This is where the AI's proposal becomes an actual candidate artifact.
### `runner`
The command that evaluates the candidate.
Examples:
- run a deterministic checker for a skill
- run a short training job
- run validation on a model checkpoint
- run a forecasting backtest
- run a scheduling simulation
### `scorer`
The command that returns structured JSON for the decision step.
Examples:
- rubric score
- `mAP50-95`
- F1 score
- validation loss
- RMSE
- maintenance-event prediction precision/recall
- production schedule cost or lateness penalty
## Path Rules
- artifact paths are task-relative
- `mutator.cwd` and `runner.cwd` are repo-relative
- absolute `cwd` values are rejected
- `..` in `cwd` is rejected
## Status Meaning
- `keep`: candidate accepted and allowed artifacts synced back
- `discard`: candidate rejected and main workspace unchanged
- `crash`: execution failed, main workspace unchanged, CLI exits non-zero
## How AI Should Interpret Results
### `keep`
Meaning:
- the candidate is better than the current baseline
- the main workspace now contains the accepted artifact
What the AI should do next:
- treat the current artifact as the new baseline
- decide whether another iteration is still worth trying
### `discard`
Meaning:
- the candidate was valid enough to evaluate, or was rejected during validation, but it should not replace the baseline
Common reasons:
- too many changed files
- too many changed lines
- disallowed file type
- non-artifact change detected
- candidate did not improve
What the AI should do next:
- shrink the mutator if the candidate was too aggressive
- fix task boundaries if the candidate touched the wrong files
- adjust strategy if the score did not improve
### `crash`
Meaning:
- mutator, runner, scorer, or task configuration failed
Common reasons:
- command failed
- scorer output was not valid JSON
- configured sandbox `cwd` does not exist
What the AI should do next:
- fix the task definition or scripts first
- do not continue blind iteration until the crash cause is removed
## What AI Must Not Do
- do not directly use the main workspace as the experiment loop
- do not manually copy files out of sandbox logic
- do not treat `work`, `.venv`, or `.pytest_cache` as source artifacts
- do not let runner/scorer outputs become accepted artifacts unless explicitly intended
- do not start with a wide artifact scope if one-file optimization is possible
- do not open up entire model codebases when a config or recipe file is enough
## How To Create A New Task
When the AI wants to optimize a new problem, use this sequence:
1. Pick the smallest artifact that can express the change.
2. Define a strict mutation budget.
3. Write a deterministic mutator.
4. Write a deterministic runner.
5. Write a scorer that emits structured JSON.
6. Run one iteration.
7. Inspect the result.
8. Tighten or broaden the task only if the current boundary is too small.
## Recommended Optimization Strategy
Start from the narrowest controllable object.
Good first artifacts:
- one `SKILL.md`
- one prompt file
- one YAML config
- one training recipe
- one augmentation policy
- one scheduling policy file
- one feature engineering config
Avoid starting with:
- large source trees
- multiple unrelated configs
- code plus docs plus scripts in one task
## General Deep Learning Use Cases
This engine is not limited to skills or YOLO.
It fits any scenario where:
- one candidate can be generated deterministically
- one evaluation run can be executed
- one score can be extracted
Examples:
- object detection tuning
- classification tuning
- time-series forecasting
- predictive maintenance
- production scheduling
- demand prediction
- anomaly detection
- tabular model feature-policy search
- inference prompt or tool-policy optimization
## Pattern For Deep Learning Tasks
For most deep learning tasks, the AI should optimize one of these first:
- training config
- augmentation config
- feature config
- inference threshold config
- loss-weight config
- scheduling or dispatch policy config
Typical setup:
- `artifacts`: one config file
- `mutator`: edits config values
- `runner`: runs a short fixed-budget experiment or simulation
- `scorer`: emits a JSON score and metrics
This is usually better than letting the AI directly rewrite model code.
## Example Mental Templates
### Skill Optimization
- artifact: one `SKILL.md`
- mutator: rewrites structure or instructions
- runner: checks required sections and formatting
- scorer: outputs rubric-based score
### YOLO Or Vision Model Tuning
- artifact: one experiment YAML
- mutator: changes lr, augmentation, image size, loss weights, thresholds
- runner: runs a short training/eval job
- scorer: outputs `mAP`, latency, memory, and constraint violations
### Predictive Maintenance
- artifact: one feature/training config
- mutator: changes window sizes, features, thresholds, class weighting
- runner: runs a short train + validation or backtest
- scorer: outputs precision, recall, F1, false-alarm penalty
### Production Scheduling
- artifact: one scheduling policy or simulation config
- mutator: changes dispatch heuristics, penalty weights, batching rules
- runner: runs a deterministic simulation
- scorer: outputs lateness, throughput, utilization, total cost
## Ignored Runtime Directories
These repo-root directories are ignored by sandbox copy/hash validation:
- `work`
- `.venv`
- `.pytest_cache`
Reason:
- they are runtime or cache state, not accepted source artifacts
## When To Extend The Project Itself
Do not default to changing the engine.
Only extend the engine when the AI cannot express the optimization problem cleanly with:
- a task file
- a mutator
- a runner
- a scorer
In most cases, the right move is to add a new task, not to modify the orchestrator.
## More Detail
See the full guide:
[2026-04-02-baseline-aware-single-iteration-orchestrator-usage.md](D:\App\GitHub\autoresearch\docs\superpowers\usage\2026-04-02-baseline-aware-single-iteration-orchestrator-usage.md)