# Usage

## What This Is

This project should be used as a safe single-iteration optimization engine.

It is not a free-form "let the AI edit the repo however it wants" loop.
It gives the AI a controlled way to:

1. define what may change
2. generate one candidate in a sandbox
3. run evaluation in isolation
4. score the result
5. keep or discard the candidate

If you want repeated optimization, the AI should call this workflow repeatedly.

## How AI Should Use This Project

The correct mental model is:

- the AI is the search and decision layer
- this project is the sandboxed execution and evaluation layer

The AI should work in this loop:

1. Pick one concrete optimization target.
2. Create or update a task for that target.
3. Run one iteration with `scripts/run_task.py`.
4. Read the JSON result and the latest line in `work/results.jsonl`.
5. Decide what to do next:
   - `keep`: continue from the new baseline
   - `discard`: adjust the mutator or task constraints and try again
   - `crash`: fix the task, runner, scorer, or paths before continuing
6. Repeat until the budget or goal is reached.

Do not treat the main workspace as the experiment loop.

## Quick Start

Run the sample task from the repo root:

```bash
uv run python scripts/run_task.py --task tasks/skill-quality/task.yaml
```

Expected result:

- stdout prints one JSON record
- `work/results.jsonl` gets one new JSON line
- `tasks/skill-quality/fixtures/SKILL.md` changes only if the candidate is kept

Example:

```json
{"task_id":"skill-quality","status":"keep","reason":"no baseline available","candidate_score":4.0,"diff_summary":""}
```

## What A Task Must Define

Each task file must include:

- `artifacts`
- `mutation`
- `mutator`
- `runner`
- `scorer`
- `objective`
- `constraints`
- `policy`
- `budget`
- `logging`

Important runtime fields:

```yaml
mutator:
  type: command
  command: "..."
  cwd: "tasks/your-task"
  timeout_seconds: 30

runner:
  command: "..."
  cwd: "tasks/your-task"
  timeout_seconds: 30

scorer:
  type: command
  command: "..."
  timeout_seconds: 30
  parse:
    format: json
    score_field: score
    metrics_field: metrics
```

## What Each Part Means

### `artifacts`

The files that may be accepted back into the main workspace.

Keep this set narrow.
Start with one file whenever possible.

### `mutation`

The safety budget:

- how many files may change
- how many lines may change
- which file types are allowed

### `mutator`

The command that creates one candidate in the sandbox.

This is where the AI's proposal becomes an actual candidate artifact.

### `runner`

The command that evaluates the candidate.

Examples:

- run a deterministic checker for a skill
- run a short training job
- run validation on a model checkpoint
- run a forecasting backtest
- run a scheduling simulation

### `scorer`

The command that returns structured JSON for the decision step.

Examples:

- rubric score
- `mAP50-95`
- F1 score
- validation loss
- RMSE
- maintenance-event prediction precision/recall
- production schedule cost or lateness penalty

## Path Rules

- artifact paths are task-relative
- `mutator.cwd` and `runner.cwd` are repo-relative
- absolute `cwd` values are rejected
- `..` in `cwd` is rejected

## Status Meaning

- `keep`: candidate accepted and allowed artifacts synced back
- `discard`: candidate rejected and main workspace unchanged
- `crash`: execution failed, main workspace unchanged, CLI exits non-zero

## How AI Should Interpret Results

### `keep`

Meaning:

- the candidate is better than the current baseline
- the main workspace now contains the accepted artifact

What the AI should do next:

- treat the current artifact as the new baseline
- decide whether another iteration is still worth trying

### `discard`

Meaning:

- the candidate was valid enough to evaluate, or was rejected during validation, but it should not replace the baseline

Common reasons:

- too many changed files
- too many changed lines
- disallowed file type
- non-artifact change detected
- candidate did not improve

What the AI should do next:

- shrink the mutator if the candidate was too aggressive
- fix task boundaries if the candidate touched the wrong files
- adjust strategy if the score did not improve

### `crash`

Meaning:

- mutator, runner, scorer, or task configuration failed

Common reasons:

- command failed
- scorer output was not valid JSON
- configured sandbox `cwd` does not exist

What the AI should do next:

- fix the task definition or scripts first
- do not continue blind iteration until the crash cause is removed

## What AI Must Not Do

- do not directly use the main workspace as the experiment loop
- do not manually copy files out of sandbox logic
- do not treat `work`, `.venv`, or `.pytest_cache` as source artifacts
- do not let runner/scorer outputs become accepted artifacts unless explicitly intended
- do not start with a wide artifact scope if one-file optimization is possible
- do not open up entire model codebases when a config or recipe file is enough

## How To Create A New Task

When the AI wants to optimize a new problem, use this sequence:

1. Pick the smallest artifact that can express the change.
2. Define a strict mutation budget.
3. Write a deterministic mutator.
4. Write a deterministic runner.
5. Write a scorer that emits structured JSON.
6. Run one iteration.
7. Inspect the result.
8. Tighten or broaden the task only if the current boundary is too small.

## Recommended Optimization Strategy

Start from the narrowest controllable object.

Good first artifacts:

- one `SKILL.md`
- one prompt file
- one YAML config
- one training recipe
- one augmentation policy
- one scheduling policy file
- one feature engineering config

Avoid starting with:

- large source trees
- multiple unrelated configs
- code plus docs plus scripts in one task

## General Deep Learning Use Cases

This engine is not limited to skills or YOLO.

It fits any scenario where:

- one candidate can be generated deterministically
- one evaluation run can be executed
- one score can be extracted

Examples:

- object detection tuning
- classification tuning
- time-series forecasting
- predictive maintenance
- production scheduling
- demand prediction
- anomaly detection
- tabular model feature-policy search
- inference prompt or tool-policy optimization

## Pattern For Deep Learning Tasks

For most deep learning tasks, the AI should optimize one of these first:

- training config
- augmentation config
- feature config
- inference threshold config
- loss-weight config
- scheduling or dispatch policy config

Typical setup:

- `artifacts`: one config file
- `mutator`: edits config values
- `runner`: runs a short fixed-budget experiment or simulation
- `scorer`: emits a JSON score and metrics

This is usually better than letting the AI directly rewrite model code.

## Example Mental Templates

### Skill Optimization

- artifact: one `SKILL.md`
- mutator: rewrites structure or instructions
- runner: checks required sections and formatting
- scorer: outputs rubric-based score

### YOLO Or Vision Model Tuning

- artifact: one experiment YAML
- mutator: changes lr, augmentation, image size, loss weights, thresholds
- runner: runs a short training/eval job
- scorer: outputs `mAP`, latency, memory, and constraint violations

### Predictive Maintenance

- artifact: one feature/training config
- mutator: changes window sizes, features, thresholds, class weighting
- runner: runs a short train + validation or backtest
- scorer: outputs precision, recall, F1, false-alarm penalty

### Production Scheduling

- artifact: one scheduling policy or simulation config
- mutator: changes dispatch heuristics, penalty weights, batching rules
- runner: runs a deterministic simulation
- scorer: outputs lateness, throughput, utilization, total cost

## Ignored Runtime Directories

These repo-root directories are ignored by sandbox copy/hash validation:

- `work`
- `.venv`
- `.pytest_cache`

Reason:

- they are runtime or cache state, not accepted source artifacts

## When To Extend The Project Itself

Do not default to changing the engine.

Only extend the engine when the AI cannot express the optimization problem cleanly with:

- a task file
- a mutator
- a runner
- a scorer

In most cases, the right move is to add a new task, not to modify the orchestrator.

## More Detail

See the full guide:

[2026-04-02-baseline-aware-single-iteration-orchestrator-usage.md](D:\App\GitHub\autoresearch\docs\superpowers\usage\2026-04-02-baseline-aware-single-iteration-orchestrator-usage.md)