CommonAutoRearsh/USAGE.md

8.4 KiB

Usage

What This Is

This project should be used as a safe single-iteration optimization engine.

It is not a free-form "let the AI edit the repo however it wants" loop. It gives the AI a controlled way to:

  1. define what may change
  2. generate one candidate in a sandbox
  3. run evaluation in isolation
  4. score the result
  5. keep or discard the candidate

If you want repeated optimization, the AI should call this workflow repeatedly.

How AI Should Use This Project

The correct mental model is:

  • the AI is the search and decision layer
  • this project is the sandboxed execution and evaluation layer

The AI should work in this loop:

  1. Pick one concrete optimization target.
  2. Create or update a task for that target.
  3. Run one iteration with scripts/run_task.py.
  4. Read the JSON result and the latest line in work/results.jsonl.
  5. Decide what to do next:
    • keep: continue from the new baseline
    • discard: adjust the mutator or task constraints and try again
    • crash: fix the task, runner, scorer, or paths before continuing
  6. Repeat until the budget or goal is reached.

Do not treat the main workspace as the experiment loop.

Quick Start

Run the sample task from the repo root:

uv run python scripts/run_task.py --task tasks/skill-quality/task.yaml

Expected result:

  • stdout prints one JSON record
  • work/results.jsonl gets one new JSON line
  • tasks/skill-quality/fixtures/SKILL.md changes only if the candidate is kept

Example:

{"task_id":"skill-quality","status":"keep","reason":"no baseline available","candidate_score":4.0,"diff_summary":""}

What A Task Must Define

Each task file must include:

  • artifacts
  • mutation
  • mutator
  • runner
  • scorer
  • objective
  • constraints
  • policy
  • budget
  • logging

Important runtime fields:

mutator:
  type: command
  command: "..."
  cwd: "tasks/your-task"
  timeout_seconds: 30

runner:
  command: "..."
  cwd: "tasks/your-task"
  timeout_seconds: 30

scorer:
  type: command
  command: "..."
  timeout_seconds: 30
  parse:
    format: json
    score_field: score
    metrics_field: metrics

What Each Part Means

artifacts

The files that may be accepted back into the main workspace.

Keep this set narrow. Start with one file whenever possible.

mutation

The safety budget:

  • how many files may change
  • how many lines may change
  • which file types are allowed

mutator

The command that creates one candidate in the sandbox.

This is where the AI's proposal becomes an actual candidate artifact.

runner

The command that evaluates the candidate.

Examples:

  • run a deterministic checker for a skill
  • run a short training job
  • run validation on a model checkpoint
  • run a forecasting backtest
  • run a scheduling simulation

scorer

The command that returns structured JSON for the decision step.

Examples:

  • rubric score
  • mAP50-95
  • F1 score
  • validation loss
  • RMSE
  • maintenance-event prediction precision/recall
  • production schedule cost or lateness penalty

Path Rules

  • artifact paths are task-relative
  • mutator.cwd and runner.cwd are repo-relative
  • absolute cwd values are rejected
  • .. in cwd is rejected

Status Meaning

  • keep: candidate accepted and allowed artifacts synced back
  • discard: candidate rejected and main workspace unchanged
  • crash: execution failed, main workspace unchanged, CLI exits non-zero

How AI Should Interpret Results

keep

Meaning:

  • the candidate is better than the current baseline
  • the main workspace now contains the accepted artifact

What the AI should do next:

  • treat the current artifact as the new baseline
  • decide whether another iteration is still worth trying

discard

Meaning:

  • the candidate was valid enough to evaluate, or was rejected during validation, but it should not replace the baseline

Common reasons:

  • too many changed files
  • too many changed lines
  • disallowed file type
  • non-artifact change detected
  • candidate did not improve

What the AI should do next:

  • shrink the mutator if the candidate was too aggressive
  • fix task boundaries if the candidate touched the wrong files
  • adjust strategy if the score did not improve

crash

Meaning:

  • mutator, runner, scorer, or task configuration failed

Common reasons:

  • command failed
  • scorer output was not valid JSON
  • configured sandbox cwd does not exist

What the AI should do next:

  • fix the task definition or scripts first
  • do not continue blind iteration until the crash cause is removed

What AI Must Not Do

  • do not directly use the main workspace as the experiment loop
  • do not manually copy files out of sandbox logic
  • do not treat work, .venv, or .pytest_cache as source artifacts
  • do not let runner/scorer outputs become accepted artifacts unless explicitly intended
  • do not start with a wide artifact scope if one-file optimization is possible
  • do not open up entire model codebases when a config or recipe file is enough

How To Create A New Task

When the AI wants to optimize a new problem, use this sequence:

  1. Pick the smallest artifact that can express the change.
  2. Define a strict mutation budget.
  3. Write a deterministic mutator.
  4. Write a deterministic runner.
  5. Write a scorer that emits structured JSON.
  6. Run one iteration.
  7. Inspect the result.
  8. Tighten or broaden the task only if the current boundary is too small.

Start from the narrowest controllable object.

Good first artifacts:

  • one SKILL.md
  • one prompt file
  • one YAML config
  • one training recipe
  • one augmentation policy
  • one scheduling policy file
  • one feature engineering config

Avoid starting with:

  • large source trees
  • multiple unrelated configs
  • code plus docs plus scripts in one task

General Deep Learning Use Cases

This engine is not limited to skills or YOLO.

It fits any scenario where:

  • one candidate can be generated deterministically
  • one evaluation run can be executed
  • one score can be extracted

Examples:

  • object detection tuning
  • classification tuning
  • time-series forecasting
  • predictive maintenance
  • production scheduling
  • demand prediction
  • anomaly detection
  • tabular model feature-policy search
  • inference prompt or tool-policy optimization

Pattern For Deep Learning Tasks

For most deep learning tasks, the AI should optimize one of these first:

  • training config
  • augmentation config
  • feature config
  • inference threshold config
  • loss-weight config
  • scheduling or dispatch policy config

Typical setup:

  • artifacts: one config file
  • mutator: edits config values
  • runner: runs a short fixed-budget experiment or simulation
  • scorer: emits a JSON score and metrics

This is usually better than letting the AI directly rewrite model code.

Example Mental Templates

Skill Optimization

  • artifact: one SKILL.md
  • mutator: rewrites structure or instructions
  • runner: checks required sections and formatting
  • scorer: outputs rubric-based score

YOLO Or Vision Model Tuning

  • artifact: one experiment YAML
  • mutator: changes lr, augmentation, image size, loss weights, thresholds
  • runner: runs a short training/eval job
  • scorer: outputs mAP, latency, memory, and constraint violations

Predictive Maintenance

  • artifact: one feature/training config
  • mutator: changes window sizes, features, thresholds, class weighting
  • runner: runs a short train + validation or backtest
  • scorer: outputs precision, recall, F1, false-alarm penalty

Production Scheduling

  • artifact: one scheduling policy or simulation config
  • mutator: changes dispatch heuristics, penalty weights, batching rules
  • runner: runs a deterministic simulation
  • scorer: outputs lateness, throughput, utilization, total cost

Ignored Runtime Directories

These repo-root directories are ignored by sandbox copy/hash validation:

  • work
  • .venv
  • .pytest_cache

Reason:

  • they are runtime or cache state, not accepted source artifacts

When To Extend The Project Itself

Do not default to changing the engine.

Only extend the engine when the AI cannot express the optimization problem cleanly with:

  • a task file
  • a mutator
  • a runner
  • a scorer

In most cases, the right move is to add a new task, not to modify the orchestrator.

More Detail

See the full guide:

2026-04-02-baseline-aware-single-iteration-orchestrator-usage.md