# Usage ## What This Is This project should be used as a safe single-iteration optimization engine. It is not a free-form "let the AI edit the repo however it wants" loop. It gives the AI a controlled way to: 1. define what may change 2. generate one candidate in a sandbox 3. run evaluation in isolation 4. score the result 5. keep or discard the candidate If you want repeated optimization, the AI should call this workflow repeatedly. ## How AI Should Use This Project The correct mental model is: - the AI is the search and decision layer - this project is the sandboxed execution and evaluation layer The AI should work in this loop: 1. Pick one concrete optimization target. 2. Create or update a task for that target. 3. Run one iteration with `scripts/run_task.py`. 4. Read the JSON result and the latest line in `work/results.jsonl`. 5. Decide what to do next: - `keep`: continue from the new baseline - `discard`: adjust the mutator or task constraints and try again - `crash`: fix the task, runner, scorer, or paths before continuing 6. Repeat until the budget or goal is reached. Do not treat the main workspace as the experiment loop. ## Quick Start Run the sample task from the repo root: ```bash uv run python scripts/run_task.py --task tasks/skill-quality/task.yaml ``` Expected result: - stdout prints one JSON record - `work/results.jsonl` gets one new JSON line - `tasks/skill-quality/fixtures/SKILL.md` changes only if the candidate is kept Example: ```json {"task_id":"skill-quality","status":"keep","reason":"no baseline available","candidate_score":4.0,"diff_summary":""} ``` ## What A Task Must Define Each task file must include: - `artifacts` - `mutation` - `mutator` - `runner` - `scorer` - `objective` - `constraints` - `policy` - `budget` - `logging` Important runtime fields: ```yaml mutator: type: command command: "..." cwd: "tasks/your-task" timeout_seconds: 30 runner: command: "..." cwd: "tasks/your-task" timeout_seconds: 30 scorer: type: command command: "..." timeout_seconds: 30 parse: format: json score_field: score metrics_field: metrics ``` ## What Each Part Means ### `artifacts` The files that may be accepted back into the main workspace. Keep this set narrow. Start with one file whenever possible. ### `mutation` The safety budget: - how many files may change - how many lines may change - which file types are allowed ### `mutator` The command that creates one candidate in the sandbox. This is where the AI's proposal becomes an actual candidate artifact. ### `runner` The command that evaluates the candidate. Examples: - run a deterministic checker for a skill - run a short training job - run validation on a model checkpoint - run a forecasting backtest - run a scheduling simulation ### `scorer` The command that returns structured JSON for the decision step. Examples: - rubric score - `mAP50-95` - F1 score - validation loss - RMSE - maintenance-event prediction precision/recall - production schedule cost or lateness penalty ## Path Rules - artifact paths are task-relative - `mutator.cwd` and `runner.cwd` are repo-relative - absolute `cwd` values are rejected - `..` in `cwd` is rejected ## Status Meaning - `keep`: candidate accepted and allowed artifacts synced back - `discard`: candidate rejected and main workspace unchanged - `crash`: execution failed, main workspace unchanged, CLI exits non-zero ## How AI Should Interpret Results ### `keep` Meaning: - the candidate is better than the current baseline - the main workspace now contains the accepted artifact What the AI should do next: - treat the current artifact as the new baseline - decide whether another iteration is still worth trying ### `discard` Meaning: - the candidate was valid enough to evaluate, or was rejected during validation, but it should not replace the baseline Common reasons: - too many changed files - too many changed lines - disallowed file type - non-artifact change detected - candidate did not improve What the AI should do next: - shrink the mutator if the candidate was too aggressive - fix task boundaries if the candidate touched the wrong files - adjust strategy if the score did not improve ### `crash` Meaning: - mutator, runner, scorer, or task configuration failed Common reasons: - command failed - scorer output was not valid JSON - configured sandbox `cwd` does not exist What the AI should do next: - fix the task definition or scripts first - do not continue blind iteration until the crash cause is removed ## What AI Must Not Do - do not directly use the main workspace as the experiment loop - do not manually copy files out of sandbox logic - do not treat `work`, `.venv`, or `.pytest_cache` as source artifacts - do not let runner/scorer outputs become accepted artifacts unless explicitly intended - do not start with a wide artifact scope if one-file optimization is possible - do not open up entire model codebases when a config or recipe file is enough ## How To Create A New Task When the AI wants to optimize a new problem, use this sequence: 1. Pick the smallest artifact that can express the change. 2. Define a strict mutation budget. 3. Write a deterministic mutator. 4. Write a deterministic runner. 5. Write a scorer that emits structured JSON. 6. Run one iteration. 7. Inspect the result. 8. Tighten or broaden the task only if the current boundary is too small. ## Recommended Optimization Strategy Start from the narrowest controllable object. Good first artifacts: - one `SKILL.md` - one prompt file - one YAML config - one training recipe - one augmentation policy - one scheduling policy file - one feature engineering config Avoid starting with: - large source trees - multiple unrelated configs - code plus docs plus scripts in one task ## General Deep Learning Use Cases This engine is not limited to skills or YOLO. It fits any scenario where: - one candidate can be generated deterministically - one evaluation run can be executed - one score can be extracted Examples: - object detection tuning - classification tuning - time-series forecasting - predictive maintenance - production scheduling - demand prediction - anomaly detection - tabular model feature-policy search - inference prompt or tool-policy optimization ## Pattern For Deep Learning Tasks For most deep learning tasks, the AI should optimize one of these first: - training config - augmentation config - feature config - inference threshold config - loss-weight config - scheduling or dispatch policy config Typical setup: - `artifacts`: one config file - `mutator`: edits config values - `runner`: runs a short fixed-budget experiment or simulation - `scorer`: emits a JSON score and metrics This is usually better than letting the AI directly rewrite model code. ## Example Mental Templates ### Skill Optimization - artifact: one `SKILL.md` - mutator: rewrites structure or instructions - runner: checks required sections and formatting - scorer: outputs rubric-based score ### YOLO Or Vision Model Tuning - artifact: one experiment YAML - mutator: changes lr, augmentation, image size, loss weights, thresholds - runner: runs a short training/eval job - scorer: outputs `mAP`, latency, memory, and constraint violations ### Predictive Maintenance - artifact: one feature/training config - mutator: changes window sizes, features, thresholds, class weighting - runner: runs a short train + validation or backtest - scorer: outputs precision, recall, F1, false-alarm penalty ### Production Scheduling - artifact: one scheduling policy or simulation config - mutator: changes dispatch heuristics, penalty weights, batching rules - runner: runs a deterministic simulation - scorer: outputs lateness, throughput, utilization, total cost ## Ignored Runtime Directories These repo-root directories are ignored by sandbox copy/hash validation: - `work` - `.venv` - `.pytest_cache` Reason: - they are runtime or cache state, not accepted source artifacts ## When To Extend The Project Itself Do not default to changing the engine. Only extend the engine when the AI cannot express the optimization problem cleanly with: - a task file - a mutator - a runner - a scorer In most cases, the right move is to add a new task, not to modify the orchestrator. ## More Detail See the full guide: [2026-04-02-baseline-aware-single-iteration-orchestrator-usage.md](D:\App\GitHub\autoresearch\docs\superpowers\usage\2026-04-02-baseline-aware-single-iteration-orchestrator-usage.md)