From 2968ec63a98c92cbf6c69429e64bd3eff16213c4 Mon Sep 17 00:00:00 2001 From: sladro Date: Thu, 2 Apr 2026 10:59:11 +0800 Subject: [PATCH] docs: add artifact loop engine design spec --- .../2026-04-02-artifact-loop-engine-design.md | 432 ++++++++++++++++++ 1 file changed, 432 insertions(+) create mode 100644 docs/superpowers/specs/2026-04-02-artifact-loop-engine-design.md diff --git a/docs/superpowers/specs/2026-04-02-artifact-loop-engine-design.md b/docs/superpowers/specs/2026-04-02-artifact-loop-engine-design.md new file mode 100644 index 0000000..1239522 --- /dev/null +++ b/docs/superpowers/specs/2026-04-02-artifact-loop-engine-design.md @@ -0,0 +1,432 @@ +# Artifact Loop Engine Design + +Date: 2026-04-02 + +## Summary + +This document defines a minimal, general-purpose optimization engine for editable text artifacts such as code, config, prompt, and skill files. + +The engine does not optimize arbitrary real-world processes directly. It optimizes a closed loop: + +1. Select allowed text artifacts. +2. Generate a candidate edit. +3. Run a scripted task. +4. Score the result with a structured evaluator. +5. Keep or discard the candidate. +6. Log the outcome. + +The goal is to generalize the current repository from a single-file training-script optimizer into a reusable artifact optimization framework while keeping the scope bounded and auditable. + +## Problem Statement + +The current repository assumes a narrow workflow: + +- One primary editable file: `train.py` +- One execution command: `uv run train.py` +- One primary metric: `val_bpb` +- One decision rule: lower is better + +That design is effective for fixed-budget language model experiments, but it does not generalize cleanly to tasks such as: + +- Improving a `SKILL.md` against a rubric +- Improving a prompt template against evaluator scores +- Tuning a small config file against runtime or accuracy metrics +- Improving a small code path against a structured test score + +The new system should generalize the optimization loop without becoming a full autonomous project platform. + +## Goals + +- Optimize editable text artifacts: code, config, prompt, and skill files +- Support any task that can be executed by a script and scored as structured metrics +- Keep edits bounded, reversible, and easy to audit +- Separate task definition from engine logic +- Make the first version usable for `skill/prompt` optimization + +## Non-Goals + +- General-purpose multi-agent software development +- Optimization of arbitrary external systems without a scripted evaluation loop +- Image, binary, database, or manual-interaction optimization +- True multi-objective Pareto optimization in V1 +- Unbounded edits across an entire repository + +## System Boundary + +The engine optimizes text artifacts only. A task must define: + +- What files may be edited +- How to run a candidate +- How to score a candidate +- What constraints must hold +- How to decide whether a candidate is better + +If a task cannot define these precisely, it is outside the engine boundary. + +## Core Model + +Each optimization task is represented by a task specification. + +The engine operates on four core abstractions: + +- `artifact`: the editable files +- `runner`: the command that executes a candidate +- `scorer`: the component that converts outputs into structured metrics +- `policy`: the decision logic for keep, discard, or fail + +## Experiment Lifecycle + +Each iteration follows a fixed eight-stage lifecycle: + +1. `load_task` + - Parse and validate the task specification. + +2. `snapshot_baseline` + - Record current artifact hashes and the current best result. + +3. `mutate_artifacts` + - Generate a bounded candidate edit within the allowed files and change budget. + +4. `run_candidate` + - Execute the runner command and capture logs, exit code, and runtime. + +5. `score_candidate` + - Produce structured output containing a primary score and supporting metrics. + +6. `validate_constraints` + - Evaluate hard constraints before comparing the primary score. + +7. `decide` + - Apply the decision policy and classify the candidate as `keep`, `discard`, or `crash`. + +8. `log_iteration` + - Persist the result, metrics, decision, and diff summary. + +This lifecycle is fixed across tasks. Tasks vary only by configuration and optional task-specific runner or scorer implementations. + +## Scoring Model + +V1 uses a conservative scoring model: + +- One primary metric +- Zero or more hard constraints +- Zero or more tie-breakers + +### Primary Metric + +The primary metric determines the main notion of improvement. + +Examples: + +- `score` +- `f1` +- `mAP50_95` +- `test_pass_rate` +- `proposal_rubric_score` + +The objective must define whether the metric should be minimized or maximized. + +### Constraints + +Constraints are pass/fail conditions that block acceptance regardless of the primary score. + +Examples: + +- `violation_count <= 0` +- `runtime_seconds <= 300` +- `length_tokens <= 2500` +- `required_sections_present == true` + +Constraints protect the system from reward hacking and invalid outputs. + +### Tie-Breakers + +Tie-breakers apply only when the primary metric is equal or too close to justify a change on primary score alone. + +Examples: + +- Lower token count +- Lower runtime +- Fewer violations +- Lower cost + +### Reward Hacking Risks + +Tasks involving prompts, skills, proposals, and other natural-language artifacts are especially vulnerable to reward hacking. V1 must explicitly guard against: + +- Length hacking +- Keyword stuffing +- Overfitting to a single judge +- Passing the judge while violating structural requirements + +For this reason, the scorer should combine: + +- Rubric-based judging +- Rule-based checks +- Format validation +- Optional multi-judge averaging + +The primary score may come from a model-based evaluator, but constraints should be as rule-based and scriptable as possible. + +## Decision Policy + +V1 keeps the decision policy intentionally strict: + +- If the runner crashes, mark as `crash` +- If any hard constraint fails, mark as `discard` +- If the candidate improves the primary metric, mark as `keep` +- If the primary metric is tied, keep only if tie-breakers are clearly better +- Otherwise mark as `discard` + +The default philosophy is to retain only clear improvements. + +## Task Specification + +Each task lives in its own directory and includes a declarative task definition file. + +Example: + +```yaml +id: skill-quality +description: Optimize a skill markdown file against a rubric. + +artifacts: + include: + - skills/writing/SKILL.md + exclude: + - work/** + max_files_per_iteration: 2 + +mutation: + mode: direct_edit + allowed_file_types: + - .md + - .txt + - .yaml + - .py + max_changed_lines: 120 + +runner: + command: "python scripts/run_skill_eval.py" + cwd: "." + timeout_seconds: 300 + +scorer: + type: command + command: "python scripts/score_skill.py" + parse: + format: json + score_field: "score" + metrics_field: "metrics" + +objective: + primary_metric: score + direction: maximize + +constraints: + - metric: violation_count + op: "<=" + value: 0 + - metric: length_tokens + op: "<=" + value: 2500 + +policy: + keep_if: "better_primary" + tie_breakers: + - lower: violation_count + - lower: length_tokens + on_failure: discard + +budget: + max_iterations: 50 + max_failures: 10 + +logging: + results_file: "work/results.jsonl" + candidate_dir: "work/candidates" +``` + +## Core Modules + +V1 should be implemented with six focused modules. + +### `task_loader` + +Responsibilities: + +- Parse task files +- Validate required fields +- Produce normalized internal models + +It should not execute commands or mutate files. + +### `artifact_manager` + +Responsibilities: + +- Resolve allowed files +- Snapshot baseline contents and hashes +- Produce diff summaries +- Restore discarded candidates + +It owns file-level safety and reversibility. + +### `mutation_engine` + +Responsibilities: + +- Instruct the agent to edit artifacts +- Enforce mutation budget limits +- Reject edits outside allowed paths or file types + +It is a bounded editor, not an open-ended autonomous developer. + +### `runner` + +Responsibilities: + +- Execute the candidate task +- Capture exit code, runtime, and logs +- Emit a structured run result + +It does not decide quality. + +### `scorer` + +Responsibilities: + +- Convert raw outputs into structured metrics +- Emit the primary metric, additional metrics, and constraint data + +V1 should support: + +- `command` scorer +- Optional later extension to `python_plugin` + +### `decision_engine` + +Responsibilities: + +- Compare candidate vs baseline +- Apply objective and policy rules +- Emit `keep`, `discard`, or `crash` + +It should consume only structured results, not raw logs. + +## Directory Layout + +Suggested V1 structure: + +```text +engine/ + task_loader.py + artifact_manager.py + mutation_engine.py + runner.py + scorer.py + decision_engine.py + models.py + +tasks/ + skill-quality/ + task.yaml + rubric.md + prompt.md + +scripts/ + run_task.py + score_task.py + +work/ + runs/ + candidates/ + logs/ + results.jsonl +``` + +This structure separates reusable engine code from per-task resources and runtime state. + +## Minimal V1 Task + +The first task should prove the engine on `skill/prompt` optimization, because it directly matches the desired use case and has fast evaluation. + +Recommended first task: + +- Optimize one `SKILL.md` +- Score it against a rubric +- Enforce structural constraints +- Keep only strictly better versions + +Suggested output metrics: + +- `primary_score` +- `clarity` +- `coverage` +- `constraint_obedience` +- `violation_count` +- `length_tokens` + +This is enough to validate the engine without requiring GPU training or long-running experiments. + +## Guardrails + +V1 should explicitly include the following guardrails: + +- Limit each iteration to a small number of files +- Limit total changed lines per iteration +- Reject edits outside declared artifact boundaries +- Require structured scorer output +- Evaluate constraints before comparing scores +- Log every accepted or discarded candidate with a diff summary +- Default to strict improvement only +- Avoid hidden cross-task memory; share only structured historical results + +These controls are necessary to keep the optimization loop safe, explainable, and reusable. + +## Recommended Implementation Strategy + +Implement the system in two phases. + +### Phase 1 + +Build the smallest complete loop: + +- Task loader +- Artifact manager +- Runner +- Scorer +- Decision engine +- JSONL logging + +Use a simple `skill-quality` task to prove the loop works. + +### Phase 2 + +Add bounded artifact mutation: + +- Agent-driven edit generation +- Mutation budget enforcement +- Candidate restoration on discard +- Better diff summaries and diagnostics + +This sequencing reduces risk by proving the evaluation loop before adding autonomous mutation. + +## Open Questions Deferred + +The following topics are intentionally deferred from V1: + +- Multi-objective Pareto search +- Branch-per-candidate workflows +- Distributed or parallel experiment scheduling +- Task-to-task transfer memory +- Human-in-the-loop review checkpoints +- Rich plugin ecosystems + +These can be added later if V1 proves useful. + +## Recommendation + +Proceed with a minimal `Artifact Loop Engine` centered on declarative task specs and strict structured scoring. + +Do not generalize immediately to full project-level autonomy. The correct first step is a bounded optimizer for editable text artifacts with a stable evaluation loop.