From 2968ec63a98c92cbf6c69429e64bd3eff16213c4 Mon Sep 17 00:00:00 2001
From: sladro <sladro@163.com>
Date: Thu, 2 Apr 2026 10:59:11 +0800
Subject: [PATCH] docs: add artifact loop engine design spec

---
 .../2026-04-02-artifact-loop-engine-design.md | 432 ++++++++++++++++++
 1 file changed, 432 insertions(+)
 create mode 100644 docs/superpowers/specs/2026-04-02-artifact-loop-engine-design.md

diff --git a/docs/superpowers/specs/2026-04-02-artifact-loop-engine-design.md b/docs/superpowers/specs/2026-04-02-artifact-loop-engine-design.md
new file mode 100644
index 0000000..1239522
--- /dev/null
+++ b/docs/superpowers/specs/2026-04-02-artifact-loop-engine-design.md
@@ -0,0 +1,432 @@
+# Artifact Loop Engine Design
+
+Date: 2026-04-02
+
+## Summary
+
+This document defines a minimal, general-purpose optimization engine for editable text artifacts such as code, config, prompt, and skill files.
+
+The engine does not optimize arbitrary real-world processes directly. It optimizes a closed loop:
+
+1. Select allowed text artifacts.
+2. Generate a candidate edit.
+3. Run a scripted task.
+4. Score the result with a structured evaluator.
+5. Keep or discard the candidate.
+6. Log the outcome.
+
+The goal is to generalize the current repository from a single-file training-script optimizer into a reusable artifact optimization framework while keeping the scope bounded and auditable.
+
+## Problem Statement
+
+The current repository assumes a narrow workflow:
+
+- One primary editable file: `train.py`
+- One execution command: `uv run train.py`
+- One primary metric: `val_bpb`
+- One decision rule: lower is better
+
+That design is effective for fixed-budget language model experiments, but it does not generalize cleanly to tasks such as:
+
+- Improving a `SKILL.md` against a rubric
+- Improving a prompt template against evaluator scores
+- Tuning a small config file against runtime or accuracy metrics
+- Improving a small code path against a structured test score
+
+The new system should generalize the optimization loop without becoming a full autonomous project platform.
+
+## Goals
+
+- Optimize editable text artifacts: code, config, prompt, and skill files
+- Support any task that can be executed by a script and scored as structured metrics
+- Keep edits bounded, reversible, and easy to audit
+- Separate task definition from engine logic
+- Make the first version usable for `skill/prompt` optimization
+
+## Non-Goals
+
+- General-purpose multi-agent software development
+- Optimization of arbitrary external systems without a scripted evaluation loop
+- Image, binary, database, or manual-interaction optimization
+- True multi-objective Pareto optimization in V1
+- Unbounded edits across an entire repository
+
+## System Boundary
+
+The engine optimizes text artifacts only. A task must define:
+
+- What files may be edited
+- How to run a candidate
+- How to score a candidate
+- What constraints must hold
+- How to decide whether a candidate is better
+
+If a task cannot define these precisely, it is outside the engine boundary.
+
+## Core Model
+
+Each optimization task is represented by a task specification.
+
+The engine operates on four core abstractions:
+
+- `artifact`: the editable files
+- `runner`: the command that executes a candidate
+- `scorer`: the component that converts outputs into structured metrics
+- `policy`: the decision logic for keep, discard, or fail
+
+## Experiment Lifecycle
+
+Each iteration follows a fixed eight-stage lifecycle:
+
+1. `load_task`
+   - Parse and validate the task specification.
+
+2. `snapshot_baseline`
+   - Record current artifact hashes and the current best result.
+
+3. `mutate_artifacts`
+   - Generate a bounded candidate edit within the allowed files and change budget.
+
+4. `run_candidate`
+   - Execute the runner command and capture logs, exit code, and runtime.
+
+5. `score_candidate`
+   - Produce structured output containing a primary score and supporting metrics.
+
+6. `validate_constraints`
+   - Evaluate hard constraints before comparing the primary score.
+
+7. `decide`
+   - Apply the decision policy and classify the candidate as `keep`, `discard`, or `crash`.
+
+8. `log_iteration`
+   - Persist the result, metrics, decision, and diff summary.
+
+This lifecycle is fixed across tasks. Tasks vary only by configuration and optional task-specific runner or scorer implementations.
+
+## Scoring Model
+
+V1 uses a conservative scoring model:
+
+- One primary metric
+- Zero or more hard constraints
+- Zero or more tie-breakers
+
+### Primary Metric
+
+The primary metric determines the main notion of improvement.
+
+Examples:
+
+- `score`
+- `f1`
+- `mAP50_95`
+- `test_pass_rate`
+- `proposal_rubric_score`
+
+The objective must define whether the metric should be minimized or maximized.
+
+### Constraints
+
+Constraints are pass/fail conditions that block acceptance regardless of the primary score.
+
+Examples:
+
+- `violation_count <= 0`
+- `runtime_seconds <= 300`
+- `length_tokens <= 2500`
+- `required_sections_present == true`
+
+Constraints protect the system from reward hacking and invalid outputs.
+
+### Tie-Breakers
+
+Tie-breakers apply only when the primary metric is equal or too close to justify a change on primary score alone.
+
+Examples:
+
+- Lower token count
+- Lower runtime
+- Fewer violations
+- Lower cost
+
+### Reward Hacking Risks
+
+Tasks involving prompts, skills, proposals, and other natural-language artifacts are especially vulnerable to reward hacking. V1 must explicitly guard against:
+
+- Length hacking
+- Keyword stuffing
+- Overfitting to a single judge
+- Passing the judge while violating structural requirements
+
+For this reason, the scorer should combine:
+
+- Rubric-based judging
+- Rule-based checks
+- Format validation
+- Optional multi-judge averaging
+
+The primary score may come from a model-based evaluator, but constraints should be as rule-based and scriptable as possible.
+
+## Decision Policy
+
+V1 keeps the decision policy intentionally strict:
+
+- If the runner crashes, mark as `crash`
+- If any hard constraint fails, mark as `discard`
+- If the candidate improves the primary metric, mark as `keep`
+- If the primary metric is tied, keep only if tie-breakers are clearly better
+- Otherwise mark as `discard`
+
+The default philosophy is to retain only clear improvements.
+
+## Task Specification
+
+Each task lives in its own directory and includes a declarative task definition file.
+
+Example:
+
+```yaml
+id: skill-quality
+description: Optimize a skill markdown file against a rubric.
+
+artifacts:
+  include:
+    - skills/writing/SKILL.md
+  exclude:
+    - work/**
+  max_files_per_iteration: 2
+
+mutation:
+  mode: direct_edit
+  allowed_file_types:
+    - .md
+    - .txt
+    - .yaml
+    - .py
+  max_changed_lines: 120
+
+runner:
+  command: "python scripts/run_skill_eval.py"
+  cwd: "."
+  timeout_seconds: 300
+
+scorer:
+  type: command
+  command: "python scripts/score_skill.py"
+  parse:
+    format: json
+    score_field: "score"
+    metrics_field: "metrics"
+
+objective:
+  primary_metric: score
+  direction: maximize
+
+constraints:
+  - metric: violation_count
+    op: "<="
+    value: 0
+  - metric: length_tokens
+    op: "<="
+    value: 2500
+
+policy:
+  keep_if: "better_primary"
+  tie_breakers:
+    - lower: violation_count
+    - lower: length_tokens
+  on_failure: discard
+
+budget:
+  max_iterations: 50
+  max_failures: 10
+
+logging:
+  results_file: "work/results.jsonl"
+  candidate_dir: "work/candidates"
+```
+
+## Core Modules
+
+V1 should be implemented with six focused modules.
+
+### `task_loader`
+
+Responsibilities:
+
+- Parse task files
+- Validate required fields
+- Produce normalized internal models
+
+It should not execute commands or mutate files.
+
+### `artifact_manager`
+
+Responsibilities:
+
+- Resolve allowed files
+- Snapshot baseline contents and hashes
+- Produce diff summaries
+- Restore discarded candidates
+
+It owns file-level safety and reversibility.
+
+### `mutation_engine`
+
+Responsibilities:
+
+- Instruct the agent to edit artifacts
+- Enforce mutation budget limits
+- Reject edits outside allowed paths or file types
+
+It is a bounded editor, not an open-ended autonomous developer.
+
+### `runner`
+
+Responsibilities:
+
+- Execute the candidate task
+- Capture exit code, runtime, and logs
+- Emit a structured run result
+
+It does not decide quality.
+
+### `scorer`
+
+Responsibilities:
+
+- Convert raw outputs into structured metrics
+- Emit the primary metric, additional metrics, and constraint data
+
+V1 should support:
+
+- `command` scorer
+- Optional later extension to `python_plugin`
+
+### `decision_engine`
+
+Responsibilities:
+
+- Compare candidate vs baseline
+- Apply objective and policy rules
+- Emit `keep`, `discard`, or `crash`
+
+It should consume only structured results, not raw logs.
+
+## Directory Layout
+
+Suggested V1 structure:
+
+```text
+engine/
+  task_loader.py
+  artifact_manager.py
+  mutation_engine.py
+  runner.py
+  scorer.py
+  decision_engine.py
+  models.py
+
+tasks/
+  skill-quality/
+    task.yaml
+    rubric.md
+    prompt.md
+
+scripts/
+  run_task.py
+  score_task.py
+
+work/
+  runs/
+  candidates/
+  logs/
+  results.jsonl
+```
+
+This structure separates reusable engine code from per-task resources and runtime state.
+
+## Minimal V1 Task
+
+The first task should prove the engine on `skill/prompt` optimization, because it directly matches the desired use case and has fast evaluation.
+
+Recommended first task:
+
+- Optimize one `SKILL.md`
+- Score it against a rubric
+- Enforce structural constraints
+- Keep only strictly better versions
+
+Suggested output metrics:
+
+- `primary_score`
+- `clarity`
+- `coverage`
+- `constraint_obedience`
+- `violation_count`
+- `length_tokens`
+
+This is enough to validate the engine without requiring GPU training or long-running experiments.
+
+## Guardrails
+
+V1 should explicitly include the following guardrails:
+
+- Limit each iteration to a small number of files
+- Limit total changed lines per iteration
+- Reject edits outside declared artifact boundaries
+- Require structured scorer output
+- Evaluate constraints before comparing scores
+- Log every accepted or discarded candidate with a diff summary
+- Default to strict improvement only
+- Avoid hidden cross-task memory; share only structured historical results
+
+These controls are necessary to keep the optimization loop safe, explainable, and reusable.
+
+## Recommended Implementation Strategy
+
+Implement the system in two phases.
+
+### Phase 1
+
+Build the smallest complete loop:
+
+- Task loader
+- Artifact manager
+- Runner
+- Scorer
+- Decision engine
+- JSONL logging
+
+Use a simple `skill-quality` task to prove the loop works.
+
+### Phase 2
+
+Add bounded artifact mutation:
+
+- Agent-driven edit generation
+- Mutation budget enforcement
+- Candidate restoration on discard
+- Better diff summaries and diagnostics
+
+This sequencing reduces risk by proving the evaluation loop before adding autonomous mutation.
+
+## Open Questions Deferred
+
+The following topics are intentionally deferred from V1:
+
+- Multi-objective Pareto search
+- Branch-per-candidate workflows
+- Distributed or parallel experiment scheduling
+- Task-to-task transfer memory
+- Human-in-the-loop review checkpoints
+- Rich plugin ecosystems
+
+These can be added later if V1 proves useful.
+
+## Recommendation
+
+Proceed with a minimal `Artifact Loop Engine` centered on declarative task specs and strict structured scoring.
+
+Do not generalize immediately to full project-level autonomy. The correct first step is a bounded optimizer for editable text artifacts with a stable evaluation loop.