docs: add artifact loop engine design spec
This commit is contained in:
parent
228791fb49
commit
2968ec63a9
432
docs/superpowers/specs/2026-04-02-artifact-loop-engine-design.md
Normal file
432
docs/superpowers/specs/2026-04-02-artifact-loop-engine-design.md
Normal file
@ -0,0 +1,432 @@
|
||||
# Artifact Loop Engine Design
|
||||
|
||||
Date: 2026-04-02
|
||||
|
||||
## Summary
|
||||
|
||||
This document defines a minimal, general-purpose optimization engine for editable text artifacts such as code, config, prompt, and skill files.
|
||||
|
||||
The engine does not optimize arbitrary real-world processes directly. It optimizes a closed loop:
|
||||
|
||||
1. Select allowed text artifacts.
|
||||
2. Generate a candidate edit.
|
||||
3. Run a scripted task.
|
||||
4. Score the result with a structured evaluator.
|
||||
5. Keep or discard the candidate.
|
||||
6. Log the outcome.
|
||||
|
||||
The goal is to generalize the current repository from a single-file training-script optimizer into a reusable artifact optimization framework while keeping the scope bounded and auditable.
|
||||
|
||||
## Problem Statement
|
||||
|
||||
The current repository assumes a narrow workflow:
|
||||
|
||||
- One primary editable file: `train.py`
|
||||
- One execution command: `uv run train.py`
|
||||
- One primary metric: `val_bpb`
|
||||
- One decision rule: lower is better
|
||||
|
||||
That design is effective for fixed-budget language model experiments, but it does not generalize cleanly to tasks such as:
|
||||
|
||||
- Improving a `SKILL.md` against a rubric
|
||||
- Improving a prompt template against evaluator scores
|
||||
- Tuning a small config file against runtime or accuracy metrics
|
||||
- Improving a small code path against a structured test score
|
||||
|
||||
The new system should generalize the optimization loop without becoming a full autonomous project platform.
|
||||
|
||||
## Goals
|
||||
|
||||
- Optimize editable text artifacts: code, config, prompt, and skill files
|
||||
- Support any task that can be executed by a script and scored as structured metrics
|
||||
- Keep edits bounded, reversible, and easy to audit
|
||||
- Separate task definition from engine logic
|
||||
- Make the first version usable for `skill/prompt` optimization
|
||||
|
||||
## Non-Goals
|
||||
|
||||
- General-purpose multi-agent software development
|
||||
- Optimization of arbitrary external systems without a scripted evaluation loop
|
||||
- Image, binary, database, or manual-interaction optimization
|
||||
- True multi-objective Pareto optimization in V1
|
||||
- Unbounded edits across an entire repository
|
||||
|
||||
## System Boundary
|
||||
|
||||
The engine optimizes text artifacts only. A task must define:
|
||||
|
||||
- What files may be edited
|
||||
- How to run a candidate
|
||||
- How to score a candidate
|
||||
- What constraints must hold
|
||||
- How to decide whether a candidate is better
|
||||
|
||||
If a task cannot define these precisely, it is outside the engine boundary.
|
||||
|
||||
## Core Model
|
||||
|
||||
Each optimization task is represented by a task specification.
|
||||
|
||||
The engine operates on four core abstractions:
|
||||
|
||||
- `artifact`: the editable files
|
||||
- `runner`: the command that executes a candidate
|
||||
- `scorer`: the component that converts outputs into structured metrics
|
||||
- `policy`: the decision logic for keep, discard, or fail
|
||||
|
||||
## Experiment Lifecycle
|
||||
|
||||
Each iteration follows a fixed eight-stage lifecycle:
|
||||
|
||||
1. `load_task`
|
||||
- Parse and validate the task specification.
|
||||
|
||||
2. `snapshot_baseline`
|
||||
- Record current artifact hashes and the current best result.
|
||||
|
||||
3. `mutate_artifacts`
|
||||
- Generate a bounded candidate edit within the allowed files and change budget.
|
||||
|
||||
4. `run_candidate`
|
||||
- Execute the runner command and capture logs, exit code, and runtime.
|
||||
|
||||
5. `score_candidate`
|
||||
- Produce structured output containing a primary score and supporting metrics.
|
||||
|
||||
6. `validate_constraints`
|
||||
- Evaluate hard constraints before comparing the primary score.
|
||||
|
||||
7. `decide`
|
||||
- Apply the decision policy and classify the candidate as `keep`, `discard`, or `crash`.
|
||||
|
||||
8. `log_iteration`
|
||||
- Persist the result, metrics, decision, and diff summary.
|
||||
|
||||
This lifecycle is fixed across tasks. Tasks vary only by configuration and optional task-specific runner or scorer implementations.
|
||||
|
||||
## Scoring Model
|
||||
|
||||
V1 uses a conservative scoring model:
|
||||
|
||||
- One primary metric
|
||||
- Zero or more hard constraints
|
||||
- Zero or more tie-breakers
|
||||
|
||||
### Primary Metric
|
||||
|
||||
The primary metric determines the main notion of improvement.
|
||||
|
||||
Examples:
|
||||
|
||||
- `score`
|
||||
- `f1`
|
||||
- `mAP50_95`
|
||||
- `test_pass_rate`
|
||||
- `proposal_rubric_score`
|
||||
|
||||
The objective must define whether the metric should be minimized or maximized.
|
||||
|
||||
### Constraints
|
||||
|
||||
Constraints are pass/fail conditions that block acceptance regardless of the primary score.
|
||||
|
||||
Examples:
|
||||
|
||||
- `violation_count <= 0`
|
||||
- `runtime_seconds <= 300`
|
||||
- `length_tokens <= 2500`
|
||||
- `required_sections_present == true`
|
||||
|
||||
Constraints protect the system from reward hacking and invalid outputs.
|
||||
|
||||
### Tie-Breakers
|
||||
|
||||
Tie-breakers apply only when the primary metric is equal or too close to justify a change on primary score alone.
|
||||
|
||||
Examples:
|
||||
|
||||
- Lower token count
|
||||
- Lower runtime
|
||||
- Fewer violations
|
||||
- Lower cost
|
||||
|
||||
### Reward Hacking Risks
|
||||
|
||||
Tasks involving prompts, skills, proposals, and other natural-language artifacts are especially vulnerable to reward hacking. V1 must explicitly guard against:
|
||||
|
||||
- Length hacking
|
||||
- Keyword stuffing
|
||||
- Overfitting to a single judge
|
||||
- Passing the judge while violating structural requirements
|
||||
|
||||
For this reason, the scorer should combine:
|
||||
|
||||
- Rubric-based judging
|
||||
- Rule-based checks
|
||||
- Format validation
|
||||
- Optional multi-judge averaging
|
||||
|
||||
The primary score may come from a model-based evaluator, but constraints should be as rule-based and scriptable as possible.
|
||||
|
||||
## Decision Policy
|
||||
|
||||
V1 keeps the decision policy intentionally strict:
|
||||
|
||||
- If the runner crashes, mark as `crash`
|
||||
- If any hard constraint fails, mark as `discard`
|
||||
- If the candidate improves the primary metric, mark as `keep`
|
||||
- If the primary metric is tied, keep only if tie-breakers are clearly better
|
||||
- Otherwise mark as `discard`
|
||||
|
||||
The default philosophy is to retain only clear improvements.
|
||||
|
||||
## Task Specification
|
||||
|
||||
Each task lives in its own directory and includes a declarative task definition file.
|
||||
|
||||
Example:
|
||||
|
||||
```yaml
|
||||
id: skill-quality
|
||||
description: Optimize a skill markdown file against a rubric.
|
||||
|
||||
artifacts:
|
||||
include:
|
||||
- skills/writing/SKILL.md
|
||||
exclude:
|
||||
- work/**
|
||||
max_files_per_iteration: 2
|
||||
|
||||
mutation:
|
||||
mode: direct_edit
|
||||
allowed_file_types:
|
||||
- .md
|
||||
- .txt
|
||||
- .yaml
|
||||
- .py
|
||||
max_changed_lines: 120
|
||||
|
||||
runner:
|
||||
command: "python scripts/run_skill_eval.py"
|
||||
cwd: "."
|
||||
timeout_seconds: 300
|
||||
|
||||
scorer:
|
||||
type: command
|
||||
command: "python scripts/score_skill.py"
|
||||
parse:
|
||||
format: json
|
||||
score_field: "score"
|
||||
metrics_field: "metrics"
|
||||
|
||||
objective:
|
||||
primary_metric: score
|
||||
direction: maximize
|
||||
|
||||
constraints:
|
||||
- metric: violation_count
|
||||
op: "<="
|
||||
value: 0
|
||||
- metric: length_tokens
|
||||
op: "<="
|
||||
value: 2500
|
||||
|
||||
policy:
|
||||
keep_if: "better_primary"
|
||||
tie_breakers:
|
||||
- lower: violation_count
|
||||
- lower: length_tokens
|
||||
on_failure: discard
|
||||
|
||||
budget:
|
||||
max_iterations: 50
|
||||
max_failures: 10
|
||||
|
||||
logging:
|
||||
results_file: "work/results.jsonl"
|
||||
candidate_dir: "work/candidates"
|
||||
```
|
||||
|
||||
## Core Modules
|
||||
|
||||
V1 should be implemented with six focused modules.
|
||||
|
||||
### `task_loader`
|
||||
|
||||
Responsibilities:
|
||||
|
||||
- Parse task files
|
||||
- Validate required fields
|
||||
- Produce normalized internal models
|
||||
|
||||
It should not execute commands or mutate files.
|
||||
|
||||
### `artifact_manager`
|
||||
|
||||
Responsibilities:
|
||||
|
||||
- Resolve allowed files
|
||||
- Snapshot baseline contents and hashes
|
||||
- Produce diff summaries
|
||||
- Restore discarded candidates
|
||||
|
||||
It owns file-level safety and reversibility.
|
||||
|
||||
### `mutation_engine`
|
||||
|
||||
Responsibilities:
|
||||
|
||||
- Instruct the agent to edit artifacts
|
||||
- Enforce mutation budget limits
|
||||
- Reject edits outside allowed paths or file types
|
||||
|
||||
It is a bounded editor, not an open-ended autonomous developer.
|
||||
|
||||
### `runner`
|
||||
|
||||
Responsibilities:
|
||||
|
||||
- Execute the candidate task
|
||||
- Capture exit code, runtime, and logs
|
||||
- Emit a structured run result
|
||||
|
||||
It does not decide quality.
|
||||
|
||||
### `scorer`
|
||||
|
||||
Responsibilities:
|
||||
|
||||
- Convert raw outputs into structured metrics
|
||||
- Emit the primary metric, additional metrics, and constraint data
|
||||
|
||||
V1 should support:
|
||||
|
||||
- `command` scorer
|
||||
- Optional later extension to `python_plugin`
|
||||
|
||||
### `decision_engine`
|
||||
|
||||
Responsibilities:
|
||||
|
||||
- Compare candidate vs baseline
|
||||
- Apply objective and policy rules
|
||||
- Emit `keep`, `discard`, or `crash`
|
||||
|
||||
It should consume only structured results, not raw logs.
|
||||
|
||||
## Directory Layout
|
||||
|
||||
Suggested V1 structure:
|
||||
|
||||
```text
|
||||
engine/
|
||||
task_loader.py
|
||||
artifact_manager.py
|
||||
mutation_engine.py
|
||||
runner.py
|
||||
scorer.py
|
||||
decision_engine.py
|
||||
models.py
|
||||
|
||||
tasks/
|
||||
skill-quality/
|
||||
task.yaml
|
||||
rubric.md
|
||||
prompt.md
|
||||
|
||||
scripts/
|
||||
run_task.py
|
||||
score_task.py
|
||||
|
||||
work/
|
||||
runs/
|
||||
candidates/
|
||||
logs/
|
||||
results.jsonl
|
||||
```
|
||||
|
||||
This structure separates reusable engine code from per-task resources and runtime state.
|
||||
|
||||
## Minimal V1 Task
|
||||
|
||||
The first task should prove the engine on `skill/prompt` optimization, because it directly matches the desired use case and has fast evaluation.
|
||||
|
||||
Recommended first task:
|
||||
|
||||
- Optimize one `SKILL.md`
|
||||
- Score it against a rubric
|
||||
- Enforce structural constraints
|
||||
- Keep only strictly better versions
|
||||
|
||||
Suggested output metrics:
|
||||
|
||||
- `primary_score`
|
||||
- `clarity`
|
||||
- `coverage`
|
||||
- `constraint_obedience`
|
||||
- `violation_count`
|
||||
- `length_tokens`
|
||||
|
||||
This is enough to validate the engine without requiring GPU training or long-running experiments.
|
||||
|
||||
## Guardrails
|
||||
|
||||
V1 should explicitly include the following guardrails:
|
||||
|
||||
- Limit each iteration to a small number of files
|
||||
- Limit total changed lines per iteration
|
||||
- Reject edits outside declared artifact boundaries
|
||||
- Require structured scorer output
|
||||
- Evaluate constraints before comparing scores
|
||||
- Log every accepted or discarded candidate with a diff summary
|
||||
- Default to strict improvement only
|
||||
- Avoid hidden cross-task memory; share only structured historical results
|
||||
|
||||
These controls are necessary to keep the optimization loop safe, explainable, and reusable.
|
||||
|
||||
## Recommended Implementation Strategy
|
||||
|
||||
Implement the system in two phases.
|
||||
|
||||
### Phase 1
|
||||
|
||||
Build the smallest complete loop:
|
||||
|
||||
- Task loader
|
||||
- Artifact manager
|
||||
- Runner
|
||||
- Scorer
|
||||
- Decision engine
|
||||
- JSONL logging
|
||||
|
||||
Use a simple `skill-quality` task to prove the loop works.
|
||||
|
||||
### Phase 2
|
||||
|
||||
Add bounded artifact mutation:
|
||||
|
||||
- Agent-driven edit generation
|
||||
- Mutation budget enforcement
|
||||
- Candidate restoration on discard
|
||||
- Better diff summaries and diagnostics
|
||||
|
||||
This sequencing reduces risk by proving the evaluation loop before adding autonomous mutation.
|
||||
|
||||
## Open Questions Deferred
|
||||
|
||||
The following topics are intentionally deferred from V1:
|
||||
|
||||
- Multi-objective Pareto search
|
||||
- Branch-per-candidate workflows
|
||||
- Distributed or parallel experiment scheduling
|
||||
- Task-to-task transfer memory
|
||||
- Human-in-the-loop review checkpoints
|
||||
- Rich plugin ecosystems
|
||||
|
||||
These can be added later if V1 proves useful.
|
||||
|
||||
## Recommendation
|
||||
|
||||
Proceed with a minimal `Artifact Loop Engine` centered on declarative task specs and strict structured scoring.
|
||||
|
||||
Do not generalize immediately to full project-level autonomy. The correct first step is a bounded optimizer for editable text artifacts with a stable evaluation loop.
|
||||
Loading…
Reference in New Issue
Block a user