docs: add artifact loop engine design spec

This commit is contained in:
sladro 2026-04-02 10:59:11 +08:00
parent 228791fb49
commit 2968ec63a9

View File

@ -0,0 +1,432 @@
# Artifact Loop Engine Design
Date: 2026-04-02
## Summary
This document defines a minimal, general-purpose optimization engine for editable text artifacts such as code, config, prompt, and skill files.
The engine does not optimize arbitrary real-world processes directly. It optimizes a closed loop:
1. Select allowed text artifacts.
2. Generate a candidate edit.
3. Run a scripted task.
4. Score the result with a structured evaluator.
5. Keep or discard the candidate.
6. Log the outcome.
The goal is to generalize the current repository from a single-file training-script optimizer into a reusable artifact optimization framework while keeping the scope bounded and auditable.
## Problem Statement
The current repository assumes a narrow workflow:
- One primary editable file: `train.py`
- One execution command: `uv run train.py`
- One primary metric: `val_bpb`
- One decision rule: lower is better
That design is effective for fixed-budget language model experiments, but it does not generalize cleanly to tasks such as:
- Improving a `SKILL.md` against a rubric
- Improving a prompt template against evaluator scores
- Tuning a small config file against runtime or accuracy metrics
- Improving a small code path against a structured test score
The new system should generalize the optimization loop without becoming a full autonomous project platform.
## Goals
- Optimize editable text artifacts: code, config, prompt, and skill files
- Support any task that can be executed by a script and scored as structured metrics
- Keep edits bounded, reversible, and easy to audit
- Separate task definition from engine logic
- Make the first version usable for `skill/prompt` optimization
## Non-Goals
- General-purpose multi-agent software development
- Optimization of arbitrary external systems without a scripted evaluation loop
- Image, binary, database, or manual-interaction optimization
- True multi-objective Pareto optimization in V1
- Unbounded edits across an entire repository
## System Boundary
The engine optimizes text artifacts only. A task must define:
- What files may be edited
- How to run a candidate
- How to score a candidate
- What constraints must hold
- How to decide whether a candidate is better
If a task cannot define these precisely, it is outside the engine boundary.
## Core Model
Each optimization task is represented by a task specification.
The engine operates on four core abstractions:
- `artifact`: the editable files
- `runner`: the command that executes a candidate
- `scorer`: the component that converts outputs into structured metrics
- `policy`: the decision logic for keep, discard, or fail
## Experiment Lifecycle
Each iteration follows a fixed eight-stage lifecycle:
1. `load_task`
- Parse and validate the task specification.
2. `snapshot_baseline`
- Record current artifact hashes and the current best result.
3. `mutate_artifacts`
- Generate a bounded candidate edit within the allowed files and change budget.
4. `run_candidate`
- Execute the runner command and capture logs, exit code, and runtime.
5. `score_candidate`
- Produce structured output containing a primary score and supporting metrics.
6. `validate_constraints`
- Evaluate hard constraints before comparing the primary score.
7. `decide`
- Apply the decision policy and classify the candidate as `keep`, `discard`, or `crash`.
8. `log_iteration`
- Persist the result, metrics, decision, and diff summary.
This lifecycle is fixed across tasks. Tasks vary only by configuration and optional task-specific runner or scorer implementations.
## Scoring Model
V1 uses a conservative scoring model:
- One primary metric
- Zero or more hard constraints
- Zero or more tie-breakers
### Primary Metric
The primary metric determines the main notion of improvement.
Examples:
- `score`
- `f1`
- `mAP50_95`
- `test_pass_rate`
- `proposal_rubric_score`
The objective must define whether the metric should be minimized or maximized.
### Constraints
Constraints are pass/fail conditions that block acceptance regardless of the primary score.
Examples:
- `violation_count <= 0`
- `runtime_seconds <= 300`
- `length_tokens <= 2500`
- `required_sections_present == true`
Constraints protect the system from reward hacking and invalid outputs.
### Tie-Breakers
Tie-breakers apply only when the primary metric is equal or too close to justify a change on primary score alone.
Examples:
- Lower token count
- Lower runtime
- Fewer violations
- Lower cost
### Reward Hacking Risks
Tasks involving prompts, skills, proposals, and other natural-language artifacts are especially vulnerable to reward hacking. V1 must explicitly guard against:
- Length hacking
- Keyword stuffing
- Overfitting to a single judge
- Passing the judge while violating structural requirements
For this reason, the scorer should combine:
- Rubric-based judging
- Rule-based checks
- Format validation
- Optional multi-judge averaging
The primary score may come from a model-based evaluator, but constraints should be as rule-based and scriptable as possible.
## Decision Policy
V1 keeps the decision policy intentionally strict:
- If the runner crashes, mark as `crash`
- If any hard constraint fails, mark as `discard`
- If the candidate improves the primary metric, mark as `keep`
- If the primary metric is tied, keep only if tie-breakers are clearly better
- Otherwise mark as `discard`
The default philosophy is to retain only clear improvements.
## Task Specification
Each task lives in its own directory and includes a declarative task definition file.
Example:
```yaml
id: skill-quality
description: Optimize a skill markdown file against a rubric.
artifacts:
include:
- skills/writing/SKILL.md
exclude:
- work/**
max_files_per_iteration: 2
mutation:
mode: direct_edit
allowed_file_types:
- .md
- .txt
- .yaml
- .py
max_changed_lines: 120
runner:
command: "python scripts/run_skill_eval.py"
cwd: "."
timeout_seconds: 300
scorer:
type: command
command: "python scripts/score_skill.py"
parse:
format: json
score_field: "score"
metrics_field: "metrics"
objective:
primary_metric: score
direction: maximize
constraints:
- metric: violation_count
op: "<="
value: 0
- metric: length_tokens
op: "<="
value: 2500
policy:
keep_if: "better_primary"
tie_breakers:
- lower: violation_count
- lower: length_tokens
on_failure: discard
budget:
max_iterations: 50
max_failures: 10
logging:
results_file: "work/results.jsonl"
candidate_dir: "work/candidates"
```
## Core Modules
V1 should be implemented with six focused modules.
### `task_loader`
Responsibilities:
- Parse task files
- Validate required fields
- Produce normalized internal models
It should not execute commands or mutate files.
### `artifact_manager`
Responsibilities:
- Resolve allowed files
- Snapshot baseline contents and hashes
- Produce diff summaries
- Restore discarded candidates
It owns file-level safety and reversibility.
### `mutation_engine`
Responsibilities:
- Instruct the agent to edit artifacts
- Enforce mutation budget limits
- Reject edits outside allowed paths or file types
It is a bounded editor, not an open-ended autonomous developer.
### `runner`
Responsibilities:
- Execute the candidate task
- Capture exit code, runtime, and logs
- Emit a structured run result
It does not decide quality.
### `scorer`
Responsibilities:
- Convert raw outputs into structured metrics
- Emit the primary metric, additional metrics, and constraint data
V1 should support:
- `command` scorer
- Optional later extension to `python_plugin`
### `decision_engine`
Responsibilities:
- Compare candidate vs baseline
- Apply objective and policy rules
- Emit `keep`, `discard`, or `crash`
It should consume only structured results, not raw logs.
## Directory Layout
Suggested V1 structure:
```text
engine/
task_loader.py
artifact_manager.py
mutation_engine.py
runner.py
scorer.py
decision_engine.py
models.py
tasks/
skill-quality/
task.yaml
rubric.md
prompt.md
scripts/
run_task.py
score_task.py
work/
runs/
candidates/
logs/
results.jsonl
```
This structure separates reusable engine code from per-task resources and runtime state.
## Minimal V1 Task
The first task should prove the engine on `skill/prompt` optimization, because it directly matches the desired use case and has fast evaluation.
Recommended first task:
- Optimize one `SKILL.md`
- Score it against a rubric
- Enforce structural constraints
- Keep only strictly better versions
Suggested output metrics:
- `primary_score`
- `clarity`
- `coverage`
- `constraint_obedience`
- `violation_count`
- `length_tokens`
This is enough to validate the engine without requiring GPU training or long-running experiments.
## Guardrails
V1 should explicitly include the following guardrails:
- Limit each iteration to a small number of files
- Limit total changed lines per iteration
- Reject edits outside declared artifact boundaries
- Require structured scorer output
- Evaluate constraints before comparing scores
- Log every accepted or discarded candidate with a diff summary
- Default to strict improvement only
- Avoid hidden cross-task memory; share only structured historical results
These controls are necessary to keep the optimization loop safe, explainable, and reusable.
## Recommended Implementation Strategy
Implement the system in two phases.
### Phase 1
Build the smallest complete loop:
- Task loader
- Artifact manager
- Runner
- Scorer
- Decision engine
- JSONL logging
Use a simple `skill-quality` task to prove the loop works.
### Phase 2
Add bounded artifact mutation:
- Agent-driven edit generation
- Mutation budget enforcement
- Candidate restoration on discard
- Better diff summaries and diagnostics
This sequencing reduces risk by proving the evaluation loop before adding autonomous mutation.
## Open Questions Deferred
The following topics are intentionally deferred from V1:
- Multi-objective Pareto search
- Branch-per-candidate workflows
- Distributed or parallel experiment scheduling
- Task-to-task transfer memory
- Human-in-the-loop review checkpoints
- Rich plugin ecosystems
These can be added later if V1 proves useful.
## Recommendation
Proceed with a minimal `Artifact Loop Engine` centered on declarative task specs and strict structured scoring.
Do not generalize immediately to full project-level autonomy. The correct first step is a bounded optimizer for editable text artifacts with a stable evaluation loop.