10 KiB
Artifact Loop Engine Design
Date: 2026-04-02
Summary
This document defines a minimal, general-purpose optimization engine for editable text artifacts such as code, config, prompt, and skill files.
The engine does not optimize arbitrary real-world processes directly. It optimizes a closed loop:
- Select allowed text artifacts.
- Generate a candidate edit.
- Run a scripted task.
- Score the result with a structured evaluator.
- Keep or discard the candidate.
- Log the outcome.
The goal is to generalize the current repository from a single-file training-script optimizer into a reusable artifact optimization framework while keeping the scope bounded and auditable.
Problem Statement
The current repository assumes a narrow workflow:
- One primary editable file:
train.py - One execution command:
uv run train.py - One primary metric:
val_bpb - One decision rule: lower is better
That design is effective for fixed-budget language model experiments, but it does not generalize cleanly to tasks such as:
- Improving a
SKILL.mdagainst a rubric - Improving a prompt template against evaluator scores
- Tuning a small config file against runtime or accuracy metrics
- Improving a small code path against a structured test score
The new system should generalize the optimization loop without becoming a full autonomous project platform.
Goals
- Optimize editable text artifacts: code, config, prompt, and skill files
- Support any task that can be executed by a script and scored as structured metrics
- Keep edits bounded, reversible, and easy to audit
- Separate task definition from engine logic
- Make the first version usable for
skill/promptoptimization
Non-Goals
- General-purpose multi-agent software development
- Optimization of arbitrary external systems without a scripted evaluation loop
- Image, binary, database, or manual-interaction optimization
- True multi-objective Pareto optimization in V1
- Unbounded edits across an entire repository
System Boundary
The engine optimizes text artifacts only. A task must define:
- What files may be edited
- How to run a candidate
- How to score a candidate
- What constraints must hold
- How to decide whether a candidate is better
If a task cannot define these precisely, it is outside the engine boundary.
Core Model
Each optimization task is represented by a task specification.
The engine operates on four core abstractions:
artifact: the editable filesrunner: the command that executes a candidatescorer: the component that converts outputs into structured metricspolicy: the decision logic for keep, discard, or fail
Experiment Lifecycle
Each iteration follows a fixed eight-stage lifecycle:
-
load_task- Parse and validate the task specification.
-
snapshot_baseline- Record current artifact hashes and the current best result.
-
mutate_artifacts- Generate a bounded candidate edit within the allowed files and change budget.
-
run_candidate- Execute the runner command and capture logs, exit code, and runtime.
-
score_candidate- Produce structured output containing a primary score and supporting metrics.
-
validate_constraints- Evaluate hard constraints before comparing the primary score.
-
decide- Apply the decision policy and classify the candidate as
keep,discard, orcrash.
- Apply the decision policy and classify the candidate as
-
log_iteration- Persist the result, metrics, decision, and diff summary.
This lifecycle is fixed across tasks. Tasks vary only by configuration and optional task-specific runner or scorer implementations.
Scoring Model
V1 uses a conservative scoring model:
- One primary metric
- Zero or more hard constraints
- Zero or more tie-breakers
Primary Metric
The primary metric determines the main notion of improvement.
Examples:
scoref1mAP50_95test_pass_rateproposal_rubric_score
The objective must define whether the metric should be minimized or maximized.
Constraints
Constraints are pass/fail conditions that block acceptance regardless of the primary score.
Examples:
violation_count <= 0runtime_seconds <= 300length_tokens <= 2500required_sections_present == true
Constraints protect the system from reward hacking and invalid outputs.
Tie-Breakers
Tie-breakers apply only when the primary metric is equal or too close to justify a change on primary score alone.
Examples:
- Lower token count
- Lower runtime
- Fewer violations
- Lower cost
Reward Hacking Risks
Tasks involving prompts, skills, proposals, and other natural-language artifacts are especially vulnerable to reward hacking. V1 must explicitly guard against:
- Length hacking
- Keyword stuffing
- Overfitting to a single judge
- Passing the judge while violating structural requirements
For this reason, the scorer should combine:
- Rubric-based judging
- Rule-based checks
- Format validation
- Optional multi-judge averaging
The primary score may come from a model-based evaluator, but constraints should be as rule-based and scriptable as possible.
Decision Policy
V1 keeps the decision policy intentionally strict:
- If the runner crashes, mark as
crash - If any hard constraint fails, mark as
discard - If the candidate improves the primary metric, mark as
keep - If the primary metric is tied, keep only if tie-breakers are clearly better
- Otherwise mark as
discard
The default philosophy is to retain only clear improvements.
Task Specification
Each task lives in its own directory and includes a declarative task definition file.
Example:
id: skill-quality
description: Optimize a skill markdown file against a rubric.
artifacts:
include:
- skills/writing/SKILL.md
exclude:
- work/**
max_files_per_iteration: 2
mutation:
mode: direct_edit
allowed_file_types:
- .md
- .txt
- .yaml
- .py
max_changed_lines: 120
runner:
command: "python scripts/run_skill_eval.py"
cwd: "."
timeout_seconds: 300
scorer:
type: command
command: "python scripts/score_skill.py"
parse:
format: json
score_field: "score"
metrics_field: "metrics"
objective:
primary_metric: score
direction: maximize
constraints:
- metric: violation_count
op: "<="
value: 0
- metric: length_tokens
op: "<="
value: 2500
policy:
keep_if: "better_primary"
tie_breakers:
- lower: violation_count
- lower: length_tokens
on_failure: discard
budget:
max_iterations: 50
max_failures: 10
logging:
results_file: "work/results.jsonl"
candidate_dir: "work/candidates"
Core Modules
V1 should be implemented with six focused modules.
task_loader
Responsibilities:
- Parse task files
- Validate required fields
- Produce normalized internal models
It should not execute commands or mutate files.
artifact_manager
Responsibilities:
- Resolve allowed files
- Snapshot baseline contents and hashes
- Produce diff summaries
- Restore discarded candidates
It owns file-level safety and reversibility.
mutation_engine
Responsibilities:
- Instruct the agent to edit artifacts
- Enforce mutation budget limits
- Reject edits outside allowed paths or file types
It is a bounded editor, not an open-ended autonomous developer.
runner
Responsibilities:
- Execute the candidate task
- Capture exit code, runtime, and logs
- Emit a structured run result
It does not decide quality.
scorer
Responsibilities:
- Convert raw outputs into structured metrics
- Emit the primary metric, additional metrics, and constraint data
V1 should support:
commandscorer- Optional later extension to
python_plugin
decision_engine
Responsibilities:
- Compare candidate vs baseline
- Apply objective and policy rules
- Emit
keep,discard, orcrash
It should consume only structured results, not raw logs.
Directory Layout
Suggested V1 structure:
engine/
task_loader.py
artifact_manager.py
mutation_engine.py
runner.py
scorer.py
decision_engine.py
models.py
tasks/
skill-quality/
task.yaml
rubric.md
prompt.md
scripts/
run_task.py
score_task.py
work/
runs/
candidates/
logs/
results.jsonl
This structure separates reusable engine code from per-task resources and runtime state.
Minimal V1 Task
The first task should prove the engine on skill/prompt optimization, because it directly matches the desired use case and has fast evaluation.
Recommended first task:
- Optimize one
SKILL.md - Score it against a rubric
- Enforce structural constraints
- Keep only strictly better versions
Suggested output metrics:
primary_scoreclaritycoverageconstraint_obedienceviolation_countlength_tokens
This is enough to validate the engine without requiring GPU training or long-running experiments.
Guardrails
V1 should explicitly include the following guardrails:
- Limit each iteration to a small number of files
- Limit total changed lines per iteration
- Reject edits outside declared artifact boundaries
- Require structured scorer output
- Evaluate constraints before comparing scores
- Log every accepted or discarded candidate with a diff summary
- Default to strict improvement only
- Avoid hidden cross-task memory; share only structured historical results
These controls are necessary to keep the optimization loop safe, explainable, and reusable.
Recommended Implementation Strategy
Implement the system in two phases.
Phase 1
Build the smallest complete loop:
- Task loader
- Artifact manager
- Runner
- Scorer
- Decision engine
- JSONL logging
Use a simple skill-quality task to prove the loop works.
Phase 2
Add bounded artifact mutation:
- Agent-driven edit generation
- Mutation budget enforcement
- Candidate restoration on discard
- Better diff summaries and diagnostics
This sequencing reduces risk by proving the evaluation loop before adding autonomous mutation.
Open Questions Deferred
The following topics are intentionally deferred from V1:
- Multi-objective Pareto search
- Branch-per-candidate workflows
- Distributed or parallel experiment scheduling
- Task-to-task transfer memory
- Human-in-the-loop review checkpoints
- Rich plugin ecosystems
These can be added later if V1 proves useful.
Recommendation
Proceed with a minimal Artifact Loop Engine centered on declarative task specs and strict structured scoring.
Do not generalize immediately to full project-level autonomy. The correct first step is a bounded optimizer for editable text artifacts with a stable evaluation loop.