sladro 2968ec63a9 docs: add artifact loop engine design spec

2026-04-02 10:59:11 +08:00

10 KiB

Raw Permalink Blame History

Artifact Loop Engine Design

Date: 2026-04-02

Summary

This document defines a minimal, general-purpose optimization engine for editable text artifacts such as code, config, prompt, and skill files.

The engine does not optimize arbitrary real-world processes directly. It optimizes a closed loop:

Select allowed text artifacts.
Generate a candidate edit.
Run a scripted task.
Score the result with a structured evaluator.
Keep or discard the candidate.
Log the outcome.

The goal is to generalize the current repository from a single-file training-script optimizer into a reusable artifact optimization framework while keeping the scope bounded and auditable.

Problem Statement

The current repository assumes a narrow workflow:

One primary editable file: train.py
One execution command: uv run train.py
One primary metric: val_bpb
One decision rule: lower is better

That design is effective for fixed-budget language model experiments, but it does not generalize cleanly to tasks such as:

Improving a SKILL.md against a rubric
Improving a prompt template against evaluator scores
Tuning a small config file against runtime or accuracy metrics
Improving a small code path against a structured test score

The new system should generalize the optimization loop without becoming a full autonomous project platform.

Goals

Optimize editable text artifacts: code, config, prompt, and skill files
Support any task that can be executed by a script and scored as structured metrics
Keep edits bounded, reversible, and easy to audit
Separate task definition from engine logic
Make the first version usable for skill/prompt optimization

Non-Goals

General-purpose multi-agent software development
Optimization of arbitrary external systems without a scripted evaluation loop
Image, binary, database, or manual-interaction optimization
True multi-objective Pareto optimization in V1
Unbounded edits across an entire repository

System Boundary

The engine optimizes text artifacts only. A task must define:

What files may be edited
How to run a candidate
How to score a candidate
What constraints must hold
How to decide whether a candidate is better

If a task cannot define these precisely, it is outside the engine boundary.

Core Model

Each optimization task is represented by a task specification.

The engine operates on four core abstractions:

artifact: the editable files
runner: the command that executes a candidate
scorer: the component that converts outputs into structured metrics
policy: the decision logic for keep, discard, or fail

Experiment Lifecycle

Each iteration follows a fixed eight-stage lifecycle:

load_task
- Parse and validate the task specification.
snapshot_baseline
- Record current artifact hashes and the current best result.
mutate_artifacts
- Generate a bounded candidate edit within the allowed files and change budget.
run_candidate
- Execute the runner command and capture logs, exit code, and runtime.
score_candidate
- Produce structured output containing a primary score and supporting metrics.
validate_constraints
- Evaluate hard constraints before comparing the primary score.
decide
- Apply the decision policy and classify the candidate as keep, discard, or crash.
log_iteration
- Persist the result, metrics, decision, and diff summary.

This lifecycle is fixed across tasks. Tasks vary only by configuration and optional task-specific runner or scorer implementations.

Scoring Model

V1 uses a conservative scoring model:

One primary metric
Zero or more hard constraints
Zero or more tie-breakers

Primary Metric

The primary metric determines the main notion of improvement.

Examples:

score
f1
mAP50_95
test_pass_rate
proposal_rubric_score

The objective must define whether the metric should be minimized or maximized.

Constraints

Constraints are pass/fail conditions that block acceptance regardless of the primary score.

Examples:

violation_count <= 0
runtime_seconds <= 300
length_tokens <= 2500
required_sections_present == true

Constraints protect the system from reward hacking and invalid outputs.

Tie-Breakers

Tie-breakers apply only when the primary metric is equal or too close to justify a change on primary score alone.

Examples:

Lower token count
Lower runtime
Fewer violations
Lower cost

Reward Hacking Risks

Tasks involving prompts, skills, proposals, and other natural-language artifacts are especially vulnerable to reward hacking. V1 must explicitly guard against:

Length hacking
Keyword stuffing
Overfitting to a single judge
Passing the judge while violating structural requirements

For this reason, the scorer should combine:

Rubric-based judging
Rule-based checks
Format validation
Optional multi-judge averaging

The primary score may come from a model-based evaluator, but constraints should be as rule-based and scriptable as possible.

Decision Policy

V1 keeps the decision policy intentionally strict:

If the runner crashes, mark as crash
If any hard constraint fails, mark as discard
If the candidate improves the primary metric, mark as keep
If the primary metric is tied, keep only if tie-breakers are clearly better
Otherwise mark as discard

The default philosophy is to retain only clear improvements.

Task Specification

Each task lives in its own directory and includes a declarative task definition file.

Example:

id: skill-quality
description: Optimize a skill markdown file against a rubric.

artifacts:
  include:
    - skills/writing/SKILL.md
  exclude:
    - work/**
  max_files_per_iteration: 2

mutation:
  mode: direct_edit
  allowed_file_types:
    - .md
    - .txt
    - .yaml
    - .py
  max_changed_lines: 120

runner:
  command: "python scripts/run_skill_eval.py"
  cwd: "."
  timeout_seconds: 300

scorer:
  type: command
  command: "python scripts/score_skill.py"
  parse:
    format: json
    score_field: "score"
    metrics_field: "metrics"

objective:
  primary_metric: score
  direction: maximize

constraints:
  - metric: violation_count
    op: "<="
    value: 0
  - metric: length_tokens
    op: "<="
    value: 2500

policy:
  keep_if: "better_primary"
  tie_breakers:
    - lower: violation_count
    - lower: length_tokens
  on_failure: discard

budget:
  max_iterations: 50
  max_failures: 10

logging:
  results_file: "work/results.jsonl"
  candidate_dir: "work/candidates"

Core Modules

V1 should be implemented with six focused modules.

`task_loader`

Responsibilities:

Parse task files
Validate required fields
Produce normalized internal models

It should not execute commands or mutate files.

`artifact_manager`

Responsibilities:

Resolve allowed files
Snapshot baseline contents and hashes
Produce diff summaries
Restore discarded candidates

It owns file-level safety and reversibility.

`mutation_engine`

Responsibilities:

Instruct the agent to edit artifacts
Enforce mutation budget limits
Reject edits outside allowed paths or file types

It is a bounded editor, not an open-ended autonomous developer.

`runner`

Responsibilities:

Execute the candidate task
Capture exit code, runtime, and logs
Emit a structured run result

It does not decide quality.

`scorer`

Responsibilities:

Convert raw outputs into structured metrics
Emit the primary metric, additional metrics, and constraint data

V1 should support:

command scorer
Optional later extension to python_plugin

`decision_engine`

Responsibilities:

Compare candidate vs baseline
Apply objective and policy rules
Emit keep, discard, or crash

It should consume only structured results, not raw logs.

Directory Layout

Suggested V1 structure:

engine/
  task_loader.py
  artifact_manager.py
  mutation_engine.py
  runner.py
  scorer.py
  decision_engine.py
  models.py

tasks/
  skill-quality/
    task.yaml
    rubric.md
    prompt.md

scripts/
  run_task.py
  score_task.py

work/
  runs/
  candidates/
  logs/
  results.jsonl

This structure separates reusable engine code from per-task resources and runtime state.

Minimal V1 Task

The first task should prove the engine on skill/prompt optimization, because it directly matches the desired use case and has fast evaluation.

Recommended first task:

Optimize one SKILL.md
Score it against a rubric
Enforce structural constraints
Keep only strictly better versions

Suggested output metrics:

primary_score
clarity
coverage
constraint_obedience
violation_count
length_tokens

This is enough to validate the engine without requiring GPU training or long-running experiments.

Guardrails

V1 should explicitly include the following guardrails:

Limit each iteration to a small number of files
Limit total changed lines per iteration
Reject edits outside declared artifact boundaries
Require structured scorer output
Evaluate constraints before comparing scores
Log every accepted or discarded candidate with a diff summary
Default to strict improvement only
Avoid hidden cross-task memory; share only structured historical results

These controls are necessary to keep the optimization loop safe, explainable, and reusable.

Recommended Implementation Strategy

Implement the system in two phases.

Phase 1

Build the smallest complete loop:

Task loader
Artifact manager
Runner
Scorer
Decision engine
JSONL logging

Use a simple skill-quality task to prove the loop works.

Phase 2

Add bounded artifact mutation:

Agent-driven edit generation
Mutation budget enforcement
Candidate restoration on discard
Better diff summaries and diagnostics

This sequencing reduces risk by proving the evaluation loop before adding autonomous mutation.

Open Questions Deferred

The following topics are intentionally deferred from V1:

Multi-objective Pareto search
Branch-per-candidate workflows
Distributed or parallel experiment scheduling
Task-to-task transfer memory
Human-in-the-loop review checkpoints
Rich plugin ecosystems

These can be added later if V1 proves useful.

Recommendation

Proceed with a minimal Artifact Loop Engine centered on declarative task specs and strict structured scoring.

Do not generalize immediately to full project-level autonomy. The correct first step is a bounded optimizer for editable text artifacts with a stable evaluation loop.

10 KiB Raw Permalink Blame History

Artifact Loop Engine Design

Summary

Problem Statement

Goals

Non-Goals

System Boundary

Core Model

Experiment Lifecycle

Scoring Model

Primary Metric

Constraints

Tie-Breakers

Reward Hacking Risks

Decision Policy

Task Specification

Core Modules

task_loader

artifact_manager

mutation_engine

runner

scorer

decision_engine

Directory Layout

Minimal V1 Task

Guardrails

Recommended Implementation Strategy

Phase 1

Phase 2

Open Questions Deferred

Recommendation

10 KiB

Raw Permalink Blame History

`task_loader`

`artifact_manager`

`mutation_engine`

`runner`

`scorer`

`decision_engine`