CommonAutoRearsh/docs/superpowers/specs/2026-04-02-artifact-loop-engine-design.md

10 KiB

Artifact Loop Engine Design

Date: 2026-04-02

Summary

This document defines a minimal, general-purpose optimization engine for editable text artifacts such as code, config, prompt, and skill files.

The engine does not optimize arbitrary real-world processes directly. It optimizes a closed loop:

  1. Select allowed text artifacts.
  2. Generate a candidate edit.
  3. Run a scripted task.
  4. Score the result with a structured evaluator.
  5. Keep or discard the candidate.
  6. Log the outcome.

The goal is to generalize the current repository from a single-file training-script optimizer into a reusable artifact optimization framework while keeping the scope bounded and auditable.

Problem Statement

The current repository assumes a narrow workflow:

  • One primary editable file: train.py
  • One execution command: uv run train.py
  • One primary metric: val_bpb
  • One decision rule: lower is better

That design is effective for fixed-budget language model experiments, but it does not generalize cleanly to tasks such as:

  • Improving a SKILL.md against a rubric
  • Improving a prompt template against evaluator scores
  • Tuning a small config file against runtime or accuracy metrics
  • Improving a small code path against a structured test score

The new system should generalize the optimization loop without becoming a full autonomous project platform.

Goals

  • Optimize editable text artifacts: code, config, prompt, and skill files
  • Support any task that can be executed by a script and scored as structured metrics
  • Keep edits bounded, reversible, and easy to audit
  • Separate task definition from engine logic
  • Make the first version usable for skill/prompt optimization

Non-Goals

  • General-purpose multi-agent software development
  • Optimization of arbitrary external systems without a scripted evaluation loop
  • Image, binary, database, or manual-interaction optimization
  • True multi-objective Pareto optimization in V1
  • Unbounded edits across an entire repository

System Boundary

The engine optimizes text artifacts only. A task must define:

  • What files may be edited
  • How to run a candidate
  • How to score a candidate
  • What constraints must hold
  • How to decide whether a candidate is better

If a task cannot define these precisely, it is outside the engine boundary.

Core Model

Each optimization task is represented by a task specification.

The engine operates on four core abstractions:

  • artifact: the editable files
  • runner: the command that executes a candidate
  • scorer: the component that converts outputs into structured metrics
  • policy: the decision logic for keep, discard, or fail

Experiment Lifecycle

Each iteration follows a fixed eight-stage lifecycle:

  1. load_task

    • Parse and validate the task specification.
  2. snapshot_baseline

    • Record current artifact hashes and the current best result.
  3. mutate_artifacts

    • Generate a bounded candidate edit within the allowed files and change budget.
  4. run_candidate

    • Execute the runner command and capture logs, exit code, and runtime.
  5. score_candidate

    • Produce structured output containing a primary score and supporting metrics.
  6. validate_constraints

    • Evaluate hard constraints before comparing the primary score.
  7. decide

    • Apply the decision policy and classify the candidate as keep, discard, or crash.
  8. log_iteration

    • Persist the result, metrics, decision, and diff summary.

This lifecycle is fixed across tasks. Tasks vary only by configuration and optional task-specific runner or scorer implementations.

Scoring Model

V1 uses a conservative scoring model:

  • One primary metric
  • Zero or more hard constraints
  • Zero or more tie-breakers

Primary Metric

The primary metric determines the main notion of improvement.

Examples:

  • score
  • f1
  • mAP50_95
  • test_pass_rate
  • proposal_rubric_score

The objective must define whether the metric should be minimized or maximized.

Constraints

Constraints are pass/fail conditions that block acceptance regardless of the primary score.

Examples:

  • violation_count <= 0
  • runtime_seconds <= 300
  • length_tokens <= 2500
  • required_sections_present == true

Constraints protect the system from reward hacking and invalid outputs.

Tie-Breakers

Tie-breakers apply only when the primary metric is equal or too close to justify a change on primary score alone.

Examples:

  • Lower token count
  • Lower runtime
  • Fewer violations
  • Lower cost

Reward Hacking Risks

Tasks involving prompts, skills, proposals, and other natural-language artifacts are especially vulnerable to reward hacking. V1 must explicitly guard against:

  • Length hacking
  • Keyword stuffing
  • Overfitting to a single judge
  • Passing the judge while violating structural requirements

For this reason, the scorer should combine:

  • Rubric-based judging
  • Rule-based checks
  • Format validation
  • Optional multi-judge averaging

The primary score may come from a model-based evaluator, but constraints should be as rule-based and scriptable as possible.

Decision Policy

V1 keeps the decision policy intentionally strict:

  • If the runner crashes, mark as crash
  • If any hard constraint fails, mark as discard
  • If the candidate improves the primary metric, mark as keep
  • If the primary metric is tied, keep only if tie-breakers are clearly better
  • Otherwise mark as discard

The default philosophy is to retain only clear improvements.

Task Specification

Each task lives in its own directory and includes a declarative task definition file.

Example:

id: skill-quality
description: Optimize a skill markdown file against a rubric.

artifacts:
  include:
    - skills/writing/SKILL.md
  exclude:
    - work/**
  max_files_per_iteration: 2

mutation:
  mode: direct_edit
  allowed_file_types:
    - .md
    - .txt
    - .yaml
    - .py
  max_changed_lines: 120

runner:
  command: "python scripts/run_skill_eval.py"
  cwd: "."
  timeout_seconds: 300

scorer:
  type: command
  command: "python scripts/score_skill.py"
  parse:
    format: json
    score_field: "score"
    metrics_field: "metrics"

objective:
  primary_metric: score
  direction: maximize

constraints:
  - metric: violation_count
    op: "<="
    value: 0
  - metric: length_tokens
    op: "<="
    value: 2500

policy:
  keep_if: "better_primary"
  tie_breakers:
    - lower: violation_count
    - lower: length_tokens
  on_failure: discard

budget:
  max_iterations: 50
  max_failures: 10

logging:
  results_file: "work/results.jsonl"
  candidate_dir: "work/candidates"

Core Modules

V1 should be implemented with six focused modules.

task_loader

Responsibilities:

  • Parse task files
  • Validate required fields
  • Produce normalized internal models

It should not execute commands or mutate files.

artifact_manager

Responsibilities:

  • Resolve allowed files
  • Snapshot baseline contents and hashes
  • Produce diff summaries
  • Restore discarded candidates

It owns file-level safety and reversibility.

mutation_engine

Responsibilities:

  • Instruct the agent to edit artifacts
  • Enforce mutation budget limits
  • Reject edits outside allowed paths or file types

It is a bounded editor, not an open-ended autonomous developer.

runner

Responsibilities:

  • Execute the candidate task
  • Capture exit code, runtime, and logs
  • Emit a structured run result

It does not decide quality.

scorer

Responsibilities:

  • Convert raw outputs into structured metrics
  • Emit the primary metric, additional metrics, and constraint data

V1 should support:

  • command scorer
  • Optional later extension to python_plugin

decision_engine

Responsibilities:

  • Compare candidate vs baseline
  • Apply objective and policy rules
  • Emit keep, discard, or crash

It should consume only structured results, not raw logs.

Directory Layout

Suggested V1 structure:

engine/
  task_loader.py
  artifact_manager.py
  mutation_engine.py
  runner.py
  scorer.py
  decision_engine.py
  models.py

tasks/
  skill-quality/
    task.yaml
    rubric.md
    prompt.md

scripts/
  run_task.py
  score_task.py

work/
  runs/
  candidates/
  logs/
  results.jsonl

This structure separates reusable engine code from per-task resources and runtime state.

Minimal V1 Task

The first task should prove the engine on skill/prompt optimization, because it directly matches the desired use case and has fast evaluation.

Recommended first task:

  • Optimize one SKILL.md
  • Score it against a rubric
  • Enforce structural constraints
  • Keep only strictly better versions

Suggested output metrics:

  • primary_score
  • clarity
  • coverage
  • constraint_obedience
  • violation_count
  • length_tokens

This is enough to validate the engine without requiring GPU training or long-running experiments.

Guardrails

V1 should explicitly include the following guardrails:

  • Limit each iteration to a small number of files
  • Limit total changed lines per iteration
  • Reject edits outside declared artifact boundaries
  • Require structured scorer output
  • Evaluate constraints before comparing scores
  • Log every accepted or discarded candidate with a diff summary
  • Default to strict improvement only
  • Avoid hidden cross-task memory; share only structured historical results

These controls are necessary to keep the optimization loop safe, explainable, and reusable.

Implement the system in two phases.

Phase 1

Build the smallest complete loop:

  • Task loader
  • Artifact manager
  • Runner
  • Scorer
  • Decision engine
  • JSONL logging

Use a simple skill-quality task to prove the loop works.

Phase 2

Add bounded artifact mutation:

  • Agent-driven edit generation
  • Mutation budget enforcement
  • Candidate restoration on discard
  • Better diff summaries and diagnostics

This sequencing reduces risk by proving the evaluation loop before adding autonomous mutation.

Open Questions Deferred

The following topics are intentionally deferred from V1:

  • Multi-objective Pareto search
  • Branch-per-candidate workflows
  • Distributed or parallel experiment scheduling
  • Task-to-task transfer memory
  • Human-in-the-loop review checkpoints
  • Rich plugin ecosystems

These can be added later if V1 proves useful.

Recommendation

Proceed with a minimal Artifact Loop Engine centered on declarative task specs and strict structured scoring.

Do not generalize immediately to full project-level autonomy. The correct first step is a bounded optimizer for editable text artifacts with a stable evaluation loop.