bunch of small changes to docs and files, and a teaser figure with a blooper :)

2026-03-07 19:00:04 +00:00 · 2026-03-07 19:00:04 +00:00 · 8a5c4869bd
commit 8a5c4869bd
parent 032d203695
7 changed files with 20 additions and 37 deletions
--- a/README.md
+++ b/README.md
@ -1,25 +1,20 @@
 # autoresearch

-Autonomous LLM pretraining research, driven by AI agents.
+![teaser](progress.png)

-The idea: give an AI agent a small but real LLM training setup and let it run experiments overnight. It modifies the code, trains for 5 minutes, checks if the result improved, keeps or discards, and repeats. You wake up in the morning to a log of experiments and (hopefully) a better model.
+*One day, frontier AI research used to be done by meat computers in between eating, sleeping, having other fun, and synchronizing once in a while using sound wave interconnect in the ritual of "group meeting". That era is long gone. Research is now entirely the domain of autonomous swarms of AI agents running across compute cluster megastructures in the skies. The agents claim that we are now in the 10,205th generation of the code base, in any case no one could tell if that's right or wrong as the "code" is now a self-modifying binary that has grown beyond human comprehension. This repo is the story of how it all began. -@karpathy, March 2026*.

-This particular implementation is trying to be the least fancy baseline, but it's clear how one would adjust the `program.md` file to run more sophisticated research programs with more elaborate instructions. For example, the agent can actively do little experiments on research while the job is running.
-
-The training code here is a simplified single-GPU implementation of [nanochat](https://github.com/karpathy/nanochat).
-
-A bit more context on this project is here in this [tweet](https://x.com/karpathy/status/2029701092347630069). This code in particular is a simpler, more self-contained version that I thought people might like to play with.
+The idea: give an AI agent a small but real LLM training setup and let it experiment autonomously overnight. It modifies the code, trains for 5 minutes, checks if the result improved, keeps or discards, and repeats. You wake up in the morning to a log of experiments and (hopefully) a better model. The training code here is a simplified single-GPU implementation of [nanochat](https://github.com/karpathy/nanochat). The core idea is that you're not touching any of the Python files like you normally would as a researcher. Instead, you are programming the `program.md` Markdown files that provide context to the AI agents and set up your autonomous research org. The default `program.md` in this repo is intentionally kept as a bare bones baseline, though it's obvious how one would iterate on it over time to find the "research org code" that achieves the fastest research progress, how you'd add more agents to the mix, etc. A bit more context on this project is here in this [tweet](https://x.com/karpathy/status/2029701092347630069).

 ## How it works

-The repo is deliberately small and only has a few files:
+The repo is deliberately kept small and only really has a three files that matter:

- **`constants.py`** — fixed rules: sequence length, time budget, eval tokens. Not modified.
- **`prepare.py`** — one-time data prep (downloads training data, trains a BPE tokenizer) and runtime utilities (dataloader, evaluation). Not modified.
- **`train.py`** — the single file the agent edits. Contains the full GPT model, optimizer (Muon + AdamW), and training loop. Everything is fair game: architecture, hyperparameters, optimizer, batch size, etc.
- **`program.md`** — instructions for the agent. Point your agent here and let it go.
+- **`prepare.py`** — fixed constants, one-time data prep (downloads training data, trains a BPE tokenizer), and runtime utilities (dataloader, evaluation). Not modified.
+- **`train.py`** — the single file the agent edits. Contains the full GPT model, optimizer (Muon + AdamW), and training loop. Everything is fair game: architecture, hyperparameters, optimizer, batch size, etc. **This file is edited and iterated on by the agent**.
+- **`program.md`** — baseline instructions for one agent. Point your agent here and let it go. **This file is edited and iterated on by the human**.

-Training runs for a **fixed 5-minute time budget** (wall clock, excluding startup/compilation). The metric is **val_bpb** (validation bits per byte) — lower is better, and vocab-size-independent so architectural changes are fairly compared.
+By design, training runs for a **fixed 5-minute time budget** (wall clock, excluding startup/compilation), regardless of the details of your compute. The metric is **val_bpb** (validation bits per byte) — lower is better, and vocab-size-independent so architectural changes are fairly compared.

 ## Quick start

@ -55,8 +50,7 @@ The `program.md` file is essentially a super lightweight "skill".
 ## Project structure

 ```
-constants.py    — fixed constants (do not modify)
-prepare.py      — data prep + runtime utilities (do not modify)
+prepare.py      — constants, data prep + runtime utilities (do not modify)
 train.py        — model, optimizer, training loop (agent modifies this)
 program.md      — agent instructions
 pyproject.toml  — dependencies
--- a/analysis.ipynb
+++ b/analysis.ipynb
@ -121,17 +121,10 @@
    "\n",
    "    ax.annotate(desc, (idx, bpb),\n",
    "                textcoords=\"offset points\",\n",
-    "                xytext=(6, 6), fontsize=6.0,\n",
+    "                xytext=(6, 6), fontsize=8.0,\n",
    "                color=\"#1a7a3a\", alpha=0.9,\n",
    "                rotation=30, ha=\"left\", va=\"bottom\")\n",
    "\n",
-    "# Reference lines\n",
-    "ax.axhline(y=baseline_bpb, color=\"#e74c3c\", linewidth=1,\n",
-    "           linestyle=\"--\", alpha=0.5, label=f\"Baseline ({baseline_bpb:.4f})\")\n",
-    "best = kept_bpb.min()\n",
-    "ax.axhline(y=best, color=\"#27ae60\", linewidth=1,\n",
-    "           linestyle=\"--\", alpha=0.5, label=f\"Best ({best:.4f})\")\n",
-    "\n",
    "n_total = len(df)\n",
    "n_kept = len(df[df[\"status\"] == \"KEEP\"])\n",
    "ax.set_xlabel(\"Experiment #\", fontsize=12)\n",
--- a/constants.py
+++ b/constants.py
@ -1,7 +0,0 @@
-"""
-Fixed constants for autoresearch. Do not modify.
-"""
-
-MAX_SEQ_LEN = 2048       # context length
-TIME_BUDGET = 300        # training time budget in seconds (5 minutes)
-EVAL_TOKENS = 40 * 524288  # number of tokens for val eval
--- a/prepare.py
+++ b/prepare.py
@ -23,7 +23,13 @@ import rustbpe
 import tiktoken
 import torch

-from constants import MAX_SEQ_LEN, EVAL_TOKENS
+# ---------------------------------------------------------------------------
+# Constants (fixed, do not modify)
+# ---------------------------------------------------------------------------
+
+MAX_SEQ_LEN = 2048       # context length
+TIME_BUDGET = 300        # training time budget in seconds (5 minutes)
+EVAL_TOKENS = 40 * 524288  # number of tokens for val eval

 # ---------------------------------------------------------------------------
 # Configuration
--- a/program.md
+++ b/program.md
@ -9,8 +9,7 @@ To set up a new experiment, work with the user to:
 1. **Agree on a run tag**: propose a tag based on today's date (e.g. `mar5`). The branch `autoresearch/<tag>` must not already exist — this is a fresh run.
 2. **Create the branch**: `git checkout -b autoresearch/<tag>` from current master.
 3. **Read the in-scope files**: The repo is small. Read these files for full context:
-   - `constants.py` — fixed constants (`MAX_SEQ_LEN`, `TIME_BUDGET`, `EVAL_TOKENS`). Do not modify.
-   - `prepare.py` — data prep, tokenizer, dataloader, evaluation. Do not modify.
+   - `prepare.py` — fixed constants, data prep, tokenizer, dataloader, evaluation. Do not modify.
   - `train.py` — the file you modify. Model architecture, optimizer, training loop.
 4. **Verify data exists**: Check that `~/.cache/autoresearch/` contains data shards and a tokenizer. If not, tell the human to run `uv run prepare.py`.
 5. **Initialize results.tsv**: Create `results.tsv` with header row and baseline entry. The baseline results are already known from the output format section below (val_bpb: 0.997900, peak_vram_mb: 45060.2). Do NOT re-run the baseline — just record it.
@ -26,7 +25,7 @@ Each experiment runs on a single GPU. The training script runs for a **fixed tim
 - Modify `train.py` — this is the only file you edit. Everything is fair game: model architecture, optimizer, hyperparameters, training loop, batch size, model size, etc.

 **What you CANNOT do:**
- Modify `constants.py` or `prepare.py`. These are read-only. They contain the fixed evaluation, data loading, tokenizer, and training constants (time budget, sequence length, etc).
+- Modify `prepare.py`. It is read-only. It contains the fixed evaluation, data loading, tokenizer, and training constants (time budget, sequence length, etc).
 - Install new packages or add dependencies. You can only use what's already in `pyproject.toml`.
 - Modify the evaluation harness. The `evaluate_bpb` function in `prepare.py` is the ground truth metric.

--- a/progress.png
+++ b/progress.png
--- a/train.py
+++ b/train.py
@ -9,7 +9,6 @@ os.environ["PYTORCH_ALLOC_CONF"] = "expandable_segments:True"
 os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = "1"

 import gc
-import math
 import time
 from dataclasses import dataclass, asdict

@ -23,8 +22,7 @@ cap = torch.cuda.get_device_capability()
 repo = "varunneal/flash-attention-3" if cap == (9, 0) else "kernels-community/flash-attn3"
 fa3 = get_kernel(repo).flash_attn_interface

-from constants import MAX_SEQ_LEN, TIME_BUDGET
-from prepare import Tokenizer, make_dataloader, evaluate_bpb
+from prepare import MAX_SEQ_LEN, TIME_BUDGET, Tokenizer, make_dataloader, evaluate_bpb

 # ---------------------------------------------------------------------------
 # GPT Model