Go to file

dumko2001 bdf0c0d520 Allow agent to diagnose crashes by reading the python stack trace		2026-03-07 14:46:43 +05:30
.gitignore	initial commit	2026-03-06 21:58:52 +00:00
.python-version	initial commit	2026-03-06 21:58:52 +00:00
analysis.ipynb	add analysis notebook for convenience	2026-03-06 22:36:37 +00:00
constants.py	initial commit	2026-03-06 21:58:52 +00:00
prepare.py	initial commit	2026-03-06 21:58:52 +00:00
program.md	Allow agent to diagnose crashes by reading the python stack trace	2026-03-07 14:46:43 +05:30
pyproject.toml	add analysis notebook for convenience	2026-03-06 22:36:37 +00:00
README.md	remove spawn.sh reference from README (file was deleted)	2026-03-06 22:08:45 +00:00
train.py	add fallback FA3 kernel for non-Hopper GPUs	2026-03-07 01:31:48 +00:00
uv.lock	add analysis notebook for convenience	2026-03-06 22:36:37 +00:00

README.md

autoresearch

Autonomous LLM pretraining research, driven by AI agents.

The idea: give an AI agent a small but real LLM training setup and let it run experiments overnight. It modifies the code, trains for 5 minutes, checks if the result improved, keeps or discards, and repeats. You wake up in the morning to a log of experiments and (hopefully) a better model.

This particular implementation is trying to be the least fancy baseline, but it's clear how one would adjust the program.md file to run more sophisticated research programs with more elaborate instructions. For example, the agent can actively do little experiments on research while the job is running.

The training code here is a simplified single-GPU implementation of nanochat.

A bit more context on this project is here in this tweet. This code in particular is a simpler, more self-contained version that I thought people might like to play with.

How it works

The repo is deliberately small and only has a few files:

constants.py — fixed rules: sequence length, time budget, eval tokens. Not modified.
prepare.py — one-time data prep (downloads training data, trains a BPE tokenizer) and runtime utilities (dataloader, evaluation). Not modified.
train.py — the single file the agent edits. Contains the full GPT model, optimizer (Muon + AdamW), and training loop. Everything is fair game: architecture, hyperparameters, optimizer, batch size, etc.
program.md — instructions for the agent. Point your agent here and let it go.

Training runs for a fixed 5-minute time budget (wall clock, excluding startup/compilation). The metric is val_bpb (validation bits per byte) — lower is better, and vocab-size-independent so architectural changes are fairly compared.

Quick start

Requirements: A single NVIDIA GPU (tested on H100), Python 3.10+, uv.

# 1. Install dependencies
uv sync

# 2. Download data and train tokenizer (one-time, ~5 min)
uv run prepare.py

# 3. Run a single training experiment (5 min + startup)
uv run train.py

Running the agent

Simply spin up your Claude/Codex or whatever you want in this repo, then you can something like:

Hi have a look at program.md and let's kick off a new experiment! let's do the setup first.

The program.md file is essentially a super lightweight "skill".

Project structure

constants.py    — fixed constants (do not modify)
prepare.py      — data prep + runtime utilities (do not modify)
train.py        — model, optimizer, training loop (agent modifies this)
program.md      — agent instructions
pyproject.toml  — dependencies

Design choices

Single file to modify. The agent only touches train.py. This keeps the scope manageable and diffs reviewable.
Fixed time budget. Training always runs for exactly 5 minutes. This makes experiments directly comparable regardless of what the agent changes (model size, batch size, architecture, etc).
BPB metric. Bits per byte is independent of tokenizer vocabulary size, so the agent could in principle change the vocab size and still get a fair comparison.
Self-contained. No external dependencies beyond PyTorch and a few small packages. No distributed training, no complex configs. One GPU, one file, one metric.

License

MIT