tweaks to docs for both humans and agents

2026-03-07 17:02:43 +00:00 · 2026-03-07 17:02:43 +00:00 · 47ec1ade0a
commit 47ec1ade0a
parent ada84e5247
2 changed files with 18 additions and 13 deletions
--- a/README.md
+++ b/README.md
@ -26,19 +26,25 @@ Training runs for a **fixed 5-minute time budget** (wall clock, excluding startu
 **Requirements:** A single NVIDIA GPU (tested on H100), Python 3.10+, [uv](https://docs.astral.sh/uv/).

 ```bash
-# 1. Install dependencies
+
+# 1. Install uv project manager (if you don't already have it)
+curl -LsSf https://astral.sh/uv/install.sh | sh
+
+# 2. Install dependencies
 uv sync

-# 2. Download data and train tokenizer (one-time, ~5 min)
+# 3. Download data and train tokenizer (one-time, ~2 min)
 uv run prepare.py

-# 3. Run a single training experiment (5 min + startup)
+# 4. Manually run a single training experiment (~5 min)
 uv run train.py
 ```

+If the above commands all work ok, your setup is working and you can go into autonomous research mode.
+
 ## Running the agent

-Simply spin up your Claude/Codex or whatever you want in this repo, then you can something like:
+Simply spin up your Claude/Codex or whatever you want in this repo (and disable all permissions), then you can prompt something like:

 ```
 Hi have a look at program.md and let's kick off a new experiment! let's do the setup first.
@ -59,8 +65,7 @@ pyproject.toml  — dependencies
 ## Design choices

 - **Single file to modify.** The agent only touches `train.py`. This keeps the scope manageable and diffs reviewable.
- **Fixed time budget.** Training always runs for exactly 5 minutes. This makes experiments directly comparable regardless of what the agent changes (model size, batch size, architecture, etc).
- **BPB metric.** Bits per byte is independent of tokenizer vocabulary size, so the agent could in principle change the vocab size and still get a fair comparison.
+- **Fixed time budget.** Training always runs for exactly 5 minutes, regardless of your specific platform. This means you can expect approx 12 experiments/hour and approx 100 experiments while you sleep. There are two upsides of this design decision. First, this makes experiments directly comparable regardless of what the agent changes (model size, batch size, architecture, etc). Second, this means that autoresearch will find the most optimal model for your platform in that time budget. The downside is that your runs (and results) become not comparable to other people running on other compute platforms.
 - **Self-contained.** No external dependencies beyond PyTorch and a few small packages. No distributed training, no complex configs. One GPU, one file, one metric.

 ## License
--- a/program.md
+++ b/program.md
@ -36,6 +36,8 @@ Each experiment runs on a single GPU. The training script runs for a **fixed tim

 **Simplicity criterion**: All else being equal, simpler is better. A small improvement that adds ugly complexity is not worth it. Conversely, removing something and getting equal or better results is a great outcome — that's a simplification win. When evaluating whether to keep a change, weigh the complexity cost against the improvement magnitude. A 0.001 val_bpb improvement that adds 20 lines of hacky code? Probably not worth it. A 0.001 val_bpb improvement from deleting code? Definitely keep. An improvement of ~0 but much simpler code? Keep.

+**The first run**: Your very first run should always be to establish the baseline, so you will run the training script as is.
+
 ## Output format

 Once the script finishes it prints a summary like this:
@ -53,9 +55,7 @@ num_params_M:     50.3
 depth:            8
 ```

-This is the baseline to beat.
-
-You can extract the key metric from the log:
+Note that the script is configured to always stop after 5 minutes, so depending on the computing platform of this computer the numbers might look different. You can extract the key metric from the log file:

 ```
 grep "^val_bpb:" run.log
@ -91,7 +91,7 @@ d4e5f6g	0.000000	0.0	crash	double model width (OOM)

 The experiment runs on a dedicated branch (e.g. `autoresearch/mar5` or `autoresearch/mar5-gpu0`).

-LOOP FOREVER (until I wake up and come back in the morning):
+LOOP FOREVER:

 1. Look at the git state: the current branch/commit we're on
 2. Tune `train.py` with an experimental idea by directly hacking the code.
@ -105,10 +105,10 @@ LOOP FOREVER (until I wake up and come back in the morning):

 The idea is that you are a completely autonomous researcher trying things out. If they work, keep. If they don't, discard. And you're advancing the branch so that you can iterate. If you feel like you're getting stuck in some way, you can rewind but you should probably do this very very sparingly (if ever).

-**Timeout**: Each experiment should take ~7 minutes total (5 min training + startup + eval). If a run exceeds 10 minutes, kill it and treat it as a failure (discard and revert).
+**Timeout**: Each experiment should take ~5 minutes total (+ a few seconds for startup and eval overhead). If a run exceeds 10 minutes, kill it and treat it as a failure (discard and revert).

-**Crashes**: If a run crashes (OOM, or a bug, or etc.), use your judgment: If it's something dumb and easy to fix (e.g. a typo, a missing import), fix it and re-run. If the idea itself is fundamentally broken, just skip it, log "CRASH" in the tsv, and move on.
+**Crashes**: If a run crashes (OOM, or a bug, or etc.), use your judgment: If it's something dumb and easy to fix (e.g. a typo, a missing import), fix it and re-run. If the idea itself is fundamentally broken, just skip it, log "crash" as the status in the tsv, and move on.

 **NEVER STOP**: Once the experiment loop has begun (after the initial setup), do NOT pause to ask the human if you should continue. Do NOT ask "should I keep going?" or "is this a good stopping point?". The human might be asleep, or gone from a computer and expects you to continue working *indefinitely* until you are manually stopped. You are autonomous. If you run out of ideas, think harder — read papers referenced in the code, re-read the in-scope files for new angles, try combining previous near-misses, try more radical architectural changes. The loop runs until the human interrupts you, period.

-I will usually leave this running for a number of hours, like... 10 or so. If each experiment is ~7 min, you can do ~8/hour, for a total of approx. 80. The hope is that I come back in the morning and we have some improvements.
+As an example use case, a user might leave you running while they sleep. If each experiment takes you ~5 minutes then you can run approx 12/hour, for a total of about 100 over the duration of the average human sleep. The user then wakes up to experimental results, all completed by you while they slept!