Image by Nana Dua on Unsplash Source

Andrej Karpathy released AutoResearch on March 6, 2026. Two weeks later, it has 48,000 GitHub stars, 6,700 forks, and has produced concrete research results that beat Karpathy’s own hand-tuned submissions on ML benchmarks. The entire system is 630 lines of Python. An AI agent reads the code, proposes a change, runs a 5-minute training experiment on a single GPU, checks if the validation metric improved, keeps or discards the result, and repeats. No human input required after you press enter.

This is not a framework. It is not a platform. It is a loop: hypothesis, experiment, evaluation, repeat. The agent runs 12 experiments per hour, roughly 100 overnight. Karpathy ran 700 experiments over two days and found 20 stackable improvements that reduced GPT-2 training time from 2.02 hours to 1.80 hours, an 11% speedup.

Related: Agentic AI in R&D: How Agent-Accelerated Research Changes Competitive Advantage

How AutoResearch Works: Three Files, One Metric, Zero Handholding

The design philosophy behind AutoResearch is radical simplicity. The entire system consists of three files:

train.py (~630 lines): Contains the GPT model definition, the Muon+AdamW optimizer, and the training loop. This is the only file the agent is allowed to edit. Every experiment is a modification to this file.

prepare.py: Handles data preparation, tokenization, and defines the evaluate_bpb() function. The agent can read this file but cannot modify it. This prevents the agent from gaming the evaluation metric.

program.md: Natural language instructions that tell the agent what to do. Think of it as a system prompt for the research loop. It includes one critical directive: “Do NOT pause to ask the human if you should continue. You are autonomous. The loop runs until the human interrupts you, period.”

The single metric is val_bpb (validation bits per byte). Lower is better. Because it is vocabulary-size-independent, the agent can make architectural changes (adding or removing layers, changing attention mechanisms) and still get a fair comparison across experiments.

The Eight-Step Loop

Each experiment follows an identical cycle:

  1. Analyze: Read train.py, check git state, review previous experiment results
  2. Hypothesize: Propose a specific modification (architecture tweak, hyperparameter change, optimizer adjustment)
  3. Modify: Edit train.py with the proposed change
  4. Commit: Git commit the modification
  5. Train: Execute uv run train.py > run.log 2>&1 for exactly 300 seconds
  6. Extract: Parse val_bpb from the training log
  7. Evaluate: Compare against the previous best score
  8. Decide: If improved, keep the commit. If worse or crashed, git reset HEAD~1. Return to step 1.

Every result gets logged to results.tsv with a status of keep, discard, or crash. This gives the agent a growing history of what has been tried and what worked, which informs future hypotheses.

Why 630 Lines Matters

This is not accidental minimalism. Karpathy designed the codebase to fit entirely within a single LLM context window. Any capable model (Claude Opus 4.6, GPT-4, Sonnet) can hold the full 630 lines in memory while reasoning about modifications. Once you split code across multiple files and thousands of lines, current AI agents lose track of interactions between components. By keeping everything in one file, the agent’s mental model of the system is complete and accurate.

The constraint is also what makes the results credible. The agent cannot install new packages, cannot modify the evaluation function, and cannot cheat. It has to make real improvements to real training code.

What AutoResearch Actually Found

Karpathy did not release AutoResearch as a toy demo. He ran it against his own nanochat leaderboard, a competitive benchmark for small-scale GPT training, and the agent produced results good enough to rank 5th on the board. That entry beat every previous submission Karpathy himself had made manually.

Over ~700 experiments across two days on depth-12 GPT models, the agent found approximately 20 genuine improvements that could be stacked together:

  • QKNorm was missing a scaler multiplier, making attention too diffuse. The agent added it and saw immediate validation improvement.
  • Value embeddings lacked proper regularization. The agent discovered this independently, without any prior knowledge of the research literature on the topic.
  • The banded attention window was too conservative. The agent widened it.
  • AdamW beta parameters were suboptimal. The agent ran dozens of experiments to narrow in on better values.
  • Weight decay scheduling and network initialization needed tuning. Small changes, but they stacked.

Each of these individually produced marginal gains. Together, they brought the “time to GPT-2” metric from 2.02 hours down to 1.80 hours. And critically, all 20 improvements transferred perfectly from depth-12 to depth-24 models without any re-tuning.

Related: Software Factories: When AI Agents Build Software Without Human Review

Beyond Karpathy: Tobi Lutke’s Results

Shopify CEO Tobi Lutke applied the AutoResearch pattern to two very different problems. First, he ran it on an ML task: 37 experiments overnight. A 0.8B parameter model scored 19% higher than his previous hand-tuned 1.6B model. Better performance with literally half the parameters.

Then Lutke tried something more interesting. He pointed the same pattern at Shopify’s Liquid templating engine, the rendering system behind every Shopify storefront. Ninety-three automated commits later: 53% faster rendering and 61% fewer object allocations. This was not ML research. It was performance engineering on production business code, done by an agent loop running overnight.

SkyPilot: Scaling to 16 GPUs for $300

The SkyPilot team took AutoResearch and parallelized it across 13 H100s and 3 H200s on CoreWeave Kubernetes. In 8 hours, they submitted ~910 experiments (~700 with valid results), achieving 9x throughput compared to a single-GPU sequential run.

The total cost was roughly $300: $9 in Claude API calls and $260 in GPU compute. The val_bpb improved from 1.003 to 0.974, a 2.87% gain.

One detail stood out. The agent autonomously developed a two-tier strategy: it screened ideas on the cheaper H100 GPUs first, then validated the most promising candidates on the faster H200s. Nobody told it to do this. It figured out resource optimization on its own.

The Karpathy Loop as a Design Pattern

Analyst Janakiram MSV coined the term “The Karpathy Loop” to describe the pattern that makes AutoResearch work. It has three requirements:

  1. One modifiable file that the agent controls completely
  2. One objectively testable metric that determines success or failure
  3. A fixed time budget per experiment to prevent runaway costs

Any problem that can be structured this way can be autoresearched. The metric does not have to be validation loss. It could be page load time, memory usage, rendering speed, test pass rate, or conversion rate.

Karpathy himself made this explicit: “Any metric you care about can be autoresearched by an agent swarm.” He compared the future of AutoResearch to SETI@home, where distributed agents collaborate asynchronously on a shared research frontier.

Related: AI Agent Benchmarks Explained: What SWE-bench, WebArena, and AgentBench Actually Measure

How AutoResearch Differs from AutoML

Traditional neural architecture search (NAS) and AutoML tools explore predefined parameter spaces. They vary learning rates between 1e-3 and 1e-5, try three activation functions, and test four layer counts. The search space is fixed before the process begins.

AutoResearch is fundamentally different. The LLM reads actual Python code, understands what each component does, reasons about why a change might help, and proposes novel modifications that no predefined search space would contain. The agent that discovered QKNorm was missing a scaler multiplier was not searching a grid. It was reading code, understanding a mathematical flaw, and fixing it.

Karpathy has been blunt about this distinction, calling traditional AutoML “useless” compared to LLM-driven research. The LLM has access to the internet, can look up research papers, and learns from its own experiment history. Every failed experiment teaches it something about what does not work.

Which LLM Works Best?

AutoResearch is model-agnostic. The program.md file does not specify which model to use. In practice, Latent Space reported that Claude Opus 4.6 sustained 12+ hours of continuous operation and completed 118 experiments without breaking the loop. GPT-5.4 in extended-thinking mode had trouble maintaining continuous loops over long sessions.

The SkyPilot scaling experiment used Claude Code as the agent runtime. For most single-GPU runs, any model in the Claude Sonnet/Opus family or GPT-4 class will work.

Running AutoResearch Yourself

The setup is straightforward. You need a single NVIDIA GPU (the repo targets H100 but community forks exist for RTX cards on Windows, Mac Mini M4, and AMD GPUs), Python 3.10+, and the UV package manager.

git clone https://github.com/karpathy/autoresearch
cd autoresearch
uv sync
python prepare.py  # download and tokenize training data

Then point your preferred AI coding agent at program.md and let it run. The default training dataset works out of the box. For machines with less VRAM, the community recommends switching to TinyStories as the training corpus.

Expect roughly 60% of experiments to fail or crash, especially early in a run. That is by design. The agent tries aggressive modifications, learns from crashes, and gradually narrows toward productive changes. One user documented a Mac Mini M4 run where 26 of 35 experiments failed, but the 7 that succeeded revealed that “the model got better by getting simpler.”

Related: AutoResearchClaw: The 23-Stage AI Pipeline That Writes Conference Papers from a Single Idea

Frequently Asked Questions

What is Karpathy’s AutoResearch?

AutoResearch is an open-source tool by Andrej Karpathy that lets an AI agent autonomously run ML research experiments on a single GPU. The agent reads a 630-line Python training script, proposes modifications, runs 5-minute experiments, and keeps or discards changes based on a single validation metric. It hit 48,000 GitHub stars within two weeks of its March 2026 release.

What results has AutoResearch produced?

In Karpathy’s own run, 700 experiments over two days found 20 stackable improvements that reduced GPT-2 training time by 11%. Shopify CEO Tobi Lutke used the same pattern to make Liquid rendering 53% faster with 61% fewer object allocations. SkyPilot scaled it to 16 GPUs and ran 910 experiments in 8 hours for about $300.

What hardware do you need to run AutoResearch?

AutoResearch was designed for a single NVIDIA GPU, originally tested on H100. Community forks support RTX cards on Windows, Mac Mini M4, and AMD GPUs. You also need Python 3.10+ and the UV package manager. Smaller GPUs can use TinyStories instead of the default dataset to reduce VRAM requirements.

How is AutoResearch different from AutoML?

Traditional AutoML searches predefined parameter spaces (learning rates, layer counts). AutoResearch uses an LLM that reads actual Python code, understands what each component does, and proposes novel modifications no grid search would find. The agent discovered architectural bugs like a missing scaler multiplier in QKNorm, not just better hyperparameters.

Can AutoResearch be used for non-ML tasks?

Yes. The Karpathy Loop pattern works for any problem with one modifiable file, one testable metric, and a fixed time budget. Shopify used it for production Liquid template optimization. Others have applied it to page load performance, and one PM used the pattern to improve landing page conversion from 41% to 92% in four rounds.