Karpathy's autoresearch: 630 lines of Python that run 100 ML experiments overnight
Andrej Karpathy open-sourced autoresearch, a minimal Python tool letting AI agents run autonomous ML experiments on a single GPU.
Whether you're looking for an angel investor, a growth advisor, or just want to connect — I'm always open to great ideas.
Get in TouchAI, startups & growth insights. No spam.
TL;DR: Andrej Karpathy released autoresearch, an open-source Python tool that lets AI agents autonomously run ML experiments on a single GPU. At roughly 12 experiments per hour, a single overnight run covers about 100 hypothesis tests. The project hit 8.6 million views in two days.
Autoresearch is a Python tool that instructs an LLM agent to generate, run, and evaluate ML experiments without a human in the loop. You point it at a problem. The agent writes code, trains a model for roughly five minutes, checks the results, and decides whether to keep or discard the modification.
This is not a notebook automation script. It is a full hypothesis-test cycle. The agent treats each run as a unit of scientific inquiry: propose a change, measure the outcome, update the search direction.
The code fits in a single file with no external dependencies beyond PyTorch. That constraint is intentional. Karpathy has always favored minimal, readable implementations over abstraction-heavy frameworks.
Six hundred and thirty lines is genuinely small for what autoresearch does. For comparison, many popular ML training utilities exceed that count in their configuration parsers alone.
The codebase has three main concerns. First, it manages the agent's context: what experiments have run, what results came back, and what hypotheses remain untested. Second, it handles code execution in a sandboxed way, writing model training scripts and capturing stdout. Third, it calls an LLM to interpret results and generate the next modification.
There are no heavy abstractions. No plugin systems, no YAML configs, no class hierarchies ten levels deep. The entire loop is readable in an afternoon. That is the point. Karpathy said in the project's README that the goal was a tool researchers could understand completely and modify freely.
The absence of external dependencies beyond PyTorch is also significant. You can clone this repository and run it on any machine with a GPU and PyTorch installed. No pip install rabbit holes, no version conflicts, no Docker containers required.
The agent loop is the core mechanism. Each iteration follows a fixed pattern.
The agent reads the current experiment state. It proposes a code modification based on prior results. The modification gets applied to the training script. The script runs for approximately five minutes. Results are captured and fed back to the agent with context about what changed.
The agent then decides: keep the change if performance improved, discard it if not. After the decision, the loop repeats. This is a basic hill-climbing search guided by an LLM's code generation ability.
At 12 experiments per hour, 100 experiments takes about 8 hours. Leave it running overnight. Come back to a ranked list of what worked and what did not.
The five-minute training window is a deliberate constraint. Short enough to iterate fast, long enough to get a meaningful signal from a real training run. It is a reasonable tradeoff for exploration-phase research.
Access to large GPU clusters sits behind cloud spend, academic compute grants, or big company employment. Single-GPU research does not. Anyone with a mid-range gaming GPU or a cheap cloud instance can run autoresearch today.
This is the barrier that gets removed. The question was never whether AI agents could automate experimentation. The question was whether you needed a fleet of A100s to do it at useful scale. Autoresearch answers that: no, you do not.
A single RTX 4090 costs around $1,800 on the consumer market. A single A100 80GB costs $10,000 to $15,000. For most researchers at smaller institutions, companies without a dedicated ML team, or independent builders, the GPU fleet has never been accessible. Autoresearch runs on what many people already own.
The throughput is real. One hundred experiments overnight is more hypothesis tests than many PhD students run in a week of manual experimentation. The bottleneck was never the training compute. It was the human loop: write code, wait, inspect results, write more code.
The project hit 8.6 million views in two days. That number reflects genuine interest from a diverse audience: ML researchers, startup engineers, hobbyists, and tech observers who may never run it themselves.
The viral spread was driven partly by Karpathy's platform. He has over 800,000 followers on X and his posts routinely reach people far outside the ML core. But the content itself earned the attention. A working tool with a clear use case, released as a single file with no setup friction, is rare.
The builders who responded immediately were not surprised by the concept. Automated experimentation has been a research direction for years, from neural architecture search to AutoML platforms. What caught attention was the minimalism. The fact that the core loop fits in a single file makes it forkable, auditable, and educational.
Several researchers started posting results within 24 hours. The phrase "Karpathy loop" appeared in posts describing teams spinning up their own versions, scaling to multiple GPUs, or adapting the approach for different problem types. That kind of organic replication is a strong signal.
| Capability | autoresearch | Traditional ML pipeline |
|---|---|---|
| Setup time | Minutes (single file) | ✗ Hours to days |
| GPU requirement | Single consumer GPU ✓ | ✗ Often multi-GPU cluster |
| External dependencies | PyTorch only ✓ | ✗ Many framework dependencies |
| Human-in-loop per experiment | No ✓ | ✗ Yes |
| Experiments per overnight run | ~100 ✓ | ✗ Typically 5-20 |
| Code auditability | Full (630 lines) ✓ | ✗ Abstracted across modules |
| LLM-guided search | Yes ✓ | ✗ Usually manual or grid search |
| Production-ready training | ✗ No | ✓ Yes |
Traditional pipelines win on production reliability, distributed training, and deployment readiness. Autoresearch wins on iteration speed during the exploration phase. These are not competing tools. They address different phases of the research process.
The appropriate mental model is: use autoresearch to find what works, then build production infrastructure around the winning approach. The tool is for discovery, not deployment.
Independent researchers get the most immediate value. If you are running ML experiments without institutional compute, autoresearch removes the manual overhead that makes iteration slow. You set up a problem, define the evaluation metric, and let the agent run.
Small ML teams at startups benefit next. A two-person ML team often cannot afford to have one engineer babysitting training runs. Autoresearch makes the overnight slot productive without human supervision.
Students and learners benefit in a different way. The codebase is educational by design. Reading 630 lines to understand how an agent loop works is a far better learning experience than reading documentation for a heavy framework. The code makes the concept concrete.
Large research institutions may find autoresearch most useful as a prototyping tool. Before committing expensive cluster time to a full hyperparameter sweep, run autoresearch on a single GPU to filter the hypothesis space. This is not a replacement for scale. It is a filter before scale.
The five-minute training window is not always sufficient signal. Some architectures and datasets need longer runs before meaningful differences appear. Using autoresearch on problems with noisy short-run metrics can send the agent in the wrong direction.
The LLM-guided search is also not guaranteed to be efficient. Current LLMs are good code generators but imperfect scientists. The agent may propose changes that are syntactically valid but semantically misguided. Humans reviewing the experiment log still catch things the agent misses.
There is also the reproducibility question. ML experiments are not deterministic by default. If two runs of the same code produce different results due to seed variation, the agent's keep/discard decision becomes noisy. The tool does not handle this automatically.
Scaling beyond a single GPU introduces coordination complexity that the current codebase does not address. Several community forks are exploring multi-GPU versions, but none have reached the stability of the original single-GPU implementation.
The term "Karpathy loop" spread quickly because it names something researchers already understood intuitively: the cycle of propose, train, evaluate, repeat. What changed is that the loop can now run without a human at the keyboard for each iteration.
This has cultural implications. The scarce resource in ML research is not compute or data. It is researcher attention. Every hour a researcher spends watching a training run is an hour not spent thinking about the problem. If agents handle the iteration loop, researcher attention gets reallocated to problem formulation and result interpretation.
This is a meaningful shift. Not because it removes researchers from the process, but because it changes where their time goes. The best researchers will use autoresearch to explore more aggressively, not to stop thinking.
The risk is the opposite tendency: running 100 experiments and treating the top result as valid without understanding why it worked. Autoresearch amplifies iteration speed. It does not amplify understanding. Researchers who skip the interpretation step will produce brittle results.
The setup is minimal. Clone the autoresearch repository, install PyTorch for your GPU, and configure an LLM API key for the agent. The README walks through the exact steps.
The most important configuration choice is the evaluation metric. The agent optimizes toward whatever you define as success. A poorly chosen metric will produce experiments that score well on the metric and poorly on your actual goal. Spend time on this before starting a run.
Start with a problem you already understand. Running autoresearch on a familiar benchmark lets you evaluate whether the agent's choices make sense. Once you trust the loop on known problems, apply it to the open questions in your work.
The community is active. GitHub issues and forks are moving fast as people adapt the tool for different domains. The VentureBeat coverage and MarkTechPost writeup both have good context on early use cases.
Autoresearch fits into a larger pattern. AI agents are moving from assistants that help humans do work to systems that do work and report results. The distinction matters.
An assistant waits for instruction. An agent with autoresearch-style infrastructure proposes hypotheses, runs experiments, and returns a ranked list of findings. The human decides what questions to ask and what the results mean. The agent decides how to explore the answer space.
This is not science fiction. It is already running on consumer GPUs overnight. The interesting question is not whether this will scale. It will. The question is how research institutions, funding bodies, and peer review will adapt to work where the experiment loop runs autonomously.
Karpathy's contribution here is not just the tool. It is making the concept legible. A 630-line file that anyone can read, fork, and run is a better argument for autonomous ML research than any white paper. The code is the proof of concept.
Autoresearch is an open-source Python tool released by Andrej Karpathy that enables AI agents to run ML experiments autonomously on a single GPU. The agent proposes code modifications, runs training, evaluates results, and repeats without human intervention between iterations.
The tool runs approximately 12 experiments per hour, yielding around 100 experiments in an overnight session of roughly 8 hours. Each experiment involves a ~5-minute training run followed by evaluation.
No. Autoresearch is explicitly designed for single-GPU use. A consumer gaming GPU or an affordable cloud GPU instance is sufficient to run the full experiment loop.
Beyond PyTorch, autoresearch has no external dependencies. This keeps setup to a minimum and makes the codebase portable across different environments.
Each training run is approximately five minutes. This is a deliberate tradeoff: short enough to iterate quickly, long enough to produce a meaningful signal from a real training run.
Andrej Karpathy, former director of AI at Tesla and former OpenAI researcher. He is also known for creating the micrograd and nanoGPT projects.
The release received 8.6 million views in two days. Researchers and builders began posting results and forking the project almost immediately, with several groups exploring multi-GPU extensions.
No. Autoresearch is a discovery tool for the exploration phase. It does not handle production training, distributed compute, or deployment. The appropriate use is to identify what works before committing to a full production pipeline.
The term emerged organically after the release. It refers to the agent-driven cycle of: propose modification, run short training, evaluate result, keep or discard. Teams began applying the term to their own automated experiment workflows inspired by autoresearch.
Yes. At 630 lines with no deep abstractions, the codebase is designed to be understood completely. Reading it is a practical way to learn how an agent experiment loop works.
Any ML problem where a five-minute training run produces a meaningful evaluation signal. Classification, regression, and small-scale generative tasks are natural fits. Problems that require long training runs to show meaningful differences are harder to explore with the default configuration.
Not deeply. The agent optimizes toward the metric you define and makes code changes based on prior results. It does not reason about causal mechanisms. Understanding why something works remains a human responsibility.
Technically possible but impractical. A five-minute GPU training run could take hours on CPU, making the iteration loop too slow to be useful. A GPU is effectively required.
The agent uses an LLM to interpret prior experiment results and propose the next code modification. The LLM generates Python code that gets executed against the training script. The choice of LLM affects the quality of proposals.
No. It is an experimental tool for research exploration. The codebase is intentionally minimal and does not include error handling, logging infrastructure, or distributed training support that production systems need.
You configure an evaluation metric before starting a run. The agent treats improvements on this metric as success. Choosing the right metric is the most important setup decision. A misaligned metric produces experiments that look good on paper but miss the actual goal.
Community forks exploring multi-GPU support appeared quickly after the release. None have reached the stability of the original single-GPU implementation as of the release date. The original repository remains the reference implementation.
The repository is at github.com/karpathy/autoresearch. The README includes setup instructions and examples.
The tool is designed to work with LLM APIs. The specific model is configurable. Higher-capability models produce better code suggestions. The README provides guidance on configuration options.
Running many experiments and accepting the top result without understanding why it worked. Autoresearch accelerates iteration, not comprehension. Treating output rankings as ground truth without analysis leads to results that fail to generalize.
Alibaba Qwen 3.5 small models beat GPT-4o-mini on benchmarks at $0.01/M tokens. Six MIT-licensed sizes, 119 languages, runs on iPhone.
DeepSeek V4 trillion parameter open source multimodal model beats GPT-5.3 on MMLU. MIT license. $0.14/M tokens. Full benchmark breakdown.
Open source video AI LTX Helios march models hit 83.1 VBench. Run 10 hours of video for $0.19 vs $1,440 on Sora. Here's what changed.