‘The Karpathy Loop’: Ex-OpenAI researcher’s autonomous agents conducted 700 experiments over two days—and offered a glimpse of AI’s future direction

Earlier this month, Andrej Karpathy, a prominent AI researcher who was among OpenAI’s first employees and later led AI initiatives at Tesla, sparked widespread attention on X. This isn’t particularly surprising. Karpathy—currently an independent AI researcher and founder of Eureka Labs, an organization aiming to establish a novel educational model for the AI age—commands 1.9 million followers on X, and his standing in the field means his AI commentary is often regarded as authoritative or predictive.

However, this particular post detailed an experiment in which he deployed an AI coding agent to conduct a series of tests aimed at enhancing the training process for a small language model. He allowed the AI agent to operate nonstop for 48 hours, during which it performed 700 distinct experiments. Throughout these trials, it identified 20 optimizations that accelerated the training duration.

Karpathy discovered that implementing those same 20 adjustments on a bigger—though still relatively modest—language model yielded an 11% reduction in training time. He dubbed the framework he developed for this experiment “autoresearch.”

Tobias Lütke, Shopify’s cofounder and CEO, shared on X that he tested autoresearch to enhance an AI model using internal corporate data, directing the agent to boost both the model’s quality and efficiency. Lütke noted that after running autoresearch overnight, it executed 37 experiments and achieved a 19% performance improvement.

What drew considerable attention was how closely autoresearch resembles the concept of self-improving AI systems—an idea first explored in science fiction and now either eagerly anticipated or deeply dreaded by some AI researchers. The worry centers on “recursive self-improvement,” wherein an AI perpetually refines its own code and training through a cyclical process, potentially triggering what AI safety experts refer to as a “hard takeoff” or “intelligence explosion.” Under such circumstances, an AI system would rapidly elevate its own capabilities, surpassing human intellectual capacity and evading human oversight.

Karpathy’s experiment didn’t exactly fit this scenario. The AI agent central to the autoresearch setup wasn’t enhancing its own training configuration; rather, it was modifying the training code and initial neural network parameters for a separate, significantly smaller and less advanced AI model. Nevertheless, Karpathy correctly observed that his experiment carries major implications for the future of research methodologies in AI labs, potentially hastening their advancement.

“Every leading LLM laboratory will adopt this approach. It’s the ultimate challenge,” Karpathy posted on X. He conceded that “the complexity increases dramatically at scale,” given that his autoresearcher merely needed to adjust a model and training procedure comprising just 630 lines of Python code, while the training infrastructure for cutting-edge AI models is exponentially larger. “But implementing this is merely an engineering problem, and it will succeed,” he added. “You deploy a fleet of agents, have them work together to fine-tune smaller models, elevate the most viable concepts to progressively larger scales, and humans can (optionally) provide input at the periphery.”

He explained that although his current autoresearch system was engineered for a solitary agent to progressively enhance code along one trajectory, he envisions a future where numerous AI agents can simultaneously investigate various optimizations and conduct parallel experiments. “The forthcoming evolution of autoresearch requires asynchronous, massively collaborative agent networks,” he wrote. “The objective isn’t to replicate one PhD student, but to replicate an entire research community of them.”

Karpathy also made another statement about autoresearch that generated considerable excitement. “*Any* metric you prioritize that can be assessed with reasonable efficiency (or that possesses more efficient proxy measures, such as training a smaller network) can be subjected to autoresearch by an agent swarm,” he wrote. “It’s worthwhile to consider whether your particular challenge fits this category as well.”

Some observers noted that the fundamental elements of autoresearch could be applied to various other agent-based systems for process optimization. Janakiram MSV, principal analyst at Janakiram & Associates, writing in the technology publication The New Stack, labeled this approach “the Karpathy Loop.” It consists of three parts: an agent with the ability to access and modify a single file; a solitary, objectively measurable metric that the agent can strive to optimize; and a predetermined time constraint for each experiment’s duration. He also emphasized that the directives Karpathy provided to the AI agent in autoresearch serve as excellent templates for anyone engaging with AI agents. The plain text file Karpathy employed contained explicit instructions regarding the agent’s tasks, limitations outlining what the agent must avoid or not alter, and termination criteria specifying the duration of each cycle and when the agent should cease looping and deliver its findings.

However, some critics argued that Karpathy had essentially just rediscovered aspects of a methodology known as AutoML, which researchers at Google, Microsoft, and other AI laboratories have utilized for years. AutoML likewise employs an optimization cycle and sequence of experiments to identify optimal data for AI applications, the best model architectures, and to refine those architectures. Yet it doesn’t utilize an AI agent capable of reading AI research publications and formulating hypotheses about which enhancements to pursue. Instead, AutoML systems typically rely on random mutations or diverse evolutionary algorithms to determine which modifications to attempt.

Karpathy responded to some of these remarks, asserting that certain AutoML techniques, such as neural architecture search—an automated method for optimizing AI model design—were far less potent than his autoresearch. “Neural architecture search in its previous form represents such an inferior version of this concept that it belongs in an entirely separate category of complete uselessness by comparison,” he wrote. “This involves an *actual* LLM generating arbitrary code, learning from prior experiments, and accessing the internet. There’s simply no comparison.”