Andrej Karpathy on Software 3.0: Programming LLMs in English and Building for Agents
Y Combinator
Summary:
Andrej Karpathy discusses the profound shifts in software, introducing "Software 3.0" where Large Language Models (LLMs) are programmed in natural language, fundamentally changing how software is developed and interacted with.
- Software has evolved from traditional code [1] to neural network weights [2] and now to natural language prompts [3], making programming accessible to everyone ("vibe coding").
- LLMs behave like utilities, fabs, and especially operating systems, centralizing compute but enabling widespread access, similar to computing in the 1960s.
- LLMs possess emergent human-like "psychology," offering superhuman knowledge but also exhibiting cognitive deficits like hallucinations and amnesia.
- The future of software involves building "partial autonomy" LLM applications that integrate AI for generation and humans for verification, emphasizing fast human-AI collaboration loops and configurable "autonomy sliders."
- To support this shift, digital infrastructure must be redesigned for agents, including LLM-friendly documentation and APIs that allow agents to interact directly.
Software is undergoing a fundamental change, arguably the most significant in 70 years, with two rapid shifts occurring recently. This transformation presents immense opportunities for those entering the industry, requiring the ability to write and rewrite vast amounts of software.
Software Evolution: From 1.0 to 3.0 [01:25]
The talk outlines a progression in software paradigms:
- Software 1.0 = Code [01:42]
- This refers to traditional computer code written by humans (e.g., C++, Python).
- It programs a "classical computer."
- Software 2.0 = Weights [01:46]
- This refers to neural networks, where the "software" is the learned weights of the network.
- Instead of writing explicit code, developers curate datasets and run optimizers to generate these weights.
- The "Map of GitHub" for Software 1.0 finds its equivalent in "HuggingFace Model Atlas" for Software 2.0.
- Software 3.0 = Prompts [03:09]
- Large Language Models (LLMs) are a new kind of computer that are programmable.
- Programming LLMs is done via natural language prompts, making English a new, unique programming language.
- Example: Sentiment classification can be done with Python code [1], a trained neural net [2], or a few-shot prompt to an LLM [3].
- Analogy: Software 2.0 "Eating" 1.0 (Tesla Autopilot) [04:40]
- In Tesla Autopilot, early functionality was implemented in Software 1.0 (C++ code).
- Over time, as neural networks (Software 2.0) grew in capability, much of the C++ code was replaced and functionalities migrated to the neural net stack.
- This demonstrated Software 2.0 literally "eating through" the Software 1.0 stack.
- Current State: [05:37]
- We now have three distinct programming paradigms: 1.0 (code), 2.0 (weights), and 3.0 (prompts).
- Developers entering the industry should be fluent in all three, as each has pros and cons, requiring fluid transitions between them for different functionalities.
- The "Map of GitHub" now includes an emerging area for Software 3.0 (LLM prompts in English).
LLMs as Utilities, Fabs, and Operating Systems [06:10]
Andrej Karpathy proposes analogies to understand LLMs:
- Properties of Utilities [06:30]
- LLM labs (e.g., OpenAI, Gemini, Anthropic) incur significant Capital Expenditure (CAPEX) to train LLMs (analogous to building an electricity grid).
- Operational Expenditure (OPEX) is used to serve intelligence via homogeneous APIs (prompt, image, tools, etc.).
- Access is metered (e.g., per million tokens), similar to paying for electricity.
- Users demand low latency, high uptime, and consistent quality, much like demanding consistent voltage from a grid.
- Tools like OpenRouter act as "transfer switches," allowing users to switch between different LLM providers (analogous to switching between grid, solar, or generator).
- "Intelligence brownouts" occur when state-of-the-art LLMs go down, meaning the "planet gets dumber" due to increasing reliance.
- Properties of Fabs [08:00]
- Building LLMs involves huge CAPEX, akin to building semiconductor fabrication plants.
- There's deep tech tree R&D and valuable secrets centralized within LLM labs.
- However, LLMs are software, making them more malleable and less defensible than physical fabs.
- Analogy: Training on NVIDIA GPUs is like a "fabless" model, while training on Google's TPUs is like owning the "fab" (Intel model).
- Properties of Operating Systems [09:09]
- LLMs are increasingly complex software ecosystems, not simple commodities.
- The ecosystem resembles traditional OS: a few closed-source providers (like Windows, macOS) and an open-source alternative (like Linux, with Llama ecosystem as an early approximation).
- An "LLM OS" vision shows the LLM as a central CPU, interacting with peripheral devices, classical computers (for tools), file systems, browsers, and other LLMs.
- Just as apps like VS Code run on Windows, Linux, or Mac, LLM apps like Cursor can run on different LLMs (GPT, Claude, Gemini).
- Historical Computing Analogies (1960s) [11:04]
- LLM compute is currently very expensive, forcing centralization in the cloud.
- Users are "thin clients" interacting over the network, with compute often batched (time-sharing era of mainframe computing).
* Personal computing for LLMs hasn't fully happened yet, though early hints exist (e.g., Mac Minis for LLM inference).
* Direct text chat with LLMs feels like interacting with an operating system through a terminal; a general GUI for LLMs is yet to be invented.
*
LLMs Flip Technology Diffusion [
12:49]
* Traditionally, transformative technologies diffuse from government and corporations to consumers (e.g., electricity, cryptography, computing).
* LLMs have flipped this script: initial widespread use is by consumers (e.g., asking "how to boil an egg"), with governments and corporations lagging in adoption.
LLM Psychology: People Spirits and Cognitive Quirks [14:39]
To effectively program LLMs, one must understand their "psychology."
- LLMs as "People Spirits" [14:48]
- LLMs are stochastic simulations of people, with the simulator being an autoregressive Transformer.
- Trained on human data, they exhibit an emergent, human-like psychology.
- Superhuman Abilities [15:28]
- They possess encyclopedic knowledge and memory, far exceeding any single human (likened to the character Rainman).
- Cognitive Deficits [16:06]
- Hallucinations: LLMs frequently make up information and lack sufficient internal self-knowledge.
- Jagged Intelligence: They can be superhuman in some problem-solving domains but make elementary mistakes in others.
- Anterograde Amnesia: LLMs do not learn continually or consolidate knowledge over time like humans "sleeping." Context windows are merely working memory that gets wiped, requiring explicit programming for knowledge retention (likened to movies like Memento or 50 First Dates).
- Gullibility: LLMs are susceptible to prompt injection risks and might leak private data due to their trusting nature.
- Summary of LLM Psychology: [17:55]
- LLMs are a "lossy simulation of a savant with cognitive issues."
- The challenge is to program them effectively by working around their deficits while leveraging their superhuman capabilities.
Designing LLM Apps with Partial Autonomy [18:22]
There are significant opportunities in building LLM-powered applications.
- Partial Autonomy Apps: [18:26]
- Instead of directly interacting with LLMs like a terminal (e.g., copy-pasting code to ChatGPT), it's more effective to use dedicated LLM applications.
- Example: Anatomy of Cursor (coding assistant) [18:47]
- Traditional Interface + LLM Integration: Cursor combines a standard code editor with an LLM chat sidebar.
- Context Management: LLMs handle packaging relevant state into a context window before calls.
- Orchestration of Multiple LLM Models: Cursor orchestrates various models (embedding, chat, diff application) seamlessly.
- Application-Specific GUI: A graphical user interface is crucial for auditing LLM work visually (e.g., seeing red/green diffs) and taking quick actions (e.g.,
Cmd+Y
to accept), speeding up the human-AI loop.
- Autonomy Slider: Cursor offers varying levels of LLM autonomy, from simple tap completion to modifying entire files or repositories, allowing users to tune control based on task complexity.
- Example: Anatomy of Perplexity (search engine) [21:03]
- Perplexity also embodies these principles: it packages information, orchestrates multiple LLMs, provides a GUI to audit sources, and features an autonomy slider for different search depths (quick search, research, deep research).
- Designing Software for Partial Autonomy [21:30]
- Software like Photoshop or Unreal Engine needs to consider how LLMs can "see" and "act" in the same ways a human can, and how humans can effectively supervise and stay in the loop with AI actions.
- Human-AI Collaboration Loops [23:40]
- In LLM apps, AI typically handles generation, while humans handle verification.
- The goal is to speed up this generation-verification loop for higher productivity.
- 1. Speed up Verification: This is achieved through effective GUIs and visual representations that leverage human computer vision, making it easier and faster to audit AI output compared to reading raw text.
- 2. Keep AI on the Leash: Avoid giving LLMs too much autonomy prematurely. Large, unverified diffs are counterproductive. It's better to work in small, incremental chunks and provide concrete prompts to ensure successful verification and prevent the AI from "getting lost in the woods."
- AI Education Example: Rather than asking an LLM to "teach me physics" broadly, separate apps can be built: one for a teacher to create auditable courses with AI, and another for serving these courses to students, keeping the AI focused within a defined syllabus.
- Lessons from Tesla Autopilot & Autonomy Sliders [26:00]
- Andrej's experience with Tesla Autopilot, a partial autonomy product, highlights the importance of GUIs (instrument panel showing what the neural network "sees") and autonomy sliders (gradually increasing autonomous tasks over time).
- The journey to full self-driving (driving agents) has been long (over a decade since 2013 demos), still involving human intervention (teleoperation).
- This suggests caution against premature declarations of AGI or "the year of agents," emphasizing that humans will remain in the loop for complex software for some time.
- The Iron Man Analogy: Augmentation vs. Agents [27:52]
- The Iron Man suit exemplifies both augmentation (Tony Stark directly piloting) and agents (the suit acting autonomously).
- Currently, the focus should be on building "Iron Man suits" (augmentations, partial autonomy products with custom GUIs and autonomy sliders), not "Iron Man robots" (flashy, fully autonomous agents that are still fallible).
- Products should integrate an "autonomy slider" to enable a gradual shift from augmentation towards higher levels of agentic behavior over time.
Vibe Coding: Everyone is Now a Programmer [29:06]
The ability to program LLMs in English has made software highly accessible, leading to a new phenomenon called "vibe coding."
- Natural Language Interface: English as a programming language means anyone who speaks English can now "program" computers. This is unprecedented, as traditional programming required years of study.
- "Vibe Coding" Concept: A term coined by Andrej Karpathy to describe a mode of software development where one "gives in to the vibes" and lets LLMs (like Cursor Composer) generate code, focusing on broad descriptions rather than intricate details, and accepting changes without deep auditing.
- Real-World Examples of Vibe Coding:
- Kids vibe coding, creating simple applications, suggesting it could be a "gateway drug" to software development for a new generation.
- Andrej built a basic iOS app in a day without knowing Swift, highlighting how LLMs abstract away language-specific complexities.
- Developed "MenuGen" (menugen.app), an app that takes a picture of a restaurant menu and generates images for each item.
- A live demonstration of MenuGen shows users taking a picture of a menu and receiving a digital version with images for each item.
- Challenge: Devops is the New Bottleneck: [32:21]
- While vibe coding makes generating code easy (the "easy part"), making the application "real" (e.g., adding authentication, payments, deployment, domain names) is still hard.
- These tasks primarily involve manual "clicking things" in browser-based devops interfaces designed for humans, which is slow and frustrating for LLMs/automation.
Building for Agents: Future-Ready Digital Infrastructure [33:39]
The rise of human-like AI agents necessitates a fundamental shift in how digital infrastructure is designed.
- New Category of Digital Consumer/Manipulator: Besides humans (via GUIs) and traditional computers (via APIs), there's now a new category: agents, which are computers but human-like in their interaction patterns. They need to interact with software infrastructure.
- Building for Agents Requires New Approaches:
lm.txt
for LLMs: Analogous to robots.txt
for web crawlers, an lm.txt
file (simple markdown) could directly inform LLMs about a domain's purpose, making it easier for them to understand and interact than parsing complex HTML.
- Docs for LLMs: Documentation traditionally written for humans (with lists, bold text, images) needs to be adapted for LLMs, ideally in markdown format for easy understanding.
- Actions for LLMs: Instructions like "click this button" in documentation are problematic for LLMs. Instead, docs should provide machine-executable actions, such as
curl
commands, allowing agents to perform tasks directly.
- Context Builders (e.g., Gitingest, Devin DeepWiki): Tools that help ingest and summarize information (like GitHub repositories) into LLM-friendly formats (e.g., concatenated text, directory structures, analytical summaries), making it easier for LLMs to consume and reason about complex data.
- Meeting LLMs Halfway: While LLMs are developing capabilities to interact with traditional GUIs (e.g., "clicking stuff"), it is still more efficient and less expensive to adapt our digital infrastructure to meet them halfway by providing machine-readable formats and executable instructions. This is particularly relevant for the "long tail" of software that may not be actively updated for full agent compatibility.
Summary: We’re in the 1960s of LLMs — Time to Build [38:14]
- Software is fundamentally changing again, moving through Software 1.0 (code), 2.0 (weights), and now 3.0 (prompts in English).
- LLMs combine properties of utilities, fabs, and especially operating systems, echoing the early 1960s era of computing where systems were centralized and accessed via time-sharing.
- A key new aspect is the unprecedented, sudden access billions of people have to these powerful models, making it a critical time to program them.
- LLMs, while superhuman in some ways, also possess human-like cognitive deficits (hallucinations, amnesia, gullibility) that require careful consideration in design.
- The industry should focus on building "partial autonomy" LLM applications with custom GUIs and "autonomy sliders" that enable a fast generation-verification loop between AI and humans.
- Digital infrastructure must evolve to "build for agents," providing LLM-legible documentation and actions.
- The next decade will see a significant shift along the "Iron Man suit" autonomy slider, from augmentation to more capable agents. It's an exciting time to build in this evolving landscape.