The Dawn of Software 3.0: Programming with Natural Language
大家好,我是王利杰。今天,我们来探讨一个正在从根本上重塑我们数字世界的深刻变革。我们所熟知的软件,正在经历一场几十年来最根本的进化。这不仅仅是技术的迭代,而是一个全新编程范式的诞生。
The Evolution of Software: From 1.0 to 3.0
In the past, we had Software 1.0, which was traditional code manually written to precisely instruct computers. Then came Software 2.0, represented by neural networks. Instead of writing direct instructions, we "trained" models by optimizing vast amounts of data, like an image recognizer. Now, we're entering the era of Software 3.0, centered around large language models (LLMs).
The most striking aspect is that we're starting to "program" these models using natural language, such as Chinese or English. A single prompt can become a powerful program.
LLMs as a New Operating System
How do we understand this new species, the LLM? A helpful analogy is that LLMs are the next-generation operating system. Think of it as a new computer:
- Its context window is the computer's memory.
- The model itself is the CPU.
We interact through API interfaces, reminiscent of the "time-sharing systems" of the 1960s, where computing resources were expensive and centralized. The industry landscape is strikingly similar, with a few powerful, closed-source systems alongside a thriving open-source ecosystem.
Understanding the "Mind" of LLMs
To navigate this new system, we must understand its "mindset". LLMs are like "digital minds" with superpowers. They possess:
- Vast knowledge: They remember enormous amounts of information from the internet.
- Impressive memory: They can recall details that would be impossible for a human to retain.
However, they also have cognitive flaws:
- Hallucinations: They can confidently fabricate information.
- Inconsistent intelligence: They can surpass humans in some areas but make elementary mistakes in others.
- Anterograde amnesia: They forget everything after each conversation, lacking the ability to learn and grow from experience.
The Opportunity: Building Partially Autonomous Applications
With these understandings, we realize the biggest opportunity lies not in creating uncontrollable super-intelligences but in building "partially autonomous applications".
Imagine an advanced code editor that doesn't rewrite your entire project but helps manage context and precisely modify code based on your instructions. Or a smart search tool that not only provides answers but also lists sources for quick verification.
These applications share several characteristics:
- They manage complex context in the background.
- They have GUIs designed for specific tasks, allowing you to visually review the AI's output.
- They offer an "autonomy slider", letting you adjust the AI's level of involvement based on the task's complexity.
Human-AI Collaboration: The "Generate-Verify" Loop
This leads to the core of human-AI collaboration: accelerating the "generate-verify" loop. AI generates, and humans quickly verify.
This requires:
- More intuitive interfaces to reduce review costs.
- "Reins" on the AI, limiting its work to manageable areas.
This is similar to the development of self-driving technology. We're at an exciting starting point, learning to communicate with this new "operating system" and write programs for the Software 3.0 era.
Andrej Karpathy: Software in the Era of AI
Please welcome former director of AI Tesla, Andre Karpathy. Hello. Hello. Wow, a lot of people here. Hello. Okay, yeah, so I'm excited to be here today to talk to you about software in the era of AI. And I'm told that many of you are students, like Bachelor's, Master's, Ph.D. and so on, and you're about to enter the industry. And I think it's actually like an extremely unique and very interesting time to enter the industry right now. And I think fundamentally the reason for that is that software is changing again. And I say again, because I actually gave this talk already, but the problem is that software keeps changing. So I actually have a lot of material to create new talks, and I think it's changing quite fundamentally. I think, roughly speaking, software has not changed much on such a fundamental level for 70 years, and then it's changed, I think, about twice quite rapidly in the last few years. And so there's just a huge amount of work to do, a huge amount of software to write and rewrite. So let's take a look at maybe the realm of software. So if we kind of think of this as like the map of Software, this is a really cool tool called map of GitHub. This is kind of like all the software that's written. These are instructions to the computer for carrying out tasks in. In the digital space. So if you zoom in here, these are all different kinds of repositories, and this is all the code that has been written. And a few years ago, I kind of observed that software was kind of changing and there was kind of like a new type of software around. And I called this software 2.0 at the time. And the idea here was that software 1.0 is the code you write for the computer. Software 2.0 are basically neural networks and in particular the weights of a neural network. And you're not writing this code directly. You are most. You are more kind of like tuning the data sets, and then you're running an optimizer to create the parameters of this neural net. And I think, like, at the time, neural nets were kind of seen as like just a different kind of classifier, like a decision tree or something like that. And so I think it was kind of like. I think this framing was a lot more appropriate. And now actually what we have is kind of like an equivalent of GitHub in the realm of software 2.0. And I think the hugging face is basically equivalent of GitHub in software 2.0. And there's also Model Atlas, and you can Visualize all the code written there. In case you're curious by the way, the giant circle, the point in the middle, these are the parameters of flux, the image generator. And so anytime someone tunes a lora on top of a flux model, you basically create a git commit in this space and you create a different kind of image generator. So basically what we have is Software 1.0 is the computer code that programs a computer. Software 2.0 are the weights which program neural networks. And here's an example of Alexnet Image recognizer neural network. Now, so far, all of the neural networks that we've been familiar with until recently were kind of like fixed function computers, image 2 categories or something like that. And I think what's changed, and I think is a quite fundamental change, is that neural networks became programmable with large language models. And so I see this as quite new, unique, it's a new kind of a computer. And so in my mind, it's worth giving it a new Designation of Software 3.0. And basically your prompts are now programs that program the LLM. And remarkably, these prompts are written in English. So it's kind of a very interesting programming language. So maybe to summarize the difference, if you're doing sentiment classification, for example, you can, you can imagine writing some amount of Python to basically do sentiment classification, or you can train a neural net, or you can prompt a large language model. So here this is a few shot prompt, and you can imagine changing it and programming the computer in a slightly different way. So basically we have software 1.0, software 2.0, and I think we're seeing, maybe you've seen a lot of GitHub code is not just like code anymore. There's a bunch of English interspersed with code. And so I think kind of there's a growing category of new kind of code. So not only is it a new programming paradigm, it's also remarkable to me that it's in our native language of English. And so when this blew my mind a few, I guess years ago now, I tweeted this and I think it captured the attention of a lot of people. And this is my currently pinned tweet, is that remarkably, we're now programming computers in English. Now, when I was at Tesla, we were working on the autopilot and we were trying to get the car to drive. And I sort of showed this slide at the time where you can imagine that the inputs to the car are on the bottom and they're going through a software stack to produce the steering and acceleration And I made the observation at the time that there was a ton of C code around in the autopilot, which was the software 1.0 code. And then there was some neural nets in there doing image recognition. And I kind of observed that over time, as we made the autopilot better, basically the neural network grew in capability and size. And in addition to that, all the C code was being deleted and kind of like was. And a lot of the kind of capabilities and functionality that was originally written in 1.0 was migrated to 2.0. So as an example, a lot of the stitching up of information across images from the different cameras and across time was done by a neural network. And we were able to delete a lot of code. And so the software 2.0 stack quite literally ate through the software stack of the autopilot. So I thought this was really remarkable at the time. And I think we're seeing the same thing again, where basically we have a new kind of software and it's eating through the stack. We have three completely different programming paradigms. And I think if you're entering the industry, it's a very good idea to be fluent in all of them because they all have slight pros and cons. And you may want to program some functionality in 1.0 or 2.0 or 3.0, or are you gonna train a neural net? Are you gonna just prompt an LLM? Should this be a piece of code that's explicit, et cetera? So we all have to make these decisions and actually potentially fluidly transition between these paradigms. So what I wanted to get into now is first I want to, in the first part, talk about LLMs and how to kind of like, think of this new paradigm and the ecosystem and what that looks like. Like, what is this new computer? What does it look like and what does the ecosystem look like? I was struck by this quote from Andrew Ng actually many years ago now, I think, and I think Andrew is going to be speaking right after me, but he said at the time, AI is the new electricity. And I do think that it kind of captures something very interesting in that LLMs certainly feel like they have properties of utilities right now. So LLM labs like OpenAI, Gemini, Anthropi, et cetera, they spend capex to train the LLMs. And this is kind of equivalent to building out a grid. And then there's OPEX to serve that intelligence over APIs to all of us. And this is done through metered access, where we pay per million tokens or something like that. And we have a lot of demands that are very utility like demands out of this API. We demand low latency, high uptime, consistent quality, et cetera. In electricity you would have a transfer switch so you can transfer your electricity source from like grid and solar or battery or generator. In LLMs we have maybe open router and easily switch between the different types of LLMs that exist. Because the LLMs are software, they don't compete for physical space. So it's okay to have basically like six electricity providers and you can switch between them. Right. Because they don't compete in such a direct way. And I think what's also a little fascinating, and we saw this in the last few days, actually a lot of the LLMs went down and people were kind of like stuck and unable to work. And, and I think it's kind of fascinating to me that when the state of the art LLMs go down, it's actually kind of like an intelligence brownout in the world. It's kind of like when the voltage is unreliable on the grid and the planet just gets dumber the more reliance we have on these models, which already is like really dramatic and I think will continue to grow. But LLMs don't only have properties of utilities. I think it's also fair to say that they have some properties of fabs. And the reason for this is that the capex required for building LLMs is actually quite large. It's not just like building some power station or something like that. Right. You're investing a huge amount of money. And I think the tech tree for the technology is growing quite rapidly. So we're in a world where we have sort of deep tech trees, research and development secrets that are centralizing inside the LLM labs. But I think the analogy muddies a little bit also because as I mentioned, this is software and software is a bit less defensible because it is so malleable. And so I think it's just an interesting kind of thing to think about. Potentially there's many analogies you can make. Like a 4 nanometer process node maybe is something like a cluster with certain max flops you can think about. When you're using Nvidia GPUs and you're only doing the software and you're not doing the hardware, that's kind of like the Fabless model. But if you're actually also building your own hardware and you're training on TPU's, if you're Google, that's kind of like the intel model where you own your fab So I think there's some analogies here that make sense. But actually I think the analogy that makes the most sense perhaps is that in my mind LLMs have very strong kind of analogies to operating systems in that this is not just electricity or water, it's not something that comes out of the tap as a commodity. These are now increasingly complex software ecosystems, right? So they're not just like simple commodities like electricity. And it's kind of interesting to me that the ecosystem is shaping in a very similar kind of way where you have a few closed source providers like Windows or Mac OS and then you have an open source alternative like Linux. And I think for Neural, for LLMs as well, we have a kind of a few competing closed source providers and then maybe the LLAMA ecosystem is currently like maybe a close approximation to something that may grow into something like Linux again. I think it's still very early because these are just simple LLMs, but we're starting to see that these are going to get a lot more complicated. It's not just about the LLM itself, it's about all the tool use and multimodalities and how all of that works. And so when I sort of had this realization a while back, I tried to sketch it out and it kind of seems to me like LLMs are kind of like a new operating system, right? So the LLM is a new kind of a computer. It's sitting, it's kind of like the CPU equivalent. The context windows are kind of like the memory. And then the LLM is orchestrating memory and compute for problem solving using all of these capabilities here. And so definitely if you look at it, it looks very much like operating system from that perspective. A few more analogies. For example, if you want to download an app, say I go to VS code and I go to download, you can download VS code and you can run it on Windows, Linux or Mac in the same way as you can take an LLM app like Cursor and you can run it on GPT or Claude or Gemini series, right? It's just a dropdown. So it's kind of like similar in that way as well. More analogies that I think strike me is that we're kind of like in this 1960s ish era where LLM compute is still very expensive for this new kind of a computer and that forces the LLMs to be centralized in the cloud and, and we're all just sort of think clients that interact with it over the network and none of us have full utilization of these computers. And therefore it makes sense to use timesharing where we're all just a dimension of the batch when they're running the computer in the cloud. And this is very much what computers used to look like during this time, the operating systems were in the cloud, everything was streamed around and there was batching. And so the personal computing revolution hasn't happened yet because it's just not economical. It doesn't make sense. But I think some people are trying and it turns out that Mac Minis, for example, are a very good fit for some of the LLMs because it's all, if you're doing batch one inference, this is all super memory bound, so this actually works. And I think these are some early indications maybe of personal computing, but this hasn't really happened yet. It's not clear what this looks like. Maybe some of you get to invent what this is or how it works or what this should be. Maybe. One more analogy that I'll mention is whenever I talk to ChatGPT or some LLM directly in text, I feel like I'm talking to an operating system through the terminal. Like it just, it's text, it's direct access to the operating system. And I think a GUI hasn't yet really been invented in like a general way. Like, should ChatGPT have a GUI? Like different than just the tech bubbles. Certainly some of the apps that we're gonna go into in a bit have gui, but there's no like GUI across all the tasks, if that makes sense. There are some ways in which LLMs are different from kind of operating systems in some fairly unique way and from early computing. And I wrote about this one particular property that strikes me as very different this time around. It's that LLMs like Flip, they flip the direction of technology diffusion that is usually present in technology. So for example, with electricity, cryptography, computing, flight, Internet, gps, lots of new transformative technologies that have not been around. Typically it is the government and corporations that are the first users because it's new and expensive, et cetera, and it only later diffuses to consumer. But I feel like LLMs are kind of like flipped around. So maybe with early computers it was all about ballistics and military use, but with LLMs it's all about how do you boil an egg? Or something like that. This is certainly like a lot of my use. And so it's really fascinating to me that we have a new magical computer and it's like helping me boil an egg. It's not helping the government do something really crazy like some Military ballistics, or some special technology. Indeed, corporations and governments are lagging behind the adoption of all of us, of all of these technologies. So it's just backwards. And I think it informs maybe some of the uses of how we want to use this technology, or like, what are some of the first apps and so on. So in summary, so far, LLM Labs fab LLMs, I think it's accurate language to use. But LLMs are complicated operating systems. They're circa 1960s in computing, and we're redoing computing all over again. And they're currently available via time sharing and distributed like a utility. What is new and unprecedented is that they're not in the hands of a few governments and corporations. They're in the hands of all of us because we all have a computer and it's all just software. And ChatGPT was beamed down to our computers, like to billions of people, like instantly and overnight. And this is insane. And it's kind of insane to me that this is the case. And now it is our time to enter the industry and program these computers. This is crazy. So I think this is quite remarkable. Before we program LLMs, we have to kind of like spend some time to think about what these things are. And I especially like to kind of talk about their psychology. So the way I like to think about LLMs is that they're kind of like people spirits. They are stochastic simulations of people. And the simulator in this case happens to be an autoregressive transformer. So transformer is a neural net. And it just kind of like goes on the level of tokens. It goes chunk, chunk, chunk, chunk, chunk, and there's an almost equal amount of compute for every single chunk. And this simulator, of course, is just basically, there's some weights involved and we fit it to, to all of text that we have on the Internet and so on, and you end up with this kind of a simulator. And because it is trained on humans, it's got this emergent psychology that is human like. So the first thing you'll notice is, of course, LLMs have encyclopedic knowledge and memory, and they can remember lots of things, a lot more than any single individual human can because they read so many things. It actually kind of reminds me of this movie, Rain man, which I actually really recommend people watch. It's an amazing movie. I love this movie. And Dustin Hoffman here is an autistic savant who has almost perfect memory, so he can read like a phone book and remember all of the names and phone Numbers. And I kind of feel like LLMs are kind of like very similar. They can remember Sha hashes and lots of different kinds of things very, very easily. So they certainly have superpowers in some respects, but they also have a bunch of, I would say cognitive deficits. So they hallucinate quite a bit and they kind of make up stuff and don't have a very good sort of internal model of self knowledge. Not sufficient at least. And this has gotten better, but not perfect. They display jagged intelligence. So they're going to be superhuman in some problem solving domains and then they're going to make mistakes that basically no human will make. Like, you know, they will insist that 9.11 is greater than 9.9 or that there are two Rs in Strawberry. These are some famous examples. But basically there are rough edges that you can trip on. So that's kind of, I think, also kind of unique. They also kind of suffer from anterograde amnesia. And I think I'm alluding to the fact that if you have a coworker who joins your organization, this coworker will over time learn your organization and they will understand and gain like a huge amount of context on the organization. And they go home and they sleep and they consolidate knowledge and they develop expertise over time. LLMs don't natively do this. And this is not something that has really been solved in the R and D of LLMs, I think. And so context windows are really kind of like working memory and you have to sort of program the working memory quite directly because they don't just kind of like get smarter by default. And I think a lot of people get tripped up by the analogies in this way. In popular culture. I recommend people watch these two movies, Memento and 50 First Dates. In both of these movies, the protagonists, their weights are fixed and their context windows gets wiped every single morning. And it's really probably problematic to go to work or have relationships when this happens. And this happens to LLMs all the time. I guess one more thing I would point to is security kind of related limitations of the use of LLMs. So for example, LLMs are quite gullible. They are susceptible to prompt injection risks, they might leak your data, et cetera. And there's many other considerations security related. So basically, long story short, you have to load your. You have to load your. You have to simultaneously think through this superhuman thing that has a bunch of cognitive deficits and issues. How do we. And yet they are extremely useful. And so how do we program them and how do we work around their deficits and enjoy their superhuman powers. So what I want to switch to now is talking about the opportunities of how do we use these models and what are some of the biggest opportunities? This is not a comprehensive list, just some of the things that I thought were interesting for this stock. The first thing I'm kind of excited about is what I would call partial autonomy apps. So, for example, let's work with the example of coding. You can certainly go to ChatGPT directly and you can start copy, pasting code around and copy pasting bug reports and stuff around and getting code and copy pasting everything around. Why would you do that? Why would you go directly to the operating system? It makes a lot more sense to have an app dedicated for this. And so I think many of you use Cursor. I do as well. And cursor is kind of like the thing you want. Instead you don't want to just directly go to the ChatGPT. And I think Cursor is a very good example of an early LLM app that has a bunch of properties that I think are useful across all the LLM apps. So in particular, you will notice that we have a traditional interface that allows a human to go in and do all the work manually, just as before. But in addition to that, we now have this LLM integration that allows us to go in bigger chunks. And so some of the properties of LLM apps that I think are shared and useful to point out. Number one, the LLMs basically do a ton of the context management. Number two, they orchestrate multiple calls to LLMs. Right? So in the case of Cursor, there's under the hood embedding models for all your files, the actual chat models, models that apply diffs to the code. And this is all orchestrated for you, a really big one that I think also maybe not fully appreciated always is application specific GUI and the importance of it, because you don't just want to talk to the operating system directly in text. Text is very hard to read, interpret, understand, and also like you don't want to take some of these actions natively in text. So it's much better to just see a diff as like red and green change and you can see what's being added and subtracted. It's much easier to just do command Y to accept or command N to reject. I shouldn't have to type it in text. Right. So GUI allows a human to audit the work of these fallible systems and to go faster. I'm going to come back to this point a little bit later as well. And the last kind of feature I want to point out is that there's what I call the Autonomy slider. So, for example, in Cursor, you can just do tap completion. You're mostly in charge. You can select a chunk of code and command K to change just that chunk of code. You can do command L to change the entire file, or you can do command I, which just let a rip, do whatever you want in the entire repo. And that's the sort of full autonomy agent genetic version. And so you are in charge of the Autonomy slider. And depending on the complexity of the task at hand, you can tune the amount of autonomy that you're willing to give up for that task. Maybe. To show one more example of a fairly successful LLM app, Perplexity, it also has very similar features to what I've just pointed out. In Cursor, it packages up a lot of the information. It orchestrates multiple LLMs. It's got a GUI that allows you to audit some of its work. So, for example, it will cite sources and you can imagine inspecting them. And it's got an Autonomy slider. You can either just do a quick search, or you can do research, or you can do deep research and come back 10 minutes later. So this is all just varying levels of autonomy that you give up to the tool. So I guess my question is, I feel like a lot of software will become partially autonomous. And I'm trying to think through, like, what does that look like? And for many of you who maintain products and services, how are you going to make your products and services partially autonomous? Can an LLM see everything that a human can see? Can an LLM act in all the ways that a human could act? And can humans supervise and stay in the loop of this activity? Because, again, these are fallible systems that aren't yet perfect. And what does a diff look like in Photoshop or something like that, you know? And also a lot of the traditional software right now, it has all these switches and all this kind of stuff that's all designed for Human. All of this has to change and become accessible to LLMs. So one thing I want to stress with a lot of these LLM apps that I'm not sure gets as much attention as it should, is we're now kind of like cooperating with AIs, and usually they are doing the generation, and we as humans are doing the verification. It is in our interest to make this loop go as fast as possible. So we're getting a lot of work done. There are two major ways that I think this can be done. Number one, you can speed up verification a lot. And I think GUIs, for example, are extremely important to this because a GUI utilizes your computer vision GPU in all of our head. Reading text is effortful and it's not fun, but looking at stuff is fun, and it's just kind of like a highway to your brain. So I think GUIs are very useful for auditing systems and visual representations in general. And number two, I would say is we have to keep the AI on the leash. I think a lot of people are getting way overexcited with AI agents, and it's not useful to me to get a diff of 1000 lines of code to my repo. Like, I have to. I'm still the bottleneck, right? Even though that 1000 lines come out instantly, I have to make sure that this thing is not introducing bugs. It's just like. And that it's doing the correct thing, right? And that there's no security issues and so on. So I think that, yeah, basically we have to sort of like, it's in our interest to make the flow of these two go very, very fast, and we have to somehow keep the AI on the leash because it gets way too overreactive. It's kind of like this. This is how I feel when I do AI assisted coding. If I'm just bytecoding, everything is nice and great, but if I'm actually trying to get work done, it's not so great to have an overreactive agent doing all this kind of stuff. So this slide is not very good. I'm sorry, but I guess I'm trying to develop, like many of you, some ways of utilizing these agents in my coding workflow and to do AI assisted coding. And in my own work, I'm always scared to get way too big diffs. I always go in small incremental chunks. I want to make sure that everything is good. I want to spin this loop very, very fast. And I sort of work on small chunks of single concrete thing. And so I think many of you probably are developing similar ways of working with LLMs. I also saw a number of blog posts that try to develop these best practices for working with LLMs. And here's one that I read recently and I thought was quite good. And it kind of discussed some techniques and some of them have to do with how you keep the AI on the leash. And so as an example, if you are prompting, if your prompt is big, then the AI might not do exactly what you wanted, and in that case verification will fail. You're going to ask for something else. If a verification fails, then you're going to start spinning. So it makes a lot more sense to spend a bit more time to be more concrete in your prompts, which increases the probability of successful verification and you can move forward. And so I think a lot of us are going to end up finding techniques like this. I think in my own work as well, I'm currently interested in what education looks like together with kind of like now that we have AI and LLMs, what does education look like? And I think a large amount of thought for me goes into how we keep AI on leash. I don't think it just works to go to ChatGPT and be like, hey, teach me physics. I don't think this works because the AI gets lost in the woods. And so for me, this is actually two separate apps. For example, there's an app for a teacher that creates courses and then there's an app that takes courses and serves them to students. And in both cases we now have this intermediate artifact of a course that is auditable and we can make sure it's good, we can make sure it's consistent and the AI is kept on the leash with respect to a certain syllabus, a certain like, progression of projects and so on. And so this is one way of keeping the AI on leash and I think has a much higher likelihood of working and the AI is not getting lost in the woods. One more kind of analogy I wanted to sort of allude to is I'm no stranger to partial autonomy and I've kind of worked on this, I think for five years at Tesla. And this is also a partial autonomy product and shares a lot of the features. Like for example, right there in the instrument panel is the GUI of the autopilot. So it's showing me what the neural network sees and so on. And we have the autonomy slider where over the course of my tenure there, we did more and more autonomous tasks for the user. And maybe the story that I wanted to tell very briefly is actually the first time I drove a self driving vehicle was in 2013 and I had a friend who worked at Waymo and he offered to give me a drive around Palo Alto. I took this picture using Google Glass at the time and many of you are so young that you might not even know what that is. But yeah, this was like older age at the time. And we got into this car and we went for about a 30 minute drive around Palo Alto, highways, streets and so on. And this drive was perfect. There was zero interventions. And this was 2013, which is now 12 years ago. And it's kind of struck me because at the time when I had this perfect drive, this perfect demo, I felt like, wow, self driving is imminent. Because this just worked. This is incredible. But here we are 12 years later, and we are still working on autonomy. We are still working on driving agents, and even now, we haven't actually fully solved the problem. You may see waymos going around and they look driverless, but there's still a lot of teleoperation and a lot of human in the loop of. Of a lot of this driving. So we still haven't even declared success, but I think it's definitely going to succeed at this point. But it just took a long time. And so I think software is really tricky. I think in the same way that dr
🎥 Watch the Animated Story
📺 Experience the complete creation story in this beautifully animated video