The incredible progress in AI over the past five years can be summarized in one word: scale. Yes, there have been algorithmic advances, but the frontier models of today are still based on the same transformer architecture that was introduced in 2017, and they are trained in a very similar way to the models that were trained in 2019. The main difference is the scale of the data and compute that goes into these models.
In 2019, GPT-2 cost about 5,000 dollars to train. Every year since then, for the past five years, the models have gotten bigger, trained for longer on more data. And every year they've gotten better. But today's frontier models can cost hundreds of millions of dollars to train, and there are reasonable concerns among some that AI will soon plateau or hit a wall. After all, are we really going to train models that cost hundreds of billions of dollars? What about trillions of dollars? At some point, the scaling paradigm breaks down. This is, in my opinion, a reasonable concern, and in fact it's one that I used to share.
But today I am more confident than ever that AI will not plateau. And in fact, I believe that we will see AI progress accelerate in the coming months. To explain why I want to tell a story from my time as a PhD student.
I started my PhD in 2012, and I was lucky to be able to work on the most exciting projects I could imagine, developing AIs that could learn, on their own, how to play poker. Now I had played a lot of poker when I was in high school and college, so for me, this was basically my childhood dream job. Now contrary to its reputation, poker is not just a game of luck, it's also a game of deep strategy. You can kind of think of it like chess with a deck of cards. When I started my PhD, there had already been several years of research on how to make AIs that play poker. And the general feeling among the research community [was] that we had figured out the paradigm, and now all we needed to do was scale it. So every year we would train larger poker AIs for longer on more data. And every year they would get better. Just like today's frontier language models.
By 2015, they got so good that we thought they might be able to rival the top human experts. So we challenged four of the world's top poker players to an 80,000-hand poker competition with 120,000 dollars in prize money to incentivize them to play their best. And unfortunately, our bot lost by a wide margin. In fact, it was clear even on day one that our bot was outmatched. But during this competition I noticed something interesting. You see, leading up to this competition, our bot had played almost a trillion hands of poker over thousands of CPUs for about three months. But when it came time to actually play against these human experts, the bot acted instantly. It took about ten milliseconds to make a decision, no matter how difficult it was. Meanwhile, the human experts had only played maybe 10 million hands of poker in their lifetimes. But when they were faced with a difficult decision, they would take the time to think. If it was an easy decision, they might only think for a couple of seconds. If it was a difficult decision, they might think for a few minutes, but they would take advantage of the time that they had to think through their decisions.
In Daniel Kahneman's book, "Thinking Fast and Slow," he describes this as the difference between System 1 thinking and System 2 thinking. System 1 thinking is the faster, more intuitive kind of thinking that you might use, for example, to recognize a friendly face or laugh at a funny joke. System 2 thinking is the slower, more methodical thinking that you might use for things like planning a vacation, or writing an essay or solving a hard math problem. After this competition, I wondered whether this System 2 thinking might be what’s missing from our bot and might explain the difference in the performance between our bot and the human experts. So I ran some experiments to see just how much of a difference this System 2 thinking makes in poker. And the results that I got blew me away. It turned out that having the bot think for just 20 seconds in a hand of poker got the same boost in performance as scaling up the model by 100,000x and training it for 100,000 times longer. Let me say that again. Spending 20s thinking in a hand of poker got the same boost in performance as scaling up the size of the model and the training by 100,000x. When I got this result, I literally thought it was a bug. For the first three years of my PhD, I had managed to scale up these models by 100x. I was proud of that work. I had written multiple papers on how to do that scaling. But I knew pretty quickly that all of that would be a footnote compared to just scaling up System 2 thinking.
So based on these results, we redesigned the poker AI from the ground up. Now we were focused on scaling up System 2 thinking in addition to System 1. And in 2017, we again challenged four of the world's top poker pros to a 120,000-hand poker competition, this time with 200,000 dollars in prize money. And this time we beat all of them by a huge margin. This was a huge surprise to everybody involved. It was a huge surprise to the poker community, it was a huge surprise to the AI community, and honestly, even a huge surprise to us. I literally did not think it was possible to win by the kind of margin that we won by. In fact, I think what really highlights just how surprising this result was is that when we announced the competition, the poker community decided to do what they do best and gamble on who would win.
(Laughter)
When we started, when we announced the competition, the betting odds were about four to one against us. After the first three days of the competition, when we had won for the first three days, the betting odds were still about fifty-fifty. But by the eighth day of the competition, you could no longer gamble on which side would win. You could only gamble on which human would lose the least by the end. This pattern of AI benefiting by thinking for longer is not unique to poker, and in fact, we've seen it in multiple other games as well. For example, in 1997, IBM created Deep Blue, an AI that plays chess, and they challenged the world champion Garry Kasparov to a tournament, and beat him in a landmark achievement for AI. But Deep Blue didn't act instantly. Deep Blue thought for a couple of minutes before making each move.
Similarly, in 2016, DeepMind created AlphaGo, an AI that plays the game of go, which is even more complicated than the game of chess. And they too challenged a world champion, Lee Sedol, and beat him in a landmark achievement for AI. But AlphaGo also didn't act instantly. AlphaGo took the time to think for a couple of minutes before making each move. In fact, the authors of AlphaGo later published a paper where they measured just how much of a difference this thinking time makes for the strongest version of AlphaGo. And what they found is that when AlphaGo had the time to think for a couple of minutes, it would beat any human alive by a huge margin. But when it had to act instantly, it would do much worse than top humans. In 2021, there was a paper that was published that tried to measure just how much of a difference this thinking time made a bit more scientifically. In it, the authors found that in these games, scaling up thinking time by 10x was roughly the equivalent of scaling up the model size and training by 10x. So you have this very clear, clean relationship between scaling up System 2 thinking time and scaling up System 1 training.
Now why does this matter? Well, remember I mentioned at the start of this talk that today's frontier models cost hundreds of millions of dollars to train, but the cost of querying them, the cost of asking a question and getting an answer is fractions of a penny. So this result says that if you want an even better model, there are two ways you could do it. One is to keep doing what we've been doing for the past five years, and scaling up System 1 training, go from spending hundreds of millions of dollars on a model to billions of dollars on a model. The other is to scale up System 2 thinking and go from spending a penny per query to 10 cents per query. At a certain point, that trade-off becomes well worth it.
Now of course, all of these results are in the domain of games, and there was a reasonable question about whether these results could be extended to a more complicated setting, like language. But recently, my colleagues and I at OpenAI released o1, a new series of language models that think before responding. If it's an easy question, o1 might only think for a few seconds. If it's a difficult decision, it might think for a few minutes. But just like the AIs for chess, go and poker, o1 benefits by being able to think for longer. This opens up a completely new dimension for scaling. We're no longer constrained to just scaling up System 1 training. Now we can scale up System 2 thinking as well. And the beautiful thing about scaling up in this direction is that it's largely untapped. Remember I mentioned that the frontier models of today cost less than a penny to query.
Now when I mention this to people, a frequent response that I get is that people might not be willing to wait around for a few minutes to get a response from a model, or pay a few dollars to get an answer to their question. And it's true that o1 takes longer and costs more than other models that are out there. But I would argue that for some of the most important problems that we care about, that cost is well worth it.
So let's do an experiment and see. Raise your hand if you would be willing to pay more than a dollar for a new cancer treatment. Alright. Basically everybody in the audience. Keep your hand up. How about 1,000 dollars? How about a million dollars? What about for more efficient solar panels? Or for a proof of the Riemann hypothesis?
The common conception of AI today is chatbots, but it doesn't have to be that way. This isn't a revolution that's 10 years away or even two years away. It's a revolution that's happening now. My colleagues and I have already released o1 preview, and I have had people come to me and say that it has saved them days' worth of work, including researchers at top universities. And that's just the preview.
I mentioned at the start of this talk that the history of AI progress over the past five years can be summarized in one word: scale. So far, that has meant scaling up the System 1 training of these models. Now we have a new paradigm, one where we can scale up System 2 thinking as well. And we are just at the very beginning of scaling up in this direction.
Now I know that there are some people who will still say that AI is going to plateau or hit a wall. And to them I say, want to bet?
(Laughter)
Thank you.
(Applause)