Yejin Choi: Why AI is incredibly smart and shockingly stupid

So I'm excited to share a few spicy thoughts on artificial intelligence. But first, let's get philosophical by starting with this quote by Voltaire, an 18th century Enlightenment philosopher, who said, "Common sense is not so common." Turns out this quote couldn't be more relevant to artificial intelligence today. Despite that, AI is an undeniably powerful tool, beating the world-class "Go" champion, acing college admission tests and even passing the bar exam.

我很興奮能來分享一些關於人工智慧的辛辣想法。但，咱們先來點哲學，用一句伏爾泰的引言開頭，這位十八世紀的啟蒙時代哲學家曾說：「常識並不常見。」結果發現，這句話太適合套用在現今的人工智慧上了。儘管如此，人工智慧無疑是強大的工具，打敗世界級圍棋冠軍，在大學入學考試表現一流，甚至通過了律師考試。

I’m a computer scientist of 20 years, and I work on artificial intelligence. I am here to demystify AI. So AI today is like a Goliath. It is literally very, very large. It is speculated that the recent ones are trained on tens of thousands of GPUs and a trillion words. Such extreme-scale AI models, often referred to as "large language models," appear to demonstrate sparks of AGI, artificial general intelligence. Except when it makes small, silly mistakes, which it often does. Many believe that whatever mistakes AI makes today can be easily fixed with brute force, bigger scale and more resources. What possibly could go wrong?

我是有二十年經驗的電腦科學家，我做的是人工智慧。我來這裡揭開人工智慧的神秘面紗。現今的人工智慧就像巨人歌利亞。它真的非常非常大。據推測，訓練近期的人工智慧所用的是數以萬計的圖形處理器和上兆個字詞。這種極大規模的人工智慧模型通常被稱為「大型語言模型」，它們顯然展現出了 AGI 的跡象， AGI 就是人工通用智慧。只是它也會犯很蠢的小錯，且還蠻常犯的。許多人相信，不論現今人工智慧犯的是什麼錯誤，都可以用暴力地以更大的規模及更多的資源來解決。怎麼可能會出錯呢？

So there are three immediate challenges we face already at the societal level. First, extreme-scale AI models are so expensive to train, and only a few tech companies can afford to do so. So we already see the concentration of power. But what's worse for AI safety, we are now at the mercy of those few tech companies because researchers in the larger community do not have the means to truly inspect and dissect these models. And let's not forget their massive carbon footprint and the environmental impact.

在社會層級，我們已經面臨到三個立即的挑戰。第一，訓練極大規模的人工智慧模型，成本非常昂貴，只有幾間科技公司能負擔得起。所以我們已經看到了權力集中的現象。但就人工智慧安全性而言，更糟的是，我們現在要看少數科技公司的臉色，因為，在更大的圈子裡的研究者沒有辦法真正檢查和分析這些模型。也別忘了它們的大量碳足跡以及對環境的衝擊。

And then there are these additional intellectual questions. Can AI, without robust common sense, be truly safe for humanity? And is brute-force scale really the only way and even the correct way to teach AI?

此外還有智慧方面的問題。若沒有健全的常識，對人類來說人工智慧真的安全嗎？且，暴力規模的方式真的是唯一可以教導人工智慧的方式嗎？它又是對的方式嗎？

So I’m often asked these days whether it's even feasible to do any meaningful research without extreme-scale compute. And I work at a university and nonprofit research institute, so I cannot afford a massive GPU farm to create enormous language models. Nevertheless, I believe that there's so much we need to do and can do to make AI sustainable and humanistic. We need to make AI smaller, to democratize it. And we need to make AI safer by teaching human norms and values. Perhaps we can draw an analogy from "David and Goliath," here, Goliath being the extreme-scale language models, and seek inspiration from an old-time classic, "The Art of War," which tells us, in my interpretation, know your enemy, choose your battles, and innovate your weapons.

這陣子我常被問，在沒有極大規模計算的情況下做任何有意義的研究是可行的嗎？我在大學及非營利研究機構工作，所以我負擔不起大型圖形處理器農場來創造巨大的語言模型。儘管如此，我相信我們必須要做／能做很多事讓人工智慧永續且人本主義化。我們要讓人工智慧變得更小，將它民主化，我們也需要教導人工智慧人類標準和價值觀來讓它更安全。也許我們可以用《大衛與歌利亞》來比喻。歌利亞就是極大規模語言模型，接著從古典名著《孫子兵法》中尋求靈感，根據我的詮釋，它告訴我們，了解你的敵人，選擇你要打哪場仗，並創新你的武器。

Let's start with the first, know your enemy, which means we need to evaluate AI with scrutiny. AI is passing the bar exam. Does that mean that AI is robust at common sense? You might assume so, but you never know.

咱們先從第一點開始，了解你的敵人，我們需要用嚴謹的態度來評估人工智慧。人工智慧能通過律師考試。那表示人工智慧有健全的常識嗎？你可能假設是如此，但你永遠不會知道。

So suppose I left five clothes to dry out in the sun, and it took them five hours to dry completely. How long would it take to dry 30 clothes? GPT-4, the newest, greatest AI system says 30 hours. Not good. A different one. I have 12-liter jug and six-liter jug, and I want to measure six liters. How do I do it? Just use the six liter jug, right? GPT-4 spits out some very elaborate nonsense.

假設我留了五件衣服在太陽下曬乾，它們需要五小時才能全乾。曬乾三十件衣服要多久？最新最棒的人工智慧系統 GPT-4 說是三十個小時。不好。再來個不同的例子：我有十二公升的甕和六公升的甕，我想要測量出六公升，要怎麼做？用六公升的甕就好了對吧？ GPT-4 卻吐出一堆精心製作的胡言亂語。

(Laughter)

（笑聲）

Step one, fill the six-liter jug, step two, pour the water from six to 12-liter jug, step three, fill the six-liter jug again, step four, very carefully, pour the water from six to 12-liter jug. And finally you have six liters of water in the six-liter jug that should be empty by now.

第一步：把六公升的甕裝滿。第二步：把六公升甕中的水倒入十二公升的甕。第三步：再次把六公升的甕裝滿。第四步：非常小心地將水從六公升的甕倒入十二公升的甕。最後，你就有六公升的水了，它就在現在應該沒水的六公升甕中。

(Laughter)

（笑聲）

OK, one more. Would I get a flat tire by bicycling over a bridge that is suspended over nails, screws and broken glass? Yes, highly likely, GPT-4 says, presumably because it cannot correctly reason that if a bridge is suspended over the broken nails and broken glass, then the surface of the bridge doesn't touch the sharp objects directly.

好，再來一個。我這樣會不會爆胎：騎腳踏車騎過一座懸吊的橋，橋下有釘子、螺絲，和碎玻璃。 GPT-4 說：是的，非常有可能，有可能是因為它無法正確推論出如果橋是懸吊在釘子和碎玻璃上方的，那麼橋的表面就不會直接接觸到這些銳利的物體。

OK, so how would you feel about an AI lawyer that aced the bar exam yet randomly fails at such basic common sense? AI today is unbelievably intelligent and then shockingly stupid.

好，各位對此有什麼感想：一位人工智慧律師在律師考試表現出色，卻會隨機在一些常識上出現錯誤？現今的人工智慧聰明到讓人難以置信，卻也愚蠢到讓人瞠目結舌。

(Laughter)

（笑聲）

It is an unavoidable side effect of teaching AI through brute-force scale. Some scale optimists might say, “Don’t worry about this. All of these can be easily fixed by adding similar examples as yet more training data for AI." But the real question is this. Why should we even do that? You are able to get the correct answers right away without having to train yourself with similar examples. Children do not even read a trillion words to acquire such a basic level of common sense.

若要透過暴力規模來教導人工智慧，這就是無可避免的副作用。有些對規模抱持樂觀的人會說：「別擔心這點，這一切都很容易解決，只要加一些類似的例子，給人工智慧更多訓練資料即可。但，真正的問題是這個：我們幹嘛這樣做？你馬上就能得出正確答案了，你還不需要用類似的例子來訓練你自己。兒童甚至不用讀到上兆個字詞也能習得這種基本層級的常識。

So this observation leads us to the next wisdom, choose your battles. So what fundamental questions should we ask right now and tackle today in order to overcome this status quo with extreme-scale AI? I'll say common sense is among the top priorities.

這項觀察，就要帶到第二項教訓：選擇你要打哪場仗。我們現在應該要問哪些基礎問題，現今要處理哪些問題，才能夠克服極大規模人工智慧的這種現況？我會說，常識是最該優先處理的議題之一。

So common sense has been a long-standing challenge in AI. To explain why, let me draw an analogy to dark matter. So only five percent of the universe is normal matter that you can see and interact with, and the remaining 95 percent is dark matter and dark energy. Dark matter is completely invisible, but scientists speculate that it's there because it influences the visible world, even including the trajectory of light. So for language, the normal matter is the visible text, and the dark matter is the unspoken rules about how the world works, including naive physics and folk psychology, which influence the way people use and interpret language.

長久以來，常識一直是人工智慧領域的難題。為了解釋這一點，讓我用暗物質來做比喻。宇宙只有 5% 是正常物質，即你可以看見、互動的物質，剩下的 95% 都是暗物質和暗能量。暗物質是完全看不見的，但科學家推論它存在，因為它會影響可見的世界，甚至光的軌道。語言上的正常物質就是可見的文字，暗物質則是世界如何運作的潛在規則，包括天真物理學和民間心理學，這些都會影響到人如何使用和詮釋語言。

So why is this common sense even important? Well, in a famous thought experiment proposed by Nick Bostrom, AI was asked to produce and maximize the paper clips. And that AI decided to kill humans to utilize them as additional resources, to turn you into paper clips. Because AI didn't have the basic human understanding about human values. Now, writing a better objective and equation that explicitly states: “Do not kill humans” will not work either because AI might go ahead and kill all the trees, thinking that's a perfectly OK thing to do. And in fact, there are endless other things that AI obviously shouldn’t do while maximizing paper clips, including: “Don’t spread the fake news,” “Don’t steal,” “Don’t lie,” which are all part of our common sense understanding about how the world works.

那為什麼常識很重要？尼克‧博斯特羅姆提出了一個著名的思想實驗，人工智慧被要求要盡量製造出最多的迴紋針。而人工智慧決定要殺害人類，用他們當作額外的資源，把各位變成迴紋針。因為人工智慧不像人類對於人的價值有基本的了解。就算是給它比較好的目標和方程式，明確陳述「不要殺人」，也行不通，因為人工智慧可能就會改成殺光所有樹木，以為那樣做完全沒問題。事實上，還有數不清的事，都是人工智慧在盡量製造迴紋針時不該做的事，包括：「別散播假消息」、「別偷竊」、「別說謊」，都屬於我們了解世界如何運作的常識。

However, the AI field for decades has considered common sense as a nearly impossible challenge. So much so that when my students and colleagues and I started working on it several years ago, we were very much discouraged. We’ve been told that it’s a research topic of ’70s and ’80s; shouldn’t work on it because it will never work; in fact, don't even say the word to be taken seriously. Now fast forward to this year, I’m hearing: “Don’t work on it because ChatGPT has almost solved it.” And: “Just scale things up and magic will arise, and nothing else matters.”

然而，數十年來，人工智慧領域認為常識是不可能的挑戰。且到這種程度：當我和我的學生及同事數年前開始投入這個主題時，我們被大力勸阻。別人告訴我們，這是七○ 和八○年代的研究主題，別投入這個主題，因為永遠不會有成果。事實上，甚至別說出這個詞才能被別人認真對待。快轉到今年，我聽到：「別投入這個主題，因為 ChatGPT 快解決它了。」及「把規模擴大就會有神奇的事情發生，其他都無所謂。」

So my position is that giving true common sense human-like robots common sense to AI, is still moonshot. And you don’t reach to the Moon by making the tallest building in the world one inch taller at a time. Extreme-scale AI models do acquire an ever-more increasing amount of commonsense knowledge, I'll give you that. But remember, they still stumble on such trivial problems that even children can do.

所以，我的立場是：要讓人工智慧有真正的常識仍然是跟登月一樣難的課題。且登月的做法並不是把世上最高的大樓一次加高一英吋。我承認極大規模人工智慧模型取得的常識知識不斷在增加。但是，別忘了，連孩子都懂的簡單小問題它們還是會出錯。現今的人工智慧非常低效。

So AI today is awfully inefficient. And what if there is an alternative path or path yet to be found? A path that can build on the advancements of the deep neural networks, but without going so extreme with the scale.

如果有替代途徑或尚未被找到的途徑呢？以深度神經網路的進展為基礎的途徑，但在規模上不用做到這麼大。

So this leads us to our final wisdom: innovate your weapons. In the modern-day AI context, that means innovate your data and algorithms. OK, so there are, roughly speaking, three types of data that modern AI is trained on: raw web data, crafted examples custom developed for AI training, and then human judgments, also known as human feedback on AI performance. If the AI is only trained on the first type, raw web data, which is freely available, it's not good because this data is loaded with racism and sexism and misinformation. So no matter how much of it you use, garbage in and garbage out. So the newest, greatest AI systems are now powered with the second and third types of data that are crafted and judged by human workers. It's analogous to writing specialized textbooks for AI to study from and then hiring human tutors to give constant feedback to AI. These are proprietary data, by and large, speculated to cost tens of millions of dollars. We don't know what's in this, but it should be open and publicly available so that we can inspect and ensure [it supports] diverse norms and values. So for this reason, my teams at UW and AI2 have been working on commonsense knowledge graphs as well as moral norm repositories to teach AI basic commonsense norms and morals. Our data is fully open so that anybody can inspect the content and make corrections as needed because transparency is the key for such an important research topic.

這就要帶出最後一項教訓：創新你的武器。在現代人工智慧的情境中，意思就是創新你的資料和演算法。粗略來說，訓練現代人工智慧所用的資料分為三類：原始網路資料、為了訓練人工智慧而量身打造的範例，以及人類判斷，也就是人類針對人工智慧表現的回饋。若只用免費取得的原始網路資料（第一類）來訓練人工智慧，那並不好，因為，這些資料中帶有許多種族主義、性別主義、假消息。不論你用多少資料，垃圾進只會垃圾出。所以，現在最新最棒的人工智慧系統也靠第二、三類資料的支持，這些資料由人類工作者製作和評斷。可以比喻成：寫專門的教科書給人工智慧研讀，接著僱用人類家教來經常給予人工智慧回饋意見。這些大多是專有的資料，推測可能成本要數千萬美金。我們不知道這些資料包含什麼，但應該要開放給大家取得，讓我們能檢視和確保它們能協助傳遞多樣性的規範和價值觀。基於這個理由，我在 UW 和 AI2 的團隊一直投入在做常識知識圖以及道德規範庫。來教導人工智慧基本的常識規範和道德。我們的資料是完全公開的，人人都可以檢視內容，並做必要的修正，因為對這麼重要的研究主題來說，透明度是關鍵。

Now let's think about learning algorithms. No matter how amazing large language models are, by design they may not be the best suited to serve as reliable knowledge models. And these language models do acquire a vast amount of knowledge, but they do so as a byproduct as opposed to direct learning objective. Resulting in unwanted side effects such as hallucinated effects and lack of common sense. Now, in contrast, human learning is never about predicting which word comes next, but it's really about making sense of the world and learning how the world works. Maybe AI should be taught that way as well.

咱們來思考一下學習演算法。不論大型語言模型有多驚人，從設計角度來說它們可能不會最適合擔任可靠的知識模型。這些語言模型確實取得了大量的知識，但這是它們副產物，而不是直接的學習目標，這導致了我們不想要的副作用，如幻覺效應以及缺乏常識。相對的，人類學習的重點從來不是去預測下一個字是什麼，重點是要去理解這個世界，並學習這個世界如何運作。也許也該用這種方式來教導人工智慧。

So as a quest toward more direct commonsense knowledge acquisition,

所以，為了追尋

my team has been investigating potential new algorithms, including symbolic knowledge distillation that can take a very large language model as shown here that I couldn't fit into the screen because it's too large, and crunch that down to much smaller commonsense models using deep neural networks. And in doing so, we also generate, algorithmically, human-inspectable, symbolic, commonsense knowledge representation, so that people can inspect and make corrections and even use it to train other neural commonsense models.

更直接取得常識知識的方法，我的團隊一直在研究潛在的新演算法，包括符號知識蒸餾，可以將非常大的語言模型，如畫面上的這個，它太大了所以螢幕放不下，把它打碎成更小許多的常識模型，用的方法是深度神經網路。這麼做的過程中，我們也用演算法產生出可讓人類檢視的符號化知識呈現方式，讓大家可以做檢查、做修正，甚至用它訓練其他神經常識模型。

More broadly, we have been tackling this seemingly impossible giant puzzle of common sense, ranging from physical, social and visual common sense to theory of minds, norms and morals. Each individual piece may seem quirky and incomplete, but when you step back, it's almost as if these pieces weave together into a tapestry that we call human experience and common sense.

更廣泛來說，我們一直在拼湊這個似乎無法解決的巨型常識拼圖，它的範圍從實體、社會，以及視覺常識，一直到心智理論、規範，和道德。每一片看起來都很怪異且不完整，但當你退後一步，就好像每片拼圖編織在一起成了一幅織錦，我們稱這織錦為人類經驗及常識。

We're now entering a new era in which AI is almost like a new intellectual species with unique strengths and weaknesses compared to humans. In order to make this powerful AI sustainable and humanistic, we need to teach AI common sense, norms and values.

我們正在邁入新時代，在這個新時代，人工智慧幾乎就像是一種有智慧的新物種，和人類相比，它們有獨特的優勢和缺點。為了要讓這種強大的人工智慧能永續且有人性，我們必須要教導人工智慧常識、規範，和價值觀。

Thank you.

謝謝。

(Applause)

（掌聲）

Chris Anderson: Look at that. Yejin, please stay one sec. This is so interesting, this idea of common sense. We obviously all really want this from whatever's coming. But help me understand. Like, so we've had this model of a child learning. How does a child gain common sense apart from the accumulation of more input and some, you know, human feedback? What else is there?

主持人：看哪。葉真，請留步。這個關於常識的想法相當有趣。不論將來出現的是什麼，我們顯然都希望能如此。請幫我釐清一下，我們有兒童學習模型，兒童是如何習得常識的？除了累積更多的輸入資訊以及一些人類回饋？還有什麼其他的？

Yejin Choi: So fundamentally, there are several things missing, but one of them is, for example, the ability to make hypothesis and make experiments, interact with the world and develop this hypothesis. We abstract away the concepts about how the world works, and then that's how we truly learn, as opposed to today's language model. Some of them is really not there quite yet.

講者：基本上被忽略的有好幾項，但舉例來說其一就是做假設和實驗的能力，和世界互動並發展出假設的能力。我們把世界運作的方式抽象化成概念，那是我們真正學習的方式。相對的，現今的語言模型並非如此。有一些真的還差得很遠。

CA: You use the analogy that we can’t get to the Moon by extending a building a foot at a time. But the experience that most of us have had of these language models is not a foot at a time. It's like, the sort of, breathtaking acceleration. Are you sure that given the pace at which those things are going, each next level seems to be bringing with it what feels kind of like wisdom and knowledge.

主持人：你比喻說，若要登月，不能用把大樓一次增加一英呎的方式。但我們大部分人對語言模型的體驗並不是一次一英呎，而像是讓人摒息的加速。你確定，依現在這些模型的發展步調，每提升一個層級就似乎帶來了很像是智慧和知識的感覺。

YC: I totally agree that it's remarkable how much this scaling things up really enhances the performance across the board. So there's real learning happening due to the scale of the compute and data.

講者：我完全同意這種擴展規模的做法真的讓性能全面性強化了。因為計算和資料的規模這麼大，確實有真正的學習產生。

However, there's a quality of learning that is still not quite there. And the thing is, we don't yet know whether we can fully get there or not just by scaling things up. And if we cannot, then there's this question of what else? And then even if we could, do we like this idea of having very, very extreme-scale AI models that only a few can create and own?

然而，學習的品質還不太到位。重點是，我們還不知道我們是否能靠擴大規模就達到目標。如果達不到，問題就是：還有什麼別的？就算達得到，我們真的會喜歡這種概念嗎：這些極大規模的人工智慧模型只有少數人能創造和擁有？

CA: I mean, if OpenAI said, you know, "We're interested in your work, we would like you to help improve our model," can you see any way of combining what you're doing with what they have built?

主持人：如果 OpenAI 說「我們對你的研究很感興趣，我們希望你能協助改善我們的模型」，你認為有辦法將你的研究和他們已經打造出的成果結合嗎？

YC: Certainly what I envision will need to build on the advancements of deep neural networks. And it might be that there’s some scale Goldilocks Zone, such that ... I'm not imagining that the smaller is the better either, by the way. It's likely that there's right amount of scale, but beyond that, the winning recipe might be something else. So some synthesis of ideas will be critical here.

講者：我所想像的肯定是要以深度神經網路的進展為基礎。且可能會有某種最適度的規模，讓…… 順道一提，我並沒有想像比較小就是比較好。很可能會有適當的規模，但除此之外，成功的關鍵可能是別的。所以，將想法綜合起來是很重要的。

CA: Yejin Choi, thank you so much for your talk.

主持人：崔葉真，謝謝你帶來的演說。

(Applause)

（掌聲）

(Laughter)

（笑聲）

(Laughter)

（笑聲）

OK, so how would you feel about an AI lawyer that aced the bar exam yet randomly fails at such basic common sense? AI today is unbelievably intelligent and then shockingly stupid.

(Laughter)

（笑聲）

如果有替代途徑或尚未被找到的途徑呢？以深度神經網路的進展為基礎的途徑，但在規模上不用做到這麼大。

So as a quest toward more direct commonsense knowledge acquisition,

所以，為了追尋

Thank you.

謝謝。

(Applause)

（掌聲）

YC: I totally agree that it's remarkable how much this scaling things up really enhances the performance across the board. So there's real learning happening due to the scale of the compute and data.

講者：我完全同意這種擴展規模的做法真的讓性能全面性強化了。因為計算和資料的規模這麼大，確實有真正的學習產生。

CA: I mean, if OpenAI said, you know, "We're interested in your work, we would like you to help improve our model," can you see any way of combining what you're doing with what they have built?

主持人：如果 OpenAI 說「我們對你的研究很感興趣，我們希望你能協助改善我們的模型」，你認為有辦法將你的研究和他們已經打造出的成果結合嗎？