Fei-Fei Li: How we're teaching computers to understand pictures

Let me show you something.

容我為各位呈現一些照片

(Video) Girl: Okay, that's a cat sitting in a bed. The boy is petting the elephant. Those are people that are going on an airplane. That's a big airplane.

(影片)女孩：嗯，這是一隻貓，坐在床上。這男孩在拍撫一隻象。這些人要去搭飛機。好大的飛機。

Fei-Fei Li: This is a three-year-old child describing what she sees in a series of photos. She might still have a lot to learn about this world, but she's already an expert at one very important task: to make sense of what she sees. Our society is more technologically advanced than ever. We send people to the moon, we make phones that talk to us or customize radio stations that can play only music we like. Yet, our most advanced machines and computers still struggle at this task. So I'm here today to give you a progress report on the latest advances in our research in computer vision, one of the most frontier and potentially revolutionary technologies in computer science.

主講人：這是由一位三歲的小孩所描述她看到的一系列照片雖然對於這世界她還有更多要學習的地方，但是她已經是其中一項重要技能的專家-- 為所見之聞賦予意義。科技在我們的社會已進展到前所未有的程度：我們把人送上月球、發明可以與人交談的電話，或是客製一個電台，只播放個人喜歡的音樂。然而這台無比聰明的機器和電腦仍然無法發展這項技能，因此今天我來到這裡向各位報告我們在電腦視覺的最新研究進展，這是現階段在資訊業領域中，最先進、最具潛力的革命性技術。

Yes, we have prototyped cars that can drive by themselves, but without smart vision, they cannot really tell the difference between a crumpled paper bag on the road, which can be run over, and a rock that size, which should be avoided. We have made fabulous megapixel cameras, but we have not delivered sight to the blind. Drones can fly over massive land, but don't have enough vision technology to help us to track the changes of the rainforests. Security cameras are everywhere, but they do not alert us when a child is drowning in a swimming pool. Photos and videos are becoming an integral part of global life. They're being generated at a pace that's far beyond what any human, or teams of humans, could hope to view, and you and I are contributing to that at this TED. Yet our most advanced software is still struggling at understanding and managing this enormous content. So in other words, collectively as a society, we're very much blind, because our smartest machines are still blind.

是的，目前我們已經有自動駕駛的原型車，但若不具備視覺辨識技術，它將無法分辨同樣出現在馬路中，一團它其實輾過也無妨的破紙袋，以及一個大到它必須閃避的石塊，兩者有何不同。我們製造出畫素極高的相機，但我們卻無法賦予盲人視覺；無人機可以翻山越嶺，卻沒有足夠的視覺技術可以讓我們追蹤雨林的變化；監視器滿佈在各個角落，卻無法在看到一個孩子將溺斃在泳池之際，對我們發出警訊。靜態及動態影像已逐漸與全世界的生活密不可分，它們發展的步伐已經遠遠超越人類及其群體所相信的，在座各位以及我自己都是TED這個活動裡頭的推手。然而，目前最先進的軟體卻仍在其中苦苦掙扎，無法理解與應用這龐大的資料體。換而言之，在這整個社會裡，大家都有如盲人在運作，因為連我們最聰明的機器都還看不見。

"Why is this so hard?" you may ask. Cameras can take pictures like this one by converting lights into a two-dimensional array of numbers known as pixels, but these are just lifeless numbers. They do not carry meaning in themselves. Just like to hear is not the same as to listen, to take pictures is not the same as to see, and by seeing, we really mean understanding. In fact, it took Mother Nature 540 million years of hard work to do this task, and much of that effort went into developing the visual processing apparatus of our brains, not the eyes themselves. So vision begins with the eyes, but it truly takes place in the brain.

或許有人會問：這到底有什麼困難？任何相機都可以產生像這樣的照片，它是藉由將有色光轉換成2D的數字陣列，也就是大家熟知的像素。但這些數字是死的，並沒有被賦予意義。就好像有「聽」，不代表有「到」。同樣地，攝取到影像不等於看見，我們所認知的看到，應包含著了解其中的意義。事實上，這樣的成果，是大自然花了五億四千萬年的光陰才得到的。這其中的努力，泰半是耗費在發展腦部的視覺處理這個區塊，而不是眼睛的部分。也就是說，視覺始於眼睛，但真正使它有用的，卻是大腦。

So for 15 years now, starting from my Ph.D. at Caltech and then leading Stanford's Vision Lab, I've been working with my mentors, collaborators and students to teach computers to see. Our research field is called computer vision and machine learning. It's part of the general field of artificial intelligence. So ultimately, we want to teach the machines to see just like we do: naming objects, identifying people, inferring 3D geometry of things, understanding relations, emotions, actions and intentions. You and I weave together entire stories of people, places and things the moment we lay our gaze on them.

十五年來，從在加州理工學院攻讀博士開始，到領導史丹佛的視覺實驗室，我和指導教授、同事及學生們，試圖讓電腦擁有智能之眼，我們研究的領域稱之為電腦視覺與機器學習，這是人工智慧其中一環。我們的終極目標就是教導機器能夠像人一樣理解所見之物，像是識別物品、辨認人臉、推論物體的幾何形態，進而理解其中的關聯、情緒、動作及意圖。在座每一位和我，都可以在匆匆一瞥的瞬間，理解到人事、地、物所交織而成的網絡，

The first step towards this goal is to teach a computer to see objects, the building block of the visual world. In its simplest terms, imagine this teaching process as showing the computers some training images of a particular object, let's say cats, and designing a model that learns from these training images. How hard can this be? After all, a cat is just a collection of shapes and colors, and this is what we did in the early days of object modeling. We'd tell the computer algorithm in a mathematical language that a cat has a round face, a chubby body, two pointy ears, and a long tail, and that looked all fine. But what about this cat? (Laughter) It's all curled up. Now you have to add another shape and viewpoint to the object model. But what if cats are hidden? What about these silly cats? Now you get my point. Even something as simple as a household pet can present an infinite number of variations to the object model, and that's just one object.

要電腦達成這個目標的第一步，就是教導它辨別物品，這是視覺的基石。簡單來說，我們教導的方法就是給電腦看一些特定物體的影像，例如貓咪。我們設計了一個程式讓電腦利用這些影像來學習這有啥困難？貓咪不就是由一些幾何圖形和顏色所組成的嘛，這就是我們初期所做的物體模型。我們用數學語言來告知電腦演繹方法，貓就是有圓圓的臉、胖胖的身體，兩個尖尖的耳朵和一條長尾巴。看起來很好啊，但如果貓咪長這樣呢？ (觀眾笑) 全身都捲起來了。這下子我們又得在原來的模型加上新的形狀和不同的視野角度。又，如果貓咪是躲著的呢？像這群傻貓？這樣各位了解我的意思嗎？即使簡單如貓這樣的家庭寵物，也會有相對於原型以外，無數的其他形態表徵，而這只是其中一樣。

So about eight years ago, a very simple and profound observation changed my thinking. No one tells a child how to see, especially in the early years. They learn this through real-world experiences and examples. If you consider a child's eyes as a pair of biological cameras, they take one picture about every 200 milliseconds, the average time an eye movement is made. So by age three, a child would have seen hundreds of millions of pictures of the real world. That's a lot of training examples. So instead of focusing solely on better and better algorithms, my insight was to give the algorithms the kind of training data that a child was given through experiences in both quantity and quality.

因此八年前，一項極其簡單和深刻的觀察，改變了我的想法，沒有人教導孩子如何去「看」，特別是在早期發育階段，他們是從真實世界的經驗中學習。如果你把孩童的眼睛當成生物相機的概念，就如同每200毫秒就拍一張照片一樣，這是眼球移動的平均時間。年紀到了三歲時，孩子們已經看過了真實世界中數以百萬計的照片，這樣的訓練範例是很大量的。因此，我的直覺告訴我應該以孩童的學習經驗法則，並兼以質與量，提供訓練的資料給電腦，而非一昧追求更好的程式演算。

Once we know this, we knew we needed to collect a data set that has far more images than we have ever had before, perhaps thousands of times more, and together with Professor Kai Li at Princeton University, we launched the ImageNet project in 2007. Luckily, we didn't have to mount a camera on our head and wait for many years. We went to the Internet, the biggest treasure trove of pictures that humans have ever created. We downloaded nearly a billion images and used crowdsourcing technology like the Amazon Mechanical Turk platform to help us to label these images. At its peak, ImageNet was one of the biggest employers of the Amazon Mechanical Turk workers: together, almost 50,000 workers from 167 countries around the world helped us to clean, sort and label nearly a billion candidate images. That was how much effort it took to capture even a fraction of the imagery a child's mind takes in in the early developmental years.

有了上述的洞見，我們接下來必須要收集前所未有的大量資料群，甚至於是千倍以上的。於是我與普林斯頓大學的李凱教授共同於2007年開始了我們稱之為 ImageNet 的專案。很幸運地，我們不必在頭上綁一個相機，然後花費數年收集影像，而是轉而由網際網路，這個由人類所創造出來龐大的影像寶窟，我們下載了數以百萬計的影像，並且使用如Amazon Mechanical Turk 這樣的群眾外包平台，來協助我們處理及分類這些照片。在高峰期，ImageNet 甚至是整個亞馬遜平台最大的雇主之一，我們一共聘請了來自167個國家，約5萬個工作者，來協助我們分類處理並標示將近10億幅影像，花費了這麼多的資源，就是為了捕捉那一絲絲孩童在早期心智發展的浮光掠影。

In hindsight, this idea of using big data to train computer algorithms may seem obvious now, but back in 2007, it was not so obvious. We were fairly alone on this journey for quite a while. Some very friendly colleagues advised me to do something more useful for my tenure, and we were constantly struggling for research funding. Once, I even joked to my graduate students that I would just reopen my dry cleaner's shop to fund ImageNet. After all, that's how I funded my college years.

用現在眼光看來，使用大量的資料來訓練電腦演算是明顯合理的，然而在2007年的世界卻非如此。有好長一段時間，我們在這個旅途中孤獨地踽踽而行，有些同事好心地建議我，與其苦苦掙扎於研究經費的募集，還不如轉而先做些比較好拿到終身聘的研究，我還曾跟我的研究生開玩笑說我乾脆再開一間乾洗店來資助ImageNet 好了，畢竟那就是我用以支付大學學費的方法。

So we carried on. In 2009, the ImageNet project delivered a database of 15 million images across 22,000 classes of objects and things organized by everyday English words. In both quantity and quality, this was an unprecedented scale. As an example, in the case of cats, we have more than 62,000 cats of all kinds of looks and poses and across all species of domestic and wild cats. We were thrilled to have put together ImageNet, and we wanted the whole research world to benefit from it, so in the TED fashion, we opened up the entire data set to the worldwide research community for free. (Applause)

就這樣我們還是繼續往前走， 2009年起，ImageNet 已經是個擁有涵蓋了兩萬兩千種不同類別，多達150億幅圖像的資料庫，並組織以英語日常生活用字為主，這樣的規模，不論是「質」或「量」都是史無前例的。用貓來舉個例子說明，我們有超過六萬兩千種不同外觀和姿勢的貓咪，橫跨不同的種類，有家貓，也有野貓。 ImageNet 的成果讓我們非常激動，我們希望它有助於全世界的研究，就如同 TED 的貢獻，我們免費提供整個資料庫給全世界的研究單位。 (觀眾鼓掌)

Now that we have the data to nourish our computer brain, we're ready to come back to the algorithms themselves. As it turned out, the wealth of information provided by ImageNet was a perfect match to a particular class of machine learning algorithms called convolutional neural network, pioneered by Kunihiko Fukushima, Geoff Hinton, and Yann LeCun back in the 1970s and '80s. Just like the brain consists of billions of highly connected neurons, a basic operating unit in a neural network is a neuron-like node. It takes input from other nodes and sends output to others. Moreover, these hundreds of thousands or even millions of nodes are organized in hierarchical layers, also similar to the brain. In a typical neural network we use to train our object recognition model, it has 24 million nodes, 140 million parameters, and 15 billion connections. That's an enormous model. Powered by the massive data from ImageNet and the modern CPUs and GPUs to train such a humongous model, the convolutional neural network blossomed in a way that no one expected. It became the winning architecture to generate exciting new results in object recognition. This is a computer telling us this picture contains a cat and where the cat is. Of course there are more things than cats, so here's a computer algorithm telling us the picture contains a boy and a teddy bear; a dog, a person, and a small kite in the background; or a picture of very busy things like a man, a skateboard, railings, a lampost, and so on. Sometimes, when the computer is not so confident about what it sees, we have taught it to be smart enough to give us a safe answer instead of committing too much, just like we would do, but other times our computer algorithm is remarkable at telling us what exactly the objects are, like the make, model, year of the cars.

有了這些資料，我們可以教育我們的電腦，下一步就是回到程式演算的部分了。結果我們發現，ImageNet 所提供的豐富資訊恰巧與機器學習演算的其中一門特定領域不謀而合，我們稱它為「卷積神經網絡」，在七零及八零年代，福島邦彥、Geoff Hinton 和 Yann LeCun 等學者為該領域的先驅。正如同大腦是由無數個緊密連結的神經元所組成，神經網絡的基本運作單位也是一個類神經元的節點。它的運作方式是從別的節點得到資料，然後再傳給其他的節點。而且這些數不清的節點擁有層層的組織架構，就好像我們的大腦一樣。在一般的神經網絡中，我們用作訓練的物品辨識模型就有兩千四百萬個節點、一億四千萬個參數，以及一百五十億個連結。這是一個大的不得了的模型。由ImageNet 提供巨大的資料群、並使用先進的核心處理器及圖型處理器來訓練這個龐然大物，卷積神經網絡就在眾人的意料外開花結果了。在物品辨識領域中，這樣的架構以令人興奮的嶄新成果，傲視群雄。電腦告訴我們這張圖中有隻貓，還告訴我們貓在哪裡。當然，這世界不會只有貓，電腦的演算告訴我們這張圖中有一個男孩和一隻泰迪熊；有狗，一個人，以及背景中的一支小風箏；或這一張令人眼花撩亂的圖，有人、滑板、欄杆、路燈，等等。有時候，如果電腦不確定自己所見到的東西時，我們已經將它教到可以聰明地給一個安全的答案，而非莽撞地回答，就像一般人會做的。更有些時候，電腦的運算竟能夠精準地辨別物體品項例如製造商、型號、車子的年份。

We applied this algorithm to millions of Google Street View images across hundreds of American cities, and we have learned something really interesting: first, it confirmed our common wisdom that car prices correlate very well with household incomes. But surprisingly, car prices also correlate well with crime rates in cities, or voting patterns by zip codes.

Google 將這個演算程式廣泛地運用在數百個美國城市的街景裡，也因此我們從中得到了一些有趣的概念。首先，它證實了一項廣為人知的說法，也就是汽車價格和家庭收入是息息相關的。然而令人驚訝的是，汽車價格也和城市中的犯罪率以及區域選舉模式，有相當的關係。

So wait a minute. Is that it? Has the computer already matched or even surpassed human capabilities? Not so fast. So far, we have just taught the computer to see objects. This is like a small child learning to utter a few nouns. It's an incredible accomplishment, but it's only the first step. Soon, another developmental milestone will be hit, and children begin to communicate in sentences. So instead of saying this is a cat in the picture, you already heard the little girl telling us this is a cat lying on a bed.

等等，難道說我今天就是來告訴各位電腦已經趕上甚至超越人類了嗎？還早得很呢。到目前為止，我們只是教導電腦識別物品，就像小孩子牙牙學語一樣，雖然這是個傲人的進展，但它不過是第一步而已，很快地，下一波具指標性的後浪就會打上來了，小孩子開始進展到用句子來溝通。因此，他已經不會用「這是貓」來描述圖片，而是會聽到這個小女孩說「這是躺在床上的貓」。

So to teach a computer to see a picture and generate sentences, the marriage between big data and machine learning algorithm has to take another step. Now, the computer has to learn from both pictures as well as natural language sentences generated by humans. Just like the brain integrates vision and language, we developed a model that connects parts of visual things like visual snippets with words and phrases in sentences.

因此，要教導電腦看到圖並說出句子，必須進一步地仰賴龐大資料群以及機器的學習演算。現在，電腦不僅要學習圖片識別，還要學習人類自然的說話方式。就如同大腦要結合視覺和語言一樣，我們做出了一個模型，它可以連結不同的可視物體，就像視覺片段一樣，並附上句子用的字詞和片語。

About four months ago, we finally tied all this together and produced one of the first computer vision models that is capable of generating a human-like sentence when it sees a picture for the first time. Now, I'm ready to show you what the computer says when it sees the picture that the little girl saw at the beginning of this talk.

約四個月前，我們終於把所有的元素全部兜起來了，做出了第一個電腦版的模型，它有辦法在初次看到照片時說出像人類般自然的句子，好，現在我要給各位看看電腦對於演講一開頭那位小女孩所看到的影像，它又是如何理解的。

(Video) Computer: A man is standing next to an elephant. A large airplane sitting on top of an airport runway.

(電腦) 有個人站在大象旁邊。一架大飛機停在機場跑道上。

FFL: Of course, we're still working hard to improve our algorithms, and it still has a lot to learn. (Applause)

(主講人) 當然，我們仍戮力於改善這電腦程式，它還有很多要學。 (觀眾鼓掌)

And the computer still makes mistakes.

電腦還是會犯錯。

(Video) Computer: A cat lying on a bed in a blanket.

(電腦) 一隻貓包著毯子躺在床上。

FFL: So of course, when it sees too many cats, it thinks everything might look like a cat.

(主講人) 因為它看了太多貓了，以至於它見到了什麼都像貓咪。

(Video) Computer: A young boy is holding a baseball bat. (Laughter)

(電腦) 一位小男孩握著一支球棒。 (觀眾笑)

FFL: Or, if it hasn't seen a toothbrush, it confuses it with a baseball bat.

(主講人) 或者，如果電腦是第一次看到牙刷，會把它與球棒混淆。

(Video) Computer: A man riding a horse down a street next to a building. (Laughter)

(電腦) 一個人在建築物旁的街道上騎馬。 (觀眾笑)

FFL: We haven't taught Art 101 to the computers.

(主講人) 我們還沒讓電腦上基礎美術課。

(Video) Computer: A zebra standing in a field of grass.

(電腦) 一匹斑馬站在原野中。

FFL: And it hasn't learned to appreciate the stunning beauty of nature like you and I do.

(主講人) 電腦還沒辦法像人類一樣，學會欣賞大自然的美景。

So it has been a long journey. To get from age zero to three was hard. The real challenge is to go from three to 13 and far beyond. Let me remind you with this picture of the boy and the cake again. So far, we have taught the computer to see objects or even tell us a simple story when seeing a picture.

這是條漫漫長路，要從零歲發展到三歲是很難的，更艱深的挑戰在於從三歲發展到十三歲，甚至到更遠的階段。讓我用這張男孩與蛋糕的圖片來進一步說明，直到今日，我們已經教會了電腦識別物品，甚至於在看到一張圖後，可以簡單地敘述。

(Video) Computer: A person sitting at a table with a cake.

(電腦) 一個人和蛋糕坐在桌旁。

FFL: But there's so much more to this picture than just a person and a cake. What the computer doesn't see is that this is a special Italian cake that's only served during Easter time. The boy is wearing his favorite t-shirt given to him as a gift by his father after a trip to Sydney, and you and I can all tell how happy he is and what's exactly on his mind at that moment.

(主講人) 這張照片其實蘊涵著更多的東西，不僅只有人和蛋糕。電腦看不出這是種特別的義式蛋糕，人們只有在復活節時才會做。這個男孩穿著他最心愛的T恤，是去雪梨玩的時候，他的父親送的，各位和我都可以看得出他有多快樂，以及當時他的心裡在想什麼。

This is my son Leo. On my quest for visual intelligence, I think of Leo constantly and the future world he will live in. When machines can see, doctors and nurses will have extra pairs of tireless eyes to help them to diagnose and take care of patients. Cars will run smarter and safer on the road. Robots, not just humans, will help us to brave the disaster zones to save the trapped and wounded. We will discover new species, better materials, and explore unseen frontiers with the help of the machines.

這是我兒子，李奧。在探索智能視覺的旅途上，我不斷地想到他，以及他在將來生活的世界，當未來，機器有了視覺，醫生和護士就多了雙永不倦怠的眼睛，幫助他們診斷及照顧病人；行駛在路上的車子可以更聰明、更安全；人類與機器人能一起共同投入災區的救援工作，拯救受困人員及傷者；我們還可以發現新品種與更好的材料，探索未知的疆界，這一切都可仰賴機器的協助。

Little by little, we're giving sight to the machines. First, we teach them to see. Then, they help us to see better. For the first time, human eyes won't be the only ones pondering and exploring our world. We will not only use the machines for their intelligence, we will also collaborate with them in ways that we cannot even imagine.

一步一步地，我們賦予機器視覺，先教他們識別物品，然後它們也讓我們看得更清楚，這是第一次人類的眼睛不是唯一可以用來思考和探索世界的工具，我們不僅可以利用機器的智能，更可以運用更多你想像不到的方式攜手合作。

This is my quest: to give computers visual intelligence and to create a better future for Leo and for the world.

這是我想追求的目標：給予機器智慧之眼，為李奧和整個世界創造更美好的未來。

Thank you.

謝謝各位。

(Applause)

(觀眾鼓掌)

Let me show you something.

容我為各位呈現一些照片

(Video) Girl: Okay, that's a cat sitting in a bed. The boy is petting the elephant. Those are people that are going on an airplane. That's a big airplane.

(影片)女孩：嗯，這是一隻貓，坐在床上。這男孩在拍撫一隻象。這些人要去搭飛機。好大的飛機。

(Video) Computer: A man is standing next to an elephant. A large airplane sitting on top of an airport runway.

(電腦) 有個人站在大象旁邊。一架大飛機停在機場跑道上。

FFL: Of course, we're still working hard to improve our algorithms, and it still has a lot to learn. (Applause)

(主講人) 當然，我們仍戮力於改善這電腦程式，它還有很多要學。 (觀眾鼓掌)

And the computer still makes mistakes.

電腦還是會犯錯。

(Video) Computer: A cat lying on a bed in a blanket.

(電腦) 一隻貓包著毯子躺在床上。

FFL: So of course, when it sees too many cats, it thinks everything might look like a cat.

(主講人) 因為它看了太多貓了，以至於它見到了什麼都像貓咪。

(Video) Computer: A young boy is holding a baseball bat. (Laughter)

(電腦) 一位小男孩握著一支球棒。 (觀眾笑)

FFL: Or, if it hasn't seen a toothbrush, it confuses it with a baseball bat.

(主講人) 或者，如果電腦是第一次看到牙刷，會把它與球棒混淆。

(Video) Computer: A man riding a horse down a street next to a building. (Laughter)

(電腦) 一個人在建築物旁的街道上騎馬。 (觀眾笑)

FFL: We haven't taught Art 101 to the computers.

(主講人) 我們還沒讓電腦上基礎美術課。

(Video) Computer: A zebra standing in a field of grass.

(電腦) 一匹斑馬站在原野中。

FFL: And it hasn't learned to appreciate the stunning beauty of nature like you and I do.

(主講人) 電腦還沒辦法像人類一樣，學會欣賞大自然的美景。

(Video) Computer: A person sitting at a table with a cake.

(電腦) 一個人和蛋糕坐在桌旁。

This is my quest: to give computers visual intelligence and to create a better future for Leo and for the world.

這是我想追求的目標：給予機器智慧之眼，為李奧和整個世界創造更美好的未來。

Thank you.

謝謝各位。

(Applause)

(觀眾鼓掌)

Fei-Fei Li: How we're teaching computers to understand pictures

Fei-Fei Li: How we're teaching computers to understand pictures

Related talks

Jeremy Howard: The wonderful and terrifying implications of computers that can learn

Pawan Sinha: How brains learn to see

Patricia Kuhl: The linguistic genius of babies

Joseph Redmon: How computers learn to recognize objects instantly

Sebastian Thrun and Chris Anderson: What AI is -- and isn't

Linda Liukas: A delightful way to teach kids about computers

Related talks

Jeremy Howard: The wonderful and terrifying implications of computers that can learn

Pawan Sinha: How brains learn to see

Patricia Kuhl: The linguistic genius of babies

Joseph Redmon: How computers learn to recognize objects instantly

Sebastian Thrun and Chris Anderson: What AI is -- and isn't

Linda Liukas: A delightful way to teach kids about computers