Joseph Redmon: How computers learn to recognize objects instantly

Ten years ago, computer vision researchers thought that getting a computer to tell the difference between a cat and a dog would be almost impossible, even with the significant advance in the state of artificial intelligence. Now we can do it at a level greater than 99 percent accuracy. This is called image classification -- give it an image, put a label to that image -- and computers know thousands of other categories as well.

10 年前，電腦視覺研究人員認為，要讓電腦辨別貓與狗的差別，幾乎是比登天還難，即使用了相當先進的人工智慧都很難辦到。現在我們可以把辨別的準確度提升到 99% 以上。這技術叫做圖像分類—— 給電腦看圖片，並給圖片貼上標籤—— 電腦還可以識別出許多其它類別的東西。

I'm a graduate student at the University of Washington, and I work on a project called Darknet, which is a neural network framework for training and testing computer vision models. So let's just see what Darknet thinks of this image that we have. When we run our classifier on this image, we see we don't just get a prediction of dog or cat, we actually get specific breed predictions. That's the level of granularity we have now. And it's correct. My dog is in fact a malamute.

我目前是華盛頓大學的研究生，我正在做一個專題叫做「暗黑網路」，它是一個用來訓練及測試電腦視覺模型的神經網路架構。所以，讓我們來瞧瞧暗黑網路對我們照片識別能力的狀況。當我們在這張照片上開啟我們的分類器，可以看到電腦現在不只在預測這是狗或貓，它實際上正在擷取特定品種的預測。這就是現在我們電腦的粒度等級。辨別正確。我的狗的確是隻雪橇犬。

So we've made amazing strides in image classification, but what happens when we run our classifier on an image that looks like this? Well ... We see that the classifier comes back with a pretty similar prediction. And it's correct, there is a malamute in the image, but just given this label, we don't actually know that much about what's going on in the image. We need something more powerful. I work on a problem called object detection, where we look at an image and try to find all of the objects, put bounding boxes around them and say what those objects are. So here's what happens when we run a detector on this image.

所以，我們在圖像識別上已經有了很大的進步，但如果我們用識別器來辨別這樣的照片呢？嗯…… 可以看到從分類器得到的預測也相當類似。沒錯，圖片中有一隻雪橇狗，但它只給出一個標籤，我們對這張照片的理解還不是很完整。我們需要更強的東西。我正在研究一個問題，叫做「物件偵測」，我們把一張照片中的所有物體都找出來，用邊界框把它們框起來，然後標示它們是那些東西。我們來看一下當我們在這一張圖片上執行偵測軟體時，會發生甚麼事。

Now, with this kind of result, we can do a lot more with our computer vision algorithms. We see that it knows that there's a cat and a dog. It knows their relative locations, their size. It may even know some extra information. There's a book sitting in the background. And if you want to build a system on top of computer vision, say a self-driving vehicle or a robotic system, this is the kind of information that you want. You want something so that you can interact with the physical world. Now, when I started working on object detection, it took 20 seconds to process a single image. And to get a feel for why speed is so important in this domain, here's an example of an object detector that takes two seconds to process an image. So this is 10 times faster than the 20-seconds-per-image detector, and you can see that by the time it makes predictions, the entire state of the world has changed, and this wouldn't be very useful for an application.

現在，有了這類的結果，我們就可以利用電腦視覺演算法，幫我們做更多的事。我們可以看到，電腦知道圖片中有一隻貓和狗。它知道牠們彼此的相對位置、大小。電腦甚至可能知道其它的資訊。它也看到了背景中有一本書。如果你想要建立一個基於電腦視覺系統的實用系統，比如說，自動駕駛車或機械人系統，這類就會是你想要的資訊。你會想要一個可以與實體世界互動的東西。當我開始做物件偵測時，它要花 20 秒才能處理一張圖片。為了讓各位體會為什麼這個領域這麼講究速度，我這邊做個執行物件偵測器的示範，一張照片只要 2 秒的處理時間。所以，比 20 秒一張的偵測器快了 10 倍，各位可以看到，在它識別圖像的過程中，周圍環境已經發生了變化，但對一個應用軟體而言，這樣的速度是很鷄肋的。

If we speed this up by another factor of 10, this is a detector running at five frames per second. This is a lot better, but for example, if there's any significant movement, I wouldn't want a system like this driving my car.

如果我們把另一個參數調升到 10 ，這個偵測器每秒就可以識別 5 張圖片。這樣好多了，但，假如，移動很快的時候…… 我可不想在我車上裝這樣慢的系統。

This is our detection system running in real time on my laptop. So it smoothly tracks me as I move around the frame, and it's robust to a wide variety of changes in size, pose, forward, backward. This is great. This is what we really need if we're going to build systems on top of computer vision.

這是在我筆電上運行的即時偵測系統。我在框框附近移動的時候，它可以很順暢地追蹤著我，而且，它可以根據不同的大小、姿勢、前、後來做調整。太棒了。如果我們要建立一個基於電腦視覺系統的實用系統，這個才會是我真正想要的。

(Applause)

（掌聲）

So in just a few years, we've gone from 20 seconds per image to 20 milliseconds per image, a thousand times faster. How did we get there? Well, in the past, object detection systems would take an image like this and split it into a bunch of regions and then run a classifier on each of these regions, and high scores for that classifier would be considered detections in the image. But this involved running a classifier thousands of times over an image, thousands of neural network evaluations to produce detection. Instead, we trained a single network to do all of detection for us. It produces all of the bounding boxes and class probabilities simultaneously. With our system, instead of looking at an image thousands of times to produce detection, you only look once, and that's why we call it the YOLO method of object detection. So with this speed, we're not just limited to images; we can process video in real time. And now, instead of just seeing that cat and dog, we can see them move around and interact with each other.

所以，才幾年的時間，我們從每 20 秒處理一張照片，進步到每張照片只要 20 毫秒，快了 1000 倍。我們是如何辦到的？過去，物件偵測系統，會把一張像這樣的照片，分割成好幾個小區塊，然後在每一個小區塊運行分類器軟體，相似度得分如果比較高會被識別器認為照片偵測成功。但這樣一張圖片要執行好幾千次的識別指令、經過好幾千次的神經網路評估才有辦法偵測出來。但我們不是這樣做，我們訓練了一個網路模型來幫我們完成所有的偵測。它可以同時產出邊界框並同時對可能的結果進行評估。有了我們的系統，你就不用一張圖片看了好幾千遍才能偵測出來。你只要看一眼 (YOLO)，所以我們簡稱這個物件偵測技術為「YOLO」。所以，有了這樣的辨識速度，我們不只可以偵測圖片；還可以處理即時的影片。現在各位看到的不是貓、狗的靜態圖片，而是有牠們在移動、互動的動態影片。

This is a detector that we trained on 80 different classes in Microsoft's COCO dataset. It has all sorts of things like spoon and fork, bowl, common objects like that. It has a variety of more exotic things: animals, cars, zebras, giraffes. And now we're going to do something fun. We're just going to go out into the audience and see what kind of things we can detect. Does anyone want a stuffed animal? There are some teddy bears out there. And we can turn down our threshold for detection a little bit, so we can find more of you guys out in the audience. Let's see if we can get these stop signs. We find some backpacks. Let's just zoom in a little bit. And this is great. And all of the processing is happening in real time on the laptop.

這是我們用微軟 COCO 資料集裡 80 種不同的類別訓練出來的辨識器。它包含各種東西，像是湯匙、叉子、碗這類的日常用品。它還有很多奇妙的東西：動物、車子、斑馬、長頸鹿。現在我們要進行一件好玩的事。我們會進到觀眾席，去看看能辨識到哪些東西。有誰要填充娃娃？這邊還有一些泰迪熊。我們現在降低一下對偵測結果的精確度的要求，這樣我們可以在觀眾席中找到更多東西。我們來看看能不能偵測到停止標誌。我們有偵測到一些背包。現在把鏡頭拉近一點。這真的很厲害。所有的偵測流程都可以在筆電裡即時呈現。

And it's important to remember that this is a general purpose object detection system, so we can train this for any image domain. The same code that we use to find stop signs or pedestrians, bicycles in a self-driving vehicle, can be used to find cancer cells in a tissue biopsy. And there are researchers around the globe already using this technology for advances in things like medicine, robotics. This morning, I read a paper where they were taking a census of animals in Nairobi National Park with YOLO as part of this detection system. And that's because Darknet is open source and in the public domain, free for anyone to use.

更重要的是，這只是一個一般用的物件偵測系統，我們還可以訓練它辨別任何領域的照片。同樣的程式碼，放在自動駕駛車裡，可以偵測到停止標誌、行人、腳踏車，但放到組織切片就可以偵測出癌症細胞。現在全球有很多研究人員已經開始在使用這項技術做進一步的研究，像是醫藥、機械人領域。今天早上，我讀到一篇文章，在奈洛比國家公園裡，他們要對動物們進行統計調查， YOLO 就是其使用的偵測系統的一部分。而這一切都是因為暗黑網路是開放原始碼，在公眾領域，任何人都可以免費使用。

(Applause)

（掌聲）

But we wanted to make detection even more accessible and usable, so through a combination of model optimization, network binarization and approximation, we actually have object detection running on a phone.

但我們希望偵測系統可以更親民、更好用，所以在經過模型優化、網路二值化及近似度化的整合後，我們終於可以在手機上偵測物件。

(Applause)

（掌聲）

And I'm really excited because now we have a pretty powerful solution to this low-level computer vision problem, and anyone can take it and build something with it. So now the rest is up to all of you and people around the world with access to this software, and I can't wait to see what people will build with this technology.

而我真的相當興奮，因為我們現在在低階的電腦影像處理問題上有了相當強力的解決方式，任何人都可以拿去並創造一些東西。所以，接下來就看各位以及全世界所有人用這個軟體大展身手了，我真的等不及想看看你們用這項科技所做出來的產品。

Thank you.

謝謝。

(Applause)

（掌聲）

(Applause)

（掌聲）

(Applause)

（掌聲）

但我們希望偵測系統可以更親民、更好用，所以在經過模型優化、網路二值化及近似度化的整合後，我們終於可以在手機上偵測物件。

(Applause)

（掌聲）

Thank you.

謝謝。

(Applause)

（掌聲）

Joseph Redmon: How computers learn to recognize objects instantly

Joseph Redmon: How computers learn to recognize objects instantly

Related talks

Blaise Agüera y Arcas: How computers are learning to be creative

Fei-Fei Li: How we're teaching computers to understand pictures

Ray Kurzweil: Get ready for hybrid thinking

Blaise Agüera y Arcas: How PhotoSynth can connect the world's images

Jeremy Howard: The wonderful and terrifying implications of computers that can learn

Sebastian Thrun and Chris Anderson: What AI is -- and isn't

Related talks

Blaise Agüera y Arcas: How computers are learning to be creative

Fei-Fei Li: How we're teaching computers to understand pictures

Ray Kurzweil: Get ready for hybrid thinking

Blaise Agüera y Arcas: How PhotoSynth can connect the world's images

Jeremy Howard: The wonderful and terrifying implications of computers that can learn

Sebastian Thrun and Chris Anderson: What AI is -- and isn't