Joseph Redmon: How computers learn to recognize objects instantly

Ten years ago, computer vision researchers thought that getting a computer to tell the difference between a cat and a dog would be almost impossible, even with the significant advance in the state of artificial intelligence. Now we can do it at a level greater than 99 percent accuracy. This is called image classification -- give it an image, put a label to that image -- and computers know thousands of other categories as well.

10年前，计算机视觉研究者认为要让一台电脑去分辨出一只猫和狗的不同之处几乎是不可能的，即便是在人工智能已经取得了重大突破的情况下。现在我们已经可以做到让它的正确率在99%以上。这个方法叫做图像分类—— 给它一张图，再给这张图贴上标签—— 通过这种方式，电脑就可以知道数千种的分类。

I'm a graduate student at the University of Washington, and I work on a project called Darknet, which is a neural network framework for training and testing computer vision models. So let's just see what Darknet thinks of this image that we have. When we run our classifier on this image, we see we don't just get a prediction of dog or cat, we actually get specific breed predictions. That's the level of granularity we have now. And it's correct. My dog is in fact a malamute.

我是华盛顿大学的一名研究生，我致力于一个名叫“暗网”的项目，这是一个用来训练和测试计算机视觉模型的神经网络结构。让我们来看看暗网是如何看待我们手上的这张图片。当我们在这张图片上运行识别器时，我们注意到，它不仅能判断出图片上是猫是狗，还能给出它是哪个品种的预测。这就是我们目前所达到的粒度级别。而且它的预测是正确的。我的狗的确是一只阿拉斯加雪橇犬。

So we've made amazing strides in image classification, but what happens when we run our classifier on an image that looks like this? Well ... We see that the classifier comes back with a pretty similar prediction. And it's correct, there is a malamute in the image, but just given this label, we don't actually know that much about what's going on in the image. We need something more powerful. I work on a problem called object detection, where we look at an image and try to find all of the objects, put bounding boxes around them and say what those objects are. So here's what happens when we run a detector on this image.

很明显，我们在图像识别上取得了惊人的进步，但是如果我们对这样一张图片上运行识别器，会如何呢？看一下。。。。。我们看到识别器给出了一个非常相似的预测。而且是正确的，图中是有一只阿拉斯加雪橇犬，但只使用这一个标签，我们并不能真正的了解这张图片里的故事。我们需要更强大的检测器。我正在研究一个叫做目标检测的问题，也就是我们尝试将一张图上的所有目标物都找出来，然后将它们分别框起来，再加上标注。这就是我们对这张照片运行检测器时所发生的。

Now, with this kind of result, we can do a lot more with our computer vision algorithms. We see that it knows that there's a cat and a dog. It knows their relative locations, their size. It may even know some extra information. There's a book sitting in the background. And if you want to build a system on top of computer vision, say a self-driving vehicle or a robotic system, this is the kind of information that you want. You want something so that you can interact with the physical world. Now, when I started working on object detection, it took 20 seconds to process a single image. And to get a feel for why speed is so important in this domain, here's an example of an object detector that takes two seconds to process an image. So this is 10 times faster than the 20-seconds-per-image detector, and you can see that by the time it makes predictions, the entire state of the world has changed, and this wouldn't be very useful for an application.

基于这样的结果，我们可以用计算机视觉算法做更多的事情。我们发现，它知道这里有一只猫和一只狗。它知道它们的相对位置，它们的大小。它可能甚至还知道一些额外的信息。例如背景里有一本书。如果你想建立一个基于计算机视觉的系统，比如说无人驾驶汽车或者机器人系统，那么这就是你想要得到的那类信息。你要一个能与物质世界互动的系统。当我最开始开展目标检测项目时，它要花20秒去处理一张图片。为了感受一下为什么速度在这个领域是如此重要，举一个例子，这是一个2秒钟就能处理一张图片的检测器。这个检测器的速度要比处理每张图需要20秒的检测器快10倍，你还可以看到在它做出预测的时候，被检测的世界已经发生变化了，这对于一个应用来说是没有多大用处的。

If we speed this up by another factor of 10, this is a detector running at five frames per second. This is a lot better, but for example, if there's any significant movement, I wouldn't want a system like this driving my car.

如果我们将它的速度再提升10倍，这个检测器每秒可处理5张画面。这就好很多了，但是，举个例子如果有任何重大的移动（它就反应不过来了），我可不想让这样的一个系统来驾驶我的汽车。

This is our detection system running in real time on my laptop. So it smoothly tracks me as I move around the frame, and it's robust to a wide variety of changes in size, pose, forward, backward. This is great. This is what we really need if we're going to build systems on top of computer vision.

这是在我电脑上运行的实时检测系统。当我在移动时，它能顺利地追踪我，而且它强大到能适应不同的大小、姿势、向前、向后的改变。很了不起。如果我们想要建造一个基于计算机视觉的系统，那么这就是我们真正需要的。

(Applause)

（掌声）

So in just a few years, we've gone from 20 seconds per image to 20 milliseconds per image, a thousand times faster. How did we get there? Well, in the past, object detection systems would take an image like this and split it into a bunch of regions and then run a classifier on each of these regions, and high scores for that classifier would be considered detections in the image. But this involved running a classifier thousands of times over an image, thousands of neural network evaluations to produce detection. Instead, we trained a single network to do all of detection for us. It produces all of the bounding boxes and class probabilities simultaneously. With our system, instead of looking at an image thousands of times to produce detection, you only look once, and that's why we call it the YOLO method of object detection. So with this speed, we're not just limited to images; we can process video in real time. And now, instead of just seeing that cat and dog, we can see them move around and interact with each other.

仅仅是几年的时间，我们就从每张图20秒，提升到了每张图20毫秒，速度提高了1000倍。我们是如何做到的呢？事实上在过去，目标检测系统会将这张图片分成很多小区域，然后在每一块区域运行一下识别器，在识别器中获得最高分数（的输出）就会被认为是这张图片的检测结果。这涉及到要在一张图片上运行数千次识别器，以及数千次的神经网络评估才能获得检测结果。而现在，我们训练了可以做出所有检测的单一网络。它能同时生成边界盒和类别概率。使用我们的系统，不需要为了生成检测结果去重复上千数次地看同一张图片， “只看一次”就行了，这也是为什么我们称之为目标检测的“YOLO”法。有了这个速度，我们就不仅限于识别图像了，还可以实时处理视频。现在，我们不仅看到了猫和狗，还能看到它们走来走去，互相嘻戏。

This is a detector that we trained on 80 different classes in Microsoft's COCO dataset. It has all sorts of things like spoon and fork, bowl, common objects like that. It has a variety of more exotic things: animals, cars, zebras, giraffes. And now we're going to do something fun. We're just going to go out into the audience and see what kind of things we can detect. Does anyone want a stuffed animal? There are some teddy bears out there. And we can turn down our threshold for detection a little bit, so we can find more of you guys out in the audience. Let's see if we can get these stop signs. We find some backpacks. Let's just zoom in a little bit. And this is great. And all of the processing is happening in real time on the laptop.

这是一个我们在微软的 COCO数据库上，用80种不同种类的物品训练过的检测器。包含了各种东西，像勺子、叉子、碗等常见物品。还有各种奇特的东西：动物、汽车、斑马、长颈鹿。现在我们要做点儿有趣的事情。我们的摄像头将要对准观众区，看看能检测出什么。谁想要一个毛绒动物玩具？观众席里有了一些泰迪熊。我们把检测阀值调低一点，这样就可以找出更多的观众。看下我们能不能找出这些停车标志。我们找到了一些背包。再放大一点。非常棒。所有这些都是在电脑上实时处理的。

And it's important to remember that this is a general purpose object detection system, so we can train this for any image domain. The same code that we use to find stop signs or pedestrians, bicycles in a self-driving vehicle, can be used to find cancer cells in a tissue biopsy. And there are researchers around the globe already using this technology for advances in things like medicine, robotics. This morning, I read a paper where they were taking a census of animals in Nairobi National Park with YOLO as part of this detection system. And that's because Darknet is open source and in the public domain, free for anyone to use.

请大家记住：这是一个通用的目标检测系统，因此我们可以将它训练用于任何领域的图像识别。我们在无人驾驶汽车中用来发现停车标志、行人和自行车的代码，同样可以用于在组织活检中找出癌细胞。全球已经有很多研究者正在利用这一技术在医学、机器人学等方面取得了进展。今天早上，我刚读到一篇文章，人们在内罗毕国家公园对动物数量进行普查，使用了YOLO作为检测系统的一部分。这是因为暗网是一个开源项目，在公共领域，任何人都可以免费使用。

(Applause)

（掌声）

But we wanted to make detection even more accessible and usable, so through a combination of model optimization, network binarization and approximation, we actually have object detection running on a phone.

但是我们想要让检测器能被更多人使用、也更好用，因此通过结合模型优化，网络二值化和近似法，我们实际上已经可以在手机上进行目标检测了。

(Applause)

（掌声）

And I'm really excited because now we have a pretty powerful solution to this low-level computer vision problem, and anyone can take it and build something with it. So now the rest is up to all of you and people around the world with access to this software, and I can't wait to see what people will build with this technology.

我真的很激动，因为我们在这个低级的计算机视觉问题上有了一个强大的解决方案，而且任何人都可以使用它来做些什么。所以接下来就看所有在座的各位以及世界上所有能够使用这个软件的人了，而我已经等不及想要看看，人们会用这一技术造出什么来了。

Thank you.

谢谢。

(Applause)

（掌声）

(Applause)

（掌声）

(Applause)

（掌声）

但是我们想要让检测器能被更多人使用、也更好用，因此通过结合模型优化，网络二值化和近似法，我们实际上已经可以在手机上进行目标检测了。

(Applause)

（掌声）

Thank you.

谢谢。

(Applause)

（掌声）

Joseph Redmon: How computers learn to recognize objects instantly

Joseph Redmon: How computers learn to recognize objects instantly

Related talks

Blaise Agüera y Arcas: How computers are learning to be creative

Fei-Fei Li: How we're teaching computers to understand pictures

Ray Kurzweil: Get ready for hybrid thinking

Blaise Agüera y Arcas: How PhotoSynth can connect the world's images

Jeremy Howard: The wonderful and terrifying implications of computers that can learn

Sebastian Thrun and Chris Anderson: What AI is -- and isn't

Related talks

Blaise Agüera y Arcas: How computers are learning to be creative

Fei-Fei Li: How we're teaching computers to understand pictures

Ray Kurzweil: Get ready for hybrid thinking

Blaise Agüera y Arcas: How PhotoSynth can connect the world's images

Jeremy Howard: The wonderful and terrifying implications of computers that can learn

Sebastian Thrun and Chris Anderson: What AI is -- and isn't