Joseph Redmon: How computers learn to recognize objects instantly

Ten years ago, computer vision researchers thought that getting a computer to tell the difference between a cat and a dog would be almost impossible, even with the significant advance in the state of artificial intelligence. Now we can do it at a level greater than 99 percent accuracy. This is called image classification -- give it an image, put a label to that image -- and computers know thousands of other categories as well.

Hai dez anos, os expertos en visión artificial crían que lograr que un ordenador distinguise entre un gato e un can sería practicamente imposible malia os considerables progresos en intelixencia artificial. Hoxe podemos facelo cunha precisión superior ao 99 por cento. Isto chámase clasificación de imaxes: dada unha imaxe, asígnalle unha etiqueta. E os ordenadores coñecen ademais miles doutras categorías.

I'm a graduate student at the University of Washington, and I work on a project called Darknet, which is a neural network framework for training and testing computer vision models. So let's just see what Darknet thinks of this image that we have. When we run our classifier on this image, we see we don't just get a prediction of dog or cat, we actually get specific breed predictions. That's the level of granularity we have now. And it's correct. My dog is in fact a malamute.

Son estudante de posgrao na Universidade de Washington, e traballo nun proxecto chamado Darknet, un contorno de traballo de rede neural para adestrar e probar modelos de visión artificial. Imos ver o que pensa Darknet desta imaxe que temos aquí. Ao aplicarlle o noso clasificador a esta imaxe, vemos que non se limita a predicir can ou gato; de feito, dános tamén razas concretas. Ese é o nivel de granularidade que temos agora. E é correcto. O meu can é efectivamente un malamute.

So we've made amazing strides in image classification, but what happens when we run our classifier on an image that looks like this? Well ... We see that the classifier comes back with a pretty similar prediction. And it's correct, there is a malamute in the image, but just given this label, we don't actually know that much about what's going on in the image. We need something more powerful. I work on a problem called object detection, where we look at an image and try to find all of the objects, put bounding boxes around them and say what those objects are. So here's what happens when we run a detector on this image.

Así que demos xa pasos enormes na clasificación de imaxes, pero que acontece se executamos o noso clasificador sobre unha imaxe coma esta? Ben... Vemos que o clasificador nos devolve unha predición bastante similar. E é correcta: hai un malamute na imaxe, pero con só esa etiqueta, en realidade, non sabemos moito sobre o que ocorre nela. Necesitamos algo máis potente. Eu traballo nun problema denominado detección de obxectos, no que vemos unha imaxe e tratamos de atopar todos os obxectos, encerrámolos en caixas e dicimos que son eses obxectos. Isto é o que acontece cando lle aplicamos un detector a esta imaxe.

Now, with this kind of result, we can do a lot more with our computer vision algorithms. We see that it knows that there's a cat and a dog. It knows their relative locations, their size. It may even know some extra information. There's a book sitting in the background. And if you want to build a system on top of computer vision, say a self-driving vehicle or a robotic system, this is the kind of information that you want. You want something so that you can interact with the physical world. Now, when I started working on object detection, it took 20 seconds to process a single image. And to get a feel for why speed is so important in this domain, here's an example of an object detector that takes two seconds to process an image. So this is 10 times faster than the 20-seconds-per-image detector, and you can see that by the time it makes predictions, the entire state of the world has changed, and this wouldn't be very useful for an application.

Ben, con este tipo de resultado podemos facer moito máis cos nosos algoritmos de visión artificial. Vemos que sabe que hai un gato e un can. Sabe a súa posición relativa, o seu tamaño. Mesmo pode coñecer algúns datos adicionais. Hai un libro ao fondo. E se queremos construír un sistema baseado na visión artificial, por exemplo, un vehículo autónomo ou un sistema robótico, esta é a clase de información que necesitamos. Necesitamos algo que faga posible interactuar co mundo físico. Cando eu comecei a traballar coa detección de obxectos, tardábase 20 segundos en procesar unha soa imaxe. Para darvos unha idea da importancia da velocidade neste ámbito, este é un exemplo dun detector de obxectos que tarda dous segundos en procesar unha imaxe. É dicir, é 10 veces máis rápido có dos 20 segundos por imaxe, e, como vedes, cando fai a súa predición, a situación xa mudou por completo, o cal non sería moi útil para ningunha aplicación.

If we speed this up by another factor of 10, this is a detector running at five frames per second. This is a lot better, but for example, if there's any significant movement, I wouldn't want a system like this driving my car.

Se multiplicamos a velocidade por 10, temos un detector que vai a cinco imaxes por segundo. É moito mellor, pero, por exemplo, se hai un movemento grande, a min non me gustaría que un sistema coma este conducise o meu coche.

This is our detection system running in real time on my laptop. So it smoothly tracks me as I move around the frame, and it's robust to a wide variety of changes in size, pose, forward, backward. This is great. This is what we really need if we're going to build systems on top of computer vision.

Este é o noso sistema de detección funcionando en tempo real no meu portátil. Segue sen saltos os meus desprazamentos pola imaxe, e responde de forma fiable a moi diversos cambios de tamaño, postura, avance, retroceso. É fantástico. Isto é o que necesitamos se pretendemos construír sistemas baseados na visión artificial.

(Applause)

(Aplausos)

So in just a few years, we've gone from 20 seconds per image to 20 milliseconds per image, a thousand times faster. How did we get there? Well, in the past, object detection systems would take an image like this and split it into a bunch of regions and then run a classifier on each of these regions, and high scores for that classifier would be considered detections in the image. But this involved running a classifier thousands of times over an image, thousands of neural network evaluations to produce detection. Instead, we trained a single network to do all of detection for us. It produces all of the bounding boxes and class probabilities simultaneously. With our system, instead of looking at an image thousands of times to produce detection, you only look once, and that's why we call it the YOLO method of object detection. So with this speed, we're not just limited to images; we can process video in real time. And now, instead of just seeing that cat and dog, we can see them move around and interact with each other.

Así que en apenas uns anos, pasamos de 20 segundos por imaxe a 20 milisegundos por imaxe, mil veces máis rápido. Como chegamos ata aquí? Ben, os anteriores sistemas de detección de obxectos collían unha imaxe coma esta, dividíana nunha serie de rexións e executaban un clasificador en cada unha desas rexións, e os valores altos nun clasificador dado interpretábanse como deteccións na imaxe. Pero isto supoñía executar un clasificador miles de veces sobre unha imaxe. miles de avaliacións por redes neurais para producir unha detección. Nós, en cambio, adestramos unha soa rede para que fixese todo o proceso. Produce simultaneamente todas as caixas e as probabilidades de clase. Co noso sistema, no canto de analizar unha imaxe miles de veces para producir unha detección, só é necesario mirala unha vez. Chamámoslle método YOLO de detección de obxectos. Con esta velocidade, non estamos limitados ás imaxes estáticas; podemos procesar vídeo en tempo real. Agora, no canto de simplemente ver o gato e o can, podemos velos moverse e interactuar.

This is a detector that we trained on 80 different classes in Microsoft's COCO dataset. It has all sorts of things like spoon and fork, bowl, common objects like that. It has a variety of more exotic things: animals, cars, zebras, giraffes. And now we're going to do something fun. We're just going to go out into the audience and see what kind of things we can detect. Does anyone want a stuffed animal? There are some teddy bears out there. And we can turn down our threshold for detection a little bit, so we can find more of you guys out in the audience. Let's see if we can get these stop signs. We find some backpacks. Let's just zoom in a little bit. And this is great. And all of the processing is happening in real time on the laptop.

Este é un detector que adestramos con 80 categorías distintas do banco de datos COCO de Microsoft. Contén todo tipo de cousas, como culler e garfo, cunca, obxectos comúns. Ten outras cousas máis exóticas: animais, coches, cebras, xirafas. Agora imos facer algo divertido. Ímonos achegar ao público para ver que podemos detectar. Alguén quere unha peluche? Hai por aí uns cantos osos de peluche. Podemos baixar un pouquiño o limiar de detección, para atopar máis cousas entre o público. A ver se podemos localizar eses sinais de stop. Atopamos algunhas mochilas. Démoslle un pouco ao zoom. Isto é fantástico. E todo ese procesamento estase realizando en tempo real no portátil.

And it's important to remember that this is a general purpose object detection system, so we can train this for any image domain. The same code that we use to find stop signs or pedestrians, bicycles in a self-driving vehicle, can be used to find cancer cells in a tissue biopsy. And there are researchers around the globe already using this technology for advances in things like medicine, robotics. This morning, I read a paper where they were taking a census of animals in Nairobi National Park with YOLO as part of this detection system. And that's because Darknet is open source and in the public domain, free for anyone to use.

Cómpre lembrar que este é un sistema de detección de uso xeral, polo que podemos adestralo para calquera ámbito. O mesmo código que empregamos para detectar sinais de stop, peóns ou bicicletas nun vehículo autónomo pódese empregar para detectar células cancerosas nunha biopsia de tecido. Hai investigadores por todo o mundo que xa están a usar esta tecnoloxía para lograr novos avances en campos como a medicina ou a robótica. Esta mañá lin un artigo no que falan dun censo dos animais do Parque Nacional de Nairobi no que están empregando YOLO como parte do sistema de detección. Porque Darknet é código aberto, é público para que calquera poida usalo libremente.

(Applause)

(Aplausos)

But we wanted to make detection even more accessible and usable, so through a combination of model optimization, network binarization and approximation, we actually have object detection running on a phone.

Pero queriamos facer a detección máis accesible e práctica, así que combinando a optimización de modelos, a binarización e aproximación de redes, logramos que a detección de obxectos funcionase nun teléfono.

(Applause)

(Aplausos)

And I'm really excited because now we have a pretty powerful solution to this low-level computer vision problem, and anyone can take it and build something with it. So now the rest is up to all of you and people around the world with access to this software, and I can't wait to see what people will build with this technology.

Estou entusiasmado porque agora temos unha solución bastante potente para este problema de visión artificial de baixo nivel, e calquera pode usala para facer algo con ela. A partir de aquí depende de todos vós e doutros por todo o mundo con acceso a este software, e estou desexando ver o que construirán con esta tecnoloxía.

Thank you.

Grazas.

(Applause)

(Aplausos)