Joseph Redmon: How computers learn to recognize objects instantly

Ten years ago, computer vision researchers thought that getting a computer to tell the difference between a cat and a dog would be almost impossible, even with the significant advance in the state of artificial intelligence. Now we can do it at a level greater than 99 percent accuracy. This is called image classification -- give it an image, put a label to that image -- and computers know thousands of other categories as well.

Hace diez años los investigadores de la visión artificial pensaban que hacer que una computadora distinguiera un gato de un perro sería casi imposible, incluso con el avance significativo de la inteligencia artificial. Ahora podemos hacerlo con un nivel superior al 99 % de precisión. Esto se llama clasificación de imagen. Esto es, poner una etiqueta a esa imagen. Y las computadoras conocen miles de otras categorías también.

I'm a graduate student at the University of Washington, and I work on a project called Darknet, which is a neural network framework for training and testing computer vision models. So let's just see what Darknet thinks of this image that we have. When we run our classifier on this image, we see we don't just get a prediction of dog or cat, we actually get specific breed predictions. That's the level of granularity we have now. And it's correct. My dog is in fact a malamute.

Soy estudiante de posgrado en la Universidad de Washington, y trabajo en un proyecto llamado Darknet, que es un framework de red neuronal para la capacitación y ensayo de modelos de visión artificial, Así que veamos lo que piensa Darknet de esta imagen que tenemos aquí. Cuando ejecutamos nuestro clasificador en esta imagen, vemos que no solo se obtiene una predicción de perro o gato, en realidad, obtenemos predicciones de raza específicas. Ese es el nivel de granularidad actual. Y es correcto. Mi perro es de hecho un malamute.

So we've made amazing strides in image classification, but what happens when we run our classifier on an image that looks like this? Well ... We see that the classifier comes back with a pretty similar prediction. And it's correct, there is a malamute in the image, but just given this label, we don't actually know that much about what's going on in the image. We need something more powerful. I work on a problem called object detection, where we look at an image and try to find all of the objects, put bounding boxes around them and say what those objects are. So here's what happens when we run a detector on this image.

Hemos hecho avances increíbles en la clasificación de imágenes, pero ¿qué pasa al ejecutar nuestro clasificador en una imagen así? Bien... Vemos que el clasificador vuelve con una predicción bastante similar. Y es correcto, hay un malamute en la imagen, pero solo con esta etiqueta, en realidad, no sabemos mucho sobre lo que pasa en la imagen. Necesitamos algo más potente. Trabajo en un problema llamado detección de objetos, donde miramos una imagen e intentamos encontrar todos los objetos, poniendo cajas delimitadoras alrededor de ellos y averiguar qué son esos objetos. Así que esto es lo que ocurre al ejecutar un detector en esta imagen.

Now, with this kind of result, we can do a lot more with our computer vision algorithms. We see that it knows that there's a cat and a dog. It knows their relative locations, their size. It may even know some extra information. There's a book sitting in the background. And if you want to build a system on top of computer vision, say a self-driving vehicle or a robotic system, this is the kind of information that you want. You want something so that you can interact with the physical world. Now, when I started working on object detection, it took 20 seconds to process a single image. And to get a feel for why speed is so important in this domain, here's an example of an object detector that takes two seconds to process an image. So this is 10 times faster than the 20-seconds-per-image detector, and you can see that by the time it makes predictions, the entire state of the world has changed, and this wouldn't be very useful for an application.

Ahora, con este tipo de resultado, podemos hacer mucho más con nuestros algoritmos de visión artificial. Vemos que sabe que hay un gato y un perro. Conoce sus ubicaciones relativas, su tamaño. Incluso puede saber alguna información adicional. Hay un libro en el fondo. Y si desea construir un sistema con base en la visión artificial por ejemplo, un vehículo autodirigido o un sistema robótico, este es el tipo de información que necesita. Algo para interactuar con el mundo físico. Cuando empecé a trabajar en la detección de objetos, se tardaba 20 segundos en procesar una sola imagen. Y para entender por qué la velocidad es tan importante en este ámbito, este es un ejemplo de un detector de objetos que tarda dos segundos en procesar una imagen. Así que esto es 10 veces más rápido que el detector de 20 segundos por imagen, y se puede ver que, en el momento de hacer las predicciones, el estado entero del mundo ha cambiado, y esto no sería muy útil para una aplicación.

If we speed this up by another factor of 10, this is a detector running at five frames per second. This is a lot better, but for example, if there's any significant movement, I wouldn't want a system like this driving my car.

Si aceleramos esto por otro factor de 10, esto es un detector que funciona en cinco fotogramas por segundo. Esto es mucho mejor pero, por ejemplo, si hay algún movimiento significativo, yo no quisiera un sistema como este conduciendo mi auto.

This is our detection system running in real time on my laptop. So it smoothly tracks me as I move around the frame, and it's robust to a wide variety of changes in size, pose, forward, backward. This is great. This is what we really need if we're going to build systems on top of computer vision.

Este es nuestro sistema de detección que se ejecuta en tiempo real en mi portátil. Así que suavemente me sigue mientras me muevo en la imagen, y es robusto con una amplia variedad de cambios de tamaño, poses, hacia adelante, hacia atrás. Esto es genial. Esto es lo que realmente necesitamos si vamos a construir sistemas con base en visión artificial.

(Applause)

(Aplausos)

So in just a few years, we've gone from 20 seconds per image to 20 milliseconds per image, a thousand times faster. How did we get there? Well, in the past, object detection systems would take an image like this and split it into a bunch of regions and then run a classifier on each of these regions, and high scores for that classifier would be considered detections in the image. But this involved running a classifier thousands of times over an image, thousands of neural network evaluations to produce detection. Instead, we trained a single network to do all of detection for us. It produces all of the bounding boxes and class probabilities simultaneously. With our system, instead of looking at an image thousands of times to produce detection, you only look once, and that's why we call it the YOLO method of object detection. So with this speed, we're not just limited to images; we can process video in real time. And now, instead of just seeing that cat and dog, we can see them move around and interact with each other.

Así que en solo unos pocos años, hemos pasado de 20 segundos por imagen a 20 milisegundos por imagen, mil veces más rápido. ¿Cómo llegamos hasta aquí? Bueno, en el pasado, los sistemas de detección de objetos tomaban una imagen como esta y la dividían en un montón de regiones y luego ejecutaban un clasificador en cada una de estas regiones, y las puntuaciones altas de ese clasificador se consideraban detecciones de la imagen. Pero eso implicaba ejecutar un clasificador miles de veces sobre una imagen, miles de evaluaciones de redes neuronales para producir detección. En cambio, nosotros hemos entrenado una sola red para hacer toda la detección. Produce todas las cajas delimitadoras y ordena las probabilidades en simultáneo. Con nuestro sistema, en lugar de mirar una imagen miles de veces para hacer la detección, se mira tan solo una vez, y por eso lo llamamos método YOLO de detección de objetos. Así que con esta velocidad, no estamos limitados a las imágenes; podemos procesar el video en tiempo real. Y ahora, en lugar de solo ver a ese gato y perro, podemos verlos moverse e interactuar unos con otros.

This is a detector that we trained on 80 different classes in Microsoft's COCO dataset. It has all sorts of things like spoon and fork, bowl, common objects like that. It has a variety of more exotic things: animals, cars, zebras, giraffes. And now we're going to do something fun. We're just going to go out into the audience and see what kind of things we can detect. Does anyone want a stuffed animal? There are some teddy bears out there. And we can turn down our threshold for detection a little bit, so we can find more of you guys out in the audience. Let's see if we can get these stop signs. We find some backpacks. Let's just zoom in a little bit. And this is great. And all of the processing is happening in real time on the laptop.

Este es un detector que entrenamos en 80 clases diferentes en el conjunto de datos COCO de Microsoft. Tiene todo tipo de cosas como cuchara y tenedor, cuenco, objetos comunes como esos. Tiene una variedad de cosas más exóticas: animales, autos, cebras, jirafas. Y ahora vamos a hacer algo divertido. Solo vamos a ir a la audiencia y ver qué tipo de cosas podemos detectar. ¿Alguien quiere un peluche? Hay algunos osos de peluche por aquí. Y podemos reducir nuestro umbral de detección un poco, así podemos encontrar a más de Uds. en la audiencia. Vamos a ver si podemos obtener estas señales de stop. Encontramos algunas mochilas. Vamos a acercarnos un poco. Y esto es genial. Y todo el procesamiento está sucediendo en tiempo real en la computadora portátil.

And it's important to remember that this is a general purpose object detection system, so we can train this for any image domain. The same code that we use to find stop signs or pedestrians, bicycles in a self-driving vehicle, can be used to find cancer cells in a tissue biopsy. And there are researchers around the globe already using this technology for advances in things like medicine, robotics. This morning, I read a paper where they were taking a census of animals in Nairobi National Park with YOLO as part of this detection system. And that's because Darknet is open source and in the public domain, free for anyone to use.

Y es importante recordar que se trata de un sistema de detección de objetos de propósito general, así que podemos entrenar esto para cualquier dominio de la imagen. El mismo código que usamos para encontrar señales de stop o de peatones, bicicletas en un vehículo autodirigido, puede usarse para encontrar células cancerosas en una biopsia de tejido. Y hay investigadores de todo el mundo que ya usan esta tecnología para avances en cosas como la medicina, la robótica. Esta mañana leí un periódico sobre un censo de animales que estaban haciendo en el Parque Nacional de Nairobi que usa YOLO como parte de este sistema de detección. Y eso es posible, porque Darknet es de código abierto y de dominio público y libre para que lo use cualquier persona.

(Applause)

(Aplausos)

But we wanted to make detection even more accessible and usable, so through a combination of model optimization, network binarization and approximation, we actually have object detection running on a phone.

Pero queríamos que la detección fuera aún más accesible y usable, por eso, mediante una combinación de optimización de modelos, binarización de red y aproximación, tenemos un reconocimiento de objetos que funciona en un teléfono.

(Applause)

(Aplausos)

And I'm really excited because now we have a pretty powerful solution to this low-level computer vision problem, and anyone can take it and build something with it. So now the rest is up to all of you and people around the world with access to this software, and I can't wait to see what people will build with this technology.

Y estoy muy contento porque ahora tenemos una solución bastante potente a este problema de visión artificial de bajo nivel, y cualquiera puede usarlo y hacer algo con esto. Así que ahora el resto depende de Uds. y de personas de todo el mundo con acceso a este software, y estoy impaciente por ver qué construirá la gente con esta tecnología.

Thank you.

Gracias.

(Applause)

(Aplausos)