Joseph Redmon: How computers learn to recognize objects instantly

Ten years ago, computer vision researchers thought that getting a computer to tell the difference between a cat and a dog would be almost impossible, even with the significant advance in the state of artificial intelligence. Now we can do it at a level greater than 99 percent accuracy. This is called image classification -- give it an image, put a label to that image -- and computers know thousands of other categories as well.

Il y a dix ans de cela, les chercheurs en vision artificielle pensaient que faire différencier à l'ordinateur un chat d'un chien relevait presque de l'impossible, et ce malgré les nets progrès réalisés en intelligence artificielle. Nous y parvenons dorénavant avec une fiabilité supérieure à 99 %. C'est appelé la classification d'images. On prend une image, qu'on étiquette... L'ordinateur connaît également des milliers d'autres catégories.

I'm a graduate student at the University of Washington, and I work on a project called Darknet, which is a neural network framework for training and testing computer vision models. So let's just see what Darknet thinks of this image that we have. When we run our classifier on this image, we see we don't just get a prediction of dog or cat, we actually get specific breed predictions. That's the level of granularity we have now. And it's correct. My dog is in fact a malamute.

Je suis diplômé et j'étudie à l'université de Washington, et je travaille sur le projet Darknet, une structure en réseau neuronal pour entraîner et tester les modèles numériques de vision. Voyons ce que Darknet pense de cette image. Si nous analysons cette image à l'aide du classifieur nous obtenons la prédiction « chat » ou « chien » mais aussi la race spécifique. Voici le niveau de granularité actuel. Et c'est exact. Mon chien est un malamute.

So we've made amazing strides in image classification, but what happens when we run our classifier on an image that looks like this? Well ... We see that the classifier comes back with a pretty similar prediction. And it's correct, there is a malamute in the image, but just given this label, we don't actually know that much about what's going on in the image. We need something more powerful. I work on a problem called object detection, where we look at an image and try to find all of the objects, put bounding boxes around them and say what those objects are. So here's what happens when we run a detector on this image.

Nos avancées en classification d'images sont donc incroyables. Mais que se passe-t-il si notre classifieur analyse une image de ce genre ? Eh bien... Il fournit une prédiction relativement similaire. Et c'est exact, l'image contient un malamute. Mais cette information n'indique pas grand-chose sur ce qui se passe dans l'image. Il faudrait aller plus loin. Je travaille sur la détection d'objets. Il s'agit de trouver tous les objets présents sur une image, de les circonscrire et de les identifier. Voyons ce qui se passe quand le détecteur analyse cette image.

Now, with this kind of result, we can do a lot more with our computer vision algorithms. We see that it knows that there's a cat and a dog. It knows their relative locations, their size. It may even know some extra information. There's a book sitting in the background. And if you want to build a system on top of computer vision, say a self-driving vehicle or a robotic system, this is the kind of information that you want. You want something so that you can interact with the physical world. Now, when I started working on object detection, it took 20 seconds to process a single image. And to get a feel for why speed is so important in this domain, here's an example of an object detector that takes two seconds to process an image. So this is 10 times faster than the 20-seconds-per-image detector, and you can see that by the time it makes predictions, the entire state of the world has changed, and this wouldn't be very useful for an application.

Ce genre de résultats démultiplient les capacités de nos algorithmes de vision artificielle. On voit qu'il a reconnu la présence du chat et du chien. Il sait où chacun se situe et leur taille. Il peut même savoir d'autres choses. Un livre se trouve au fond. Si l'on veut concevoir un système basé sur la vision artificielle, tel un véhicule autoguidé ou un système robotisé, on recherche ce genre d'informations. On cherche à pourvoir interagir avec le monde physique. Quand j'ai débuté dans la détection d'objets, il fallait 20 secondes pour analyser une seule image. Pour que vous saisissiez l'importance de la rapidité dans ce domaine, voici un exemple de détecteur d'objets qui analyse une image en deux secondes soit dix fois plus rapidement que celui à 20 secondes par image. On voit que le temps qu'il fasse des prédictions, l'état des choses a changé, et son application serait peu utile.

If we speed this up by another factor of 10, this is a detector running at five frames per second. This is a lot better, but for example, if there's any significant movement, I wouldn't want a system like this driving my car.

En l'accélérant encore d'un facteur de dix, le détecteur analyse cinq images par seconde. C'est bien mieux. Par contre, si un mouvement important se produit, je ne veux pas que ce système conduise ma voiture.

This is our detection system running in real time on my laptop. So it smoothly tracks me as I move around the frame, and it's robust to a wide variety of changes in size, pose, forward, backward. This is great. This is what we really need if we're going to build systems on top of computer vision.

Voici notre système de détection en temps réel sur mon PC portable. Il suit aisément mes déplacements dans le cadre, il supporte une grande variété de changements de taille, de postures, si j'avance, recule. C'est formidable. C'est ce qu'il nous faut réellement pour concevoir des systèmes basés sur la vision artificielle.

(Applause)

(Applaudissements)

So in just a few years, we've gone from 20 seconds per image to 20 milliseconds per image, a thousand times faster. How did we get there? Well, in the past, object detection systems would take an image like this and split it into a bunch of regions and then run a classifier on each of these regions, and high scores for that classifier would be considered detections in the image. But this involved running a classifier thousands of times over an image, thousands of neural network evaluations to produce detection. Instead, we trained a single network to do all of detection for us. It produces all of the bounding boxes and class probabilities simultaneously. With our system, instead of looking at an image thousands of times to produce detection, you only look once, and that's why we call it the YOLO method of object detection. So with this speed, we're not just limited to images; we can process video in real time. And now, instead of just seeing that cat and dog, we can see them move around and interact with each other.

En quelques années seulement, on est passés de 20 secondes par image à 20 millisecondes par image. Mille fois plus vite. Comment avons-nous fait ? Avant, les systèmes de détection d'objets prenaient une image similaire et la découpaient en zones, puis un classifieur analysait chacune d'elles, et on considérait qu'il était performant s'il détectait des objets. Mais ça signifiait analyser une image des milliers de fois autant d'évaluations en réseau neuronal pour une détection. On a alors entraîné un réseau unique à réaliser ces détections. Il produit les délimitations et les probabilités de classes simultanément. Notre système n'observe plus une image des milliers de fois afin de réaliser une détection mais une fois seulement. D'où la méthode de détection YOLO pour « You Only Look Once ». Cette rapidité ne nous limite plus aux seules images et nous pouvons analyser des vidéos en temps réel. Et au lieu de voir un chat et un chien, on les voit se déplacer et interagir.

This is a detector that we trained on 80 different classes in Microsoft's COCO dataset. It has all sorts of things like spoon and fork, bowl, common objects like that. It has a variety of more exotic things: animals, cars, zebras, giraffes. And now we're going to do something fun. We're just going to go out into the audience and see what kind of things we can detect. Does anyone want a stuffed animal? There are some teddy bears out there. And we can turn down our threshold for detection a little bit, so we can find more of you guys out in the audience. Let's see if we can get these stop signs. We find some backpacks. Let's just zoom in a little bit. And this is great. And all of the processing is happening in real time on the laptop.

Nous avons entraîné ce détecteur sur 80 classes différentes de la base COCO de Microsoft. Elle contient un peu de tout, de la cuillère à la fourchette et au bol. Ce genre d'objets usuels. Elle contient diverses choses plus inhabituelles : des animaux, des voitures, des zèbres, des girafes. Faisons une chose amusante. Parcourons le public et voyons le genre d'objets détectés. Quelqu'un veut-il une peluche ? Il y a des ours en peluche ici. Repoussons un peu notre seuil de détection afin d'en trouver plus dans le public. Voyons si on trouve ces panneaux. On a des sacs à dos. Zoomons juste un peu. C'est formidable. L'analyse entière se déroule en temps réel sur l'ordinateur.

And it's important to remember that this is a general purpose object detection system, so we can train this for any image domain. The same code that we use to find stop signs or pedestrians, bicycles in a self-driving vehicle, can be used to find cancer cells in a tissue biopsy. And there are researchers around the globe already using this technology for advances in things like medicine, robotics. This morning, I read a paper where they were taking a census of animals in Nairobi National Park with YOLO as part of this detection system. And that's because Darknet is open source and in the public domain, free for anyone to use.

Il faut garder à l'esprit qu'il s'agit d'un système de détection d'objets à usage général. On peut donc l'entraîner pour tout domaine visuel. Le même code employé pour détecter des panneaux stop, des piétons ou des vélos par un véhicule autoguidé peut être employé pour détecter des cellules cancéreuses dans une biopsie. Dans le monde, des chercheurs utilisent déjà cette technologie pour progresser dans des domaines comme la médecine ou la robotique. Ce matin, j'ai lu un article qui traitait d'un recensement de la faune du parc national de Nairobi qui utilisait YOLO pour son système de détection. Cela est possible car Darknet est en open source et dans le domaine public. Il est libre d'utilisation.

(Applause)

(Applaudissements)

But we wanted to make detection even more accessible and usable, so through a combination of model optimization, network binarization and approximation, we actually have object detection running on a phone.

On a voulu rendre la détection encore plus accessible et pratique. Grâce à la combinaison de l'optimisation de modèles la binarisation du réseau et l'approximation, on peut utiliser la détection d'objets sur un téléphone.

(Applause)

(Applaudissements)

And I'm really excited because now we have a pretty powerful solution to this low-level computer vision problem, and anyone can take it and build something with it. So now the rest is up to all of you and people around the world with access to this software, and I can't wait to see what people will build with this technology.

Je suis ravi, car nous disposons dorénavant d'une solution puissante pour ce problème de vision artificielle faible. Il peut servir à quiconque pour concevoir ce qu'il veut. Pour le reste, à vous de jouer ainsi qu'à ceux dans le monde qui ont accès à ce logiciel. J'ai hâte de voir ce que les gens feront de cette technologie.

Thank you.

Merci.

(Applause)

(Applaudissements)