Joseph Redmon: How computers learn to recognize objects instantly

Ten years ago, computer vision researchers thought that getting a computer to tell the difference between a cat and a dog would be almost impossible, even with the significant advance in the state of artificial intelligence. Now we can do it at a level greater than 99 percent accuracy. This is called image classification -- give it an image, put a label to that image -- and computers know thousands of other categories as well.

Tien jaar geleden dachten 'computer vision'-onderzoekers dat een computer het verschil leren tussen een kat en een hond bijna onmogelijk zou zijn, al maakte kunstmatige intelligentie nog zulke grote vorderingen. Nu kan het met een nauwkeurigheid van meer dan 99%. Dit heet beeldclassificatie. Je geeft de computer een afbeelding en hij labelt die. Computers kennen nog duizenden andere categorieën.

I'm a graduate student at the University of Washington, and I work on a project called Darknet, which is a neural network framework for training and testing computer vision models. So let's just see what Darknet thinks of this image that we have. When we run our classifier on this image, we see we don't just get a prediction of dog or cat, we actually get specific breed predictions. That's the level of granularity we have now. And it's correct. My dog is in fact a malamute.

Ik ben student aan de Universiteit van Washington. Ik werk aan het project Darknet, een neuraal netwerk voor het trainen en testen van computer vision-modellen. Wat zou Darknet van deze afbeelding vinden? Als we de 'classifier' op deze afbeelding toepassen, herkent hij niet alleen het dier, maar zelfs het ras. Zo verfijnd is de classificatie. Mijn hond is inderdaad een malamute.

So we've made amazing strides in image classification, but what happens when we run our classifier on an image that looks like this? Well ... We see that the classifier comes back with a pretty similar prediction. And it's correct, there is a malamute in the image, but just given this label, we don't actually know that much about what's going on in the image. We need something more powerful. I work on a problem called object detection, where we look at an image and try to find all of the objects, put bounding boxes around them and say what those objects are. So here's what happens when we run a detector on this image.

Er is veel vooruitgang geboekt bij beeldclassificatie. Wat gebeurt er als we de classifier op zo'n afbeelding toepassen? De classifier geeft bijna dezelfde voorspelling. Er is inderdaad een malamute te zien. Maar dat label vertelt ons niet wat er gebeurt in de afbeelding. We hebben iets krachtigers nodig. Ik werk aan het probleem van objectdetectie. Daarbij proberen we alle objecten op een afbeelding te vinden. We plaatsen er een blok omheen en labelen de objecten. Dit is wat er gebeurt als we een detector op deze afbeelding toepassen.

Now, with this kind of result, we can do a lot more with our computer vision algorithms. We see that it knows that there's a cat and a dog. It knows their relative locations, their size. It may even know some extra information. There's a book sitting in the background. And if you want to build a system on top of computer vision, say a self-driving vehicle or a robotic system, this is the kind of information that you want. You want something so that you can interact with the physical world. Now, when I started working on object detection, it took 20 seconds to process a single image. And to get a feel for why speed is so important in this domain, here's an example of an object detector that takes two seconds to process an image. So this is 10 times faster than the 20-seconds-per-image detector, and you can see that by the time it makes predictions, the entire state of the world has changed, and this wouldn't be very useful for an application.

Met zo'n resultaat kunnen we meer doen met onze computer vision-algoritmen. Hij weet dat er een kat en een hond zijn. Hij weet waar ze zijn en hoe groot ze zijn. Hij ziet ook extra informatie, zoals het boek op de achtergrond. Als je een systeem wilt bouwen op basis van computer vision, zoals een zelfrijdend voertuig of een robotsysteem, dan wil je zulke informatie. Je wilt contact maken met de fysieke wereld. Toen ik begon met werken aan objectdetectie, duurde het verwerken van één afbeelding 20 seconden. Ik zal laten zien waarom snelheid zo belangrijk is op dit gebied. Hier is een objectdetector die er twee seconden over doet om een afbeelding te verwerken. Dat is tien keer sneller dan de detector die er 20 seconden over deed. Zodra de voorspellingen zijn gemaakt is de situatie al veranderd. Dat is niet erg nuttig voor een toepassing.

If we speed this up by another factor of 10, this is a detector running at five frames per second. This is a lot better, but for example, if there's any significant movement, I wouldn't want a system like this driving my car.

We versnellen dit nog eens tien keer. Dit is een detector die vijf beelden per seconden verwerkt. Dat is een stuk beter. Maar niet als er veel beweging is. Ik zou niet willen dat zo'n systeem mijn auto bestuurt.

This is our detection system running in real time on my laptop. So it smoothly tracks me as I move around the frame, and it's robust to a wide variety of changes in size, pose, forward, backward. This is great. This is what we really need if we're going to build systems on top of computer vision.

Dit is ons detectiesysteem dat in realtime op mijn laptop draait. Hij volgt me terwijl ik rondloop. Hij verwerkt veranderingen in grootte en houding. Naar voren, naar achteren. Dit is wat we nodig hebben voor een systeem op basis van computer vision.

(Applause)

So in just a few years, we've gone from 20 seconds per image to 20 milliseconds per image, a thousand times faster. How did we get there? Well, in the past, object detection systems would take an image like this and split it into a bunch of regions and then run a classifier on each of these regions, and high scores for that classifier would be considered detections in the image. But this involved running a classifier thousands of times over an image, thousands of neural network evaluations to produce detection. Instead, we trained a single network to do all of detection for us. It produces all of the bounding boxes and class probabilities simultaneously. With our system, instead of looking at an image thousands of times to produce detection, you only look once, and that's why we call it the YOLO method of object detection. So with this speed, we're not just limited to images; we can process video in real time. And now, instead of just seeing that cat and dog, we can see them move around and interact with each other.

In slechts een paar jaar zijn we van 20 seconden per beeld naar 20 milliseconden per beeld gegaan. Hoe hebben we dat gedaan? Vroeger deelden detectiesystemen zo'n afbeelding op in een aantal gebieden. Een classifier werd toegepast op elk gebied. Als de classifier hoog scoorde, was dat een detectie. De classifier werd dus duizenden keren toegepast. Duizenden evaluaties van het netwerk om een detectie te doen. Wij hebben één netwerk getraind om alle detecties uit te voeren. Hij produceert alle grensvlakken en voorspellingen tegelijkertijd. Met ons systeem kijken we niet duizend keer naar een afbeelding maar slechts één keer, 'you only look once'. Daarom noemen we het de YOLO-methode. Met deze snelheid zijn we niet beperkt tot afbeeldingen. We kunnen video's in realtime verwerken. We zien niet alleen de kat en de hond. We zien ze rondbewegen en op elkaar reageren.

This is a detector that we trained on 80 different classes in Microsoft's COCO dataset. It has all sorts of things like spoon and fork, bowl, common objects like that. It has a variety of more exotic things: animals, cars, zebras, giraffes. And now we're going to do something fun. We're just going to go out into the audience and see what kind of things we can detect. Does anyone want a stuffed animal? There are some teddy bears out there. And we can turn down our threshold for detection a little bit, so we can find more of you guys out in the audience. Let's see if we can get these stop signs. We find some backpacks. Let's just zoom in a little bit. And this is great. And all of the processing is happening in real time on the laptop.

Deze detector hebben we getraind met 80 verschillende categorieën in de COCO-dataset van Microsoft. Die bevat allerlei gewone objecten zoals lepels, vorken en kommen. Hij heeft ook exotischere dingen, zoals auto's, zebra's en giraffes. Nu gaan we iets leuks doen. We gaan kijken wat we in het publiek kunnen detecteren. Wil iemand een knuffelbeest? Hier zijn wat teddyberen. Ik verlaag de detectiegevoeligheid zodat hij meer mensen detecteert. Kijken of hij deze stopborden vindt. Hij vindt wat rugzakken. Even inzoomen. De computer verwerkt alles in realtime.

And it's important to remember that this is a general purpose object detection system, so we can train this for any image domain. The same code that we use to find stop signs or pedestrians, bicycles in a self-driving vehicle, can be used to find cancer cells in a tissue biopsy. And there are researchers around the globe already using this technology for advances in things like medicine, robotics. This morning, I read a paper where they were taking a census of animals in Nairobi National Park with YOLO as part of this detection system. And that's because Darknet is open source and in the public domain, free for anyone to use.

Dit is een algemeen detectiesysteem. We kunnen hem trainen voor elk soort beeld. Dezelfde code die we gebruiken voor het vinden van stopborden, voetgangers en fietsers in een zelfrijdend voertuig, kan ook gebruikt worden om kankercellen te vinden in een weefselbiopsie. Onderzoekers over de hele wereld gebruiken deze technologie al voor de vooruitgang in de geneeskunde en robotica. Vanochtend las ik in de krant over een census van de dieren in het nationaal park van Nairobi. Ze gebruikten YOLO bij het detectiesysteem. Dat kan omdat Darknet opensource is. Iedereen kan het gratis gebruiken.

(Applause)

But we wanted to make detection even more accessible and usable, so through a combination of model optimization, network binarization and approximation, we actually have object detection running on a phone.

Wij wilden detectie nog toegankelijker maken. Met optimalisatie van het model en binarisatie en approximatie van het netwerk konden we objectdetectie uitvoeren via een telefoon.

(Applause)

And I'm really excited because now we have a pretty powerful solution to this low-level computer vision problem, and anyone can take it and build something with it. So now the rest is up to all of you and people around the world with access to this software, and I can't wait to see what people will build with this technology.

Ik ben er enthousiast over want nu hebben we een krachtige oplossing voor dit computer vision-probleem. Iedereen kan er iets mee maken. De rest is aan iedereen met toegang tot deze software. Ik kan niet wachten om te zien wat men met deze technologie maakt.

Thank you.

Bedankt.

(Applause)

Wij wilden detectie nog toegankelijker maken. Met optimalisatie van het model en binarisatie en approximatie van het netwerk konden we objectdetectie uitvoeren via een telefoon.

(Applause)