Joseph Redmon: How computers learn to recognize objects instantly

Ten years ago, computer vision researchers thought that getting a computer to tell the difference between a cat and a dog would be almost impossible, even with the significant advance in the state of artificial intelligence. Now we can do it at a level greater than 99 percent accuracy. This is called image classification -- give it an image, put a label to that image -- and computers know thousands of other categories as well.

Dziesięć lat temu badacze rozpoznawania obrazów sądzili, że nauczenie komputera rozróżniania kota od psa jest prawie niemożliwe nawet przy sporym postępie rozwoju sztucznej inteligencji. Teraz jest to możliwe z dokładnością większą niż 99%. Nazywamy to klasyfikacją obrazu: dostarcz obraz, przypisz do niego etykietę... komputery znają teraz tysiące innych kategorii.

I'm a graduate student at the University of Washington, and I work on a project called Darknet, which is a neural network framework for training and testing computer vision models. So let's just see what Darknet thinks of this image that we have. When we run our classifier on this image, we see we don't just get a prediction of dog or cat, we actually get specific breed predictions. That's the level of granularity we have now. And it's correct. My dog is in fact a malamute.

Jestem doktorantem na Uniwersytecie Waszyngtońskim i pracuję nad projektem "Darknet", który jest frameworkiem sieci neuronowej do ćwiczenia i testowana modeli rozpoznawania obrazów. Sprawdźmy więc, co "Darknet" sądzi o tym obrazie. Wprowadzenie tego obrazu do klasyfikatora nie tylko da nam prognozę, czy to pies czy kot, ale nawet prognozę konkretnej rasy. Taki poziom szczegółowości jest teraz dostępny. I jest to poprawne rozpoznanie. Mój pies to naprawdę malamut.

So we've made amazing strides in image classification, but what happens when we run our classifier on an image that looks like this? Well ... We see that the classifier comes back with a pretty similar prediction. And it's correct, there is a malamute in the image, but just given this label, we don't actually know that much about what's going on in the image. We need something more powerful. I work on a problem called object detection, where we look at an image and try to find all of the objects, put bounding boxes around them and say what those objects are. So here's what happens when we run a detector on this image.

Dokonaliśmy niesamowitego postępu w klasyfikacji obrazów, ale co się stanie, jeśli wprowadzimy taki obraz? Klasyfikator zwraca podobną prognozę i jest poprawna - na zdjęciu jest malamut, ale na podstawie samej etykiety nie można dowiedzieć się wiele na temat tego, co dzieje się na zdjęciu. Do tego potrzeba czegoś bardziej zaawansowanego. Pracuję nad wykrywaniem obiektów. Patrzymy na obraz i próbujemy znaleźć wszystkie obiekty, wpisujemy je w bryły brzegowe i je nazywamy. Oto co się dzieje, kiedy wprowadzimy ten obraz do detektora.

Now, with this kind of result, we can do a lot more with our computer vision algorithms. We see that it knows that there's a cat and a dog. It knows their relative locations, their size. It may even know some extra information. There's a book sitting in the background. And if you want to build a system on top of computer vision, say a self-driving vehicle or a robotic system, this is the kind of information that you want. You want something so that you can interact with the physical world. Now, when I started working on object detection, it took 20 seconds to process a single image. And to get a feel for why speed is so important in this domain, here's an example of an object detector that takes two seconds to process an image. So this is 10 times faster than the 20-seconds-per-image detector, and you can see that by the time it makes predictions, the entire state of the world has changed, and this wouldn't be very useful for an application.

Z tego rodzaju rezultatem możemy zrobić znacznie więcej, stosując nasze algorytmy. Komputer wie, że na zdjęciu jest kot i pies. Zna ich względne położenie, ich wielkość. Może też uzyskać dodatkowe informacje: w tle leży książka. Żeby zbudować system na rozpoznawaniu obrazu, powiedzmy samoprowadzący się pojazd lub manipulator robotyczny, potrzeba właśnie tego typu informacji. Potrzeba czegoś do interakcji ze światem fizycznym. Kiedy zacząłem pracować nad wykrywaniem obiektów, przetworzenie jednego obrazu zajmowało 20 sekund. Aby lepiej odczuć dlaczego prędkość jest tak ważna w tej dziedzinie, pokażę przykład detektora obiektów, który przetwarza obraz w 2 sekundy. To 10 razy szybciej niż ten przetwarzający obraz w 20 sekund, a i tak można zaobserwować, że zanim dokonał prognozy, zmieniła się cała sytuacja, a to nie byłoby zbyt użyteczne w zastosowaniu.

If we speed this up by another factor of 10, this is a detector running at five frames per second. This is a lot better, but for example, if there's any significant movement, I wouldn't want a system like this driving my car.

Jeśli dziesięciokrotnie przyspieszymy ten proces, tak wygląda detektor przy pięciu klatkach an sekundę. Wygląda to dużo lepiej, ale na przykład w przypadku znacznego ruchu nie chciałbym, żeby taki system prowadził mój samochód.

This is our detection system running in real time on my laptop. So it smoothly tracks me as I move around the frame, and it's robust to a wide variety of changes in size, pose, forward, backward. This is great. This is what we really need if we're going to build systems on top of computer vision.

Tak wygląda nasz system działający w czasie rzeczywistym na moim laptopie. Płynnie śledzi mnie w trakcie poruszania się w ramie i jest odporny na dużą różnorodność zmian wielkości, pozy, rotacji poziomej. O to chodziło. Tego właśnie potrzeba, żeby budować systemy na rozpoznawaniu obrazu.

(Applause)

(Brawa)

So in just a few years, we've gone from 20 seconds per image to 20 milliseconds per image, a thousand times faster. How did we get there? Well, in the past, object detection systems would take an image like this and split it into a bunch of regions and then run a classifier on each of these regions, and high scores for that classifier would be considered detections in the image. But this involved running a classifier thousands of times over an image, thousands of neural network evaluations to produce detection. Instead, we trained a single network to do all of detection for us. It produces all of the bounding boxes and class probabilities simultaneously. With our system, instead of looking at an image thousands of times to produce detection, you only look once, and that's why we call it the YOLO method of object detection. So with this speed, we're not just limited to images; we can process video in real time. And now, instead of just seeing that cat and dog, we can see them move around and interact with each other.

Zaledwie w kilka lat przeszliśmy od 20 sekund na obraz do 20 milisekund na obraz - tysiąc razy szybciej. Jak to się stało? W przeszłości systemy wykrywania obiektów rozbiłyby ten obraz na mnóstwo obszarów i wprowadziłyby każdy z nich do klasyfikatora, a wysokie rezultaty klasyfikatora byłyby uważane za wykrycie obiektów. Ale to wymagało wprowadzenia obrazu do klasyfikatora tysiące razy, tysięcy ewaluacji sieci neuronowych, aby otrzymać wykrycie. Zamiast tego nauczyliśmy pojedynczą sieć całkowitego wykrywania. Tworzy ona wszystkie bryły brzegowe i prawdopodobieństwa klasowe jednocześnie. Nasz system, zamiast tysiąc razy analizować obraz do jednego wykrycia, analizuje go tylko raz i dlatego nazywamy to metodą wykrywania obrazów YOLO. Z tą prędkością nie musimy ograniczać się jedynie do obrazów. Możemy przetwarzać wideo w czasie rzeczywistym. Teraz zamiast kota i psa można też obserwować ich ruch i interakcję.

This is a detector that we trained on 80 different classes in Microsoft's COCO dataset. It has all sorts of things like spoon and fork, bowl, common objects like that. It has a variety of more exotic things: animals, cars, zebras, giraffes. And now we're going to do something fun. We're just going to go out into the audience and see what kind of things we can detect. Does anyone want a stuffed animal? There are some teddy bears out there. And we can turn down our threshold for detection a little bit, so we can find more of you guys out in the audience. Let's see if we can get these stop signs. We find some backpacks. Let's just zoom in a little bit. And this is great. And all of the processing is happening in real time on the laptop.

Trenowaliśmy ten detektor na osiemdziesięciu różnych klasach w zbiorze danych Microsoft COCO. Zawiera on różnorodne rzeczy, na przykład łyżkę i widelec, miskę i podobne przedmioty codziennego użytku. Zawiera również bardziej egzotyczne obrazy: zwierzęta, samochody, zebry, żyrafy. Teraz zrobimy coś dla zabawy. Skierujemy detektor na publiczność i zobaczymy, co się da wykryć. Czy ktoś chce pluszaka? Mam tu kilka misiów. Możemy trochę obniżyć nasz próg wykrywalności, aby znaleźć więcej osób na widowni. Zobaczmy, czy uda nam się rozpoznać te znaki stopu. Widzimy kilka plecaków. Zróbmy niewielkie zbliżenie. Świetnie. Całe przetwarzanie ma miejsce w czasie rzeczywistym na laptopie.

And it's important to remember that this is a general purpose object detection system, so we can train this for any image domain. The same code that we use to find stop signs or pedestrians, bicycles in a self-driving vehicle, can be used to find cancer cells in a tissue biopsy. And there are researchers around the globe already using this technology for advances in things like medicine, robotics. This morning, I read a paper where they were taking a census of animals in Nairobi National Park with YOLO as part of this detection system. And that's because Darknet is open source and in the public domain, free for anyone to use.

Należy pamiętać, że jest to system wykrywania obiektów do użytku ogólnego, więc możemy go wytrenować dla dowolnej kategorii obrazów. Ten sam kod, którego używamy do znalezienia znaków stopu, pieszych, lub rowerów w autonomicznym pojeździe można wykorzystać do znalezienia komórek rakowych w biopsji tkanki. Badacze na całym świece używają już tej technologii do rozwoju w dziedzinach takich jak medycyna, czy robotyka. Dziś rano czytałem pracę omawiającą zliczanie zwierząt w Parku Narodowym Nairobi przy wykorzystaniu YOLO jako części systemu wykrywania obiektów. Jest to możliwe ponieważ "Darknet" jest programem open source, będącym własnością publiczną do wolnego użytku.

(Applause)

(Brawa)

But we wanted to make detection even more accessible and usable, so through a combination of model optimization, network binarization and approximation, we actually have object detection running on a phone.

Chcieliśmy jednak, żeby wykrywanie było bardziej przystępne i przydatne, więc poprzez połączenie optymalizacji modelu, binaryzacji i aproksymacji sieci otrzymaliśmy wykrywanie obiektów działające na telefonie.

(Applause)

(Brawa)

And I'm really excited because now we have a pretty powerful solution to this low-level computer vision problem, and anyone can take it and build something with it. So now the rest is up to all of you and people around the world with access to this software, and I can't wait to see what people will build with this technology.

Jestem bardzo podekscytowany, bo mamy potężne rozwiązanie problemu rozpoznawania obrazów na komputerach o niskiej wydajności i każdy może je wykorzystać i stworzyć coś przy jego użyciu. Reszta zależy od was i ludzi na całym świecie z dostępem do tego oprogramowania. Nie mogę się doczekać tego,

Thank you.

co ludzie stworzą przy użyciu tej technologii.

Dziękuję.

(Applause)

(Brawa)

(Applause)

(Brawa)

(Applause)

(Brawa)

(Applause)

(Brawa)

Thank you.

co ludzie stworzą przy użyciu tej technologii.

Dziękuję.

(Applause)

(Brawa)

Joseph Redmon: How computers learn to recognize objects instantly

Joseph Redmon: How computers learn to recognize objects instantly

Related talks

Blaise Agüera y Arcas: How computers are learning to be creative

Fei-Fei Li: How we're teaching computers to understand pictures

Ray Kurzweil: Get ready for hybrid thinking

Blaise Agüera y Arcas: How PhotoSynth can connect the world's images

Jeremy Howard: The wonderful and terrifying implications of computers that can learn

Sebastian Thrun and Chris Anderson: What AI is -- and isn't

Related talks

Blaise Agüera y Arcas: How computers are learning to be creative

Fei-Fei Li: How we're teaching computers to understand pictures

Ray Kurzweil: Get ready for hybrid thinking

Blaise Agüera y Arcas: How PhotoSynth can connect the world's images

Jeremy Howard: The wonderful and terrifying implications of computers that can learn

Sebastian Thrun and Chris Anderson: What AI is -- and isn't