Joseph Redmon: How computers learn to recognize objects instantly

Ten years ago, computer vision researchers thought that getting a computer to tell the difference between a cat and a dog would be almost impossible, even with the significant advance in the state of artificial intelligence. Now we can do it at a level greater than 99 percent accuracy. This is called image classification -- give it an image, put a label to that image -- and computers know thousands of other categories as well.

Prije deset godina, istraživači računalnog vida mislili su da je naučiti računalo kako razlikovati između mačke i psa gotovo nemoguće, čak i uz značajan napredak u razvoju umjetne inteligencije. Sad to možemo učiniti s više od 99 posto točnosti. To se zove klasifikacija slike - dati sliku, staviti oznaku na sliku - a računala znaju i tisuće drugih kategorija.

I'm a graduate student at the University of Washington, and I work on a project called Darknet, which is a neural network framework for training and testing computer vision models. So let's just see what Darknet thinks of this image that we have. When we run our classifier on this image, we see we don't just get a prediction of dog or cat, we actually get specific breed predictions. That's the level of granularity we have now. And it's correct. My dog is in fact a malamute.

Ja sam postdiplomac na Sveučilištu u Washingtonu i radim na projektu pod nazivom Darknet, što je neuronska mrežna struktura za obuku i testiranje modela računalnog vida. Pa pogledajmo što Darknet misli o ovoj slici koju imamo. Kad smo pokrenuti naš klasifikator na ovoj slici, ne dobivamo samo predviđanja je li to pas ili mačka, nego čak i određena predviđanja pasmine. To je razina zrnatosti koju imamo sada. I to je točno. Moj pas je doista malamut.

So we've made amazing strides in image classification, but what happens when we run our classifier on an image that looks like this? Well ... We see that the classifier comes back with a pretty similar prediction. And it's correct, there is a malamute in the image, but just given this label, we don't actually know that much about what's going on in the image. We need something more powerful. I work on a problem called object detection, where we look at an image and try to find all of the objects, put bounding boxes around them and say what those objects are. So here's what happens when we run a detector on this image.

Napravili smo nevjerojatne pomake u klasifikaciji slike, ali što se događa kad pokrenemo klasifikator na sliku koja izgleda ovako? Dobro ... Vidimo da je klasifikator vraća uz prilično slična predviđanja. I to je točno, na slici je malamut, ali samo s tom oznakom ne znamo mnogo o tome što se događa na slici. Trebamo nešto snažnije. Radim na problemu koji se zove otkrivanje objekta, gdje gledamo sliku i pokušavamo pronaći sve objekte, staviti okvire oko njih i reći ono što ti predmeti su. Evo što se događa kad pokrenemo detektor na ovoj slici.

Now, with this kind of result, we can do a lot more with our computer vision algorithms. We see that it knows that there's a cat and a dog. It knows their relative locations, their size. It may even know some extra information. There's a book sitting in the background. And if you want to build a system on top of computer vision, say a self-driving vehicle or a robotic system, this is the kind of information that you want. You want something so that you can interact with the physical world. Now, when I started working on object detection, it took 20 seconds to process a single image. And to get a feel for why speed is so important in this domain, here's an example of an object detector that takes two seconds to process an image. So this is 10 times faster than the 20-seconds-per-image detector, and you can see that by the time it makes predictions, the entire state of the world has changed, and this wouldn't be very useful for an application.

Ovakvom vrstom rezultata možemo napraviti puno više s algoritmima računalnog vida. Vidimo da zna da su tu mačka i pas. Zna njihove relativne položaje, njihovu veličinu. Čak može znati neke dodatne informacije. U pozadini je knjiga. Ako želite izgraditi sustav na osnovi računalnog vida, recimo autonomno vozilo ili robotski sustav, ovo je vrsta informacija koje želite. Želite nešto da možete komunicirati s fizičkim svijetom. Kad sam počeo raditi na prepoznavanju objekata, trebalo je 20 sekundi za obradu jedne slike. A kako biste dobili osjećaj zašto je brzina ovdje tako važna, evo primjera detektora objekta koji treba dvije sekunde za obradu slike. Dakle ovo je 10 puta brže od detektora kojem treba 20 sekundi po slici, i možete vidjeti da se za vrijeme dok on učini predviđanja, promijenilo čitavo stanje u svijetu, i to ne bi bilo vrlo korisno za neku primjenu.

If we speed this up by another factor of 10, this is a detector running at five frames per second. This is a lot better, but for example, if there's any significant movement, I wouldn't want a system like this driving my car.

Ako ovo gore ubrzamo još jednom za faktor 10, to je detektor koji radi na pet sličica u sekundi. To je puno bolje, ali, na primjer, ako postoji bilo kakav značajan pokret, ne bih želio da sustav poput ovog vozi moj auto.

This is our detection system running in real time on my laptop. So it smoothly tracks me as I move around the frame, and it's robust to a wide variety of changes in size, pose, forward, backward. This is great. This is what we really need if we're going to build systems on top of computer vision.

Ovo je naš sustav otkrivanja u realnom vremenu na mom laptopu. Glatko me prati kako se krećem kroz kadar, i otporan je na razne promjene veličine, položaja, naprijed, natrag. Ovo je super. To je ono što stvarno trebamo ako ćemo graditi sustave na osnovi računalnog vida,

(Applause)

(Pljesak)

So in just a few years, we've gone from 20 seconds per image to 20 milliseconds per image, a thousand times faster. How did we get there? Well, in the past, object detection systems would take an image like this and split it into a bunch of regions and then run a classifier on each of these regions, and high scores for that classifier would be considered detections in the image. But this involved running a classifier thousands of times over an image, thousands of neural network evaluations to produce detection. Instead, we trained a single network to do all of detection for us. It produces all of the bounding boxes and class probabilities simultaneously. With our system, instead of looking at an image thousands of times to produce detection, you only look once, and that's why we call it the YOLO method of object detection. So with this speed, we're not just limited to images; we can process video in real time. And now, instead of just seeing that cat and dog, we can see them move around and interact with each other.

U samo nekoliko godina došli smo od 20 sekundi po slici do 20 milisekundi po slici, tisuću puta brže. Kako smo došli dovde? Nekada su sustavi za otkrivanje predmeta uzimali sliku poput ove i podijelili je na hrpu područja i zatim pokrenuli klasifikator na svakom od tih područja. Visoki rezultati za taj klasifikator smatrali su se detekcijom u slici. No, to je značilo rad klasifikatora tisuće puta na slici, tisuće procjena neuronskih mreža kako bi dobili detekciju. Umjesto toga smo naučili jednu mrežu da učini sve detekcije za nas. Ona istodobno proizvodi sve okvire i klase vjerojatnosti. S našim sustavom, umjesto da gledate sliku tisuće puta kako bi postigao detekciju, gledate samo jednom, zato ga zovemo YOLO metoda za detekciju objekta. Dakle, ovom brzinom nismo ograničeni samo na slike; možemo obraditi video u realnom vremenu. Sad, umjesto da samo vidimo mačku i psa, vidimo kako se kreću i međusobno komuniciraju.

This is a detector that we trained on 80 different classes in Microsoft's COCO dataset. It has all sorts of things like spoon and fork, bowl, common objects like that. It has a variety of more exotic things: animals, cars, zebras, giraffes. And now we're going to do something fun. We're just going to go out into the audience and see what kind of things we can detect. Does anyone want a stuffed animal? There are some teddy bears out there. And we can turn down our threshold for detection a little bit, so we can find more of you guys out in the audience. Let's see if we can get these stop signs. We find some backpacks. Let's just zoom in a little bit. And this is great. And all of the processing is happening in real time on the laptop.

To je detektor koji smo obučili na 80 različitih klasa u Microsoftovoj zbirci podataka COCO. Ona ima svašta, poput žlice i vilice, zdjele, obične predmete poput tih. Ima raznih egzotičnijih stvari: životinje, automobili, zebre, žirafe. A sada idemo učiniti nešto zabavno. Samo ćemo otići u publiku i vidjeti što možemo otkriti. Želi li tko plišanu životinju? Tamo ima nekih medvjedića. Možemo malo smanjiti prag detekcije, tako da možemo naći više vas u publici. Da vidimo možemo li dobiti ove znakove STOP. Nalazimo neke ruksake. Zumirajmo samo malo. I to je super. Sva obrada se događa u stvarnom vremenu na laptopu.

And it's important to remember that this is a general purpose object detection system, so we can train this for any image domain. The same code that we use to find stop signs or pedestrians, bicycles in a self-driving vehicle, can be used to find cancer cells in a tissue biopsy. And there are researchers around the globe already using this technology for advances in things like medicine, robotics. This morning, I read a paper where they were taking a census of animals in Nairobi National Park with YOLO as part of this detection system. And that's because Darknet is open source and in the public domain, free for anyone to use.

I to je važno zapamtiti da je ovo sustav za detekciju objekta opće namjene, možemo ga trenirati za bilo koju domenu. Isti kod koji koristimo za pronaći znakove STOP ili pješake, bicikle u autonomnim vozilima, može se koristiti kako bi pronašli stanice raka u biopsiji tkiva. A znanstvenici diljem svijeta već koriste ovu tehnologiju za napredak u medicini, robotici. Jutros sam pročitao članak o popisu životinja u Nacionalnom parku Nairobi koristeći YOLO u sustavu detekcije. A to je zato što je Darknet open source, u javnoj domeni, besplatan svakomu za korištenje.

(Applause)

(Pljesak)

But we wanted to make detection even more accessible and usable, so through a combination of model optimization, network binarization and approximation, we actually have object detection running on a phone.

No, željeli smo napraviti detekciju još dostupnijom i korisnijom pa smo kombinacijom optimizacije modela, binarizacije mreže i aproksimacije dobili detekciju objekata koja radi na mobitelu.

(Applause)

(Pljesak)

And I'm really excited because now we have a pretty powerful solution to this low-level computer vision problem, and anyone can take it and build something with it. So now the rest is up to all of you and people around the world with access to this software, and I can't wait to see what people will build with this technology.

A ja sam stvarno uzbuđen jer sada imamo moćno rješenje problema računalnog vida na osnovnoj razini, i svatko ga može uzeti i graditi nešto njime. Sad je sve ostalo do vas i ljudi diljem svijeta s pristupom tom softveru, jedva čekam vidjeti što će ljudi učiniti s ovom tehnologijom.

Thank you.

Hvala vam.

(Applause)

(Pljesak)

(Applause)

(Pljesak)

(Applause)

(Pljesak)

No, željeli smo napraviti detekciju još dostupnijom i korisnijom pa smo kombinacijom optimizacije modela, binarizacije mreže i aproksimacije dobili detekciju objekata koja radi na mobitelu.

(Applause)

(Pljesak)