Joseph Redmon: How computers learn to recognize objects instantly

Ten years ago, computer vision researchers thought that getting a computer to tell the difference between a cat and a dog would be almost impossible, even with the significant advance in the state of artificial intelligence. Now we can do it at a level greater than 99 percent accuracy. This is called image classification -- give it an image, put a label to that image -- and computers know thousands of other categories as well.

Prieš dešimt metų mokslininkai, tiriantys kompiuterinę regą, manė, kad išmokyti kompiuterį atskirti katę nuo šuns tikriausiai niekuomet nebus įmanoma, netgi gerokai pažengus dirbtinio intelekto srityje. Šiandien galime tai atlikti daugiau nei 99-ių procentų tikslumu. Tai vadinama vaizdų klasifikavimu – imame vaizdą, priskiriame jam etiketę – ir taip kompiuteriai išmoksta tūkstančius kategorijų.

I'm a graduate student at the University of Washington, and I work on a project called Darknet, which is a neural network framework for training and testing computer vision models. So let's just see what Darknet thinks of this image that we have. When we run our classifier on this image, we see we don't just get a prediction of dog or cat, we actually get specific breed predictions. That's the level of granularity we have now. And it's correct. My dog is in fact a malamute.

Esu magistrantas Vašingtono Universitete ir dirbu prie projekto pavadinimu „Darknet“ (angl. „tamsusis tinklas“). Tai – neuroninio tinklo sistema, skirta mokyti ir testuoti kompiuterinės regos modelius. Pažiūrėkime ką „Darknet“ mano apie šį mūsų turimą atvaizdą. Kai paleidžiame klasifikatorių pateikę jam šį atvaizdą, atgal gauname ne tik spėjimus apie aptiktas kates ir šunis, bet ir specifines jų veisles. Toks šiandieninis mūsų pasiektas išsamumo lygis. Ir tai – tiesa, mano šuo iš tiesų yra malamutas.

So we've made amazing strides in image classification, but what happens when we run our classifier on an image that looks like this? Well ... We see that the classifier comes back with a pretty similar prediction. And it's correct, there is a malamute in the image, but just given this label, we don't actually know that much about what's going on in the image. We need something more powerful. I work on a problem called object detection, where we look at an image and try to find all of the objects, put bounding boxes around them and say what those objects are. So here's what happens when we run a detector on this image.

Pasiekiemė įspūdingus tolius vaizdų klasifikacijoje, tačiau kas nutinka,kai paleidžiame klasifikatorių pateikę atvaizdą, kuris atrodo taip? Na... Klasifikatorius grąžina panašius spėjimus. Ir jis teisus – atvaizde yra malamutas, tačiau vien tik tokia etiketė nesuteikia jokios informacijos apie tai, kas vyksta šiame atvaizde. Mums reikia kažko galingesnio. Aš dirbu prie problemos, vadinamos objektų aptikimu. Tai reiškia, kad žiūrėdami į atvaizdą siekiame aptikti visus jame esančius objektus, apibrėžti jų ribas ir juos įvardinti. Štai kas nutinka, kai paleidžiame detektorių, pateikę jam šį atvaizdą.

Now, with this kind of result, we can do a lot more with our computer vision algorithms. We see that it knows that there's a cat and a dog. It knows their relative locations, their size. It may even know some extra information. There's a book sitting in the background. And if you want to build a system on top of computer vision, say a self-driving vehicle or a robotic system, this is the kind of information that you want. You want something so that you can interact with the physical world. Now, when I started working on object detection, it took 20 seconds to process a single image. And to get a feel for why speed is so important in this domain, here's an example of an object detector that takes two seconds to process an image. So this is 10 times faster than the 20-seconds-per-image detector, and you can see that by the time it makes predictions, the entire state of the world has changed, and this wouldn't be very useful for an application.

Gavę tokį rezultatą, galime atlikti kur kas daugiau su mūsų kompiuterinės regos algoritmais. Programa žino, jog atvaizde yra katė ir šuo. Žino jų santykines vietas ir jų dydį. Galbūt žino ir dar daugiau. Antrame vaizdo plane guli knyga. Jeigu norite sukurti sistemą, pagrįstą kompiuterine rega, pavyzdžiui, savaeigę transporto priemonę ar robotą, tai yra būtent tokio pobūdžio informacija, kurios jums reikia. Jums reikia kažko, suteikiančio galimybę bendrauti su fiziniu pasauliu. Kai pradėjau dirbti prie objektų aptikimo, vieno atvaizdo apdorojimas užtrukdavo 20 sekundžių. Kad suprastumėte, kodėl šioje srityje greitis yra toks svarbus, štai – detektoriaus pavyzdys, kuriame vienas atvaizdas apdorojamas per dvi sekundes. Taigi, 10 kartų greičiau, nei ankstesnis detektorius. Galite matyti, kad kol sistema galiausiai atlieka spėjimą, pasaulis aplink spėja pasikeisti. Taigi, tai nebūtų labai naudinga realiam pritaikymui.

If we speed this up by another factor of 10, this is a detector running at five frames per second. This is a lot better, but for example, if there's any significant movement, I wouldn't want a system like this driving my car.

Jeigu pagreitintume procesą dar 10-čia kartų, tai – detektorius, apdorojantis penkis vaizdus per sekundę. Kur kas geriau, tačiau, situacijoje, kurioje vaizdas greitai keičiasi, aš nenorėčiau, kad tokia sistema vairuotų mano automobilį.

This is our detection system running in real time on my laptop. So it smoothly tracks me as I move around the frame, and it's robust to a wide variety of changes in size, pose, forward, backward. This is great. This is what we really need if we're going to build systems on top of computer vision.

Čia – mūsų objektų aptikimo sistema, veikianti esamu laiku, mano kompiuteryje. Ji sklandžiai seka mane kai judu ir sparčiai prisitaiko prie staigių pokyčių, tokių kaip dydis, poza, judesiai pirmyn, atgal. Tai – puiku. Štai ko mums iš tiesų reikia, ketinant kurti sistemas, pagrįstas kompiuterine rega.

(Applause)

(Plojimai.)

So in just a few years, we've gone from 20 seconds per image to 20 milliseconds per image, a thousand times faster. How did we get there? Well, in the past, object detection systems would take an image like this and split it into a bunch of regions and then run a classifier on each of these regions, and high scores for that classifier would be considered detections in the image. But this involved running a classifier thousands of times over an image, thousands of neural network evaluations to produce detection. Instead, we trained a single network to do all of detection for us. It produces all of the bounding boxes and class probabilities simultaneously. With our system, instead of looking at an image thousands of times to produce detection, you only look once, and that's why we call it the YOLO method of object detection. So with this speed, we're not just limited to images; we can process video in real time. And now, instead of just seeing that cat and dog, we can see them move around and interact with each other.

Taigi, tik per keletą metų nuo 20 sekundžių vienam atvaizdui perėjome iki 20 milisekundžių vienam atvaizdui. Tai – tūkstantį kartų greičiau. Kaip tai padarėme? Praeityje objektų aptikimo sistemos pirmiausia padalindavo vaizdinį į daugybę regionių, o tuomet klasifikatorius apdorodavo kiekvieną jų. Aukšti klasifikatoriaus balai reikšdavo objektų aptikimą atvaizde. Klasikifatorius būdavo paleidžiamas tūkstantį kartų kiekvienam vaizdui, reikėdavo tūkstančių neuroninių tinklų analizių vienam objektui aptikti. Vietoj to, mes išmokėme vieną neuroninį tinklą aptikti visus objektus. Jis tuo pat metu apibrėžia objektus ir suskaičiuoja jų tikimybes. Su mūsų sistema nereikia žiūrėti į tą patį atvaizdą tūkstantį kartų, kad aptiktume objektą. Užtenka pažiūrėti tik kartą, todėl taip ir vadiname savo objektų aptikimo metodą: YOLO (You only look once – angl. „tu tik pažiūri kartą“). Tokiu greičiu galime apdoroti ne tik nuotraukas, bet ir vaizdo įrašus realiu laiku. Galime ne tik matyti šunis ir kates, bet ir sekti juos judant ir sąveikaujant vienas su kitu.

This is a detector that we trained on 80 different classes in Microsoft's COCO dataset. It has all sorts of things like spoon and fork, bowl, common objects like that. It has a variety of more exotic things: animals, cars, zebras, giraffes. And now we're going to do something fun. We're just going to go out into the audience and see what kind of things we can detect. Does anyone want a stuffed animal? There are some teddy bears out there. And we can turn down our threshold for detection a little bit, so we can find more of you guys out in the audience. Let's see if we can get these stop signs. We find some backpacks. Let's just zoom in a little bit. And this is great. And all of the processing is happening in real time on the laptop.

Tai – detektorius, kurį išmokėme 80 skirtingų klasių, esančių „Microsoft“ „COCO“ duomenų bazėje. Šioje bazėje pilna įvairių objektų, pavyzdžiui, šaukštų, šakučių, lėkščių it kitų paprastų daiktų. Yra ir egzotiškesnių objektų: gyvūnų, automobilių, zebrų, žirafų. O dabar pademonstruosime šį tą smagaus. Nukreipsime detektorių į publiką ir pažiūrėsime kokius daiktus galime joje aptikti. Ar kas nors nori pliušinio žaislo? Ten kažkur mėtosi keletas pliušinių meškinų. Šiek tiek sumažinsime aptikimo slenkstį, kad aptiktume daugiau jūsų. Pažiūrėkime, ar galime aptikti šiuos „Stop“ ženklus. Randame keletą kuprinių. Šiek tiek priartinkime vaizdą. Tai – puiku. Visas apdorojimas vyksta realiu laiku mano nešiojamame kompiuteryje.

And it's important to remember that this is a general purpose object detection system, so we can train this for any image domain. The same code that we use to find stop signs or pedestrians, bicycles in a self-driving vehicle, can be used to find cancer cells in a tissue biopsy. And there are researchers around the globe already using this technology for advances in things like medicine, robotics. This morning, I read a paper where they were taking a census of animals in Nairobi National Park with YOLO as part of this detection system. And that's because Darknet is open source and in the public domain, free for anyone to use.

Svarbu atminti, kad tai – bendro naudojimo objektų aptikimo sistema, kurią galima išmokyti visokiausių vaizdų. Tas pats kodas, kurį naudojame surasti „Stop“ ženklus ar pėsčiuosius, dviračius ar savieiges transporto priemones, gali būti naudojamas aptikti vėžines ląsteles audinio biopsijoje. Daugybė mokslininkų visame pasaulyje jau naudojasi šia technologija, siekdami pasistūmėti tokiose srityse kaip medicina ar robotų technika. Šįryt laikraštyje perskaičiau apie tai, kad gyvūnai Nairobio nacionaliniame parke buvo surašyti naudojant YOLO kaip vieną iš sistemos dalių. Nes „Darknet“ yra atviro kodo sistema, patalpinta viešajame domene – naudotis ja gali visi ir už dyką.

(Applause)

(Plojimai.)

But we wanted to make detection even more accessible and usable, so through a combination of model optimization, network binarization and approximation, we actually have object detection running on a phone.

Tačiau norėjome dar labiau padidinti objektų aptikimo prieinamumą ir naudą, taigi sujungę modelių optimizaciją, tinklo binarizavimą ir derinimą dabar turime objektų aptikimo modelį, veikiantį mobiliajame telefone.

(Applause)

(Plojimai.)

And I'm really excited because now we have a pretty powerful solution to this low-level computer vision problem, and anyone can take it and build something with it. So now the rest is up to all of you and people around the world with access to this software, and I can't wait to see what people will build with this technology.

Aš labai džiaugiuosi, nes dabar turime ganėtinai galingą sprendimą šiai žemo lygio kompiuterinės regos problemai, ir kiekvienas norintis gali pasiekti šią sistemą ir kažką su ja sukurti. Tad likusi dalis priklauso nuo visų jūsų ir žmonių visame pasaulyje, turinčių priėjimą prie šios programinės įrangos, ir aš nekantrauju pamatyti, ką žmonės sukurs, naudodamiesi šia technologija.

Thank you.

Ačiū.

(Applause)

(Plojimai.)