Joseph Redmon: How computers learn to recognize objects instantly

Ten years ago, computer vision researchers thought that getting a computer to tell the difference between a cat and a dog would be almost impossible, even with the significant advance in the state of artificial intelligence. Now we can do it at a level greater than 99 percent accuracy. This is called image classification -- give it an image, put a label to that image -- and computers know thousands of other categories as well.

Десять років тому дослідники комп'ютерного зору вважали, що навчити комп'ютер визначити різницю між котом і собакою — майже неможливо, навіть зі значним розвитком в галузі штучного інтелекту. Зараз ми вміємо це робити з точністю понад 99%. Це називається класифікація зображень — маючи зображення, назвати це зображення — і комп'ютери знають тисячі таких категорій.

I'm a graduate student at the University of Washington, and I work on a project called Darknet, which is a neural network framework for training and testing computer vision models. So let's just see what Darknet thinks of this image that we have. When we run our classifier on this image, we see we don't just get a prediction of dog or cat, we actually get specific breed predictions. That's the level of granularity we have now. And it's correct. My dog is in fact a malamute.

Я аспірант Університету Вашингтона, і я працюю над проектом під назвою Даркнет, це фреймворк для нейронної сітки, який навчає і тестує моделі комп'ютерного зору. Дізнаймося, що думає Даркнет про це наше зображення. Коли ми запускаємо класифікатор для цього зображення, ми отримуємо не просто прогноз щодо кішки чи собаки, насправді ми отримуємо визначення породи. Ось такий рівень точності ми маємо зараз. І це так. Мій собака насправді маламут.

So we've made amazing strides in image classification, but what happens when we run our classifier on an image that looks like this? Well ... We see that the classifier comes back with a pretty similar prediction. And it's correct, there is a malamute in the image, but just given this label, we don't actually know that much about what's going on in the image. We need something more powerful. I work on a problem called object detection, where we look at an image and try to find all of the objects, put bounding boxes around them and say what those objects are. So here's what happens when we run a detector on this image.

Тож ми досягли відчутного успіху у класифікації зображень, але що станеться, якщо запустити класифікатор на отакому зображенні? Ну... Ми бачимо, що він повертається з дуже подібним передбаченням. І це правда, на зображенні є маламут, але через саму цю назву ми небагато дізналися про те, що саме відбувається на зображенні. Нам потрібно щось потужніше. І я працюю над проблемою під назвою визначення об'єктів, ми дивимося на зображення і намагаємося знайти усі об'єкти, визначивши їхні межі і підписавши, що це таке. Ось що трапиться, якщо запустити детектор для цього зображення.

Now, with this kind of result, we can do a lot more with our computer vision algorithms. We see that it knows that there's a cat and a dog. It knows their relative locations, their size. It may even know some extra information. There's a book sitting in the background. And if you want to build a system on top of computer vision, say a self-driving vehicle or a robotic system, this is the kind of information that you want. You want something so that you can interact with the physical world. Now, when I started working on object detection, it took 20 seconds to process a single image. And to get a feel for why speed is so important in this domain, here's an example of an object detector that takes two seconds to process an image. So this is 10 times faster than the 20-seconds-per-image detector, and you can see that by the time it makes predictions, the entire state of the world has changed, and this wouldn't be very useful for an application.

З таким результатом ми можемо зробити більше з нашими алгоритмами комп'ютерного зору. Ми бачимо, що він знає, що це кіт і собака. Він знає їхні приблизні локації, їхній розмір. Він може навіть знати деяку додаткову інформацію. Ось там на фоні книжка. І якщо ви хочете побудувати систему на основі комп'ютерного зору, наприклад, машину на автопілоті чи роботизовану систему, ви потребуватимете саме такої інформації. Ви хочете чогось, що дозволить взаємодіяти з матеріальним світом. Коли я почав працювати над визначенням об'єктітв, обробка зображення займала 20 секунд. І щоб зрозуміти, чому швидкість настільки важлива в цій галузі, ось приклад детектора об'єктів, що потребує 2 секунди на обробку зображення. Тож він у десять разів швидший за той 20-секундний детектор, але видно, що коли він робить свій прогноз, стан навколишнього світу вже змінився. Тож це не дуже корисно для додатку.

If we speed this up by another factor of 10, this is a detector running at five frames per second. This is a lot better, but for example, if there's any significant movement, I wouldn't want a system like this driving my car.

Якщо ми прискоримо це ще вдесятеро, ось детектор, що працює на швидкості 5 кадрів на секунду. Це набагато краще, але наприклад, якщо відбувається помітний рух, я б не хотів, щоб така система керувала моєю автівкою.

This is our detection system running in real time on my laptop. So it smoothly tracks me as I move around the frame, and it's robust to a wide variety of changes in size, pose, forward, backward. This is great. This is what we really need if we're going to build systems on top of computer vision.

Це наша система визначення, що працює наживо на моєму лептопі. Вона послідовно визначає моє місце, поки я рухаюсь коло меж, і вона стійка до різноманітних змін розміру, пози, руху вперед і назад. Це чудово. Саме це нам і потрібно, якщо ми хочемо будувати системи на основі комп'ютерного зору.

(Applause)

(Оплески)

So in just a few years, we've gone from 20 seconds per image to 20 milliseconds per image, a thousand times faster. How did we get there? Well, in the past, object detection systems would take an image like this and split it into a bunch of regions and then run a classifier on each of these regions, and high scores for that classifier would be considered detections in the image. But this involved running a classifier thousands of times over an image, thousands of neural network evaluations to produce detection. Instead, we trained a single network to do all of detection for us. It produces all of the bounding boxes and class probabilities simultaneously. With our system, instead of looking at an image thousands of times to produce detection, you only look once, and that's why we call it the YOLO method of object detection. So with this speed, we're not just limited to images; we can process video in real time. And now, instead of just seeing that cat and dog, we can see them move around and interact with each other.

Тож лише за кілька років ми пройшли від 20 секунд на зображення до 20 мілісекунд на зображення, у тисячу разів швидше. Як ми цього досягли? У минулому система визначення об'єктів взяла б подібне зображення і поділила на безліч регіонів, а потім запустила б класифікатор для кожного регіону, і високі бали для такого класифікатора вважалися б за визначення зображення. Але це потребувало б запуску класифікатора тисячі разів, тисячі перевірок нейронної мережі, щоб зробити визначення. Натомість ми натренували єдину мережу робити усі визначення для нас. Вона створює усі рамки й сортує ймовірності одночасно. З нашою системою, замість того щоб тисячі разів дивитися на зображення для визначення, ми дивимося один раз: "you only look once". і ми назвали наш метод визначення об'єктів YOLO. Тож з такою швидкістю ми не обмежені зображеннями, ми можемо обробляти відео наживо. І тепер, замість того щоб бачити тих кота і собаку, ми можемо бачити, як вони рухаються і взаємодіють.

This is a detector that we trained on 80 different classes in Microsoft's COCO dataset. It has all sorts of things like spoon and fork, bowl, common objects like that. It has a variety of more exotic things: animals, cars, zebras, giraffes. And now we're going to do something fun. We're just going to go out into the audience and see what kind of things we can detect. Does anyone want a stuffed animal? There are some teddy bears out there. And we can turn down our threshold for detection a little bit, so we can find more of you guys out in the audience. Let's see if we can get these stop signs. We find some backpacks. Let's just zoom in a little bit. And this is great. And all of the processing is happening in real time on the laptop.

Цей детектор ми тренували на 80 різних класах з набору даних Microsoft СОСО. Там були усі типи речей: ложка і виделка, миска, подібні звичні речі. Там була й безліч незвичніших речей: тварини, авто, зебри, жирафи. І зараз ми зробимо дещо цікаве. Ми просто підемо у залу і побачимо, які речі зможемо визначити. Хтось хоче м'яку іграшку? Тут є іграшкові ведмеді. І ми можемо трошки знизити поріг розпізнавання, щоб знайти більше вас в аудиторії. Побачимо, чи ми визначаємо знаки "стоп". Ми знайшли наплечники. Тепер наблизьмо трошки. І це чудово. І уся обробка відбувається в реальному часі на лептопі.

And it's important to remember that this is a general purpose object detection system, so we can train this for any image domain. The same code that we use to find stop signs or pedestrians, bicycles in a self-driving vehicle, can be used to find cancer cells in a tissue biopsy. And there are researchers around the globe already using this technology for advances in things like medicine, robotics. This morning, I read a paper where they were taking a census of animals in Nairobi National Park with YOLO as part of this detection system. And that's because Darknet is open source and in the public domain, free for anyone to use.

І важливо пам'ятати, що це система визначення загального призначення, тож ми можемо натренувати її для будь-якого типу зображень. Той самий код, що використовується, щоб знаходити знаки "стоп", пішоходів, велосипеди для машини на автопілоті, можна використати, щоб знаходити ракові клітини у біопсії тканини. І дослідники по всьому світу вже використовують цю технологію, щоб розвивати медицину чи робототехніку. Цього ранку я прочитав статтю про те, як підраховували тварин в Національному парку Найробі, використовуючи YOLO як частину системи розпізнавання. А це тому, що Даркнет — проект з вікритим кодом у публічному доступі, безкоштовний для використання.

(Applause)

(Оплески)

But we wanted to make detection even more accessible and usable, so through a combination of model optimization, network binarization and approximation, we actually have object detection running on a phone.

Але ми хочемо зробити визначення навіть доступнішим і зручнішим, тож завдяки оптимізації, апроксимації та бінаризації мережі наше визначення зображень працює на телефоні.

(Applause)

(Оплески)

And I'm really excited because now we have a pretty powerful solution to this low-level computer vision problem, and anyone can take it and build something with it. So now the rest is up to all of you and people around the world with access to this software, and I can't wait to see what people will build with this technology.

І я дуже схвильований, бо зараз у нас є досить потужне розв'язання для низькорівневого комп'ютерного зору, і кожен може взяти його і щоб з цим побудувати. Тож решта залежить від вас усіх, від людей по всьому світу, які мають доступ до цієї програми, і я із нетерпінням чекаю на речі, які люди створять із цією технологією.

Thank you.

Дякую.

(Applause)

(Оплески)