Joseph Redmon: How computers learn to recognize objects instantly

Ten years ago, computer vision researchers thought that getting a computer to tell the difference between a cat and a dog would be almost impossible, even with the significant advance in the state of artificial intelligence. Now we can do it at a level greater than 99 percent accuracy. This is called image classification -- give it an image, put a label to that image -- and computers know thousands of other categories as well.

On yıl evvel, bilgisayarlı görme araştırmacıları, bir bilgisayara bir köpekle kedi arasında ayrım yaptırmanın neredeyse imkânsız olacağını düşünürdü, hem de yapay zekâ alanındaki büyük gelişmelere rağmen. Şu anda bunu, yüzde 99'ı aşkın doğruluk payı ile tamamlayabiliyoruz. Buna görüntü sınıflandırması deniyor; bir görüntü yükleyin, bu görüntüyü etiketleyin ve bilgisayarlar binlerce diğer sınıflandırmaları da biliyor.

I'm a graduate student at the University of Washington, and I work on a project called Darknet, which is a neural network framework for training and testing computer vision models. So let's just see what Darknet thinks of this image that we have. When we run our classifier on this image, we see we don't just get a prediction of dog or cat, we actually get specific breed predictions. That's the level of granularity we have now. And it's correct. My dog is in fact a malamute.

Washington Üniversitesi mezunuyum ve bilgisayarlı görme modellerini test etmek ve geliştirmek için kurulan, sinirsel ağ sistemi olan ''Darknet'' adlı bir proje üzerinde çalışıyorum. Darknet'in şu fotoğraf hakkında neler düşündüğüne bir göz atalım. Bu fotoğraf için sınıflandırıcımızı aktifleştirdiğimizde, sadece kedi veya köpek tahmini elde etmiyoruz, belirli tür tahminleri de sonuç veriyor. Şu anda, öğe boyu seviyesini görüyorsunuz. Ve doğru. Köpeğimin cinsi gerçekten Alaska Malamutu.

So we've made amazing strides in image classification, but what happens when we run our classifier on an image that looks like this? Well ... We see that the classifier comes back with a pretty similar prediction. And it's correct, there is a malamute in the image, but just given this label, we don't actually know that much about what's going on in the image. We need something more powerful. I work on a problem called object detection, where we look at an image and try to find all of the objects, put bounding boxes around them and say what those objects are. So here's what happens when we run a detector on this image.

Görüntü sınıflandırmasında büyük yol katettik, fakat sınıflandırıcımızı, şöyle bir fotoğrafta aktifleştirirsek ne olur dersiniz? Evet... Sınıflandırıcının oldukça benzer bir tahmin sunduğunu görebiliriz. Sonuç doğru, görüntüde Alaska Malamutu var, fakat yalnızca bu etiket ile birlikte, görüntüde neler olduğunu çok iyi anlayamıyoruz. Daha güçlü bir şeye ihtiyacımız var. Nesne belirleme dediğimiz konu üzerine çalışmaktayım; görüntüye bakıp içerisindeki bütün nesneleri bulmaya çalışıyor, etrafına sınırlayıcı kutu koyup nesneleri tanımlıyoruz. Bu görüntü üzerinde saptama uygulaması yapalım.

Now, with this kind of result, we can do a lot more with our computer vision algorithms. We see that it knows that there's a cat and a dog. It knows their relative locations, their size. It may even know some extra information. There's a book sitting in the background. And if you want to build a system on top of computer vision, say a self-driving vehicle or a robotic system, this is the kind of information that you want. You want something so that you can interact with the physical world. Now, when I started working on object detection, it took 20 seconds to process a single image. And to get a feel for why speed is so important in this domain, here's an example of an object detector that takes two seconds to process an image. So this is 10 times faster than the 20-seconds-per-image detector, and you can see that by the time it makes predictions, the entire state of the world has changed, and this wouldn't be very useful for an application.

Bu tarz bir bulgu ile, bilgisayarlı görme algoritması bizi daha ileriye taşıyabilir. Kedi ve köpek olduğunu saptadığını biliyoruz. Birbirleri arasındaki mesafeyi, onların boyutlarını biliyor. Daha fazlasını bile biliyor olabilir. Arka planda bir kitap var. Bilgisayarlı görme üzerine bir sistem kurmak isterseniz, sürücüsüz araç veya robot sistemi gibi, bu tarz bir bilgiye ihtiyacınız olacaktır. Dış dünya ile temas kurabileceğiniz bir şey olması gerekir. Nesne belirleme üzerine çalışmaya başladığımda, tek bir görüntüyü işlemesi 20 saniye sürüyordu. Hızın bu alanda neden bu denli önemli olduğundan bahsedecek olursak; bir görüntüyü işlemesi 2 saniye süren bir nesne belirleyicisine göz atalım. 20 saniyede resmi işleyebilene göre, 10 kat daha hızlı çalışıyor ve gördüğünüz üzere, o tahmin edene kadar, dünyadaki bütün durum değişmiş oluyor ve uygulamada pek de fayda sağlamıyor.

If we speed this up by another factor of 10, this is a detector running at five frames per second. This is a lot better, but for example, if there's any significant movement, I wouldn't want a system like this driving my car.

Bir 10 kat daha hızlandıracak olursak, bu işlemci her saniye 5 kare saptama gücüne sahip. Durum çok daha iyi, ama örneğin; kayda değer bir hareket durumu mevcutsa, böyle bir sistemin arabamı sürmesini istemezdim.

This is our detection system running in real time on my laptop. So it smoothly tracks me as I move around the frame, and it's robust to a wide variety of changes in size, pose, forward, backward. This is great. This is what we really need if we're going to build systems on top of computer vision.

Gördüğünüz, bilgisayarımdaki eş zamanlı belirleme sistemimiz. Kare etrafında hareket ederken, usulca beni takip ediyor; boyut, duruş, ileri, geri gibi çok çeşitli değişkenlere kolayca adapte olabiliyor. Harika. Bilgisayarlı görme üzerine sistem ekleyeceksek, işte tam da buna ihtiyacımız var.

(Applause)

(Alkış)

So in just a few years, we've gone from 20 seconds per image to 20 milliseconds per image, a thousand times faster. How did we get there? Well, in the past, object detection systems would take an image like this and split it into a bunch of regions and then run a classifier on each of these regions, and high scores for that classifier would be considered detections in the image. But this involved running a classifier thousands of times over an image, thousands of neural network evaluations to produce detection. Instead, we trained a single network to do all of detection for us. It produces all of the bounding boxes and class probabilities simultaneously. With our system, instead of looking at an image thousands of times to produce detection, you only look once, and that's why we call it the YOLO method of object detection. So with this speed, we're not just limited to images; we can process video in real time. And now, instead of just seeing that cat and dog, we can see them move around and interact with each other.

Birkaç yıl içinde, kare başına 20 saniyeden kare başına 20 milisaniyeye geçtik, bu bin kat daha hızlı demek oluyor. Bunu nasıl başardık? Önceleri nesne belirleme sistemleri böyle bir kareyi alıp onu birçok alana bölerdi ve bu bölgelerde sınıflandırıcı etklinleştirirdi ve sınıflandırıcı daha çok çentiklenen kısmı, o karenin belirlenen alanı olarak kabul ediyordu. Bunu yapabilmek için o kare üzerinde sınıflandırıcıyı bin kez işlemeniz gerek, belirleme yapabilmek için binlerce sinir ağı değerlendirmesi. Bunun yerine, bütün belirlemeyi yapabilen tek bir ağ tasarladık. Bütün sınırlayıcı kutuları oluşturuyor ve aynı anda olasılıkları sınıflandırıyor. Bu sistemle birlikte, tek bir kareye belirleme oluşturabilmek için bin kez bakmak yerine, yalnızca bir kez bakıyorsunuz ve bu yüzden bu nesne belirleme sistemine YOLO adını verdik. Bu hız görüntülerle sınırlı kalmıyor, videoları eş zamanlı olarak işleyebiliyoruz. Ve artık sadece kedi ve köpek görmenin yanı sıra, etrafta hareket edip birbirleriyle iletişimini de görebiliyoruz.

This is a detector that we trained on 80 different classes in Microsoft's COCO dataset. It has all sorts of things like spoon and fork, bowl, common objects like that. It has a variety of more exotic things: animals, cars, zebras, giraffes. And now we're going to do something fun. We're just going to go out into the audience and see what kind of things we can detect. Does anyone want a stuffed animal? There are some teddy bears out there. And we can turn down our threshold for detection a little bit, so we can find more of you guys out in the audience. Let's see if we can get these stop signs. We find some backpacks. Let's just zoom in a little bit. And this is great. And all of the processing is happening in real time on the laptop.

Microsoft'un COCO verisetindeki 80 farklı sınıfta bu belirleyiciyi geliştirdik. Kaşık, çatal, kase gibi her türlü sıradan nesneyi bünyesinde barındırıyor. Daha egzotik çeşitlilikleri de algılıyor; hayvanlar, arabalar, zebralar, zürafalar. Şimdi eğlenceli bir şey yapacağız. Seyirciye yönelteceğiz ve neler belirleyeceğimize bir bakalım. Peluş hayvan isteyen var mı? Ayısı olanlar var. Belirleme eşiğimizi birazcık düşürelim, böylece sizleri daha çok görmüş oluruz. DUR işaretlerini algılayacak mı bakalım. Sırt çantaları görüyoruz. Biraz daha yakınlaştıralım. Harika. Bütün işlemler, bilgisayarda eş zamanlı olarak eşleşiyor.

And it's important to remember that this is a general purpose object detection system, so we can train this for any image domain. The same code that we use to find stop signs or pedestrians, bicycles in a self-driving vehicle, can be used to find cancer cells in a tissue biopsy. And there are researchers around the globe already using this technology for advances in things like medicine, robotics. This morning, I read a paper where they were taking a census of animals in Nairobi National Park with YOLO as part of this detection system. And that's because Darknet is open source and in the public domain, free for anyone to use.

Bu, genel kullanım amaçlı bir nesne belirme sistemi, dolayısıyla herhangi bir resim işleme alanında geliştirilebilir. DUR işaretlerini veya yayaları, sürücüsüz araçları ve bisikletleri belirlemede kullandığımız aynı kod, doku biyopsisinde kanser hücrelerini tespit etmede kullanılabilir. Tıp, robotik gibi alanlarda ilerleme kaydetmek adına dünya çapında bu teknolojiyi kullanan araştırmacılar var. Bu sabah gazetede, Nairobi Milli Parkındaki hayvanların sayımını YOLO belirleme sistemini kullanarak gerçekleştirdiklerini okudum. Darknet açık kaynak olduğu için gerçekleşen bir hadise bu, kamuya açık, erişime ücretsiz açık.

(Applause)

(Alkış)

But we wanted to make detection even more accessible and usable, so through a combination of model optimization, network binarization and approximation, we actually have object detection running on a phone.

Fakat biz daha erişilebilir ve kullanışlı bir belirleme sağlamak istedik, model uyumlaştırma düzeni, ağ ikilileştirme ve uyum düzeni ile telefonda kullanılabilen nesne belirleme sistemi geliştirdik.

(Applause)

(Alkış)

And I'm really excited because now we have a pretty powerful solution to this low-level computer vision problem, and anyone can take it and build something with it. So now the rest is up to all of you and people around the world with access to this software, and I can't wait to see what people will build with this technology.

Bu beni gerçekten heyecanlandırıyor, çünkü artık düşük seviyeli bilgisayarlı görme sorununa, oldukça güçlü bir çözüm bulmuş durumdayız ve herkes kullanabilir ve bununla bir şey geliştirebilir. Artık kalanı sizlere ve dünya çapında bu yazılıma erişimi olanlara kalmış ve insanların bu teknolojiyi kullanarak

Thank you.

neler ortaya koyacağını sabırsızlıkla bekliyorum.

(Applause)

Teşekkürler.