Joseph Redmon: How computers learn to recognize objects instantly

Ten years ago, computer vision researchers thought that getting a computer to tell the difference between a cat and a dog would be almost impossible, even with the significant advance in the state of artificial intelligence. Now we can do it at a level greater than 99 percent accuracy. This is called image classification -- give it an image, put a label to that image -- and computers know thousands of other categories as well.

Sepuluh tahun lalu, peneliti penglihatan komputer merasa bahwa memerintahkan komputer membedakan kucing dan anjing hampir dikatakan mustahil, bahkan dengan kemajuan signifikan dalam bidang kecerdasan buatan. Sekarang, kita dapat melakukannya dengan akurasi lebih dari 99 persen. Inilah yang disebut klasifikasi gambar, taruhlah satu gambar, beri label gambar itu, dan komputer juga akan tahu ribuan kategori lainnya.

I'm a graduate student at the University of Washington, and I work on a project called Darknet, which is a neural network framework for training and testing computer vision models. So let's just see what Darknet thinks of this image that we have. When we run our classifier on this image, we see we don't just get a prediction of dog or cat, we actually get specific breed predictions. That's the level of granularity we have now. And it's correct. My dog is in fact a malamute.

Saya mahasiswa pascasarjana di University of Washington, dan sedang mengerjakan projek Darknet, yaitu kerangka kerja jaringan saraf untuk melatih dan menguji model penglihatan komputer. Mari kita lihat apa yang Darknet pikirkan mengenai gambar ini. Saat kita menjalankan alat klasifikasi pada gambar ini, kita melihat tidak hanya prediksi anjing dan kucing, tetapi juga prediksi turunannya secara spesifik. Kita juga mendapat prediksi lebih detail lagi. Yap, itu benar. Anjing saya memang dari jenis malamute.

So we've made amazing strides in image classification, but what happens when we run our classifier on an image that looks like this? Well ... We see that the classifier comes back with a pretty similar prediction. And it's correct, there is a malamute in the image, but just given this label, we don't actually know that much about what's going on in the image. We need something more powerful. I work on a problem called object detection, where we look at an image and try to find all of the objects, put bounding boxes around them and say what those objects are. So here's what happens when we run a detector on this image.

Kita sudah membuat langkah luar biasa dalam klasifikasi gambar, bagaimana jika alat klasifikasi dijalankan pada gambar seperti ini? Mari kita lihat... Alat klasifikasi memberikan prediksi yang lumayan mirip. Yap, itu benar, ada seekor malamute dalam gambar, tapi jika hanya dari labelnya, kita tidak tahu banyak tentang apa yang terjadi dalam gambar itu. Kita butuh lebih dari itu. Saya memikirkan satu persoalan disebut deteksi objek, yaitu kita melihat gambar dan mencoba mencari semua objek, membuat kotak pembatas, dan melabeli semua objek itu. Jadi, seperti inilah jika detektor dijalankan pada gambar.

Now, with this kind of result, we can do a lot more with our computer vision algorithms. We see that it knows that there's a cat and a dog. It knows their relative locations, their size. It may even know some extra information. There's a book sitting in the background. And if you want to build a system on top of computer vision, say a self-driving vehicle or a robotic system, this is the kind of information that you want. You want something so that you can interact with the physical world. Now, when I started working on object detection, it took 20 seconds to process a single image. And to get a feel for why speed is so important in this domain, here's an example of an object detector that takes two seconds to process an image. So this is 10 times faster than the 20-seconds-per-image detector, and you can see that by the time it makes predictions, the entire state of the world has changed, and this wouldn't be very useful for an application.

Dengan hasil seperti ini, banyak yang bisa dilakukan dengan algoritme penglihatan komputer. Bisa kita lihat bahwa komputer tahu ada kucing dan anjing, tahu lokasi relatif dan juga ukuran hewan-hewan itu. Bahkan ia juga tahu informasi lainya. Ada buku di belakang sana. Jika Anda ingin membuat sistem berkekuatan penglihatan komputer, misalnya mobil otonom atau sistem robotika, inilah jenis informasi yang Anda inginkan. Ada tentu butuh sesuatu agar dapat berinteraksi dengan dunia fisik. Ketika mulai mengerjakan deteksi objek, butuh waktu 20 detik untuk memproses satu gambar. Agar Anda memahami alasan betapa kecepatan sangat penting dalam domain ini, inilah contoh pendeteksi objek yang butuh waktu 2 detik untuk memproses 1 gambar. Proses ini 10 kali lebih cepat daripada alat deteksi 20-detik-per-gambar, dan dapat Anda lihat bahwa selagi komputer memprediksi, keadaan lingkungan sekitarnya berubah, tentu hal ini tidak akan berguna bagi aplikasi.

If we speed this up by another factor of 10, this is a detector running at five frames per second. This is a lot better, but for example, if there's any significant movement, I wouldn't want a system like this driving my car.

Jika kita tingkatkan kecepatan hingga 10 kali lipat, pendeteksi ini berjalan dengan lima bingkai per detik. Dan menjadi jauh lebih baik, tetapi seandainya, ada pergerakan yang signifikan, saya tidak ingin sistem ini mengemudikan mobil saya.

This is our detection system running in real time on my laptop. So it smoothly tracks me as I move around the frame, and it's robust to a wide variety of changes in size, pose, forward, backward. This is great. This is what we really need if we're going to build systems on top of computer vision.

Sistem deteksi kita ini beroperasi di laptop dalam waktu nyata. Dengan mulus ia melacak selagi saya bergerak di sekitar bingkai, dan cekatan dalam mendeteksi berbagai perubahan ukuran, pose, ke depan, ke belakang. Luar biasa. Ini yang sangat kita butuhkan jika akan membuat sistem berkekuatan penglihatan komputer.

(Applause)

(Tepuk tangan)

So in just a few years, we've gone from 20 seconds per image to 20 milliseconds per image, a thousand times faster. How did we get there? Well, in the past, object detection systems would take an image like this and split it into a bunch of regions and then run a classifier on each of these regions, and high scores for that classifier would be considered detections in the image. But this involved running a classifier thousands of times over an image, thousands of neural network evaluations to produce detection. Instead, we trained a single network to do all of detection for us. It produces all of the bounding boxes and class probabilities simultaneously. With our system, instead of looking at an image thousands of times to produce detection, you only look once, and that's why we call it the YOLO method of object detection. So with this speed, we're not just limited to images; we can process video in real time. And now, instead of just seeing that cat and dog, we can see them move around and interact with each other.

Hanya dalam beberapa tahun, ada kemajuan dari 20 detik per gambar menjadi 20 milidetik per gambar, seribu kali lebih cepat. Bagaimana bisa demikian? Dulu, sistem deteksi objek menggunakan gambar seperti ini dan membaginya menjadi sekelompok area lalu menjalankan alat klasifikasi pada masing-masing area, dan skor tinggi dari alat klasifikasi dianggap sebagai deteksi dalam gambar. Tetapi metode ini mengharuskan ribuan kali deteksi pada satu gambar, ribuan evaluasi kerangka saraf untuk menghasilkan deteksi. Alih-alih, kami melatih satu jaringan untuk melakukan semua deteksi. Jaringan itu memunculkan kotak pembatas sekaligus probabilitas kelas. Dengan sistem ini, alih-alih melihat satu gambar ribuan kali untuk menghasilkan deteksi, Anda cukup lihat sekali, oleh karena itulah kami menyebutnya metode deteksi objek YOLO. Dengan kecepatan seperti ini, kita dapat memproses tidak hanya gambar, tetapi juga video dalam waktu nyata. Sehingga, alih-alih hanya melihat kucing dan anjing, kita juga dapat melihat hewan ini bergerak dan berinteraksi.

This is a detector that we trained on 80 different classes in Microsoft's COCO dataset. It has all sorts of things like spoon and fork, bowl, common objects like that. It has a variety of more exotic things: animals, cars, zebras, giraffes. And now we're going to do something fun. We're just going to go out into the audience and see what kind of things we can detect. Does anyone want a stuffed animal? There are some teddy bears out there. And we can turn down our threshold for detection a little bit, so we can find more of you guys out in the audience. Let's see if we can get these stop signs. We find some backpacks. Let's just zoom in a little bit. And this is great. And all of the processing is happening in real time on the laptop.

Inilah pendeteksi yang kami latih pada 80 kelas berbeda dalam dataset COCO milik Microsoft. Dataset ini memiliki semua jenis benda, sendok dan garpu, mangkuk benda-benda umum semacam itu. Juga ada beragam benda eksotik lainnya: binatang, mobil, zebra, jerapah. Mari kita lakukan sesuatu yang menarik. Saya akan mengarahkan kamera ke penonton dan lihatlah benda yang dapat terdeteksi. Ada yang mau boneka hewan? Ada beberapa boneka beruang di sana. Kita dapat menurunkan ambang pendeteksinya sedikit, agar ia dapat mendeteksi lebih banyak penonton. Ada rambu berhenti yang terdeteksi. Ada tas ransel. Mari kita perbesar sedikit. Luar biasa. Semua proses ini terjadi saat ini juga dengan laptop.

And it's important to remember that this is a general purpose object detection system, so we can train this for any image domain. The same code that we use to find stop signs or pedestrians, bicycles in a self-driving vehicle, can be used to find cancer cells in a tissue biopsy. And there are researchers around the globe already using this technology for advances in things like medicine, robotics. This morning, I read a paper where they were taking a census of animals in Nairobi National Park with YOLO as part of this detection system. And that's because Darknet is open source and in the public domain, free for anyone to use.

Penting untuk diingat bahwa inilah tujuan umum sistem deteksi objek, agar kami dapat melatihnya pada domain gambar mana pun. Kode yang sama yang kita pakai untuk menemukan tanda berhenti atau pejalan kaki, sepeda dan mobil otonom, yang dapat dipakai untuk menemukan sel kanker dalam biopsi jaringan. Ada banyak peneliti di seluruh dunia yang sudah menggunakan teknologi ini untuk pengembangan dalam obat-obatan, robotika. Tadi pagi saya membaca koran bahwa ada sensus binatang di Taman Nasional Nairobi menggunakan YOLO sebagai bagian dari sistem deteksi ini. Itu karena Darknet adalah sumber terbuka ada di domain publik, gratis untuk siapa saja.

(Applause)

(Tepuk tangan)

But we wanted to make detection even more accessible and usable, so through a combination of model optimization, network binarization and approximation, we actually have object detection running on a phone.

Tapi kami ingin agar teknologi ini lebih mudah diperoleh dan berguna, jadi melalui kombinasi pengoptimalan model, binarisasi dan pendekatan jaringan, kita punya deteksi obyek yang berjalan dalam ponsel.

(Applause)

(Tepuk tangan)

And I'm really excited because now we have a pretty powerful solution to this low-level computer vision problem, and anyone can take it and build something with it. So now the rest is up to all of you and people around the world with access to this software, and I can't wait to see what people will build with this technology.

Dan saya sangat senang karena sekarang ada solusi yang cukup kuat atas masalah penglihatan komputer level rendah, dan siapa pun boleh mengambil dan membuat sesuatu dengan memakainya. Selebihnya terserah Anda dan orang-orang di seluruh dunia yang mengakses perangkat lunak ini, saya tidak sabar ingin melihat apa yang mereka buat dengan teknologi ini.

Thank you.

Terima kasih.

(Applause)

(Tepuk tangan)