Joseph Redmon: How computers learn to recognize objects instantly

Ten years ago, computer vision researchers thought that getting a computer to tell the difference between a cat and a dog would be almost impossible, even with the significant advance in the state of artificial intelligence. Now we can do it at a level greater than 99 percent accuracy. This is called image classification -- give it an image, put a label to that image -- and computers know thousands of other categories as well.

۱۰ سال قبل محققان بینایی ماشین فکر کردند که گفتن فرق بین گربه و سگ به کامپیوتر تقریبا غیرممکن خواهد بود، حتی با پیشرفت‌های قابل توجه در هوش مصنوعی. حالا ما می‌توانیم این را با دقت بیشتر از ۹۹ درصد انجام بدیم این را دسته بندی تصویر می‌گویند-- یک تصویر بهش بده و یک برچسب به تصویر بزن-- و کامپیوترها هزاران دسته بندی دیگر را نیز به خوبی می‌دانند.

I'm a graduate student at the University of Washington, and I work on a project called Darknet, which is a neural network framework for training and testing computer vision models. So let's just see what Darknet thinks of this image that we have. When we run our classifier on this image, we see we don't just get a prediction of dog or cat, we actually get specific breed predictions. That's the level of granularity we have now. And it's correct. My dog is in fact a malamute.

من دانشجوی ارشد از دانشگاه واشنگتن هستم و مشغول کار روی پروژه‌ دارک‌نت (شبکه سیاه) که در چارچوب شبکه عصبی است برای آموزش دادن و تست کردن مدلهای بینایی کامیپوتر. خب بیاید به چگونگی فکر کردن دارک‌نت به این تصاویری که داریم، نگاه کنیم. وقتی طبقه بندیمان را روی این تصاویر اجرا می‌کنیم. می‌بینیم که فقط پیش‌بینی سگ یا گربه بودن نیست. در واقع نژاد پیش‌بینی‌ها را نیز می‌گوییم. این سطح جزئیاتی است که الان داریم و صحیح است. سگ من در حقیقت مالاموت است.

So we've made amazing strides in image classification, but what happens when we run our classifier on an image that looks like this? Well ... We see that the classifier comes back with a pretty similar prediction. And it's correct, there is a malamute in the image, but just given this label, we don't actually know that much about what's going on in the image. We need something more powerful. I work on a problem called object detection, where we look at an image and try to find all of the objects, put bounding boxes around them and say what those objects are. So here's what happens when we run a detector on this image.

خب گام‌های حیرت آوری در دسته بندی تصاویر ساخته‌ایم، اما چه اتفاقی میفتد وقتی طبقه‌بندمان را روی تصویری مثل این اجرا می‌کنیم؟ خب... می‌بینیم که طبقه‌بند با یک پیش‌بینی خیلی مشابه باز میگردد. و درسته، یک مالاموت در تصویر وجود دارد، اما فقط یک برچسب داده شده، در واقع خیلی درباره اینکه در تصویر چه رخ داده نمیدانیم. به چیزی قویتری نیاز داریم. من روی یک مسئله کار می‌کنم که یافتن اشیا نامیده می‌شود، وقتی به تصویری نگاه می کنیم و سعی در یافتن تمام اشیا داریم، آنها را داخل مستطیلهای نمایش گذاشته و می‌گوییم که این اشیا چه هستند. خب، این چیزی است که وقتی یابنده را اجرا کنیم اتفاق میافتد.

Now, with this kind of result, we can do a lot more with our computer vision algorithms. We see that it knows that there's a cat and a dog. It knows their relative locations, their size. It may even know some extra information. There's a book sitting in the background. And if you want to build a system on top of computer vision, say a self-driving vehicle or a robotic system, this is the kind of information that you want. You want something so that you can interact with the physical world. Now, when I started working on object detection, it took 20 seconds to process a single image. And to get a feel for why speed is so important in this domain, here's an example of an object detector that takes two seconds to process an image. So this is 10 times faster than the 20-seconds-per-image detector, and you can see that by the time it makes predictions, the entire state of the world has changed, and this wouldn't be very useful for an application.

حالا، با این نتایج، میتوانیم کمی بیشتربا الگوریتم بینای کامپیوترمان کار کنیم. میبینیم همانطور که می‌دانید یک گربه و یک سگ وجود دارد. و محلهای نسبی و اندازه آنها را می‌داند. حتی شاید کمی اطلاعات اضافی نیز بدانیم. در پس زمینه هم یک کتاب قرار دارد. و اگر شما بخواید یک سیستم در صدر بینایی کامپیوتر بسازید، مثل یک خودروی خودران یا یک سیستم رباتیکی، این نوع اطلاعاتی است که می‌خواهید. چیزی می‌خواهید که بین شما و دنیای فیزیکی تعامل کند. حالا وقتی من یافتن اشیا را شروع کردم. ۲۰ ثانیه طول کشید تا تصویر را پردازش کند. و برای اینکه حس کنید چرا سرعت در این حوزه خیلی مهم است، اینجا یک مثال از یافتن اشیا داریم که ۲ ثانیه طول می‌کشد تا تصویری را پردازش کند. خب این ۱۰ برابر سریعتر از یابنده ۲۰ ثانیه بر تصویر است. و شما این پیش‌بینی‌ها را می‌توانید همزمان ببینید، کل جهان تغییر کرده است، و این برای یک برنامه خیلی مفید نخواهد بود.

If we speed this up by another factor of 10, this is a detector running at five frames per second. This is a lot better, but for example, if there's any significant movement, I wouldn't want a system like this driving my car.

اگر این را با یک فاکتور دیگر ۱۰ برابر سریعتر کنیم این یابنده با ۵ فریم بر ثانیه اجرا خواهد شد. این بسیار بهتر است، اما برای مثال، اگر هر حرکت قابل توجهی وجود داشته باشد، نمی‌خواهم که سیستمی شبیه این، ماشینم را براند.

This is our detection system running in real time on my laptop. So it smoothly tracks me as I move around the frame, and it's robust to a wide variety of changes in size, pose, forward, backward. This is great. This is what we really need if we're going to build systems on top of computer vision.

این سیستم یابنده ماست که در زمان حقیقی روی لپ تاپ‌م اجرا می‌شود. خب به آرامی من را دنبال می‌کند بطوریکه من دور فریم حرکت می‌کنم، و این شیوه مواجه با انواع تغییرات در اندازه ژست، رو به جلو،رو به پشت است. این عالیه. این چیزیست که واقعا نیاز داریم اگر بخواهیم سیستمی را در صدر بینایی کامپیوترها بسازیم.

(Applause)

(تشویق)

So in just a few years, we've gone from 20 seconds per image to 20 milliseconds per image, a thousand times faster. How did we get there? Well, in the past, object detection systems would take an image like this and split it into a bunch of regions and then run a classifier on each of these regions, and high scores for that classifier would be considered detections in the image. But this involved running a classifier thousands of times over an image, thousands of neural network evaluations to produce detection. Instead, we trained a single network to do all of detection for us. It produces all of the bounding boxes and class probabilities simultaneously. With our system, instead of looking at an image thousands of times to produce detection, you only look once, and that's why we call it the YOLO method of object detection. So with this speed, we're not just limited to images; we can process video in real time. And now, instead of just seeing that cat and dog, we can see them move around and interact with each other.

خب، ظرف فقط چند سال، ما از ۲۰ ثانیه درتصویر به ۲۰ میلی ثانیه بر تصویر رفتیم، هزار بار سریعتر. چطور به اینجا رسیدیم؟ خب، در گذشته، سیستم‌های یافتن اشیا تصویری شبیه این می‌گرفتند و آن را به بسیاری از مناطق تقسیم می‌کردند و سپس برای هر یک از این مناطق یک دسته بند را اجرا می‌کردند و بالاترین امتیاز برای این دسته بندیها به عنوان تصویر یافته شده در نظر گرفته میشد. اما این هزاران بار اجرا کردن یک دسته بند روی یک تصویر را شامل می‌شد، هزاران شبکه عصبی ارزیابی می‌کردند تا "یافتن" را تولید کنند. درعوض، ما یک تک شبکه را آموزش دادیم تا کل یافتنها را برای ما انجام دهد. همه باندهای محدود را تولید و همه احتمالات را با هم کلاس بندی می‌کند. با یک سیستم، به جای نگاه کردن به یک تصویر برای هزاران بار برای تولید یافتن شما فقط یک بار نگاه می‌کنید، و به همین دلیل ما آن را متد YOLO برای یافتن اشیا نامیدیم. خب، با این سرعت ما فقط به یک تصویر محدود نیستیم: همچنین می‌توانیم ویدیو را نیز همزمان پردازش کنیم. و حالا، به جای نگاه کردن به گربه و سگ می‌توانیم حرکت و تعامل آنها با یکدیگر را نیز ببینیم.

This is a detector that we trained on 80 different classes in Microsoft's COCO dataset. It has all sorts of things like spoon and fork, bowl, common objects like that. It has a variety of more exotic things: animals, cars, zebras, giraffes. And now we're going to do something fun. We're just going to go out into the audience and see what kind of things we can detect. Does anyone want a stuffed animal? There are some teddy bears out there. And we can turn down our threshold for detection a little bit, so we can find more of you guys out in the audience. Let's see if we can get these stop signs. We find some backpacks. Let's just zoom in a little bit. And this is great. And all of the processing is happening in real time on the laptop.

این یابنده‌ای است که ما در ۸۰ کلاس مختلف دردر دیتاست coco مایکروسافت آموزش دادیم. انواع اشیاء مثل قاشق، چنگال، کاسه را دارد اشیا معمولی مانند این. تنوع عجیب و غریبی از اشیا را دارد: حیوانات، ماشین ها، گورخرها، زرافه‌ها. و حالا میخواهیم یک کار مفرح انجام دهیم. فقط میخواهیم بیایم بیرون در بین مخاطبان و ببینیم چه چیزهایی را می‌توانیم بیابیم. آیا کسی یک حیوان پر شده می‌خواهد؟ تعدادی خرس عروسکی اینجا هست. و ما میتوانیم آستانه خود در یافتن را کمی کاهش دهیم، خب می‌توانیم شما آقایان را در بین مخاطبین پیدا کنیم. ببینیم آیا میتوان این علامتهای ایست را یافت. ما تعدادی کوله پشتی یافتیم بیاید فقط کمی زوم کنیم. و این عالیه. و تمام این اتفاقات در زمان واقعی اتفاق میافتد. روی لپ تاپ.

And it's important to remember that this is a general purpose object detection system, so we can train this for any image domain. The same code that we use to find stop signs or pedestrians, bicycles in a self-driving vehicle, can be used to find cancer cells in a tissue biopsy. And there are researchers around the globe already using this technology for advances in things like medicine, robotics. This morning, I read a paper where they were taking a census of animals in Nairobi National Park with YOLO as part of this detection system. And that's because Darknet is open source and in the public domain, free for anyone to use.

و مهم است به خاطر داشته باشید که این یک سیستم یابنده اشیا همه منظوره است، خب ما میتوانیم این را برای هر حوزه تصویری آموزش دهیم. همان کدی است که ما استفاده کردیم تا علامت ایست یا عابرپیاده، دوچرخه در یک خودروی خودران را پیدا کنیم، میتواند برای یافتن سلولهای سرطانی دریک نمونه برداری بافت استفاده شود. و محققانی در سراسر جهان وجود دارند همچنین این سیستم را برای چیزهای پیشرفته مانند دارو روباتیک استفاده می‌کنند. امروز صبح، مقاله‌ای خواندم. جایی یک صحبتی بود از سرشماری حیوانات پارک ملی نایروبی با سیستم YOLO به عنوانی بخشی از این سیستم یابنده. و این به خاطر این است که دارکنت منبع آزاد است ودر حوزه عمومی برای استفاده همگان آزاد است.

(Applause)

(تشویق)

But we wanted to make detection even more accessible and usable, so through a combination of model optimization, network binarization and approximation, we actually have object detection running on a phone.

اما می‌خواهیم یافتن را حتی در دسترس‌تر و قابل استفاده‌تر کنیم بنابراین ازطریق ترکیب مدلهای بهینه شبکه تقسیم بندی شده و تقریبی ما در واقع در حال اجرای یافتن اشیا روی گوشی هستیم.

(Applause)

(تشویق)

And I'm really excited because now we have a pretty powerful solution to this low-level computer vision problem, and anyone can take it and build something with it. So now the rest is up to all of you and people around the world with access to this software, and I can't wait to see what people will build with this technology.

و من خیلی هیجان زده هستم زیرا حالا یک راه حل خیلی قدرتمند برای این مسئله سطح پایین بینایی کامپیوتر داریم. و هرکسی می‌تواند این را بردارد و یک چیزی با آن بسازد. خب حالا دیگر بقیه آن به شما و مردم جهان با دسترسی به این نرم افزار بستگی دارد. و من نمی‌توانم صبر کنم ببینم مردم با این تکنولوژی چه خواهند ساخت.

Thank you.

متشکرم.

(Applause)

(تشویق)