Joseph Redmon: How computers learn to recognize objects instantly

منذ عشر سنوات مضت، اعتقد باحثوا رؤية الكمبيوتر أن الحصول على جهاز كمبيوتر لمعرفة الفرق بين القط والكلب سيكون شبه مستحيل، وحتى مع التقدم الكبير فى حالة الذكاء الاصطناعي. يمكننا الآن أن نفعل ذلك بمستوى دقة أعلى من 99%. وهذا ما يسمى بالتصنيف الصوري -- وإعطائها صورة، ووضع تسمية لتلك الصورة -- وكذلك تعرف أجهزة الكمبيوتر الآلاف من الفئات الأخرى. أنا طالب دراسات عليا فى جامعة واشنطن وأنا أعمل على مشروع يسمى "داركنيت" وهو إطار الشبكة العصبية لتدريب واختبار نماذج رؤية الكمبيوتر. ولذلك، دعونا نرى ماذا يفكر "داركنيت" فى هذه الصورة لدينا. عندما نقوم بتشغيل المصنف لدينا على هذه الصورة، نجد أننا لا نحصل فقط على تنبؤ للكلب والقط، نحن فى الواقع نحصل على تنبؤات سلالة محددة. ذلك هو مستوى التقسيمات المتوفر لنا الآن. وهو صحيح. في الحقيقة فإن كلبي هو "ملموت." ولذلك فقد حقننا خطوة مذهلة فى تصنيف الصورة، ولكن ماذا يحدث عندما نشغّل المصنف لدينا في صورة تبدو كهذه؟ حسناً ... نرى أن المصنف يعود مع تنبؤ مماثل جداً. وهذا صحيح، فهناك ملموت فى الصورة، ولكن فقط نظراً لهذه التسمية، فنحن حقاً لا نعرف الكثير عن ماذا يدور فى الصورة. ونحن بحاجة إلى شيء أكثر قوة. أنا أعمل على مشكلة تسمى الكشف عن الكائن، حيث ننظر إلى صورة ونحاول العثور على كل الكائنات، ووضع المربعات المحيطة بهم ونقول ما هي تلك الكائنات. وهنا ما يحدث عندما نقوم بتشغيل المكشاف على هذه الصورة.

Ten years ago, computer vision researchers thought that getting a computer to tell the difference between a cat and a dog would be almost impossible, even with the significant advance in the state of artificial intelligence. Now we can do it at a level greater than 99 percent accuracy. This is called image classification -- give it an image, put a label to that image -- and computers know thousands of other categories as well. I'm a graduate student at the University of Washington, and I work on a project called Darknet, which is a neural network framework for training and testing computer vision models. So let's just see what Darknet thinks of this image that we have. When we run our classifier on this image, we see we don't just get a prediction of dog or cat, we actually get specific breed predictions. That's the level of granularity we have now. And it's correct. My dog is in fact a malamute. So we've made amazing strides in image classification, but what happens when we run our classifier on an image that looks like this? Well ... We see that the classifier comes back with a pretty similar prediction. And it's correct, there is a malamute in the image, but just given this label, we don't actually know that much about what's going on in the image. We need something more powerful. I work on a problem called object detection, where we look at an image and try to find all of the objects, put bounding boxes around them and say what those objects are. So here's what happens when we run a detector on this image.

والآن، مع نتيجة من هذا النوع، نستطيع فعل أكثر من ذللك بكثير مع خوارزمياتنا لرؤية الكمبيوتر. نجد أنه يعرف أن هناك قط وكلب. ويعرف مواقعهم النسبية، وحجمهم. وربما يعرف معلومات إضافية. يوجد كتاب فى الخلفية. وإذا أردت بناء نظام أعلى رؤية الكمبيوتر، فلنقل سيارة ذاتية القيادة أو نظام روبوتية هذا هو نوع المعلومات التي تريدها. تريد شيئًا بحيث يمكنك التفاعل مع العالم المادي. والآن، عندما بدأت العمل في الكشف عن الكائنات، فقد استغرق الأمر 20 ثانية لمعالجة صورة واحدة. وللحصول على شعور لماذا السرعة مهمة جدًا فى هذا المجال، هنا مثال للكشف عن كائن والذي يستغرق تانيتين لمعالجة صورة. لذا فهذا أسرع ب 10 مرات من 20 تانية لكل كشف على صورة واحدة، ويمكنك فى نفس الوقت أن ترى أنه يقوم بتنبؤات، فقد تغيّرت حالة العالم بأسرها، وهذا لن يكون مفيدًا جدًا للتطبيق.

Now, with this kind of result, we can do a lot more with our computer vision algorithms. We see that it knows that there's a cat and a dog. It knows their relative locations, their size. It may even know some extra information. There's a book sitting in the background. And if you want to build a system on top of computer vision, say a self-driving vehicle or a robotic system, this is the kind of information that you want. You want something so that you can interact with the physical world. Now, when I started working on object detection, it took 20 seconds to process a single image. And to get a feel for why speed is so important in this domain, here's an example of an object detector that takes two seconds to process an image. So this is 10 times faster than the 20-seconds-per-image detector, and you can see that by the time it makes predictions, the entire state of the world has changed, and this wouldn't be very useful for an application.

إذا قمنا بتسريع هذا بواسطة عامل آخر بمقدار 10 فهذا كاشف يعمل بمعدل خمسة إطارات فى الثانية الواحدة. وهذا أفضل بكثير، ولكن على سبيل المثال، فإذا كان هناك أي حركة كبيرة، فلن أحتاج إلى نظام مثل قيادة سيارتي. هذا هو نظام الكشف لدينا المشغّل فى الوقت الحقيقي على كمبيوتري المحمول. حيث يتتبعني بسلاسة بينما أتحرك في جميع أنحاء الإطار، وهي قوى لمجموعة واسعة من التغيرات في الحجم، والتشكيل، إلى الأمام وإلى الخلف. هذا عظيم. هذا هو حقًا ما نريده إذا كنا سنبني أنظمة على رأس رؤية الكمبيوتر.

If we speed this up by another factor of 10, this is a detector running at five frames per second. This is a lot better, but for example, if there's any significant movement, I wouldn't want a system like this driving my car. This is our detection system running in real time on my laptop. So it smoothly tracks me as I move around the frame, and it's robust to a wide variety of changes in size, pose, forward, backward. This is great. This is what we really need if we're going to build systems on top of computer vision.

(تصفيق)

(Applause)

بالتالي في بضع سنوات فقط، فقد انتقلنا من 20 ثانية لكل صورة إلى 20 جزء من ألف جزء من الثانية لكل صورة، أسرع بألف مرة. كيف وصلنا إلى هناك؟ حسناً، في الماضي كانت أنظمة الكشف عن الكائن تأخذ صورة من هذا القبيل وتقوم بتقسيمها إلى مجموعة من المناطق ثم تقوم بتشغيل المصنف على كلّ من تلك المناطق، ودرجات عالية من المصنف ستعتبر بمثابة الكشف عن الصورة. ولكن هذا ينطوى على تشغيل المصنف الآف المرات على الصورة، والآلاف من تقييم الشبكة العصبية لإنتاج الكشف. وبدلاً من ذلك، فقد قمنا بتدريب شبكة واحدة للقيام بالكشف كاملاً لنا. فهي تنتج كافة المربعات المحيطة وفئة الإحتمالات في آن واحد. مع نظامنا، بدلاً من البحث في الصورة لآلاف المرات لإنتاج كشفها، فأنت تنظر مرة واحدة فقط، ولهذا السبب نحن نسميها طريقة الكشف عن كائن "يولو." لهذا بهذه السرعة فنحن لا نقتصر فقط على الصور؛ بل نستطيع أن نعالج الفيديو في الوقت الحقيقي. والآن، بدلاً من مجرد رؤية ذلك القط والكلب، نستطيع أن نرى تحركاتهم وتفاعلهم تجاه بعضهم البعض.

So in just a few years, we've gone from 20 seconds per image to 20 milliseconds per image, a thousand times faster. How did we get there? Well, in the past, object detection systems would take an image like this and split it into a bunch of regions and then run a classifier on each of these regions, and high scores for that classifier would be considered detections in the image. But this involved running a classifier thousands of times over an image, thousands of neural network evaluations to produce detection. Instead, we trained a single network to do all of detection for us. It produces all of the bounding boxes and class probabilities simultaneously. With our system, instead of looking at an image thousands of times to produce detection, you only look once, and that's why we call it the YOLO method of object detection. So with this speed, we're not just limited to images; we can process video in real time. And now, instead of just seeing that cat and dog, we can see them move around and interact with each other.

وهذا هو جهاز الكشف الذي قمنا بتدريبه في 80 فئة مختلفة في مجموعة بيانات "كوكو" ميكروسفت. لديها جميع أنواع االأشياء مثل الملعقة والشوكة والسلطانية. الأشياء الشائعة من هذا القبيل. لديها مجموعة متنوعة من أشياء أكثر غرابة: مثل الحيوانات والسيارات والحمر الوحشية والزرافات وسنقوم الآن بشيء مرح. سنستهدف الجمهور ونرى ما نوع الأشياء التي يمكننا الكشف عنها. هل من أحد يريد الحيوانات المحنطة؟ توجد بعض الدببة هناك. ويمكننا أن نخفض العتبة لدينا للكشف عنه قليلاً، لذا فنستطيع أن نجد عددًا أكبر يا رفاق من ذلك الجمهور. لنرى إذا كان يمكننا الحصول على علامات التوقف هذه. نجد بعض حقائب الظهر. دعونا فقط نكبّر قليلاً. وهذا شيء عظيم. والتحويل يحدث كلياً فى الوقت الحقيقي على الكمبيوتر المحمول.

This is a detector that we trained on 80 different classes in Microsoft's COCO dataset. It has all sorts of things like spoon and fork, bowl, common objects like that. It has a variety of more exotic things: animals, cars, zebras, giraffes. And now we're going to do something fun. We're just going to go out into the audience and see what kind of things we can detect. Does anyone want a stuffed animal? There are some teddy bears out there. And we can turn down our threshold for detection a little bit, so we can find more of you guys out in the audience. Let's see if we can get these stop signs. We find some backpacks. Let's just zoom in a little bit. And this is great. And all of the processing is happening in real time on the laptop.

وهذا أمر هام للتذكّر وهذا هو الغرض العام لنظام الكشف عن الكائن، نستطيع أن نُدرب هذا لأي نطاق للصورة. نفس التعليمات البرمجية التى نستخدمها لإيجاد علامات التوقف أو المشاة، والدراجات الهوائية في سيارة ذاتية القيادة يمكن استخدامها لإيجاد الخلايا السرطانية فى عينة من الأنسجة. وهناك باحثون حول العالم يستخدمون هذه التكنولوجيا بالفعل للتقدم في أمور مثل الطب وعلم الإنسان الآلي. فقد قرأت جريدة هذا الصباح حيث أنهم يتخذون إجراء تعداد للحيوانات فى حديقة وطنية في نيروبي مع "يولو" كجزء من نظام الكشف هذا. وهذا لأن "داركنيت" هو المصدر المفتوح وفى المجال العام، كما أنه مجاناً لأي شخص ليستخدمه.

And it's important to remember that this is a general purpose object detection system, so we can train this for any image domain. The same code that we use to find stop signs or pedestrians, bicycles in a self-driving vehicle, can be used to find cancer cells in a tissue biopsy. And there are researchers around the globe already using this technology for advances in things like medicine, robotics. This morning, I read a paper where they were taking a census of animals in Nairobi National Park with YOLO as part of this detection system. And that's because Darknet is open source and in the public domain, free for anyone to use.

(تصفيق)

(Applause)

ولكننا أردنا جعل الوصول إلى الكشف أكتر سهولة وكذلك استعماله، وذلك من خلال الجمع بين النموذج الأمثل، ووضغ البيانات على شكل ثنائي والقيمة التقريبية، ولدينا بالفعل كاشف مواد يعمل على الهاتف المحمول.

But we wanted to make detection even more accessible and usable, so through a combination of model optimization, network binarization and approximation, we actually have object detection running on a phone.

(تصفيق)

(Applause)

وأنا متحمس حقًا لأنه لدينا الآن حل قوي جداً لمشكلة رؤية الكمبيوتر ذات المستوى المنخفض، ويستطيع أي شخص أخذها وبناء شيء باستخدامها. لذا، البقية الآن متروكة لكم جميعاً والناس حول العالم ممن لهم إمكانية وصول إلى هذا البرنامج، ولا أستطيع الانتظار كي أرى ما سيبنيه الناس باستخدام هذه التكنولوجيا.

And I'm really excited because now we have a pretty powerful solution to this low-level computer vision problem, and anyone can take it and build something with it. So now the rest is up to all of you and people around the world with access to this software, and I can't wait to see what people will build with this technology.

شكراً لكم.

Thank you.

(تصفيق)

(Applause)