Joseph Redmon: How computers learn to recognize objects instantly

Ten years ago, computer vision researchers thought that getting a computer to tell the difference between a cat and a dog would be almost impossible, even with the significant advance in the state of artificial intelligence. Now we can do it at a level greater than 99 percent accuracy. This is called image classification -- give it an image, put a label to that image -- and computers know thousands of other categories as well.

לפני עשר שנים, מדעני ראייה ממוחשבת חשבו שלגרום למחשב להבדיל בין חתול לכלב יהיה כמעט בלתי אפשרי, אפילו עם התקדמות משמעותית במצב של הבינה המלאכותית. עכשיו אנחנו יכולים לעשות זאת ברמת דיוק של למעלה מ 99 אחוז. זה נקרא סיווג תמונה -- מעלים תמונה ושמים עליה תווית -- ומחשבים מכירים אלפי קטגוריות אחרות גם כן.

I'm a graduate student at the University of Washington, and I work on a project called Darknet, which is a neural network framework for training and testing computer vision models. So let's just see what Darknet thinks of this image that we have. When we run our classifier on this image, we see we don't just get a prediction of dog or cat, we actually get specific breed predictions. That's the level of granularity we have now. And it's correct. My dog is in fact a malamute.

אני סטודנט לתואר שני באוניברסיטת וושינגטון, ואני עובד על פרויקט שנקרא "דארקנט", שהוא מסגרת של רשת עצבית להכשרה ובדיקת מודלים של ראייה ממוחשבת. אז בואו ונראה מה "דארקנט" חושבת על תמונה זו שיש לנו. כאשר אנו מפעילים את המסווג שלנו על התמונה הזו, רואים שלא רק מקבלים חיזוי של כלב או חתול, אנחנו למעשה מקבלים תחזיות של גזע ספציפי. זוהי רמת הפירוט שיש לנו עכשיו. והיא נכונה. הכלב שלי למעשה הוא מלמוט.

So we've made amazing strides in image classification, but what happens when we run our classifier on an image that looks like this? Well ... We see that the classifier comes back with a pretty similar prediction. And it's correct, there is a malamute in the image, but just given this label, we don't actually know that much about what's going on in the image. We need something more powerful. I work on a problem called object detection, where we look at an image and try to find all of the objects, put bounding boxes around them and say what those objects are. So here's what happens when we run a detector on this image.

אז עשינו צעדים מדהימים בסיווג תמונות, אבל מה קורה כשאנו מפעילים את המסווג על תמונה שנראית כמו זו? טוב ... אנו רואים שהמסווג נותן תחזית די דומה. וזה נכון. יש מלמוט בתמונה. אבל רק בהתחשב בתווית זו, איננו ממש יודעים כל כך הרבה על מה שקורה בתמונה. אנחנו צריכים משהו חזק יותר. אני עובד על בעיה שנקראת זיהוי אובייקט, שבה אנו מסתכלים על תמונה ומנסים למצוא את כל האובייקטים, שמים קופסאות תוחמות סביבם ואומרים מה הם אובייקטים אלה: אז זה מה שקורה כשאנו מפעילים גלאי על התמונה הזאת.

Now, with this kind of result, we can do a lot more with our computer vision algorithms. We see that it knows that there's a cat and a dog. It knows their relative locations, their size. It may even know some extra information. There's a book sitting in the background. And if you want to build a system on top of computer vision, say a self-driving vehicle or a robotic system, this is the kind of information that you want. You want something so that you can interact with the physical world. Now, when I started working on object detection, it took 20 seconds to process a single image. And to get a feel for why speed is so important in this domain, here's an example of an object detector that takes two seconds to process an image. So this is 10 times faster than the 20-seconds-per-image detector, and you can see that by the time it makes predictions, the entire state of the world has changed, and this wouldn't be very useful for an application.

עכשיו, עם סוג זה של תוצאה, נוכל לעשות הרבה יותר עם האלגוריתמים של הראייה הממוחשבת. אנחנו רואים שהוא מזהה שיש חתול וכלב. הוא יודע את המקומות היחסיים שלהם, את גודלם. הוא אולי אפילו יודע עוד מידע נוסף כלשהו. יש ספר שמונח ברקע. ואם רוצים לבנות שיטה על גבי ראייה ממוחשבת, למשל, רכב נהיגה עצמית או מערכת רובוטית, זה סוג המידע שמעונינים בו. רוצים משהו שיאפשר לתקשר עם העולם הפיזי. עכשיו, כשהתחלתי לעבוד על זיהוי אובייקט, לקח 20 שניות כדי לעבד תמונה בודדת. וכדי לקבל תחושה לסיבה שמהירות כה חשובה בתחום זה, הנה דוגמה של גלאי אובייקט שלוקח לו שתי שניות לעבד תמונה. אז זה פי 10 מהר יותר מה20 שניות לתמונה של גלאי תמונה, ואתם יכולים לראות שעד שזה עושה תחזיות, המצב כולו של העולם השתנה, וזה לא יהיה מאוד שימושי עבור יישום.

If we speed this up by another factor of 10, this is a detector running at five frames per second. This is a lot better, but for example, if there's any significant movement, I wouldn't want a system like this driving my car.

אם נאיץ את זה לפי מקדם נוסף של 10, זה יהיה גלאי שרץ בחמש מסגרות לשנייה. זה הרבה יותר טוב, אבל לדוגמה, אם יש תנועה משמעותית, לא הייתי רוצה שמערכת כזו תנהג במכונית שלי.

This is our detection system running in real time on my laptop. So it smoothly tracks me as I move around the frame, and it's robust to a wide variety of changes in size, pose, forward, backward. This is great. This is what we really need if we're going to build systems on top of computer vision.

זוהי מערכת האיתור שלנו שרצה בזמן אמת על המחשב הנייד שלי. כך היא עוקבת אחרי בצורה חלקה כשאני זז סביב המסגרת, והיא חסינה למגוון רחב של שינויים בגודל, העמדה, קדימה, אחורה. זה נהדר. זה מה שאנחנו באמת צריכים אם אנחנו הולכים לבנות מערכות על גבי ראייה ממוחשבת.

(Applause)

(מחיאות כפיים)

So in just a few years, we've gone from 20 seconds per image to 20 milliseconds per image, a thousand times faster. How did we get there? Well, in the past, object detection systems would take an image like this and split it into a bunch of regions and then run a classifier on each of these regions, and high scores for that classifier would be considered detections in the image. But this involved running a classifier thousands of times over an image, thousands of neural network evaluations to produce detection. Instead, we trained a single network to do all of detection for us. It produces all of the bounding boxes and class probabilities simultaneously. With our system, instead of looking at an image thousands of times to produce detection, you only look once, and that's why we call it the YOLO method of object detection. So with this speed, we're not just limited to images; we can process video in real time. And now, instead of just seeing that cat and dog, we can see them move around and interact with each other.

אז תוך שנים אחדות, עברנו מ -20 שניות לתמונה ל 20 אלפיות השנייה, פי אלף יותר מהר. איך הגענו לזה? בעבר, מערכות לאיתור אובייקטים היו לוקחות תמונה כמו זו ומפצלות אותה לקבוצה של אזורים ולאחר מכן מפעילות מסווג על כל אחד מאזורים אלה, וציונים גבוהים עבור מסווג זה ייחשבו זיהויים בתמונה. אבל זה כרוך בהפעלת מסווג אלפי פעמים על תמונה, אלפי הערכות של רשת עצבית כדי לייצר זיהוי. במקום זה, הכשרנו רשת אחת לעשות את כל הזיהוי עבורנו. היא מייצרת את כל תיבות התחימה ואת סוג ההסתברויות בו זמנית. עם המערכת שלנו, במקום להסתכל על תמונה אלפי פעמים כדי לייצר זיהוי, מסתכלים רק פעם אחת, ולכן אנחנו קוראים לזה שיטת YOLO לזיהוי אובייקט. אז עם מהירות זו, איננו מוגבלים רק לתמונות; אנו יכולים לעבד וידאו בזמן אמת. ועכשיו, במקום לראות רק את החתול והכלב האלה, אנחנו יכולים לראות אותם נעים סביב ומתקשרים אחד עם השני.

This is a detector that we trained on 80 different classes in Microsoft's COCO dataset. It has all sorts of things like spoon and fork, bowl, common objects like that. It has a variety of more exotic things: animals, cars, zebras, giraffes. And now we're going to do something fun. We're just going to go out into the audience and see what kind of things we can detect. Does anyone want a stuffed animal? There are some teddy bears out there. And we can turn down our threshold for detection a little bit, so we can find more of you guys out in the audience. Let's see if we can get these stop signs. We find some backpacks. Let's just zoom in a little bit. And this is great. And all of the processing is happening in real time on the laptop.

זהו גלאי שאימנו על 80 סוגים שונים במערך הנתונים COCO של מיקרוסופט. יש בו כל מיני דברים כמו כף ומזלג, קערה, חפצים רגילים כאלה. יש לו מגוון של דברים אקזוטיים יותר: חיות, מכוניות, זברות, ג'ירפות. ועכשיו אנחנו הולכים לעשות משהו מהנה. אנחנו פשוט יוצאים אל הקהל כדי לראות איזה סוג של דברים נוכל לזהות. האם מישהו רוצה בובת חיה? יש כמה בובות דובי שם. ואנחנו יכול להנמיך מעט את סף הזיהוי שלנו, כדי שנוכל למצוא יותר אנשים מביניכם, בקהל. בואו ונראה אם נוכל לתפוס תמרורי עצור אלה. אנחנו מוצאים כמה תרמילי גב. בואו פשוט נגדיל קצת. וזה נהדר. וכל העיבוד קורה בזמן אמת על המחשב הנייד.

And it's important to remember that this is a general purpose object detection system, so we can train this for any image domain. The same code that we use to find stop signs or pedestrians, bicycles in a self-driving vehicle, can be used to find cancer cells in a tissue biopsy. And there are researchers around the globe already using this technology for advances in things like medicine, robotics. This morning, I read a paper where they were taking a census of animals in Nairobi National Park with YOLO as part of this detection system. And that's because Darknet is open source and in the public domain, free for anyone to use.

וחשוב לזכור שזוהי מערכת זיהוי אובייקט למטרה כללית, כך שנוכל להכשיר אותה עבור תמונה מכל תחום. אותו קוד שבו אנו משתמשים כדי למצוא שלטי עצור או הולכי רגל, אופניים ברכב לנהיגה עצמית, יכול לשמש כדי למצוא תאים סרטניים בביופסיה של רקמה. ויש חוקרים ברחבי העולם שכבר משתמשים בטכנולוגיה זו לקדם תחומים כמו רפואה, ורובוטיקה. הבוקר קראתי עיתון שבו ערכו מפקד של בעלי חיים בפארק הלאומי של ניירובי עם YOLO כחלק של מערכת זיהוי זו. וזה בגלל ש "דארקנט" הוא קוד פתוח עבור רשות הרבים, וללא תשלום, לכל מי שרוצה להשתמש,

(Applause)

(מחיאות כפיים)

But we wanted to make detection even more accessible and usable, so through a combination of model optimization, network binarization and approximation, we actually have object detection running on a phone.

אבל רצינו לעשות את הזיהוי לאפילו יותר נגיש ושמיש, כך שבאמצעות שילוב של אופטימיזציה של המודל, בינאריזציה ואומדנות של רשת, יש לנו למעשה זיהוי אובייקט שרץ בטלפון.

(Applause)

(מחיאות כפיים)

And I'm really excited because now we have a pretty powerful solution to this low-level computer vision problem, and anyone can take it and build something with it. So now the rest is up to all of you and people around the world with access to this software, and I can't wait to see what people will build with this technology.

ואני באמת מתרגש כי עכשיו יש לנו פתרון די חזק לבעיית ראייה ממוחשבת ברמה נמוכה זו. וכל אחד יכול לקחת את זה ולבנות עם זה משהו. אז עכשיו כל השאר תלוי בכם ובאנשים ברחבי העולם עם גישה לתוכנה זו, ואני לא יכול לחכות לראות מה אנשים יבנו עם טכנולוגיה זו.

Thank you.

תודה רבה.

(Applause)

(מחיאות כפיים)