Fei-Fei Li: How we're teaching computers to understand pictures

Let me show you something.

בואו אני אראה לכם משהו.

(Video) Girl: Okay, that's a cat sitting in a bed. The boy is petting the elephant. Those are people that are going on an airplane. That's a big airplane.

(וידאו) ילדה: "אוקיי, זה חתול שיושב על מיטה. הילד מלטף את הפיל. אלה אנשים שעולים על מטוס. זה מטוס גדול."

Fei-Fei Li: This is a three-year-old child describing what she sees in a series of photos. She might still have a lot to learn about this world, but she's already an expert at one very important task: to make sense of what she sees. Our society is more technologically advanced than ever. We send people to the moon, we make phones that talk to us or customize radio stations that can play only music we like. Yet, our most advanced machines and computers still struggle at this task. So I'm here today to give you a progress report on the latest advances in our research in computer vision, one of the most frontier and potentially revolutionary technologies in computer science.

זו ילדה בת שלוש שנים מתארת מה היא רואה בסדרת תמונות. אולי יש לה עוד הרבה ללמוד על העולם, אבל היא כבר מומחית במשימה אחת מאוד חשובה: להבין מה היא רואה. החברה שלנו היא יותר מתקדמת טכנולוגית מאי פעם. אנחנו שולחים אנשים לירח, אנחנו מייצרים טלפונים שמדברים אלינו או מתאימים אישית תחנות רדיו שינגנו רק מוסיקה שאנחנו אוהבים. ועדיין, המכונות והמחשבים המתקדמים ביותר שלנו עדיין מתקשים במשימה הזו. אז אני פה היום כדי לתת לכם דו"ח התקדמות על ההתפתחויות האחרונות במחקר שלנו על ראיית מחשב, אחד מהטכנולוגיות החלוציות והמהפכניות ביותר במדעי המחשב.

Yes, we have prototyped cars that can drive by themselves, but without smart vision, they cannot really tell the difference between a crumpled paper bag on the road, which can be run over, and a rock that size, which should be avoided. We have made fabulous megapixel cameras, but we have not delivered sight to the blind. Drones can fly over massive land, but don't have enough vision technology to help us to track the changes of the rainforests. Security cameras are everywhere, but they do not alert us when a child is drowning in a swimming pool. Photos and videos are becoming an integral part of global life. They're being generated at a pace that's far beyond what any human, or teams of humans, could hope to view, and you and I are contributing to that at this TED. Yet our most advanced software is still struggling at understanding and managing this enormous content. So in other words, collectively as a society, we're very much blind, because our smartest machines are still blind.

נכון, יש לנו אבטיפוס של מכוניות שיכולות לנסוע בעצמן, אבל בלי ראייה חכמה, הן לא יכולות באמת להבדיל בין שקית נייר מקופלת על הכביש, שאפשר לנסוע עליה, ואבן בגודל הזה, שצריך להימנע ממנה. ייצרנו מצלמות מגהפיקסל מדהימות, אבל לא הצלחנו להביא ראייה לעיוורים. מזל"טים יכולים לטוס מעל שטח עצום, אבל חסרי טכנולוגיית ראייה מספקת בשביל לעזור לנו לעקוב אחרי השינויים ביערות הגשם. מצלמות אבטחה נמצאות בכל מקום, אבל הן לא מתריעות כשילד טובע בבריכה. תמונות וסרטונים הופכים לחלק בלתי נפרד מהחיים הגלובליים. הם נוצרים בקצב שהוא מעל לכל מה שכל אדם, או קבוצות של אנשים, יכולים לקוות לצפות בהם, ואתם ואני תורמים לזה בשיחת TED הזו. אבל התוכנה המתקדמת ביותר שלנו עדיין מתמודדת בלהבין ולנהל את התוכן העצום הזה. אז במילים אחרות, במשותף כחברה, אנחנו מאוד עיוורים, כי המכונות החכמות ביותר שלנו עדיין עיוורות.

"Why is this so hard?" you may ask. Cameras can take pictures like this one by converting lights into a two-dimensional array of numbers known as pixels, but these are just lifeless numbers. They do not carry meaning in themselves. Just like to hear is not the same as to listen, to take pictures is not the same as to see, and by seeing, we really mean understanding. In fact, it took Mother Nature 540 million years of hard work to do this task, and much of that effort went into developing the visual processing apparatus of our brains, not the eyes themselves. So vision begins with the eyes, but it truly takes place in the brain.

אתם יכולים לשאול - "למה זה כל כך קשה?" מצלמות יכולות לקחת תמונות כמו זו, על ידי המרת אורות לשטח דו מימדי של מספרים, הידועים כפיקסלים, אבל אלו רק מספרים חסרי חיים. הם לא נושאים איזושהי משמעות בעצמם. כמו שלשמוע זה לא אותו דבר כמו להקשיב, לקחת תמונות זה לא אותו דבר כמו לראות, ובלראות, אנחנו למעשה מתכוונים ללהבין. למעשה, זה לקח לאימא טבע 540 מיליון שנים של עבודה קשה לעשות את המשימה הזו, והרבה מהמאמץ הזה הלך על פיתוח מנגנון העיבוד החזותי של המוח שלנו, לא העיניים עצמן. אז ראייה מתחילה בעיניים, אבל באמת מתרחשת במוח.

So for 15 years now, starting from my Ph.D. at Caltech and then leading Stanford's Vision Lab, I've been working with my mentors, collaborators and students to teach computers to see. Our research field is called computer vision and machine learning. It's part of the general field of artificial intelligence. So ultimately, we want to teach the machines to see just like we do: naming objects, identifying people, inferring 3D geometry of things, understanding relations, emotions, actions and intentions. You and I weave together entire stories of people, places and things the moment we lay our gaze on them.

אז במשך 15 שנים האחרונות, מהדוקטורט שלי במכון הטכנולוגי של קליפורניה ואז הובלת מעבדת הראייה בסטנפורד, אני עובדת עם המורים הרוחניים שלי, משתפי פעולה וסטודנטים, ללמד מחשבים לראות. תחום המחקר שלנו נקרא - ראייה ממוחשבת ולמידת מכונה. זה חלק מתחום כללי יותר של אינטלגנציה מלאכותית. אז בסופו של דבר, אנחנו רוצים ללמד את המכונות לראות כמו שאנחנו רואים: לנקוב בשמות של אובייקטים, לזהות אנשים, להסיק גיאומטרית תלת מימדית של דברים, להבין קשרים, רגשות, פעולות וכוונות. אתם ואני רוקמים ביחד סיפורים שלמים של אנשים, מקומות ודברים ברגע שאנחנו מניחים עליהם את המבט שלנו.

The first step towards this goal is to teach a computer to see objects, the building block of the visual world. In its simplest terms, imagine this teaching process as showing the computers some training images of a particular object, let's say cats, and designing a model that learns from these training images. How hard can this be? After all, a cat is just a collection of shapes and colors, and this is what we did in the early days of object modeling. We'd tell the computer algorithm in a mathematical language that a cat has a round face, a chubby body, two pointy ears, and a long tail, and that looked all fine. But what about this cat? (Laughter) It's all curled up. Now you have to add another shape and viewpoint to the object model. But what if cats are hidden? What about these silly cats? Now you get my point. Even something as simple as a household pet can present an infinite number of variations to the object model, and that's just one object.

הצעד הראשון לכיוון המטרה הזו הוא ללמד מחשב לראות חפצים, אבן הבניין של העולם החזותי. במונחים הכי פשוטים, דמיינו את תהליך הלמידה הזה כלהראות למחשבים כמה תמונות אימון של אובייקט מסוים, בואו נאמר חתולים, ועיצוב מודל שילמד מתמונות האימון האלו. כמה קשה זה כבר יכול להיות? אחרי הכל, חתול הוא פשוט אוסף של צורות וצבעים, וזה מה שאנחנו עשינו בימים הראשונים של שימוש כמודל תיאורטי באובייקטים. היינו אומרים לאלגוריתם של המחשב בשפה מתמטית שהפנים של חתול הן עגולות, גוף שמנמן, שני אוזניים מחודדות, זנב ארוך, וזה היה נראה בסדר גמור. אבל מה עם החתול הזה? (צחוק) הוא כולו מכורבל. עכשיו צריך להוסיף עוד צורה ונקודת מבט למודל האובייקט. אבל מה אם חתולים מתחבאים? מה עם החתולים המטופשים האלו? עכשיו אתם מתחילים להבין את הנקודה שלי. אפילו משהו פשוט כמו חיית מחמד ביתית יכול להציג אינסוף צורות למודל של אובייקט, וזה רק אובייקט אחד.

So about eight years ago, a very simple and profound observation changed my thinking. No one tells a child how to see, especially in the early years. They learn this through real-world experiences and examples. If you consider a child's eyes as a pair of biological cameras, they take one picture about every 200 milliseconds, the average time an eye movement is made. So by age three, a child would have seen hundreds of millions of pictures of the real world. That's a lot of training examples. So instead of focusing solely on better and better algorithms, my insight was to give the algorithms the kind of training data that a child was given through experiences in both quantity and quality.

אז לפני 8 שנים בערך, הסתכלות מאוד פשוטה ומעמיקה שינתה את המחשבה שלי. אף אחד לא אומר לילד איך לראות, במיוחד לא בשנים המוקדמות. הם לומדים דרך הנסיון בעולם האמיתי ודוגמאות. אם תתייחסו לעיניים של ילד כזוג מצלמות ביולוגיות הן לוקחות תמונה אחת בערך כל 200 אלפיות השנייה, הזמן הממוצע של תנועת עין. אז עד גיל שלוש, ילד יראה מאות מיליוני תמונות של העולם האמיתי. זה הרבה דוגמאות אימון. אז במקום להתרכז אך ורק על אלגוריתמים טובים יותר ויותר, התובנה שלי היתה לתת לאלגוריתמים את סוג מידע האימון שילד מקבל דרך ניסיון גם בכמות וגם באיכות.

Once we know this, we knew we needed to collect a data set that has far more images than we have ever had before, perhaps thousands of times more, and together with Professor Kai Li at Princeton University, we launched the ImageNet project in 2007. Luckily, we didn't have to mount a camera on our head and wait for many years. We went to the Internet, the biggest treasure trove of pictures that humans have ever created. We downloaded nearly a billion images and used crowdsourcing technology like the Amazon Mechanical Turk platform to help us to label these images. At its peak, ImageNet was one of the biggest employers of the Amazon Mechanical Turk workers: together, almost 50,000 workers from 167 countries around the world helped us to clean, sort and label nearly a billion candidate images. That was how much effort it took to capture even a fraction of the imagery a child's mind takes in in the early developmental years.

ברגע שאנחנו יודעים את זה, ידענו שאנחנו צריכים לאסוף מערכת מידע שיש בה הרבה יותר תמונות ממה שהיו לנו אי פעם, אולי פי כמה אלפים יותר, ויחד עם פרופסור קאי לי מאוניברסיטת פרינסטון, התחלנו את פרוייקט אימג'נט ב-2007. למזלנו, לא היינו צריכים לשים מצלמה על ראשינו ולחכות הרבה שנים. ניגשנו לאינטרנט, האוצר הגדול ביותר של תמונות שהאדם יצר אי פעם. הורדנו כמעט מיליארד תמונות והשתמשנו בפעילות של הציבור הרחב כמו הפלטפורמת המכניקל טורק של אמזון כדי לעזור לנו לתייג את התמונות האלו. בשיאו, אימג'נט היה אחד המעסיקים הגדולים של פלטפורמת טורק של אמזון: ביחד, כמעט 50,000 עובדים מ-167 מדינות מסביב לעולם עזרו לנו לנקות, לסדר ולתייג כמעט מיליארד תמונות מועמדות. זה כמה מאמץ נדרש כדי ללכוד שבריר מיכולת הדימוי שמוח של ילד מסוגל לעשות בשנות ההתפתחות המוקדמות.

In hindsight, this idea of using big data to train computer algorithms may seem obvious now, but back in 2007, it was not so obvious. We were fairly alone on this journey for quite a while. Some very friendly colleagues advised me to do something more useful for my tenure, and we were constantly struggling for research funding. Once, I even joked to my graduate students that I would just reopen my dry cleaner's shop to fund ImageNet. After all, that's how I funded my college years.

בדיעבד, הרעיון להשתמש בביג דאטה בשביל לאמן אלגוריתם של מחשב נראה אולי ברור עכשיו, אבל ב-2007, זה לא היה כזה ברור. היינו יחסית לבד במסע הזה למשך זמן לא קצר. כמה עמיתים ידידותיים הציעו לי לעשות משהו שימושי יותר בשביל הקביעות שלי, והיינו נאבקים כל הזמן על תקציבי מחקר. פעם אחת, התבדחתי עם הסטודנטים שלי לתואר שני שאני פשוט אפתח מחדש את החנות לניקוי יבש שלי כדי לממן את אימג'נט. אחרי הכל, ככה מימנתי את שנות הלימודים שלי.

So we carried on. In 2009, the ImageNet project delivered a database of 15 million images across 22,000 classes of objects and things organized by everyday English words. In both quantity and quality, this was an unprecedented scale. As an example, in the case of cats, we have more than 62,000 cats of all kinds of looks and poses and across all species of domestic and wild cats. We were thrilled to have put together ImageNet, and we wanted the whole research world to benefit from it, so in the TED fashion, we opened up the entire data set to the worldwide research community for free. (Applause)

אז המשכנו. ב-2009, פרוייקט אימג'נט סיפק בסיס נתונים של 15 מיליון תמונות לרוחב 22,000 סוגי אובייקטים ודברים מאורגנים לפי שפה אנגלית יומיומית. גם בכמות וגם באיכות, זה היה קנה מידה חסר תקדים. כדוגמא, במקרה של חתולים, יש לנו יותר מ-62,000 חתולים בכל מיני מראות ותנוחות ועל פני כל המינים של חתולים ביתיים ופראיים. היינו נרגשים להרכיב את אימג'נט, ורצינו שכל עולם המחקר ירוויח ממנו, אז ברוח TED, פתחנו את כל מערכת הנתונים שלנו לקהילת המחקר ברחבי העולם בחינם. (מחיאות כפיים)

Now that we have the data to nourish our computer brain, we're ready to come back to the algorithms themselves. As it turned out, the wealth of information provided by ImageNet was a perfect match to a particular class of machine learning algorithms called convolutional neural network, pioneered by Kunihiko Fukushima, Geoff Hinton, and Yann LeCun back in the 1970s and '80s. Just like the brain consists of billions of highly connected neurons, a basic operating unit in a neural network is a neuron-like node. It takes input from other nodes and sends output to others. Moreover, these hundreds of thousands or even millions of nodes are organized in hierarchical layers, also similar to the brain. In a typical neural network we use to train our object recognition model, it has 24 million nodes, 140 million parameters, and 15 billion connections. That's an enormous model. Powered by the massive data from ImageNet and the modern CPUs and GPUs to train such a humongous model, the convolutional neural network blossomed in a way that no one expected. It became the winning architecture to generate exciting new results in object recognition. This is a computer telling us this picture contains a cat and where the cat is. Of course there are more things than cats, so here's a computer algorithm telling us the picture contains a boy and a teddy bear; a dog, a person, and a small kite in the background; or a picture of very busy things like a man, a skateboard, railings, a lampost, and so on. Sometimes, when the computer is not so confident about what it sees, we have taught it to be smart enough to give us a safe answer instead of committing too much, just like we would do, but other times our computer algorithm is remarkable at telling us what exactly the objects are, like the make, model, year of the cars.

עכשיו כשיש לנו את הנתונים להזין במוח הממוחשב שלנו, אנחנו מוכנים לחזור לאלגוריתמים עצמם. כפי שהתברר, עושר המידע שסופק על ידי אימג'נט היה התאמה מושלמת לסוג מסויים של אלגוריתמים ללמידת מכונה הנקראים רשת עצבים מורכבת, שפרצו דרך על ידי קוניהיקו פוקושימה, ג'ף הינטון ויאן לה-קון בשנות ה-70 וה-80. בדיוק כמו שהמוח מכיל מיליארדי נוירונים המחוברים היטב, יחידת הפעלה בסיסית ברשת הנויראלית היא הצומת מנתב המידע דמוי נוירון. הוא מקבל מידע מצמתים אחרים ושולח אותם לאחרים. יתרה מזו, מאות אלפי או אולי אפילו מיליוני הצמתים מאורגנים בשכבות היררכיות, גם כן בדומה למוח. ברשת נויראלית טיפוסית אנו משתמשים כדי לאמן את המודל זיהוי האובייקטים שלנו, יש בו 24 מיליון צמתים, 140 מיליון משתנים, ו-15 מיליארד קשרים. זה מודל ענק. המונע על ידי מידע נתונים עצום מאימג'נט ויחידות העיבוד המרכזיות והמעבדים הגרפיים לאימון מודל כזה כביר, הרשת הנויראלית המורכבת פרחה בצורה שאף אחד לא ציפה לה. זה הפך להיות הארכיטקטורה המנצחת ביצירת תוצאות חדשות ומרגשות בזיהוי אובייקטים. זה מחשב שאומר לנו שהתמונה הזו מכילה חתול ואיפה נמצא החתול. כמובן שיש יותר דברים מחתולים, אז הנה אלגוריתם של מחשב אומר לנו שהתמונה מכילה ילד ובובת דובי; כלב, אדם, ועפיפון קטן ברקע; או תמונה של דברים מאוד עסוקים כמו איש, סקייטבורד, מעקות, עמוד תאורה וכן הלאה. לפעמים, כשהמחשב לא בטוח לגמרי במה שהוא רואה, לימדנו אותו להיות חכם מספיק כדי לתת לנו תשובה בטוחה במקום להתחייב יותר מדי, בדיוק כמו שאנחנו היינו עושים, אבל בפעמים אחרות האלגוריתם הממוחשב שלנו מצויין בלהגיד לנו מה בדיוק הם האובייקטים, כמו היצרן, המודל והשנה של מכוניות.

We applied this algorithm to millions of Google Street View images across hundreds of American cities, and we have learned something really interesting: first, it confirmed our common wisdom that car prices correlate very well with household incomes. But surprisingly, car prices also correlate well with crime rates in cities, or voting patterns by zip codes.

יישמנו את האלגוריתם הזה למיליוני תמונות של מפת הרחובות של גוגל במאות ערים אמריקניות, ולמדנו משהו מאוד מעניין: דבר ראשון, זה אימת את הידע הנפוץ שמחירי מכוניות נמצאים בקשר ישיר להכנסות משק הבית. אבל באופן מפתיע, מחירי המכוניות נמצאים בקשר ישיר עם רמת הפשע בערים, או תבנית הצבעות על פי מיקודים.

So wait a minute. Is that it? Has the computer already matched or even surpassed human capabilities? Not so fast. So far, we have just taught the computer to see objects. This is like a small child learning to utter a few nouns. It's an incredible accomplishment, but it's only the first step. Soon, another developmental milestone will be hit, and children begin to communicate in sentences. So instead of saying this is a cat in the picture, you already heard the little girl telling us this is a cat lying on a bed.

אז רגע. זה הכל? האם המחשב השווה או אפילו עקף את היכולות האנושיות? לא כל כך מהר. עד עכשיו, אנחנו רק לימדנו את המחשב לראות אובייקטים. זה כמו שילד קטן לומד לבטא מספר שמות עצם. זה הישג מדהים, אבל זה רק הצעד הראשון. בקרוב, עוד אבן דרך התפתחותית תושג, וילדים מתחילים לתקשר במשפטים. אז במקום להגיד - זה חתול בתמונה, כבר שמעתם את הילדה הקטנה אומרת לנו שזה חתול ששוכב על מיטה.

So to teach a computer to see a picture and generate sentences, the marriage between big data and machine learning algorithm has to take another step. Now, the computer has to learn from both pictures as well as natural language sentences generated by humans. Just like the brain integrates vision and language, we developed a model that connects parts of visual things like visual snippets with words and phrases in sentences.

אז ללמד מחשב לראות תמונה ולייצר משפטים, הנישואים בין ביג דאטה לאלגוריתם לימוד מכונה צריכים לקחת עוד צעד. עכשיו, המחשב צריך ללמוד משתי התמונות כמו גם ממשפטים טבעיים בשפה שנוצרים על ידי בני אדם. בדיוק כמו שהמוח מייצר ראייה ושפה, אנחנו פיתחנו מודל שמקשר חלקים של דברים ויזואלים כמו מקטעים קצרים עם מילים וביטויים במשפטים.

About four months ago, we finally tied all this together and produced one of the first computer vision models that is capable of generating a human-like sentence when it sees a picture for the first time. Now, I'm ready to show you what the computer says when it sees the picture that the little girl saw at the beginning of this talk.

לפני ארבעה חודשים בערך, סוף סוף קשרנו את כל זה ביחד ויצרנו את אחד ממודלי הראייה הממוחשבת הראשונים שמסוגלים לייצר משפט בדומה לאדם כאשר הוא רואה תמונה בפעם הראשונה. עכשיו, אני מוכנה להראות לכם מה המחשב אומר כשהוא רואה את התמונה שהילדה הקטנה ראתה בתחילת השיחה הזו.

(Video) Computer: A man is standing next to an elephant. A large airplane sitting on top of an airport runway.

(וידאו) מחשב: איש עומד ליד פיל. מטוס גדול עומד על מסלול טיסה.

FFL: Of course, we're still working hard to improve our algorithms, and it still has a lot to learn. (Applause)

פיי פיי: כמובן, אנחנו עדיין עובדים קשה כדי לשפר את האלגוריתמים שלנו, ועדיין יש לו הרבה ללמוד. (מחיאות כפיים)

And the computer still makes mistakes.

והמחשב עדיין עושה טעויות.

(Video) Computer: A cat lying on a bed in a blanket.

(וידאו) מחשב: חתול שוכב על מיטה בשמיכה.

FFL: So of course, when it sees too many cats, it thinks everything might look like a cat.

פיי פיי: אז כמובן, כשהוא רואה יותר מדי חתולים, הוא חושב שהכל יכול להיראות כמו חתול.

(Video) Computer: A young boy is holding a baseball bat. (Laughter)

(וידאו) מחשב: ילד צעיר מחזיק אלת בייסבול. (צחוק)

FFL: Or, if it hasn't seen a toothbrush, it confuses it with a baseball bat.

פיי פיי: או אם הוא לא ראה מברשת שיניים, הוא מבלבל את זה עם אלת בייסבול.

(Video) Computer: A man riding a horse down a street next to a building. (Laughter)

(וידאו) מחשב: איש רוכב על סוס במורד הרחוב ליד בניין. (צחוק)

FFL: We haven't taught Art 101 to the computers.

פיי פיי: לא לימדנו את המחשבים מבוא לאומנות.

(Video) Computer: A zebra standing in a field of grass.

(וידאו) מחשב: זברה עומדת בשדה עשב.

FFL: And it hasn't learned to appreciate the stunning beauty of nature like you and I do.

פיי פיי: והוא לא למד להעריך את היופי המדהים של הטבע כמוני וכמוכם.

So it has been a long journey. To get from age zero to three was hard. The real challenge is to go from three to 13 and far beyond. Let me remind you with this picture of the boy and the cake again. So far, we have taught the computer to see objects or even tell us a simple story when seeing a picture.

אז עדיין יש לו דרך ארוכה. להגיע מגיל אפס לשלוש היה קשה. האתגר האמיתי הוא להגיע משלוש לשלוש עשרה ומעבר לזה. תרשו לי להזכיר לכם עם התמונה של הילד והעוגה שוב. עד עכשיו, לימדנו את המחשב לראות אובייקטים או אפילו לספר לנו סיפור פשוט כשהוא רואה תמונה.

(Video) Computer: A person sitting at a table with a cake.

(וידאו) מחשב: אדם יושב ליד שולחן עם עוגה.

FFL: But there's so much more to this picture than just a person and a cake. What the computer doesn't see is that this is a special Italian cake that's only served during Easter time. The boy is wearing his favorite t-shirt given to him as a gift by his father after a trip to Sydney, and you and I can all tell how happy he is and what's exactly on his mind at that moment.

פיי פיי: אבל יש עוד כל כך הרבה בתמונה הזו מאשר רק אדם ועוגה. מה שהמחשב לא רואה זה שזו עוגה איטלקית מיוחדת שמוגשת רק בחג הפסחא. הילד לובש את החולצה האהובה עליו שניתנה לו על ידי אביו אחרי טיול בסידני, ואתם ואני יכוים להגיד כמה מאושר הוא ומה בדיוק הוא חושב ברגע הזה.

This is my son Leo. On my quest for visual intelligence, I think of Leo constantly and the future world he will live in. When machines can see, doctors and nurses will have extra pairs of tireless eyes to help them to diagnose and take care of patients. Cars will run smarter and safer on the road. Robots, not just humans, will help us to brave the disaster zones to save the trapped and wounded. We will discover new species, better materials, and explore unseen frontiers with the help of the machines.

זה בני ליאו. במסע שלי לאינטלגנציה חזותית, אני כל הזמן חושבת על ליאו ועל העולם העתידי בו הוא יחיה. כשמכונות יכולות לראות, לרופאים ואחיות יהיה עוד זוג עיניים שלא מתעייפות כדי לעזור להם לאבחן ולדאוג למטופלים. מכוניות ינועו בצורה חכמה יותר ובטוחה יותר בדרכים. רובוטים, לא רק בני אדם, יעזרו לנו לעמוד בגבורה באזורי אסון ולהציל את הלכודים והפצועים. אנחנו נגלה מינים חדשים, חומרים טובים יותר, ונחקור גבולות חדשים עם עזרה של המכונות.

Little by little, we're giving sight to the machines. First, we teach them to see. Then, they help us to see better. For the first time, human eyes won't be the only ones pondering and exploring our world. We will not only use the machines for their intelligence, we will also collaborate with them in ways that we cannot even imagine.

לאט לאט, אנחנו נותנים יכולת ראייה למכונות. קודם כל, אנחנו מלמדים אותם לראות. לאחר מכן, הם יעזרו לנו לראות טוב יותר. בפעם הראשונה, העיניים האנושיות לא יהיו העיניים היחידות שמהרהרות וחוקרות את העולם. אנחנו לא רק נשתמש במכונות בשביל האינטליגנציה שלהם, אנחנו גם נשתף איתן פעולה בדרכים שאנחנו אפילו לא יכולים לדמיין.

This is my quest: to give computers visual intelligence and to create a better future for Leo and for the world.

זה המסע שלי: לתת למחשבים תבונה חזותית וליצור עתיד טוב יותר בשביל ליאו ובשביל העולם.

Thank you.

תודה רבה.

(Applause)

(מחיאות כפיים)

Let me show you something.

בואו אני אראה לכם משהו.

(Video) Girl: Okay, that's a cat sitting in a bed. The boy is petting the elephant. Those are people that are going on an airplane. That's a big airplane.

(וידאו) ילדה: "אוקיי, זה חתול שיושב על מיטה. הילד מלטף את הפיל. אלה אנשים שעולים על מטוס. זה מטוס גדול."

(Video) Computer: A man is standing next to an elephant. A large airplane sitting on top of an airport runway.

(וידאו) מחשב: איש עומד ליד פיל. מטוס גדול עומד על מסלול טיסה.

FFL: Of course, we're still working hard to improve our algorithms, and it still has a lot to learn. (Applause)

פיי פיי: כמובן, אנחנו עדיין עובדים קשה כדי לשפר את האלגוריתמים שלנו, ועדיין יש לו הרבה ללמוד. (מחיאות כפיים)

And the computer still makes mistakes.

והמחשב עדיין עושה טעויות.

(Video) Computer: A cat lying on a bed in a blanket.

(וידאו) מחשב: חתול שוכב על מיטה בשמיכה.

FFL: So of course, when it sees too many cats, it thinks everything might look like a cat.

פיי פיי: אז כמובן, כשהוא רואה יותר מדי חתולים, הוא חושב שהכל יכול להיראות כמו חתול.

(Video) Computer: A young boy is holding a baseball bat. (Laughter)

(וידאו) מחשב: ילד צעיר מחזיק אלת בייסבול. (צחוק)

FFL: Or, if it hasn't seen a toothbrush, it confuses it with a baseball bat.

פיי פיי: או אם הוא לא ראה מברשת שיניים, הוא מבלבל את זה עם אלת בייסבול.

(Video) Computer: A man riding a horse down a street next to a building. (Laughter)

(וידאו) מחשב: איש רוכב על סוס במורד הרחוב ליד בניין. (צחוק)

FFL: We haven't taught Art 101 to the computers.

פיי פיי: לא לימדנו את המחשבים מבוא לאומנות.

(Video) Computer: A zebra standing in a field of grass.

(וידאו) מחשב: זברה עומדת בשדה עשב.

FFL: And it hasn't learned to appreciate the stunning beauty of nature like you and I do.

פיי פיי: והוא לא למד להעריך את היופי המדהים של הטבע כמוני וכמוכם.

(Video) Computer: A person sitting at a table with a cake.

(וידאו) מחשב: אדם יושב ליד שולחן עם עוגה.

This is my quest: to give computers visual intelligence and to create a better future for Leo and for the world.

זה המסע שלי: לתת למחשבים תבונה חזותית וליצור עתיד טוב יותר בשביל ליאו ובשביל העולם.

Thank you.

תודה רבה.

(Applause)

(מחיאות כפיים)

Fei-Fei Li: How we're teaching computers to understand pictures

Fei-Fei Li: How we're teaching computers to understand pictures

Related talks

Jeremy Howard: The wonderful and terrifying implications of computers that can learn

Pawan Sinha: How brains learn to see

Patricia Kuhl: The linguistic genius of babies

Joseph Redmon: How computers learn to recognize objects instantly

Sebastian Thrun and Chris Anderson: What AI is -- and isn't

Linda Liukas: A delightful way to teach kids about computers

Related talks

Jeremy Howard: The wonderful and terrifying implications of computers that can learn

Pawan Sinha: How brains learn to see

Patricia Kuhl: The linguistic genius of babies

Joseph Redmon: How computers learn to recognize objects instantly

Sebastian Thrun and Chris Anderson: What AI is -- and isn't

Linda Liukas: A delightful way to teach kids about computers