Fei-Fei Li: How we're teaching computers to understand pictures

Let me show you something.

اجازه دهید چیزی را به شما نشان دهم.

(Video) Girl: Okay, that's a cat sitting in a bed. The boy is petting the elephant. Those are people that are going on an airplane. That's a big airplane.

(ویدیو)دختر: بسیار خوب، آن گربه روی یک تخت خواب نشسته است. این پسر در حال نوازش فیل است. آنها مردمی هستند در حال سوار شدن به هواپیما. این یک هواپیمای بزرگ است.

Fei-Fei Li: This is a three-year-old child describing what she sees in a series of photos. She might still have a lot to learn about this world, but she's already an expert at one very important task: to make sense of what she sees. Our society is more technologically advanced than ever. We send people to the moon, we make phones that talk to us or customize radio stations that can play only music we like. Yet, our most advanced machines and computers still struggle at this task. So I'm here today to give you a progress report on the latest advances in our research in computer vision, one of the most frontier and potentially revolutionary technologies in computer science.

فی-فی-لی: این یک کودک سه ساله است که آنچه که در مجموعه ای از عکسها می‎بیند را توصیف می‎کند. ممکن است او هنوز چیزهای زیادی برای یادگیری درباره این جهان داشته باشد. اما او در یک کار خیلی مهم دیگه تخصص دارد: درک کردن آنچه که می‎بیند. جامعه ما از لحاظ فناوری از هر زمان دیگر پیشرفته‎تر است. ما آدمها را به ماه می‎فرستیم، تلفنهایی ساختیم که با ما صحبت می‎کنند یا ایستگاههای رادیویی سفارشی طراحی کردیم که می توانند فقط موسیقی را که دوست داریم پخش کنند. با این حال پیشرفته ترین ماشینها و رایانه‎های ما هنوز هم در این کار (درک تصاویر) مشکل دارند. بنابراین امروز من اینجا هستم که یک گزارش پیشرفت به شما بدهم در مورد آخرین پیشرفت در تحقیق ما بر روی بینایی رایانه‎ای، یکی از پیشرفته‎ترین و بصورت بالقوه انقلابی‎ترین فن آوریها در علوم رایانه‎ای.

Yes, we have prototyped cars that can drive by themselves, but without smart vision, they cannot really tell the difference between a crumpled paper bag on the road, which can be run over, and a rock that size, which should be avoided. We have made fabulous megapixel cameras, but we have not delivered sight to the blind. Drones can fly over massive land, but don't have enough vision technology to help us to track the changes of the rainforests. Security cameras are everywhere, but they do not alert us when a child is drowning in a swimming pool. Photos and videos are becoming an integral part of global life. They're being generated at a pace that's far beyond what any human, or teams of humans, could hope to view, and you and I are contributing to that at this TED. Yet our most advanced software is still struggling at understanding and managing this enormous content. So in other words, collectively as a society, we're very much blind, because our smartest machines are still blind.

بله، ما نمونه اولیه ماشینهایی را داریم که خودشان می‎توانند رانندگی کنند، اما بدون دید هوشمند (smart vision) نمی توانند فرق بگذارند بین پاکت کاغذی مچاله در جاده که میشه از روش با ماشین رد شد. و یک سنگ همان اندازه که نباید از روش رد شد ما دوربینهای (با وضوح) مگاپیکسل عالی ساخته ایم، اما به نابیناها بینایی نداده‎ایم. هواپیماهای بدون سرنشین که برفراز زمینهای وسیع پرواز کنند، ولی فناوری بینایی کافی برای کمک به ما در رهگیری تغییرات جنگلهای بارانی نداریم. دوربین های امنیتی همه جا هست، ولی وقتی یک کودک در استخر در حال غرق شدن است به ما هشدار نمیدهند. تصاویر و ویدیوها در حال تبدیل شدن به جز مهمی از زندگی جهانی هستند. تصاویر با سرعتی فراتر از آنچه هر انسان یا گروهی از انسانها، بتواند امیدوار به دیدن آنها باشد تولید می‎شوند، و من و شما در این TED یعنی تولید تصاویر مشارکت می‎کنیم. با این وجود پیشرفته‎ترین نرم افزارها همچنان در فهم و مدیریت این حجم عظیم مشکل دارند. به عبارت دیگر در مجموع به عنوان جامعه ما کاملا کور هستیم، چون باهوشترین ماشینهای ما هنوز نابینا هستند.

"Why is this so hard?" you may ask. Cameras can take pictures like this one by converting lights into a two-dimensional array of numbers known as pixels, but these are just lifeless numbers. They do not carry meaning in themselves. Just like to hear is not the same as to listen, to take pictures is not the same as to see, and by seeing, we really mean understanding. In fact, it took Mother Nature 540 million years of hard work to do this task, and much of that effort went into developing the visual processing apparatus of our brains, not the eyes themselves. So vision begins with the eyes, but it truly takes place in the brain.

شاید بپرسید "چرا انقدر سخته؟" دوربین‎ها می‎توانند تصاویری مثل این را بگیرند: با تبدیل نور به آرایه دو بعدی اعداد به نام "پیکسل" ولی اینها فقط اعداد بی روح هستند، هیچ معنی به خودی خود ندارند. مثل اینکه: شنیدن با گوش کردن یکی نیستند، عکس گرفتن با دیدن یکی نیستند، یا اینکه منظور از دیدن واقعا فهمیدن نیست. در حقیقت ۵۴۰ میلیون سال وقت مادر طبیعت صرف انجام این کار سخت شده و بیشتر این تلاش به تکامل ابزار پردازش دید مغزمان اختصاص داده شده و نه به خود چشمها. پس، دیدن با چشم آغاز میشود، ولی در حقیقت در مغز شکل می‌گیرد.

So for 15 years now, starting from my Ph.D. at Caltech and then leading Stanford's Vision Lab, I've been working with my mentors, collaborators and students to teach computers to see. Our research field is called computer vision and machine learning. It's part of the general field of artificial intelligence. So ultimately, we want to teach the machines to see just like we do: naming objects, identifying people, inferring 3D geometry of things, understanding relations, emotions, actions and intentions. You and I weave together entire stories of people, places and things the moment we lay our gaze on them.

برای ۱۵ سال با شروع از دکترا در کل‌تک و سپس رهبری آزمایشگاه بینایی در استانفورد، من با مربی هایم، همکارانم و شاگردانم تلاش کرده ام که به رایانه ها یاد بدهیم که ببینند. اسم زمینه تحقیقاتی ما بینایی رایانه ای و آموزش ماشین هست. این بخشی از زمینه عمومی تر هوش مصنوعی هست در نهایت میخواهیم به ماشین ها یاد بدهیم که ببینند همانند ما: اسم گذاشتن بر روی اشیا، تشخیص افراد ، استنباط سه بعدی از اشیا فهم ارتباط، احساسات، اعمال و نیت ها. من و شما وقتی نگاهمون به آدمها، مکانها و اشیا میافتد دربارشون قصه میسازیم.

The first step towards this goal is to teach a computer to see objects, the building block of the visual world. In its simplest terms, imagine this teaching process as showing the computers some training images of a particular object, let's say cats, and designing a model that learns from these training images. How hard can this be? After all, a cat is just a collection of shapes and colors, and this is what we did in the early days of object modeling. We'd tell the computer algorithm in a mathematical language that a cat has a round face, a chubby body, two pointy ears, and a long tail, and that looked all fine. But what about this cat? (Laughter) It's all curled up. Now you have to add another shape and viewpoint to the object model. But what if cats are hidden? What about these silly cats? Now you get my point. Even something as simple as a household pet can present an infinite number of variations to the object model, and that's just one object.

اولین قدم در راه این هدف این هست که به رایانه‎ها یاد بدهیم تا اشیا را ببینند؛ سنگ بنای دنیای بصری. به ساده ترین حالت این فرایند آموزش را مانند نشان دادن تعدادی عکس آموزشی از یک شی خاص مثلا گربه ها به رایانه تصور کنید. و طراحی یک مدل (برای رایانه) که ازدیدن این عکسها یاد می‎گیرد. اینکار چقدر میتونه سخت باشه؟ بالاخره یک گربه مجموعه ایست از شکل ها و رنگها، و این کاری هست که در روزهای ابتدایی طراحی اشیا انجام می‎دادیم. ما به الگوریتم رایانه به زبان ریاضی می‎گوییم که یک گربه صورت گرد دارد، بدن تپل دارد، دو تا گوش تیز دارد و یک دم دراز و این کافی بود. ولی این یکی گربه چطور؟ (خنده حضار) این یکی کاملا خم شده حالا شما باید یک شکل و زاویه دید دیگه به مدل شی اضافه کنید ولی اگه گربه‎ها قایم شده باشند چی؟ این گربه های بامزه چطور؟ جالا متوجه منظور من می‎شوید. حتی یک چیز ساده مثل حیوان خانگی میتونه مدلهای بینهایت گونه گون از مدل شی را ارائه کند، و این تازه فقط یک شی هست.

So about eight years ago, a very simple and profound observation changed my thinking. No one tells a child how to see, especially in the early years. They learn this through real-world experiences and examples. If you consider a child's eyes as a pair of biological cameras, they take one picture about every 200 milliseconds, the average time an eye movement is made. So by age three, a child would have seen hundreds of millions of pictures of the real world. That's a lot of training examples. So instead of focusing solely on better and better algorithms, my insight was to give the algorithms the kind of training data that a child was given through experiences in both quantity and quality.

تقریبا هشت سال پیش یک مشاهده ساده و عمیق طرز فکر من را تغییر داد. کسی به یک کودک نمی‎گه چطور ببیند، به ویژه در سالهای ابتدایی. اونها این کار را از طریق تجربیات و مثالهای دنیای واقعی یاد می‎گیرند. اگر چشمهای یک کودک را مثل یک جفت دوربین بیولوژیک در نظر بگیرید، آنها هر ۲۰۰ میلی ثانیه یک تصویر می‎گیرند، مدت زمان متوسطی که حرکت چشم صورت می‎گیرد. پس تا سه سالگی یک کودک صدها میلیون تصویر از دنیای واقعی دیده این تعداد زیادی از مثال‎های آموزشی هست. پس بجای تمرکزصرف بر الگوریتمهای بهتر و بهتر نگرش من این بود که به الگوریتمها ـآن دسته از داده‎های آموزشی که به یک کودک از طریق تجربه داده می‎شود را در همان حجم و کیفیت بدهیم.

Once we know this, we knew we needed to collect a data set that has far more images than we have ever had before, perhaps thousands of times more, and together with Professor Kai Li at Princeton University, we launched the ImageNet project in 2007. Luckily, we didn't have to mount a camera on our head and wait for many years. We went to the Internet, the biggest treasure trove of pictures that humans have ever created. We downloaded nearly a billion images and used crowdsourcing technology like the Amazon Mechanical Turk platform to help us to label these images. At its peak, ImageNet was one of the biggest employers of the Amazon Mechanical Turk workers: together, almost 50,000 workers from 167 countries around the world helped us to clean, sort and label nearly a billion candidate images. That was how much effort it took to capture even a fraction of the imagery a child's mind takes in in the early developmental years.

وقتی این را فهمیدیم متوجه شدیم که به جمع آوری مجموعه اطلاعات نیاز داریم که خیلی بیشتر از آنچه تاکنون داشته ایم عکس داشته باشد، احتمالا هزاران بار بیشتر، و با همکاری پرفسور کای لی در دانشگاه پرینستون ما پروژه ImageNet را در سال ۲۰۰۷ راه اندازی کردیم. خوشبختانه احتیاج نداشتیم که یک دوربین روی سرمان نصب کنیم و سالها منتظر بمانیم. رفتیم سراغ اینترنت بزرگترین گنجینه عکسها که انسانها تاکنون آفریده اند. نزدیک به یک میلیارد عکس دانلود کردیم و از فناوری CrowdSourcing همانند Amazon Mechanical Turk platform استفاده کردیم تا برای برچسب زدن این عکسها به ما کمک کند. در اوج خودش، ImageNet از بزرگترین کارفرماهای Amazon Mechanical Turk بود در مجموع تقریبا ۵۰٫۰۰۰ کارمند از ۱۶۷ کشور جهان به ما کمک کردند تا نزدیک به یک میلیارد عکس منتخب را اصلاح، منظم و برچسب گذاری کنند. این میزانی بود که زحمت برد برای ثبت کسری از تصویرگری که ذهن یک کودک در سالهای اولیه تکامل خود انجام می‎دهد.

In hindsight, this idea of using big data to train computer algorithms may seem obvious now, but back in 2007, it was not so obvious. We were fairly alone on this journey for quite a while. Some very friendly colleagues advised me to do something more useful for my tenure, and we were constantly struggling for research funding. Once, I even joked to my graduate students that I would just reopen my dry cleaner's shop to fund ImageNet. After all, that's how I funded my college years.

پس از گذشت زمان و کسب تجربه ایده استفاده از حجم عظیم داده‎ها برای آموزش الگوریتم رایانه‎ها، شاید الان بدیهی بنظر برسد، ولی قبلا در سال ۲۰۰۷ انقدر واضح نبود. ما توی این سفر برای مدتی کاملا تنها بودیم. بعضی از همکاران نزدیکم به من توصیه کردند که برای استخدام قطعی من کار مفیدتری بکنم و مدام برای بودجه تحقیقاتی مشکل داشتیم. یکبار با دانشجوهای تحصیلات تکمیلی‎ام شوخی کردم که برای تامین بودجه ImageNet حشکشویی‎ام را دوباره باز کنم. بهر حال این راهی بود که من پول تحصیل‎ام را در آورده بودم.

So we carried on. In 2009, the ImageNet project delivered a database of 15 million images across 22,000 classes of objects and things organized by everyday English words. In both quantity and quality, this was an unprecedented scale. As an example, in the case of cats, we have more than 62,000 cats of all kinds of looks and poses and across all species of domestic and wild cats. We were thrilled to have put together ImageNet, and we wanted the whole research world to benefit from it, so in the TED fashion, we opened up the entire data set to the worldwide research community for free. (Applause)

پس ادامه دادیم. در سال ۲۰۰۹ پروژه ImageNet یک پایگاه داده از ۱۵ میلیون عکس در وسعت ۲۲٫۰۰۰ کلاس از شی ها که با کلمات انگلیسی روزمره منظم شده بودند تحویل داد. از لحاظ کیفیت و کمیت این مقیاس بی‎سابقه بود. بعنوان مثال در مورد گربه‎ها بیش از ۶۲٫۰۰۰ (تصویر) گربه در انواع شکل ها و فرم بدن و در تمام گونه‌های اهلی و وحشی داشتیم. ما از اینکه ImageNet را ساخته بودیم هیجان زده بودیم و و می‎خواستیم که تمام دنیای تحقیقات از آن بهره ببرند پس به شیوه TED تمام مجموعه داده را برای دنیای تحقیقات بصورت رایگان باز کردیم. (تشویق حضار)

Now that we have the data to nourish our computer brain, we're ready to come back to the algorithms themselves. As it turned out, the wealth of information provided by ImageNet was a perfect match to a particular class of machine learning algorithms called convolutional neural network, pioneered by Kunihiko Fukushima, Geoff Hinton, and Yann LeCun back in the 1970s and '80s. Just like the brain consists of billions of highly connected neurons, a basic operating unit in a neural network is a neuron-like node. It takes input from other nodes and sends output to others. Moreover, these hundreds of thousands or even millions of nodes are organized in hierarchical layers, also similar to the brain. In a typical neural network we use to train our object recognition model, it has 24 million nodes, 140 million parameters, and 15 billion connections. That's an enormous model. Powered by the massive data from ImageNet and the modern CPUs and GPUs to train such a humongous model, the convolutional neural network blossomed in a way that no one expected. It became the winning architecture to generate exciting new results in object recognition. This is a computer telling us this picture contains a cat and where the cat is. Of course there are more things than cats, so here's a computer algorithm telling us the picture contains a boy and a teddy bear; a dog, a person, and a small kite in the background; or a picture of very busy things like a man, a skateboard, railings, a lampost, and so on. Sometimes, when the computer is not so confident about what it sees, we have taught it to be smart enough to give us a safe answer instead of committing too much, just like we would do, but other times our computer algorithm is remarkable at telling us what exactly the objects are, like the make, model, year of the cars.

حالا که داده‎ها را برای تغذیه مغز رایانه هایمان داریم، آماده ایم که برگردیم سراغ خود الگوریتم ها. اینطور شد که وفور اطلاعات تهیه شده توسط ImageNet خیلی خوب به کلاس خاصی از الگوریتمهای یادگیری ماشینی به نام "شبکه های عصبی در هم تنیده" تطابق داشت، که پیشگامانش کونیهیکو فوکوشیما و جف هینتون و یان لیکان در دهه‎های ۱۹۷۰ و ۱۹۸۰ بودند. درست مثل مغز که از میلیاردها نورون پیوسته تشکیل شده یک واحد عملیاتی بنیادی در یک شبکه عصبی یک گره نورون-مانند است. از گره‎های دیگر ورودی می‎گیرد و و خروجی را به دیگر گره‎ها می‎فرستند. به علاوه، این صدها یا هزاران یا حتی میلیونها گره در لایه‎هایی با سلسله مراتب منظم شده‎اند، مانند مغز. در یک شبکه عصبی نوعی، برای آموزش مدل تشخیص اشیا، ۲۴ میلیون گره، ۱۴۰ میلیون پارامتر، و ۱۵ میلیارد اتصال وجود دارد. این یک مدل عظیم است. با استفاده از نیروی عظیم داده ها از ImageNet و CPU و GPU های مدرن برای آموزش چنین مدل یکدستی، "شبکه عصبی در هم تنیده"... به شکلی که کسی انتظار نداشت شکوفا شد. تبدیل شد به معماری برتر برای تولید نتایج تازه و هیجان انگیز در تشخیص اشیا. این یک کامپیوتر هست که به ما میگه این تصویر شامل یک گربه است و اینکه گربه کجاست. البته چیزهای بیشتری از گربه وجود دارد، پس این یک الگوریتم رایانه‎ای هست که به ما می‎گوید تصویر شامل یک پسر هست و یک عروسک خرس؛ یک سگ، یک آدم، و بادبادک کوچک در پس زمینه؛ یا تصویر چیزهای شلوغ‎تر مثل یک مرد، تخته اسکیت، نرده‎ها، تیر چراغ برق و چیزهای دیگر. بعضی وقتها که رایانه مطمئن نیست از چیزی که به آن نگاه می‎کند، بهش یاد دادیم که به اندازه کافی باهوش باشد تا به جای کار زیادی یک جواب مطمئن به ما بدهد، درست مثل کاری که ما انجام می‎دهیم، ولی در موارد دیگر الگوریتم رایانه ای ما در گفتن اینکه اشیا چه هستند فوق العاده است مثل نوع ، مدل و سال ساخت ماشین.

We applied this algorithm to millions of Google Street View images across hundreds of American cities, and we have learned something really interesting: first, it confirmed our common wisdom that car prices correlate very well with household incomes. But surprisingly, car prices also correlate well with crime rates in cities, or voting patterns by zip codes.

ما این الگوریتم را به میلیونها عکس "منظره خیابان گوگل" در صدها شهر آمریکا اعمال کردیم و چیز جالبی را متوجه شدیم: اول اینکه عقل سلیم ما را تایید کرد که قیمت خودرو وابستگی زیادی به درآمد خانوارها دارد. اما تعجب اینکه، قیمت خودرو بستگی زیادی هم به نرخ جرایم در شهرها، یا الگوی رای دادن در شهرها بر اساس کدپستی دارد.

So wait a minute. Is that it? Has the computer already matched or even surpassed human capabilities? Not so fast. So far, we have just taught the computer to see objects. This is like a small child learning to utter a few nouns. It's an incredible accomplishment, but it's only the first step. Soon, another developmental milestone will be hit, and children begin to communicate in sentences. So instead of saying this is a cat in the picture, you already heard the little girl telling us this is a cat lying on a bed.

صبر کن ببینم! همین؟! آیا دیگر توانایی رایانه با توانایی انسان مطابقت دارد یا از آن پیشی گرفته؟ نه به این زودی. تا حالا به رایانه یاد دادیم که اشیا را ببیند. این مثل این هست که کودک یاد بگیرد چند اسم بگوید. این یک موفقیت باورنکردنی است، اما فقط اولین قدم است. بزودی یک مرحله مهم طی خواهد شد و کودکان یاد می‎گیرند تا بصورت گفتن جمله ارتباط برقرار کنند. پس به جای اینکه بگوید این یک گربه در این عکس است که قبلا شنیدید دختر کوچولو به ما گفت این یک گربه خوابیده روی تخت است.

So to teach a computer to see a picture and generate sentences, the marriage between big data and machine learning algorithm has to take another step. Now, the computer has to learn from both pictures as well as natural language sentences generated by humans. Just like the brain integrates vision and language, we developed a model that connects parts of visual things like visual snippets with words and phrases in sentences.

برای یاد دادن به رایانه که تصویری را ببیند و جملاتی تولید کند، پیوند بین داده‎های عظیم و الگوریتم آموزش ماشین باید گام دیگری بردارد. حالا رایانه باید هم از تصاویر یاد بگیرد هم از جملات زبان طبیعی که توسط انسان تولید می‎شوند. درست مثل مغز که بینایی و زبان را به هم می‎آمیزد ما هم مدلی ایجاد کردیم که قسمت های اجسام بصری مانند خرده تصاویر را به کلمات و عبارات در جملات پیوند میزند.

About four months ago, we finally tied all this together and produced one of the first computer vision models that is capable of generating a human-like sentence when it sees a picture for the first time. Now, I'm ready to show you what the computer says when it sees the picture that the little girl saw at the beginning of this talk.

حدود چهار ماه پیش، بالاخره همه اینها را به هم پیوند زدیم و یکی از اولین مدلهای دید رایانه‎ای را که وقتی یک تصویر را برای اولین بار می‎بیند قادر به تولید جملات همانند انسانها هست تولید کردیم. حالا آماده هستم که بهتون نشان دهم که یک رایانه وقتی تصویری که وقتی تصویری را می‎بیند که اون دختر کوچولوی اول سخنرانی آن را دید.

(Video) Computer: A man is standing next to an elephant. A large airplane sitting on top of an airport runway.

(صدای رایانه): یک مرد کنار یک فیل ایستاده است. یک هواپیمای بزرگ روی باند پروازفرودگاه نشسته.

FFL: Of course, we're still working hard to improve our algorithms, and it still has a lot to learn. (Applause)

(سخنران): البته ما هنوز داریم سخت تلاش می‎کنیم که الگوریتم‎مان را بهتر کنیم، و هنوز چیزهای زیادی هست که باید یاد بگیرد. (تشویق حضار)

And the computer still makes mistakes.

و رایانه هنوز اشتباه می‎کند.

(Video) Computer: A cat lying on a bed in a blanket.

(صدای رایانه): یک گربه زیر لحاف دراز کشیده روی تخت.

FFL: So of course, when it sees too many cats, it thinks everything might look like a cat.

(سخنران): قطعا وقتی تعداد زیادی گربه می‎بیند ممکن است فکر کند که همه چیز شبیه گربه است.

(Video) Computer: A young boy is holding a baseball bat. (Laughter)

(صدای رایانه): یک پسربچه یک چوب بیسبال در دست دارد. (خنده حضار)

FFL: Or, if it hasn't seen a toothbrush, it confuses it with a baseball bat.

(سخنران): و اگر مسواک ندیده باشد آن را با چوب بیسبال اشتباه می‎گیرد.

(Video) Computer: A man riding a horse down a street next to a building. (Laughter)

(صدای رایانه): مردی که در خیابان کنار یک ساختمان اسب سواری می‎کند. (خنده حضار)

FFL: We haven't taught Art 101 to the computers.

(سخنران): ما به رایانه‎ها کلاس هنر پایه تدریس نکردیم.

(Video) Computer: A zebra standing in a field of grass.

(صدای رایانه): یک گورخر ایستاده در زمینی پوشیده از علف.

FFL: And it hasn't learned to appreciate the stunning beauty of nature like you and I do.

(سخنران): و یاد نگرفته که قدر زیبایی مسحور کننده طبیعت را مثل من و شما بداند.

So it has been a long journey. To get from age zero to three was hard. The real challenge is to go from three to 13 and far beyond. Let me remind you with this picture of the boy and the cake again. So far, we have taught the computer to see objects or even tell us a simple story when seeing a picture.

بله، سفر درازی بوده تا از سن صفر به سه سالگی برسیم دشوار بود. سختی واقعی رفتن از سه سالگی به ۱۳ سالگی و فراتر هست. اجازه بدهید به شما با این تصویر پسر و کیک یادآوری کنم. تا الان به رایانه یاد دادیم که اجسام را ببیند یا حتی وقتی یک تصویر را می‎بیند یک داستان ساده به ما بگوید.

(Video) Computer: A person sitting at a table with a cake.

(صدای رایانه): یک شخص نشسته سر یک میز با یک کیک.

FFL: But there's so much more to this picture than just a person and a cake. What the computer doesn't see is that this is a special Italian cake that's only served during Easter time. The boy is wearing his favorite t-shirt given to him as a gift by his father after a trip to Sydney, and you and I can all tell how happy he is and what's exactly on his mind at that moment.

(سخنران): اما در این عکس خیلی چیزهای دیگر غیر از یک آدم و کیک هست. چیزی که رایانه نمی‎بیند این است که این یک کیک مخصوص ایتالیایی که فقط در زمان عید پاک پخته می‎شود هست. پسر تی‎شرت مورد علاقه‎اش را پوشیده که توسط پدرش بعنوان هدیه بعد از سفر به سیدنی به او داده شده. و من و شما همه می‎توانیم بگویم که چقدر خوشحال هست و دقیقا در آن لحظه در ذهنش چه می‎گذرد.

This is my son Leo. On my quest for visual intelligence, I think of Leo constantly and the future world he will live in. When machines can see, doctors and nurses will have extra pairs of tireless eyes to help them to diagnose and take care of patients. Cars will run smarter and safer on the road. Robots, not just humans, will help us to brave the disaster zones to save the trapped and wounded. We will discover new species, better materials, and explore unseen frontiers with the help of the machines.

این پسر من "لیو" هست. در جستجوی من برای هوش بصری مدام به "لیو" فکر می‎کنم و آینده‎ای که او زندگی خواهد کرد. زمانی که ماشینها می‎توانند ببینند، پزشکان و پرستاران یک جفت چشم خستگی ناپذیراضافه خواهند داشت که به آنها کمک خواهد کرد برای تشخیص و مراقبت از بیماران. خودروها هوشمندانه‎تر و ایمن‎تر در جاده‎ها حرکت خواهند کرد. ربات‎ها، نه فقط انسانها به ما در خطرکردن در مناطق فاجعه‎زده برای نجات مصدومان و زخمی‎ها کمک خواهند کرد. گونه‎های جدید خواهیم یافت، مواد بهتر، و مرزهای نادیده را با کمک ماشینها اکتشاف خواهیم کرد.

Little by little, we're giving sight to the machines. First, we teach them to see. Then, they help us to see better. For the first time, human eyes won't be the only ones pondering and exploring our world. We will not only use the machines for their intelligence, we will also collaborate with them in ways that we cannot even imagine.

کم کم داریم به ماشینها بینایی می‎بخشیم. ابتدا ما به آنها دیدن را می‎آموزیم. سپس آنها به ما کمک می‎کنند تا بهتر ببینیم. برای اولین بار چشمان انسان تنها چشمانی نخواهند بود که تفکر می‎کنند و جهان ما را کاوش می‎کنند. ما نه تنها از ماشینها برای هوش آنها استفاده می‎کنیم، بلکه با آنها به روش هایی که نمی‎توانیم تصور کنیم همکاری خواهیم کرد.

This is my quest: to give computers visual intelligence and to create a better future for Leo and for the world.

این جستجوی من است: تا به رایانه ها هوش بصری بدهم و آینده بهتری برای "لیو" و جهان خلق کنم.

Thank you.

متشکرم.

(Applause)

(تشویق حضار)