Fei-Fei Li: How we're teaching computers to understand pictures

دعوني أريكم شيئًا

Let me show you something.

(فيديو) طفلة: حسنًا، هذه قطة تجلس في السرير الولد يداعب الفيل هؤلاء أناس سيسافرون على متن الطائرة تلك طائرة كبيرة

(Video) Girl: Okay, that's a cat sitting in a bed. The boy is petting the elephant. Those are people that are going on an airplane. That's a big airplane.

فاي-فاي لِي: هذه طفلة عمرها ثلاثة أعوام تصف ما تراه في مجموعة من الصور ربما لا يزال أمامها الكثير لتتعلمه عن هذا العالم لكنها بالفعل خبيرة في مهمة ضرورية جدًا أن تعي وتعقل ما تراه مجتمعنا متقدمٌ تكنولوجيًا بشكل لم يسبق له مثيل نُرسل أناسًا إلى القمر أونجعل هواتفنا تتحدث إلينا أو نخصص محطات إذاعية تستطيع أن تعزف الموسيقى التي نحبها فقط لكن، آلاتنا وأجهزة الكمبيوتر الأكثر تطورًا لا تزال تشق طريقها جاهدةً لتنفيذ هذه المهمة لذلك جئت اليوم لأعطيكم تقريرًا مرحليًا عن آخر التطورات في أبحاثنا في مجال الإبصار الحاسوبي، وهو أحد التقنيات الرائدة بل وربما الثورية في مجال علوم الحاسوب

Fei-Fei Li: This is a three-year-old child describing what she sees in a series of photos. She might still have a lot to learn about this world, but she's already an expert at one very important task: to make sense of what she sees. Our society is more technologically advanced than ever. We send people to the moon, we make phones that talk to us or customize radio stations that can play only music we like. Yet, our most advanced machines and computers still struggle at this task. So I'm here today to give you a progress report on the latest advances in our research in computer vision, one of the most frontier and potentially revolutionary technologies in computer science.

نعم، لقد صنعنا نماذج لسيارات تستطيع أن تقود نفسها بنفسها لكن بدون إبصار ذكي لن تستطيع تلك السيارات أن تميز الفرق بدقة بين كيس ورقي متكوّم على الطريق بحيث يمكنها أن تمر فوقه وبين صخرة بنفس الحجم ينبغي عليها تجاوزها لقد صنعنا كاميرات رائعة تقاس دقتها بالميجا بكسل لكننا لم نمنح الأعمى بصرًا تستطيع الطائرات بدون طيار أن تحلق فوق مساحات شاسعة لكنها لا تملك تقنية إبصار كافية لتعيننا على تتبع التغيرات في الغابات المطيرة كاميرات المراقبة أصبحت في كل مكان لكنها لا تنبهنا عندما يوشك طفل على الغرق في بركة سباحة الصور والفيديوهات أصبحت جزءًا متكاملا مع الحياة على مستوى العالم لقد أصبحت تتدفق بوتيرة أسرع بكثير مما كان أي إنسان أو مجموعة من البشر يأمل في رؤيته وأنا وأنت نساهم في ذلك في مؤتمر TED هذا لكن ما تزال أكثر برامجنا تطورًا تواجه مشكلة في استيعاب وإدارة هذا المحتوى الهائل لذلك، وبكلمات أخرى نحن كمجتمع، بصورة جمعية فاقدون للبصر بشدة لأن أذكى آلاتنا لا تزال عمياء

Yes, we have prototyped cars that can drive by themselves, but without smart vision, they cannot really tell the difference between a crumpled paper bag on the road, which can be run over, and a rock that size, which should be avoided. We have made fabulous megapixel cameras, but we have not delivered sight to the blind. Drones can fly over massive land, but don't have enough vision technology to help us to track the changes of the rainforests. Security cameras are everywhere, but they do not alert us when a child is drowning in a swimming pool. Photos and videos are becoming an integral part of global life. They're being generated at a pace that's far beyond what any human, or teams of humans, could hope to view, and you and I are contributing to that at this TED. Yet our most advanced software is still struggling at understanding and managing this enormous content. So in other words, collectively as a society, we're very much blind, because our smartest machines are still blind.

يمكنك أن تسأل "لِمَ ذلك من الصعوبة بمكان؟" تستطيع الكاميرات أن تلتقط صورًا كهذه عن طريق تحويل الضوء إلى مصفوفة أرقام ثنائية البعد تعرف باسم البكسل لكنها مجرد أرقام خالية من الحياة لا تحمل في ذاتها أي معنى تمامًا كما أن السّمْع يختلف عن الإصغاء فإن التقاط الصور يختلف عن الإبصار فبالإبصار، فإننا في الحقيقة نعني الفهم في الواقع، فإن الأمر استغرق الطبيعة الأم 540 مليون سنة من العمل المُضني لتنجز هذه المهمة ومعظم ذلك الجهد ذهب في سبيل تطوير جهاز معالجة بصرية في أدمغتنا وليس العين بحد ذاتها إذن، تبدأ الرؤية في العين لكنها حقيقة تحدث في الدماغ

"Why is this so hard?" you may ask. Cameras can take pictures like this one by converting lights into a two-dimensional array of numbers known as pixels, but these are just lifeless numbers. They do not carry meaning in themselves. Just like to hear is not the same as to listen, to take pictures is not the same as to see, and by seeing, we really mean understanding. In fact, it took Mother Nature 540 million years of hard work to do this task, and much of that effort went into developing the visual processing apparatus of our brains, not the eyes themselves. So vision begins with the eyes, but it truly takes place in the brain.

منذ 15 عامًا وحتى الآن بدأتها منذ كنت أحضر الدكتوراه في معهد كاليفورنيا للتكنولوجيا وبعد ذلك عندما كنت أقود مختبر الإبصار في ستانفورد كنت أعمل مع أساتذتي ومعاونيّ وتلامذتي على تعليم الحواسيب لكي تبصر مجال بحثنا يسمى الإبصار الحاسوبي وتعليم الحواسيب وهو جزء من المجال العام في الذكاء الصناعي ما نريد الوصول إليه هو أن نُعَلّم الآلات لكي تبصر مثلنا تمامًا تُسمي الأشياء بأسمائها وتتعرف على الأشخاص وتستدل على الأبعاد الثلاثية للأسطح تفهم العلاقات والعواطف والأفعال والنوايا أنت وأنا ننسج معا قصصًا كاملة عن الناس والأماكن والأشياء في اللحظة التي تقع فيها أبصارنا عليهم

So for 15 years now, starting from my Ph.D. at Caltech and then leading Stanford's Vision Lab, I've been working with my mentors, collaborators and students to teach computers to see. Our research field is called computer vision and machine learning. It's part of the general field of artificial intelligence. So ultimately, we want to teach the machines to see just like we do: naming objects, identifying people, inferring 3D geometry of things, understanding relations, emotions, actions and intentions. You and I weave together entire stories of people, places and things the moment we lay our gaze on them.

أول خطوة في سبيل تحقيق هذا الهدف هي أن نلقن الحاسوب كيف يرى الأشياء اللبِنة الأساسية للعالم المرئي بأبسط العبارات، تخيل هذه العملية التلقينية كأن نُري الحواسيب بعض الصور التدريبية لشيء معين، لنفترض قطة ونصمم نموذجا يمكنه أن يتعلم من هذه الصور التدريبية ما هو مبلغ الصعوبة في ذلك؟ ففي النهاية، ما القطة إلا مجموعة أشكال وألوان وهذا ما فعلناه في الأيام الأولى لعملية نَمْذجة الأشياء كنا نلقن خوارزمية الحاسوب بلغة رياضية أن القطة لها وجه مُدوّر وجسم مُكْتَنِز وأذنان مدببتان وذيل طويل وبدا ذلك مما لا بأس فيه لكن ماذا عن هذه القطة؟ (ضحك) أنها ملتفة حول نفسها الآن عليك أن تضيف شكلًا ومنظورًا آخرين للنموذج لكن ماذا لو كانت القطط مخفية؟ ماذا عن هذه القطط المُضحكة؟ لقد فهمتم الآن ما أعنيه حتى بالنسبة لشيء بسيط كحيوان أليف يمكن أن يقدم عددا لا نهائي من المتغيرات للنموذج وهذا مجرد شيء واحد فحسب

The first step towards this goal is to teach a computer to see objects, the building block of the visual world. In its simplest terms, imagine this teaching process as showing the computers some training images of a particular object, let's say cats, and designing a model that learns from these training images. How hard can this be? After all, a cat is just a collection of shapes and colors, and this is what we did in the early days of object modeling. We'd tell the computer algorithm in a mathematical language that a cat has a round face, a chubby body, two pointy ears, and a long tail, and that looked all fine. But what about this cat? (Laughter) It's all curled up. Now you have to add another shape and viewpoint to the object model. But what if cats are hidden? What about these silly cats? Now you get my point. Even something as simple as a household pet can present an infinite number of variations to the object model, and that's just one object.

لذا منذ حوالي ثمان سنوات ملاحظة بسيطة ومتعمقة غيرت تفكيري لا أحد يُعَلّم الطفل كيف يبصر خصوصًا في السنوات الأولى هم يتعلمون من خلال التجارب والأمثلة في العالم الحقيقي إذا أخذت في الاعتبار عيني طفل كزوج من الكاميرات الحيوية فإنها تلتقط صورة واحدة كل 200 ميللي ثانية تقريبًا وهو متوسط الوقت الذي تستغرقه حركة العين إذن ففي الثالثة من العمر يكون الطفل قد رأى مئات الملايين من الصور للعالم الحقيقي هذا يشكل كمًا كبيرًا من الأمثلة التدريبية ولذلك فبدلا من التركيز على تحسين الخوارزميات لوحدها فَطِنْت لأن أعطي الخوارزميات نفس النوع من البيانات التدريبية التي يحصل عليها الطفل من خلال التجارب من حيث الكمّ والنوع

So about eight years ago, a very simple and profound observation changed my thinking. No one tells a child how to see, especially in the early years. They learn this through real-world experiences and examples. If you consider a child's eyes as a pair of biological cameras, they take one picture about every 200 milliseconds, the average time an eye movement is made. So by age three, a child would have seen hundreds of millions of pictures of the real world. That's a lot of training examples. So instead of focusing solely on better and better algorithms, my insight was to give the algorithms the kind of training data that a child was given through experiences in both quantity and quality.

طالما أننا نعرف ذلك فقد عرفنا أننا نحتاج أن نجمع مجموعة بيانات تحتوي على صور أكثر بكثير مما كنا قد حصلنا عليه من قبل ربما أكثر بآلاف المرات وبالتعاون مع البروفيسور كاي لي من جامعة برينستون أطلقنا في العام 2007 مشروع ImageNet لحسن الحظ، لم يتعين علينا أن نَنْصِب كاميرا فوق رؤوسنا ثم ننتظر لسنوات عديدة لقد اتجهنا للإنترنت حيث يقبع أكبر كنز من الصور أنتجته البشرية على الإطلاق لقد قمنا بتحميل ما يقارب المليار صورة واستخدمنا تقنية "التعهيد الجماعي" كتلك التي توفرها منصة "أمازون ميكنيكال تورك" ـ لتساعدنا في تصنيف تلك الصور كان مشروع ImageNet في ذروته واحدًا من أكبر المُشَغّلين للعاملين على منصة أمازون تلك معًا، كانوا 50,000 عامل تقريبًا من 167 دولة حول العالم ساعدونا في ترتيب وفرز وتصنيف ما يقرب من مليار صورة مُرَشّحة هذا يبين مقدار الجهد المُسْتغرق لالتقاط مجرد جزء صغير من الصور التي يستوعبها عقل طفل في سنوات تطوره الأولى

Once we know this, we knew we needed to collect a data set that has far more images than we have ever had before, perhaps thousands of times more, and together with Professor Kai Li at Princeton University, we launched the ImageNet project in 2007. Luckily, we didn't have to mount a camera on our head and wait for many years. We went to the Internet, the biggest treasure trove of pictures that humans have ever created. We downloaded nearly a billion images and used crowdsourcing technology like the Amazon Mechanical Turk platform to help us to label these images. At its peak, ImageNet was one of the biggest employers of the Amazon Mechanical Turk workers: together, almost 50,000 workers from 167 countries around the world helped us to clean, sort and label nearly a billion candidate images. That was how much effort it took to capture even a fraction of the imagery a child's mind takes in in the early developmental years.

في لفتة متأخرة، فإن فكرة استخدام الـ"بيانات كبيرة" (big data) لتدريب خوارزميات الحواسيب قد تبدو واضحة الآن لكنها في العام 2007، لم تكن بهذا الوضوح لقد كنا لوحدنا تمامًا في هذه الرحلة لوهلة من الزمن نصحني بعض زملائي المقربين بأن أقوم بشيء أكثر ملاءمة لمركزي وحينها كنا نعني باستمرار لتوفير التمويل لأبحاثنا مرةً، مازحت طلاب الدراسات العليا الذين كنت أشرف عليهم بأنني مستعدة لأفتتح مصبغة الملابس التي أمتلكها من جديد لتمويل ImageNet على كلٍ، كانت هذه هي الطريقة التي مولت بها نفسي خلال دراستي الجامعية

In hindsight, this idea of using big data to train computer algorithms may seem obvious now, but back in 2007, it was not so obvious. We were fairly alone on this journey for quite a while. Some very friendly colleagues advised me to do something more useful for my tenure, and we were constantly struggling for research funding. Once, I even joked to my graduate students that I would just reopen my dry cleaner's shop to fund ImageNet. After all, that's how I funded my college years.

وهكذا تابعنا عملنا في 2009، تم إنجاز مشروع ImageNet قاعدة بيانات ذات 15 مليون صورة ونحو 22,000 رُتبة للأشياء مرتبة بكلمات اللغة الإنجليزية المستعملة يوميًا حسب كل من الكمية والنوعية لقد كان هذا مستوًى غير مسبوق كمثال، في حالة القطط لدينا أكثر من 62,000 قطة من كل الأنواع وبكل الوضعيات ومن كل فصائل القطط الأليفة منها والبرية لقد كنا مغتبطين لأننا تمكنا من جمع شتات ImageNet وأردنا أن يستفيد المجتمع البحثي بأكمله من هذا المشروع فلذلك في مؤتمر TED fashion وفرنا قاعدة البيانات كاملة بالمجان للمجتمع البحثي حول العالم (تصفيق)

So we carried on. In 2009, the ImageNet project delivered a database of 15 million images across 22,000 classes of objects and things organized by everyday English words. In both quantity and quality, this was an unprecedented scale. As an example, in the case of cats, we have more than 62,000 cats of all kinds of looks and poses and across all species of domestic and wild cats. We were thrilled to have put together ImageNet, and we wanted the whole research world to benefit from it, so in the TED fashion, we opened up the entire data set to the worldwide research community for free. (Applause)

والآن وقد امتلكنا البيانات لنغذي عقل حاسوبنا أصبحنا جاهزين لنعود للخوارزميات ذاتها وكما تبين لاحقًا، فإن ثراء المعلومات التي وفرها ImageNet كان متناغمًا بشكل كامل مع طراز معين من خوارزميات تعليم الحواسيب يُسمى الشبكة العصبية الملتفّة أسسها كونيهيكو فوكوشيما وجيف هينتون ويان لي كًن وذلك في السبعينات والثمانينات من القرن الماضي تمامًا كما أن الدماغ يتكون من مليارات الأعصاب المتصلة بقوة فإن الوحدة التشغيلية الأساسية في الشبكة العصبية هي العقدة العصبية هذه العقدة تأخذ مدخلاتها من عُقَد أخرى وترسل مخرجاتها لعُقَد أخرى أيضًا، فإن مئات الآلاف أو حتى الملايين من هذه العُقَد مرتبة في طبقات هرمية شبيهة جدا بالدماغ في الشبكة العصبية النمطية، اعتدنا أن ندرب نموذج التعرف على الأشياء الخاص بنا والذي لديه 24 مليون عقدة و140 مليون متغير و15 مليار وصلة هذا نموذج ضخم مدعوم بكم هائل من البيانات من ImageNet ووحدات مركزية حديثة لمعالجة البيانات والصور لتدريب نموذج ضخم كهذا الشبكة العصبية الملتفّة تطورت بشكل لم يتوقعه أحد وأصبحت هي المعمار الحاسوبي المتألق في إنتاج نتائج جديدة ومثيرة في مجال التعرف على الأشياء هذا حاسوب يخبرنا بأن هذه الصورة تتضمن قطة وأين هي القطة تحديدًا طبعًا فإن هناك أشياء أخرى عدا القطط فهذه خوارزمية حاسوب تخبرنا بأن هذه الصورة تحتوي على ولد ودبدوب كلب وشخص وطائرة ورقية صغيرة في الخلفية أو صورة مليئة جدًا بالأشياء مثل رَجُل ولوح تزلج ودرابزين وعمود إنارة وهلم جرًا أحيانًا، عندما لا يكون الحاسوب متأكدًا جدًا حيال ما يراه علمناه أن يكون ذكيًا بقدرٍ كافٍ ليعطي إجابة آمنة بدلًا من أن يرهق نفسه زيادة عن اللزوم تماما كما قد نفعل نحن لكن في أحيان أخرى تكون خوارزميتنا مميزة في إخبارنا عن ماهية الأشياء بدقة كالشركة المصنعة لسيارة وطرازها وسنة صنعها

Now that we have the data to nourish our computer brain, we're ready to come back to the algorithms themselves. As it turned out, the wealth of information provided by ImageNet was a perfect match to a particular class of machine learning algorithms called convolutional neural network, pioneered by Kunihiko Fukushima, Geoff Hinton, and Yann LeCun back in the 1970s and '80s. Just like the brain consists of billions of highly connected neurons, a basic operating unit in a neural network is a neuron-like node. It takes input from other nodes and sends output to others. Moreover, these hundreds of thousands or even millions of nodes are organized in hierarchical layers, also similar to the brain. In a typical neural network we use to train our object recognition model, it has 24 million nodes, 140 million parameters, and 15 billion connections. That's an enormous model. Powered by the massive data from ImageNet and the modern CPUs and GPUs to train such a humongous model, the convolutional neural network blossomed in a way that no one expected. It became the winning architecture to generate exciting new results in object recognition. This is a computer telling us this picture contains a cat and where the cat is. Of course there are more things than cats, so here's a computer algorithm telling us the picture contains a boy and a teddy bear; a dog, a person, and a small kite in the background; or a picture of very busy things like a man, a skateboard, railings, a lampost, and so on. Sometimes, when the computer is not so confident about what it sees, we have taught it to be smart enough to give us a safe answer instead of committing too much, just like we would do, but other times our computer algorithm is remarkable at telling us what exactly the objects are, like the make, model, year of the cars.

لقد طبقنا هذه الخوارزمية على ملايين الصور في Google Street View عبر مئات المدن الأمريكية وتعلمنا شيئًا مثيرًا للاهتمام حقًا أولًا، لقد أكدت لنا حِسّنا السليم حيث كان هناك علاقة وثيقة بين أسعار السيارات ومستويات الدّخل لكن المفاجأة كانت أن أسعار السيارات ترتبط أيضًا بعلاقة وثيقة مع معدلات الجريمة في المدن أو مع أنماط التصويت حسب الأحياء والضواحي

We applied this algorithm to millions of Google Street View images across hundreds of American cities, and we have learned something really interesting: first, it confirmed our common wisdom that car prices correlate very well with household incomes. But surprisingly, car prices also correlate well with crime rates in cities, or voting patterns by zip codes.

فلتنتظروا لحظة. هل هذا كل ما في الأمر؟ هل وصلت قدرات الحواسيب لقدرات البشر أو تجاوزتها حتى؟ ليس بهذه السرعة حتى الآن، فقط علمنا الحاسوب كيف يرى الأشياء وهو في ذلك يشبه طفلًا صغيرا يتعلم كيف ينطق بعض الكلمات إنه إنجاز لا يصدَّق لكنها مجرد خطوة أولى قريبا سننجز مرحلة تطورية أخرى والأطفال سيبدؤون بالتواصل عن طريق جُمَل وهكذا فبدلًا من القول بأن ما في الصورة هو قطة لقد سمعتم بالفعل تلك الفتاة الصغيرة وهي تخبرنا أن تلك هي قطة تستلقي على السرير

So wait a minute. Is that it? Has the computer already matched or even surpassed human capabilities? Not so fast. So far, we have just taught the computer to see objects. This is like a small child learning to utter a few nouns. It's an incredible accomplishment, but it's only the first step. Soon, another developmental milestone will be hit, and children begin to communicate in sentences. So instead of saying this is a cat in the picture, you already heard the little girl telling us this is a cat lying on a bed.

فإذن لنعلم حاسوبًا ليرى صورة ويولّد منها جملة فإن الزواج بين "البيانات الكبيرة" وخوارزميات تعليم الحواسيب يجب أن يخطو خطوة أخرى الآن، على الحاسوب أن يتعلم من الصور وكذلك جُمَل اللّغة الطبيعية التي أحدثها البشر تمامًا كما يُكامل الدماغ بين الرؤية واللغة طورنا نموذجًا يربط أجزاءً من الأشياء المرئيَة كالقصاصات المرئية مثلًا مع كلمات وعبارات في جُمل

So to teach a computer to see a picture and generate sentences, the marriage between big data and machine learning algorithm has to take another step. Now, the computer has to learn from both pictures as well as natural language sentences generated by humans. Just like the brain integrates vision and language, we developed a model that connects parts of visual things like visual snippets with words and phrases in sentences.

منذ حوالي أربعة أشهر ربطنا أخيرًا بين كل هذه الأجزاء وأنتجنا واحدًا من أوائل نماذج الإبصار الحاسوبية القادرة على توليد جُمل مقاربة للغة البشر عندما ترى صورة للمرة الأولى الآن، أنا مستعدة لأريك ما يقول الحاسوب عندما يرى الصورة التي رأتها تلك الفتاة الصغيرة في أول هذه المحادثة

About four months ago, we finally tied all this together and produced one of the first computer vision models that is capable of generating a human-like sentence when it sees a picture for the first time. Now, I'm ready to show you what the computer says when it sees the picture that the little girl saw at the beginning of this talk.

(فيديو) الحاسوب: رجلٌ يقف إلى جانب فيل طائرة كبيرة تقبع على رأس مَدْرج مطار

(Video) Computer: A man is standing next to an elephant. A large airplane sitting on top of an airport runway.

فاي-فاي لي: طبعًا، لا نزال نعمل باجتهاد لنطور خوارزميتنا ولا يزال أمامها الكثير لتتعلمه (تصفيق)

FFL: Of course, we're still working hard to improve our algorithms, and it still has a lot to learn. (Applause)

ولا يزال الحاسوب يقع في أخطاء

And the computer still makes mistakes.

(فيديو) الحاسوب: قطة تستلقي في بطانية على سرير

(Video) Computer: A cat lying on a bed in a blanket.

فاي-فاي لي: وهكذا بالطبع، فعندما يرى الكثير من القطط يظن أن كل شيء قد يبدو مثل قطة

FFL: So of course, when it sees too many cats, it thinks everything might look like a cat.

(فيديو) الحاسوب: طفل صغير يحمل مضرب بيسبول (ضحك)

(Video) Computer: A young boy is holding a baseball bat. (Laughter)

فاي-فاي لي: أو إن لم يكن رأى فرشاة أسنان من قبل فقد يخلط بينها وبين مضرب بيسبول

FFL: Or, if it hasn't seen a toothbrush, it confuses it with a baseball bat.

(فيديو) الحاسوب: رَجل يمتطي حصانًا في آخر الشارع بجانب مبنًى (ضحك)

(Video) Computer: A man riding a horse down a street next to a building. (Laughter)

فاي-فاي لي: لم نشرح للحواسيب درس مبادئ الفن

FFL: We haven't taught Art 101 to the computers.

(فيديو) الحاسوب: حمار وحشي يقف في حقل من العشب

(Video) Computer: A zebra standing in a field of grass.

فاي-فاي لي: ولم يتعلم كذلك كيف يُقَدّر جمال الطبيعة الساحر كما تُقَدّره أنت وأنا

FFL: And it hasn't learned to appreciate the stunning beauty of nature like you and I do.

إذن فقد كانت رحلة طويلة الانتقال من عمر صفر إلى عمر ثلاث سنوات هو عمل شاق التحدي الحقيقي هو أن تنتقل من 3 سنوات إلى 13 سنة وأبعد من ذلك دعوني أذكركم بهذه الصورة للطفل والكعكة مرة أخرى حتى الآن، علمنا الحاسوب أن يبصر الأشياء أو أن يحكي لنا قصة بسيطة عندما يرى صورة

So it has been a long journey. To get from age zero to three was hard. The real challenge is to go from three to 13 and far beyond. Let me remind you with this picture of the boy and the cake again. So far, we have taught the computer to see objects or even tell us a simple story when seeing a picture.

(فيديو) الحاسوب: شخص يجلس إلى مائدة مع كعكة

(Video) Computer: A person sitting at a table with a cake.

فاي-فاي لي: لكن هناك المزيد والمزيد في هذه الصورة أكثر من مجرد شخص وكعكة ما لا يراه الحاسوب أن تلك هي كعكة إيطالية مميزة تُقَدّم فقط بمناسبة عيد الفِصْح الولد يرتدي قميصه المفضّل الذي أعطاه إياه والده كهدية بعد رحلة إلى سيدني وجميعنا نرى كم هو سعيد ونستطيع أن نخمن ما يدور في خَلَده في تلك اللحظة

FFL: But there's so much more to this picture than just a person and a cake. What the computer doesn't see is that this is a special Italian cake that's only served during Easter time. The boy is wearing his favorite t-shirt given to him as a gift by his father after a trip to Sydney, and you and I can all tell how happy he is and what's exactly on his mind at that moment.

هذا هو ابني ليو أثناء بحثي عن الذكاء البصري كنت أفكر في ليو باستمرار وعن عالم المستقبل الذي سيعيش فيه عندما ستتمكن الحواسيب من الإبصار الأطباء والممرضات سيحظون بأزواج إضافية من العيون التي لا تَكِلّ لتعينهم على تشخيص المرضى والعناية بهم ستسير السيارات على الطرقات بشكل أذكى وأكثر أمانًا الرجال الآليون وليس البشر فحسب سيساعدوننا في مواجهة نطاقات الكوارث لينقذوا المحتجزين والجرحى سنكتشف أنواع مخلوقات جديدة ومواد أفضل وسنستكشف الأبعاد غير المرئية بمساعدة الحواسيب

This is my son Leo. On my quest for visual intelligence, I think of Leo constantly and the future world he will live in. When machines can see, doctors and nurses will have extra pairs of tireless eyes to help them to diagnose and take care of patients. Cars will run smarter and safer on the road. Robots, not just humans, will help us to brave the disaster zones to save the trapped and wounded. We will discover new species, better materials, and explore unseen frontiers with the help of the machines.

شيئًا فشيئًا، نحن نمنح حاسة البصر للحواسيب في البداية نعلمها كيف ترى ثم ستساعدنا لنرى بشكل أفضل لأول مرة، لن تكون عيون البشر هي الوحيدة التي تتأمل وتستكشف عالمنا لن يقتصر استخدامنا للحواسيب لأجل ذكائها بل سوف نتعاون معها بطرق لا يمكننا حتى تخيلها

Little by little, we're giving sight to the machines. First, we teach them to see. Then, they help us to see better. For the first time, human eyes won't be the only ones pondering and exploring our world. We will not only use the machines for their intelligence, we will also collaborate with them in ways that we cannot even imagine.

هذا هو أملي أن أعطي الحواسيب ذكاءً بصريًا وأن أخلق مستقبلًا أفضل من أجل ليو ومن أجل العالم

This is my quest: to give computers visual intelligence and to create a better future for Leo and for the world.

شكرًا

Thank you.

(تصفيق)

(Applause)

دعوني أريكم شيئًا

Let me show you something.

(Video) Girl: Okay, that's a cat sitting in a bed. The boy is petting the elephant. Those are people that are going on an airplane. That's a big airplane.

(فيديو) الحاسوب: رجلٌ يقف إلى جانب فيل طائرة كبيرة تقبع على رأس مَدْرج مطار

(Video) Computer: A man is standing next to an elephant. A large airplane sitting on top of an airport runway.

فاي-فاي لي: طبعًا، لا نزال نعمل باجتهاد لنطور خوارزميتنا ولا يزال أمامها الكثير لتتعلمه (تصفيق)

FFL: Of course, we're still working hard to improve our algorithms, and it still has a lot to learn. (Applause)

ولا يزال الحاسوب يقع في أخطاء

And the computer still makes mistakes.

(فيديو) الحاسوب: قطة تستلقي في بطانية على سرير

(Video) Computer: A cat lying on a bed in a blanket.

فاي-فاي لي: وهكذا بالطبع، فعندما يرى الكثير من القطط يظن أن كل شيء قد يبدو مثل قطة

FFL: So of course, when it sees too many cats, it thinks everything might look like a cat.

(فيديو) الحاسوب: طفل صغير يحمل مضرب بيسبول (ضحك)

(Video) Computer: A young boy is holding a baseball bat. (Laughter)

فاي-فاي لي: أو إن لم يكن رأى فرشاة أسنان من قبل فقد يخلط بينها وبين مضرب بيسبول

FFL: Or, if it hasn't seen a toothbrush, it confuses it with a baseball bat.

(فيديو) الحاسوب: رَجل يمتطي حصانًا في آخر الشارع بجانب مبنًى (ضحك)

(Video) Computer: A man riding a horse down a street next to a building. (Laughter)

فاي-فاي لي: لم نشرح للحواسيب درس مبادئ الفن

FFL: We haven't taught Art 101 to the computers.

(فيديو) الحاسوب: حمار وحشي يقف في حقل من العشب

(Video) Computer: A zebra standing in a field of grass.

فاي-فاي لي: ولم يتعلم كذلك كيف يُقَدّر جمال الطبيعة الساحر كما تُقَدّره أنت وأنا

FFL: And it hasn't learned to appreciate the stunning beauty of nature like you and I do.

(فيديو) الحاسوب: شخص يجلس إلى مائدة مع كعكة

(Video) Computer: A person sitting at a table with a cake.

هذا هو أملي أن أعطي الحواسيب ذكاءً بصريًا وأن أخلق مستقبلًا أفضل من أجل ليو ومن أجل العالم

This is my quest: to give computers visual intelligence and to create a better future for Leo and for the world.

شكرًا

Thank you.

(تصفيق)

(Applause)

Fei-Fei Li: How we're teaching computers to understand pictures

Fei-Fei Li: How we're teaching computers to understand pictures

Related talks

Jeremy Howard: The wonderful and terrifying implications of computers that can learn

Pawan Sinha: How brains learn to see

Patricia Kuhl: The linguistic genius of babies

Joseph Redmon: How computers learn to recognize objects instantly

Sebastian Thrun and Chris Anderson: What AI is -- and isn't

Linda Liukas: A delightful way to teach kids about computers

Related talks

Jeremy Howard: The wonderful and terrifying implications of computers that can learn

Pawan Sinha: How brains learn to see

Patricia Kuhl: The linguistic genius of babies

Joseph Redmon: How computers learn to recognize objects instantly

Sebastian Thrun and Chris Anderson: What AI is -- and isn't

Linda Liukas: A delightful way to teach kids about computers