Fei-Fei Li: How we're teaching computers to understand pictures

Let me show you something.

ฉันขอแสดงอะไรบางอย่าง

(Video) Girl: Okay, that's a cat sitting in a bed. The boy is petting the elephant. Those are people that are going on an airplane. That's a big airplane.

(วิดีโอ) เด็กหญิง: เอาล่ะ นี่คือแมวนั่งอยู่บนเตียง เด็กชายกำลังลูบคลำช้าง นั่นคือคนที่กำลังเดินทางในเครื่องบิน นั่นคือเครื่องบินขนาดใหญ่

Fei-Fei Li: This is a three-year-old child describing what she sees in a series of photos. She might still have a lot to learn about this world, but she's already an expert at one very important task: to make sense of what she sees. Our society is more technologically advanced than ever. We send people to the moon, we make phones that talk to us or customize radio stations that can play only music we like. Yet, our most advanced machines and computers still struggle at this task. So I'm here today to give you a progress report on the latest advances in our research in computer vision, one of the most frontier and potentially revolutionary technologies in computer science.

Fei-Fei Li: นี่เป็นเด็กอายุสามขวบ กำลังอธิบายสิ่งที่เธอเห็นในชุดรูปภาพ เธออาจจะมีสิ่งของจำนวนมากที่จะ เรียนรู้เกี่ยวกับโลกนี้ แต่เธอก็เป็นผู้เชี่ยวชาญในงานที่สำคัญมาก ที่จะทำความเข้าใจกับสิ่งที่เธอเห็น สังคมของเราก้าวหน้ามากขึ้นกว่าเดิม เราส่งคนไปยังดวงจันทร์ เราทำโทรศัพท์ ที่พูดคุยกับเรา หรือปรับแต่งสถานีวิทยุที่สามารถเล่น เฉพาะเพลงที่เราชอบเท่านั้น แต่เครื่องที่ทันสมัยที่สุดและคอมพิวเตอร์ ของเรา ยังคงต่อสู้กับงานนี้ วันนี้ ฉันมาที่นี่เพื่อรายงานความคืบหน้า เกี่ยวกับความก้าวหน้าล่าสุดในการวิจัย ของเรา คอมพิวเตอร์ที่มองเห็น หนึ่งในแนวพรมแดนและการปฏิวัติส่วนใหญ่ เทคโนโลยีในวิทยาการคอมพิวเตอร์

Yes, we have prototyped cars that can drive by themselves, but without smart vision, they cannot really tell the difference between a crumpled paper bag on the road, which can be run over, and a rock that size, which should be avoided. We have made fabulous megapixel cameras, but we have not delivered sight to the blind. Drones can fly over massive land, but don't have enough vision technology to help us to track the changes of the rainforests. Security cameras are everywhere, but they do not alert us when a child is drowning in a swimming pool. Photos and videos are becoming an integral part of global life. They're being generated at a pace that's far beyond what any human, or teams of humans, could hope to view, and you and I are contributing to that at this TED. Yet our most advanced software is still struggling at understanding and managing this enormous content. So in other words, collectively as a society, we're very much blind, because our smartest machines are still blind.

ใช่ เรามีรถยนต์ต้นแบบที่สามารถขับขี่ ด้วยตัวเอง แต่ไม่มีวิสัยทัศน์ที่ฉลาด พวกนั้น ไม่สามารถบอกความแตกต่าง ระหว่างถุงกระดาษยู่ยี่บนถนนซึ่ง สามารถเรียกวิ่งทับได้ กับหินในขนาดที่ควรหลีกเลี่ยง เราได้ทำกล้องล้านพิกเซลที่ยอดเยี่ยม แต่เราไม่ได้ให้การมองเห็นแก่คนตาบอด โดรนสามารถบินผ่านพื้นดินขนาดใหญ่ แต่ไม่ได้มีเทคโนโลยีการมองเห็นเพียงพอ เพื่อช่วยในการติดตามการเปลี่ยนแปลง ของป่าฝน กล้องรักษาความปลอดภัยมีอยู่ทั่วไป แต่กล้องไม่แจ้งเตือนเราเมื่อเด็กกำลัง จมน้ำในสระว่ายน้ำ ภาพถ่ายและวิดีโอกลายเป็นส่วนสำคัญ ของชีวิตในโลก กล้องถูกสร้างขึ้นด้วยความเร็วที่ไกล เกินกว่าสิ่งที่มนุษย์ หรือทีมงานของมนุษย์สามารถหวังว่าจะเห็น และคุณและฉันมีส่วนร่วมในกรณีนี้ที่ TED นี้ แต่ซอฟต์แวร์ขั้นสูงที่สุดของเรา ยังคงต้องดิ้นรนเพื่อให้เข้าใจ และจัดการกับเนื้อหาขนาดใหญ่นี้ ดังนั้นในคำอื่นๆ เรียกรวมกันว่าเป็นสังคม เราเป็นคนตาบอดมาก เพราะเครื่องที่ชาญฉลาดของเรา ยังคงตาบอด

"Why is this so hard?" you may ask. Cameras can take pictures like this one by converting lights into a two-dimensional array of numbers known as pixels, but these are just lifeless numbers. They do not carry meaning in themselves. Just like to hear is not the same as to listen, to take pictures is not the same as to see, and by seeing, we really mean understanding. In fact, it took Mother Nature 540 million years of hard work to do this task, and much of that effort went into developing the visual processing apparatus of our brains, not the eyes themselves. So vision begins with the eyes, but it truly takes place in the brain.

"ทำไมถึงยากมาก" คุณอาจถาม กล้องสามารถถ่ายรูปได้เช่นนี้ โดยการแปลงแสงเป็นแถวสองมิติ ของตัวเลข เรียกว่าพิกเซล แต่สิ่งเหล่านี้เป็นเพียงตัวเลขที่ตาย พิกเซลไม่มีความหมายในตัวเอง เช่นเดียวกับที่ได้ยินไม่เหมือนกับการฟัง การถ่ายภาพไม่ได้เหมือนกับการดู และโดยการเห็น เราหมายถึง ความเข้าใจจริงๆ ในความเป็นจริงแล้ว ธรรมชาติต้อง ใช้เวลา 540 ล้านปีในการทำงานหนัก เพื่อทำภารกิจนี้ และความพยายามส่วนมาก เข้าไปในการพัฒนาอุปกรณ์การประมวลผล ภาพของสมองของเรา ไม่ใช่ตาเอง ดังนั้นการมองเห็นเริ่มต้นด้วยดวงตา แต่เกิดขึ้นอย่างแท้จริงในสมอง

So for 15 years now, starting from my Ph.D. at Caltech and then leading Stanford's Vision Lab, I've been working with my mentors, collaborators and students to teach computers to see. Our research field is called computer vision and machine learning. It's part of the general field of artificial intelligence. So ultimately, we want to teach the machines to see just like we do: naming objects, identifying people, inferring 3D geometry of things, understanding relations, emotions, actions and intentions. You and I weave together entire stories of people, places and things the moment we lay our gaze on them.

ดังนั้น เวลา 15 ปีตอนนี้ เริ่มต้นจาก ฉันทำปริญญา Ph.D. ที่ Caltech และจากนั้นก็บริหารห้องแล็บ การมองเห็นของ Stanford ฉันได้ทำงานร่วมกับพี่เลี้ยง ผู้ทำงาน ร่วมกันและนักเรียนของฉัน สอนคอมพิวเตอร์ให้ดู สาขาการวิจัยของเราเรียกว่า computer vision และ machine learning เป็นส่วนหนึ่งของสาขาวิชาปัญญาประดิษฐ์ ดังนั้น ในที่สุดเราต้องการสอนเครื่อง ให้เห็นเช่นเดียวกับที่เราเห็น การตั้งชื่อวัตถุ ระบุบุคคล คาดคะเน รูปทรงเรขาคณิต 3D ของสิ่งของ เข้าใจความสัมพันธ์ อารมณ์ การกระทำและเจตนา คุณและฉันสานเรื่องราวทั้งหมดของ ผู้คน สถานที่ และสิ่งต่างๆ ขณะที่เราจ้องสายตาของเราไว้

The first step towards this goal is to teach a computer to see objects, the building block of the visual world. In its simplest terms, imagine this teaching process as showing the computers some training images of a particular object, let's say cats, and designing a model that learns from these training images. How hard can this be? After all, a cat is just a collection of shapes and colors, and this is what we did in the early days of object modeling. We'd tell the computer algorithm in a mathematical language that a cat has a round face, a chubby body, two pointy ears, and a long tail, and that looked all fine. But what about this cat? (Laughter) It's all curled up. Now you have to add another shape and viewpoint to the object model. But what if cats are hidden? What about these silly cats? Now you get my point. Even something as simple as a household pet can present an infinite number of variations to the object model, and that's just one object.

ขั้นตอนแรกสู่เป้าหมายนี้คือ การสอน คอมพิวเตอร์ให้ดูวัตถุ โคงสร้างของโลกของภาพ ในแง่ที่ง่ายที่สุด ลองจินตนาการ กระบวนการเรียนการสอนนี้ แสดงคอมพิวเตอร์ภาพการฝึกซ้อม บางอย่าง ของวัตถุเฉพาะ สมมติว่าเป็นแมว และการออกแบบรูปแบบที่เรียนรู้จาก ภาพการฝึกซ้อมเหล่านี้ มันยากขนาดไหน ท้ายที่สุด แมวเป็นเพียงชุดของ รูปทรงและสี และนี่คือสิ่งที่เราทำในวันแรกๆ ของการสร้างโมเดลวัตถุ เราจะบอกอัลกอริธีมของคอมพิวเตอร์ ด้วยภาษาทางคณิตศาสตร์ ว่าแมวมีใบหน้ากลม ลำตัวอ้วน มีสองหูแหลม และหางยาว และดูดีทั้งหมด แต่อะไรเกี่ยวกับแมวตัวนี้ล่ะ (เสียงหัวเราะ) มันขดตัว ตอนนี้ ต้องเพิ่มรูปร่าง - มุมมองอื่นในโมเดลวัตถุ แต่ถ้าแมวถูกซ่อนไว้ แล้วแมวโง่เหล่านี้ล่ะ ตอนนี้คุณเข้าใจแล้ว แม้บางอย่างง่ายๆ เป็นแบบ สัตว์เลี้ยงในครัวเรือน สามารถนำเสนอรูปแบบรูปแบบ ของวัตถุที่ไม่มีที่สิ้นสุด และนั่นเป็นเพียงวัตถุเดียว

So about eight years ago, a very simple and profound observation changed my thinking. No one tells a child how to see, especially in the early years. They learn this through real-world experiences and examples. If you consider a child's eyes as a pair of biological cameras, they take one picture about every 200 milliseconds, the average time an eye movement is made. So by age three, a child would have seen hundreds of millions of pictures of the real world. That's a lot of training examples. So instead of focusing solely on better and better algorithms, my insight was to give the algorithms the kind of training data that a child was given through experiences in both quantity and quality.

ดังนั้นประมาณแปดปีที่ผ่านมา การสังเกตที่ง่ายและลึกซึ้ง ได้เปลี่ยนความคิดของฉัน ไม่มีใครบอกเด็กว่าจะดูอย่างไร โดยเฉพาะอย่างยิ่งในช่วงปีแรกๆ เด็กเรียนรู้ผ่านประสบการณ์ และตัวอย่างแห่งความเป็นจริง ถ้าคุณพิจารณาดวงตาของเด็ก เป็นกล้องชีวภาพหนึ่งคู่ กล้องถ่ายภาพทุกๆ 200 มิลลิวินาที เวลาเฉลี่ยที่เกิดจากการเคลื่อนไหว ของตา ดังนั้นเมื่ออายุสามขวบ เด็กๆ จะได้เห็นภาพหลายร้อยภาพ ในโลกแห่งความจริง นี่เป็นตัวอย่างการฝึกมากมาย ดังนั้น แทนที่จะเน้นเฉพาะ อัลกอริธึมที่ดีและดีขึ้น ข้อมูลเชิงลึกของฉันคือ การให้ อัลกอริทึมเป็นแบบข้อมูลการฝึกซ้อม ว่าเด็กได้ผ่านประสบการณ์ ทั้งในด้านปริมาณและคุณภาพ

Once we know this, we knew we needed to collect a data set that has far more images than we have ever had before, perhaps thousands of times more, and together with Professor Kai Li at Princeton University, we launched the ImageNet project in 2007. Luckily, we didn't have to mount a camera on our head and wait for many years. We went to the Internet, the biggest treasure trove of pictures that humans have ever created. We downloaded nearly a billion images and used crowdsourcing technology like the Amazon Mechanical Turk platform to help us to label these images. At its peak, ImageNet was one of the biggest employers of the Amazon Mechanical Turk workers: together, almost 50,000 workers from 167 countries around the world helped us to clean, sort and label nearly a billion candidate images. That was how much effort it took to capture even a fraction of the imagery a child's mind takes in in the early developmental years.

เมื่อเรารู้เรื่องนี้แล้ว เรารู้ว่า เราจำเป็นต้องรวบรวมชุดข้อมูล ที่มีภาพไกลเกินกว่าที่เราเคยมีมาก่อน บางที อีกหลายพันครั้ง และร่วมกับศาสตราจารย์ Kai Li ที่ Princeton University เราได้เปิดตัว ImageNet โชคดีที่เราไม่ต้องติดกล้อง บนศีรษะของเรา และรอเป็นเวลาหลายปี เราไปที่อินเทอร์เน็ต เป็นขุมสมบัติที่ยิ่งใหญ่ที่สุด ของภาพที่มนุษย์สร้างขึ้น เราดาวน์โหลดภาพมาแล้ว เกือบพันล้านภาพ และใช้เทคโนโลยีกระจายไปยังกลุ่มเพื่อค้นหา คำตอบ เช่นแพลตฟอร์ม Amazon Mechanical Turk เพื่อช่วยให้เราติดป้ายกำกับรูปภาพเหล่านี้ ในตอนท้าย ImageNet เป็นหนึ่งใน นายจ้างที่ใหญ่ที่สุด ของแรงงาน Amazon Mechanical Turk: ร่วมกันเกือบ 50,000 คน จาก 167 ประเทศทั่วโลก ช่วยให้เราสามารถทำความสะอาด จัดเรียงและติดฉลากได้ เกือบหนึ่งพันล้านภาพที่ใช้ นั่นเป็นความพยายามอย่างมาก เพื่อจับภาพแม้แต่เศษเสี้ยว จิตใจของเด็กใช้เวลาในการพัฒนา ระยะต้นหลายปี

In hindsight, this idea of using big data to train computer algorithms may seem obvious now, but back in 2007, it was not so obvious. We were fairly alone on this journey for quite a while. Some very friendly colleagues advised me to do something more useful for my tenure, and we were constantly struggling for research funding. Once, I even joked to my graduate students that I would just reopen my dry cleaner's shop to fund ImageNet. After all, that's how I funded my college years.

ในการมองย้อนกลับความคิดใน การใช้ข้อมูลขนาดใหญ่นี้ การฝึกซ้อมอัลกอริธึมของคอมพิวเตอร์ อาจดูเหมือนชัดเจนในตอนนี้ แต่กลับไปในปี 2007 ยังไม่ชัดเจนดังนั้น เราค่อนข้างโดดเดี่ยวในการเดินทาง ครั้งนี้มานานแล้ว เพื่อนร่วมงานแนะนำให้ทำสิ่ง ที่เป็นประโยชน์มากขึ้นต่อตำแหน่ง และเราก็พยายามดิ้นรนเพื่อหาเงินทุนวิจัย ครั้งหนึ่งฉันก็พูดเล่นกับนักศึกษา ระดับบัณฑิตศึกษาของฉัน ว่าฉันเพิ่งจะเปิดร้านขายของชำของฉัน อีกครั้งเพื่อลงทุนใน ImageNet เพราะนั่นเป็นเหตุผลที่ฉันให้เงินทุน แก่วิทยาลัยของฉันเป็นเวลาหลายปี

So we carried on. In 2009, the ImageNet project delivered a database of 15 million images across 22,000 classes of objects and things organized by everyday English words. In both quantity and quality, this was an unprecedented scale. As an example, in the case of cats, we have more than 62,000 cats of all kinds of looks and poses and across all species of domestic and wild cats. We were thrilled to have put together ImageNet, and we wanted the whole research world to benefit from it, so in the TED fashion, we opened up the entire data set to the worldwide research community for free. (Applause)

ดังนั้น เราจึงดำเนินการต่อ ในปี 2552 โครงการ ImageNet ได้จัดส่ง ฐานข้อมูลขนาด 15 ล้านภาพ ผ่านชั้นเรียนและวัตถุต่างๆ 22,000 ชั้น จัดตามคำภาษาอังกฤษในชีวิตประจำวัน ทั้งในด้านปริมาณและคุณภาพ นี่เป็นระดับที่ไม่เคยปรากฏมาก่อน ตัวอย่างเช่น ในกรณีของแมว เรามีแมวมากกว่า 62,000 ตัว รูปลักษณ์และโพสท่าทุกชนิด และแมวในประเทศและป่าทั่วทุกชนิด เราตื่นเต้นที่ได้ใส่ใน ImageNet และเราต้องการให้โลกการวิจัย ทั้งหมดได้รับประโยชน์ ดังนั้น ในรูปแบบ TED เราจึง เปิดชุดข้อมูลทั้งหมด ไปยังชุมชนการวิจัยทั่วโลกแบบฟรีๆ (เสียงปรบมือ)

Now that we have the data to nourish our computer brain, we're ready to come back to the algorithms themselves. As it turned out, the wealth of information provided by ImageNet was a perfect match to a particular class of machine learning algorithms called convolutional neural network, pioneered by Kunihiko Fukushima, Geoff Hinton, and Yann LeCun back in the 1970s and '80s. Just like the brain consists of billions of highly connected neurons, a basic operating unit in a neural network is a neuron-like node. It takes input from other nodes and sends output to others. Moreover, these hundreds of thousands or even millions of nodes are organized in hierarchical layers, also similar to the brain. In a typical neural network we use to train our object recognition model, it has 24 million nodes, 140 million parameters, and 15 billion connections. That's an enormous model. Powered by the massive data from ImageNet and the modern CPUs and GPUs to train such a humongous model, the convolutional neural network blossomed in a way that no one expected. It became the winning architecture to generate exciting new results in object recognition. This is a computer telling us this picture contains a cat and where the cat is. Of course there are more things than cats, so here's a computer algorithm telling us the picture contains a boy and a teddy bear; a dog, a person, and a small kite in the background; or a picture of very busy things like a man, a skateboard, railings, a lampost, and so on. Sometimes, when the computer is not so confident about what it sees, we have taught it to be smart enough to give us a safe answer instead of committing too much, just like we would do, but other times our computer algorithm is remarkable at telling us what exactly the objects are, like the make, model, year of the cars.

ตอนนี้ เรามีข้อมูลที่จะช่วยบำรุง สมองคอมพิวเตอร์ของเรา เราพร้อมที่จะกลับมาที่อัลกอริทึม ด้วยตัวเองแล้ว เมื่อเปิดความมั่งคั่งของข้อมูล ที่ได้จาก ImageNet เป็นการจับคู่ที่สมบูรณ์แบบกับชั้นเรียน ของกลไกการเรียนรู้ของเครื่อง เรียกว่าเครือข่ายประสาทแบบม้วนขด ผู้บุกเบิกคือ Kunihiko Fukushima, Geoff Hinton และ Yann LeCun ย้อนกลับไปในทศวรรษ 1970 และยุค 1980 มีเซลล์ประสาทที่เชื่อมต่อกัน อย่างมากถึงพันล้าน เหมือนสมอง หน่วยปฏิบัติการพื้นฐานใน เครือข่ายประสาทเทียม เป็นโหนดแบบเซลล์ประสาท ใช้อินพุตจากโหนดอื่น และส่งข้อมูลไปยังโหนดอื่น นอกจากนี้ นับร้อยนับพันหรือนับล้านๆ โหนด ถูกจัดอยู่ในชั้นลำดับชั้น ยังคล้ายกับสมอง ในเครือข่ายประสาททั่วไป เราการฝึกซ้อม รูปแบบการจดจำวัตถุของเรา มี 24 ล้านโหนด 140 ล้านพารามิเตอร์ และการเชื่อมต่อ 15 พันล้าน นั่นเป็นรูปแบบที่ยิ่งใหญ่ ขับเคลื่อนด้วยข้อมูลขนาดใหญ่จาก ImageNet และซีพียูและ GPU ที่ทันสมัยใน การฝึกซ้อมแบบจำลองที่ใหญ่มาก เครือข่ายประสาทแบบม้วนขด เบ่งบานในแบบที่ไม่มีใครเคยคาดหวัง กลายเป็นสถาปัตยกรรมที่ชนะ เพื่อสร้างผลลัพธ์ใหม่ที่น่าตื่นเต้น ในการจดจำวัตถุ นี่คือคอมพิวเตอร์ที่บอกเรา ว่าภาพนี้มีแมว และแมวอยู่ที่ไหน แน่นอนว่า มีอะไรมากกว่าแมว ดังนั้น นี่คืออัลกอริทึมของ คอมพิวเตอร์ที่บอกเรา ภาพมีเด็กผู้ชายและตุ๊กตาหมี สุนัข บุคคล และว่าวขนาดเล็กในพื้นหลัง หรือภาพของสิ่งที่ยุ่งมากๆ เหมือนชายคนหนึ่ง สเก็ตบอร์ด ราว เสาไฟ และสิ่งอื่นๆ บางครั้ง เมื่อคอมพิวเตอร์ ไม่ค่อยมั่นใจในสิ่งที่มองเห็น เราได้สอนให้ฉลาดพอ เพื่อให้เราคำตอบที่ปลอดภัยแทน การกระทำที่มั่นใจมากเกินไป เช่นเดียวกับที่เราจะทำ แต่บางครั้ง อัลกอริทึมคอมพิวเตอร์ ของเราก็น่าทึ่งที่บอกเรา ว่าวัตถุที่มีเป็นอะไรจริงๆ เช่นยี่ห้อ แบบ รุ่นปีของรถยนต์

We applied this algorithm to millions of Google Street View images across hundreds of American cities, and we have learned something really interesting: first, it confirmed our common wisdom that car prices correlate very well with household incomes. But surprisingly, car prices also correlate well with crime rates in cities, or voting patterns by zip codes.

เราใช้อัลกอริทึมนี้กับภาพ Google Street View นับล้านภาพ ข้ามร้อยเมืองของอเมริกัน และเราได้เรียนรู้สิ่งที่น่าสนใจจริงๆ ประการแรก ยืนยันภูมิปัญญาทั่วไปของเรา ว่าราคารถยนต์มีความสัมพันธ์กันเป็นอย่างดี กับรายได้ครัวเรือน แต่น่าเสียดายที่ราคารถยนต์ยังมี ความสัมพันธ์กันดี กับอัตราการเกิดอาชญากรรมในเมือง หรือรูปแบบการลงคะแนนโดย ใช้รหัสไปรษณีย์

So wait a minute. Is that it? Has the computer already matched or even surpassed human capabilities? Not so fast. So far, we have just taught the computer to see objects. This is like a small child learning to utter a few nouns. It's an incredible accomplishment, but it's only the first step. Soon, another developmental milestone will be hit, and children begin to communicate in sentences. So instead of saying this is a cat in the picture, you already heard the little girl telling us this is a cat lying on a bed.

รอสักครู่ นี่ใช่ไหม คอมพิวเตอร์มีการจับคู่หรือแม้กระทั่งมี มากกว่าความสามารถของมนุษย์หรือไม่ ไม่เร็วนัก จนถึงปัจจุบัน เราได้สอนคอมพิวเตอร์ ให้ดูวัตถุเท่านั้น นี่เหมือนกับเด็กเล็กๆ ที่เรียนรู้ที่ จะพูดคำนามไม่กี่คำ เป็นความสำเร็จที่น่าทึ่ง แต่นี่เป็นเพียงขั้นตอนแรกเท่านั้น ในไม่ช้า การพัฒนาอีกก้าวจะฮิต และเด็กเริ่มสื่อสารเป็นประโยค ดังนั้นแทนที่จะบอกว่านี่เป็นแมวในภาพ คุณได้ยินแล้วสาวน้อยบอกกับ เราว่า นี่คือแมวนอนอยู่บนเตียง

So to teach a computer to see a picture and generate sentences, the marriage between big data and machine learning algorithm has to take another step. Now, the computer has to learn from both pictures as well as natural language sentences generated by humans. Just like the brain integrates vision and language, we developed a model that connects parts of visual things like visual snippets with words and phrases in sentences.

ดังนั้น เพื่อสอนคอมพิวเตอร์ ให้ดูภาพและสร้างประโยค การจับคู่ระหว่างข้อมูลขนาดใหญ่ และกลไกการเรียนรู้ด้วยเครื่อง ต้องใช้ขั้นตอนอื่น ขณะนี้คอมพิวเตอร์ต้องเรียนรู้จากทั้งสองรูป เช่นเดียวกับประโยคภาษาธรรมชาติ ที่สร้างขึ้นโดยมนุษย์ เช่นเดียวกับสมองที่รวมวิสัยทัศน์และภาษา เราพัฒนารูปแบบที่เชื่อมโยงสิ่ง ที่มองเห็นบางส่วน เช่นตัวอย่างข้อมูลภาพ มีคำและวลีในประโยค

About four months ago, we finally tied all this together and produced one of the first computer vision models that is capable of generating a human-like sentence when it sees a picture for the first time. Now, I'm ready to show you what the computer says when it sees the picture that the little girl saw at the beginning of this talk.

ประมาณสี่เดือนที่ผ่านมา เราโยงไว้ด้วยกันทั้งหมด และผลิตโมเดลวิสัยทัศน์ทาง คอมพิวเตอร์เครื่องแรก ที่มีความสามารถในการสร้าง ประโยคเหมือนมนุษย์ เมื่อเห็นภาพเป็นครั้งแรก ตอนนี้ฉันพร้อมที่จะแสดง สิ่งที่คอมพิวเตอร์พูด เมื่อเห็นภาพ ที่สาวน้อยเห็นในตอนต้นของการพูดคุยนี้

(Video) Computer: A man is standing next to an elephant. A large airplane sitting on top of an airport runway.

(วิดีโอ) คอมพิวเตอร์: มีชายคนหนึ่งกำลังยืนอยู่ข้างช้าง เครื่องบินขนาดใหญ่จอดอยู่ ในรันเวย์ของสนามบิน

FFL: Of course, we're still working hard to improve our algorithms, and it still has a lot to learn. (Applause)

FFL: แน่นอน เรายังคงทำงานอย่างหนัก เพื่อปรับปรุงอัลกอริทึม และยังคงมีสิ่งที่จะเรียนรู้อยู่มากมาย (เสียงปรบมือ)

And the computer still makes mistakes.

และคอมพิวเตอร์ยังทำผิดพลาด

(Video) Computer: A cat lying on a bed in a blanket.

(วิดีโอ) คอมพิวเตอร์: แมวนอนบนเตียงในผ้าห่ม

FFL: So of course, when it sees too many cats, it thinks everything might look like a cat.

FFL:ดังนั้น แน่นอนเมื่อเห็นแมวมากเกินไป เครื่องคิดว่าทุกอย่างอาจดูเหมือนแมว

(Video) Computer: A young boy is holding a baseball bat. (Laughter)

(วิดีโอ) คอมพิวเตอร์: เด็กหนุ่มคนหนึ่งกำลังถือไม้เบสบอล (เสียงหัวเราะ)

FFL: Or, if it hasn't seen a toothbrush, it confuses it with a baseball bat.

FFL: หรือถ้ายังไม่เคยเห็นแปรงสีฟัน ก็จะสับสนกับไม้ตีเบสบอล

(Video) Computer: A man riding a horse down a street next to a building. (Laughter)

(วิดีโอ) คอมพิวเตอร์: คนขี่ม้าไปตาม ถนนข้างตึก (เสียงหัวเราะ)

FFL: We haven't taught Art 101 to the computers.

FFL: เราไม่ได้สอนวิชา Art 101 ให้แก่คอมพิวเตอร์

(Video) Computer: A zebra standing in a field of grass.

(วิดีโอ) คอมพิวเตอร์: ม้าลายยืนอยู่ในทุ่งหญ้า

FFL: And it hasn't learned to appreciate the stunning beauty of nature like you and I do.

FFL: และยังไม่ได้เรียนรู้ที่จะชื่นชมความงาม อันน่าทึ่งของธรรมชาติ เช่นคุณและฉัน

So it has been a long journey. To get from age zero to three was hard. The real challenge is to go from three to 13 and far beyond. Let me remind you with this picture of the boy and the cake again. So far, we have taught the computer to see objects or even tell us a simple story when seeing a picture.

ดังนั้น จึงเป็นการเดินทางที่ยาวนาน เพื่อให้ได้มาตั้งแต่อายุศูนย์ ถึงสามปี ยังเป็นเรื่องยาก ความท้าทายที่แท้จริงคือ การไป จากสามปีถึง 13 ปีและไกลเกินกว่านั้น ฉันขอเตือนคุณด้วยภาพของ เด็กชายและเค้กนี้อีกครั้ง จนถึงปัจจุบัน เราได้สอน คอมพิวเตอร์ให้ดูวัตถุ หรือแม้กระทั่งบอกเล่าเรื่องราว ที่เรียบง่ายเมื่อได้เห็นภาพ

(Video) Computer: A person sitting at a table with a cake.

(วิดีโอ) คอมพิวเตอร์: คนนั่งอยู่ที่โต๊ะกับเค้ก

FFL: But there's so much more to this picture than just a person and a cake. What the computer doesn't see is that this is a special Italian cake that's only served during Easter time. The boy is wearing his favorite t-shirt given to him as a gift by his father after a trip to Sydney, and you and I can all tell how happy he is and what's exactly on his mind at that moment.

FFL: แต่มีอะไรมากขึ้นในภาพนี้ ไม่ใช่แค่คนและเค้ก สิ่งที่คอมพิวเตอร์ไม่เห็นคือ นี่เป็นเค้กอิตาเลียนพิเศษ ที่ให้บริการเฉพาะในช่วงเทศกาลอีสเตอร์ เด็กชายกำลังใส่เสื้อยืดที่ชอบ พ่อให้เขาเป็นของขวัญหลังจาก เดินทางไปซิดนีย์ และคุณและฉันทั้งหมดสามารถ บอกได้ว่า เด็กมีความสุขแค่ไหน และสิ่งที่อยู่ในใจของเขาในขณะนั้น

This is my son Leo. On my quest for visual intelligence, I think of Leo constantly and the future world he will live in. When machines can see, doctors and nurses will have extra pairs of tireless eyes to help them to diagnose and take care of patients. Cars will run smarter and safer on the road. Robots, not just humans, will help us to brave the disaster zones to save the trapped and wounded. We will discover new species, better materials, and explore unseen frontiers with the help of the machines.

นี่คือลีโอลูกชายของฉัน ในการสืบเสาะของฉันต่อปัญญาจากภาพ ฉันคิดถึงเลโออย่างต่อเนื่อง และโลกอนาคตที่ลูกจะมีชีวิตอยู่ เมื่อเครื่องสามารถมองเห็น แพทย์และพยาบาลจะมีสายตา ที่ไม่รู้จักเหน็ดเหนื่อย เพื่อช่วยในการวินิจฉัยและดูแลผู้ป่วย รถยนต์จะทำงานได้อย่างชาญฉลาด และปลอดภัยยิ่งขึ้นบนท้องถนน หุ่นยนต์ไม่ใช่แค่มนุษย์ จะช่วยให้เรากล้าได้กล้าเสียในเขตภัยพิบัติ เพื่อช่วยผู้ที่ติดกับและได้รับบาดเจ็บ เราจะค้นพบสายพันธุ์ใหม่ วัสดุที่ดีขึ้น และสำรวจแนวที่มองไม่เห็นด้วย ความช่วยเหลือของเครื่อง

Little by little, we're giving sight to the machines. First, we teach them to see. Then, they help us to see better. For the first time, human eyes won't be the only ones pondering and exploring our world. We will not only use the machines for their intelligence, we will also collaborate with them in ways that we cannot even imagine.

เรากำลังให้เครื่องมองเห็นทีละเล็กทีละน้อย อันดับแรก เราสอนให้เครื่องเห็น จากนั้น ก็ช่วยให้เราดูดีขึ้น เป็นครั้งแรก ดวงตาของมนุษย์จะไม่ ขบคิดและ สำรวจโลกของเราอย่างโดดเดี่ยว เราจะไม่เพียงแต่ใช้เครื่องเพื่อสติปัญญา ของเครื่อง เราจะร่วมมือกับเครื่องด้วยวิธีที่ เราไม่สามารถจินตนาการได้

This is my quest: to give computers visual intelligence and to create a better future for Leo and for the world.

นี่คือภารกิจของฉัน ทำให้คอมพิวเตอร์ฉลาด และเพื่อสร้างอนาคตที่ดีขึ้นสำหรับ ลีโอและสำหรับโลก

Thank you.

ขอขอบคุณ

(Applause)

(เสียงปรบมือ)

Let me show you something.

ฉันขอแสดงอะไรบางอย่าง

(Video) Girl: Okay, that's a cat sitting in a bed. The boy is petting the elephant. Those are people that are going on an airplane. That's a big airplane.

(Video) Computer: A man is standing next to an elephant. A large airplane sitting on top of an airport runway.

FFL: Of course, we're still working hard to improve our algorithms, and it still has a lot to learn. (Applause)

And the computer still makes mistakes.

และคอมพิวเตอร์ยังทำผิดพลาด

(Video) Computer: A cat lying on a bed in a blanket.

(วิดีโอ) คอมพิวเตอร์: แมวนอนบนเตียงในผ้าห่ม

FFL: So of course, when it sees too many cats, it thinks everything might look like a cat.

(Video) Computer: A young boy is holding a baseball bat. (Laughter)

(วิดีโอ) คอมพิวเตอร์: เด็กหนุ่มคนหนึ่งกำลังถือไม้เบสบอล (เสียงหัวเราะ)

FFL: Or, if it hasn't seen a toothbrush, it confuses it with a baseball bat.

FFL: หรือถ้ายังไม่เคยเห็นแปรงสีฟัน ก็จะสับสนกับไม้ตีเบสบอล

(Video) Computer: A man riding a horse down a street next to a building. (Laughter)

(วิดีโอ) คอมพิวเตอร์: คนขี่ม้าไปตาม ถนนข้างตึก (เสียงหัวเราะ)

FFL: We haven't taught Art 101 to the computers.

FFL: เราไม่ได้สอนวิชา Art 101 ให้แก่คอมพิวเตอร์

(Video) Computer: A zebra standing in a field of grass.

(วิดีโอ) คอมพิวเตอร์: ม้าลายยืนอยู่ในทุ่งหญ้า

FFL: And it hasn't learned to appreciate the stunning beauty of nature like you and I do.

(Video) Computer: A person sitting at a table with a cake.

(วิดีโอ) คอมพิวเตอร์: คนนั่งอยู่ที่โต๊ะกับเค้ก

This is my quest: to give computers visual intelligence and to create a better future for Leo and for the world.

Thank you.

ขอขอบคุณ

(Applause)

(เสียงปรบมือ)

Fei-Fei Li: How we're teaching computers to understand pictures

Fei-Fei Li: How we're teaching computers to understand pictures

Related talks

Jeremy Howard: The wonderful and terrifying implications of computers that can learn

Pawan Sinha: How brains learn to see

Patricia Kuhl: The linguistic genius of babies

Joseph Redmon: How computers learn to recognize objects instantly

Sebastian Thrun and Chris Anderson: What AI is -- and isn't

Linda Liukas: A delightful way to teach kids about computers

Related talks

Jeremy Howard: The wonderful and terrifying implications of computers that can learn

Pawan Sinha: How brains learn to see

Patricia Kuhl: The linguistic genius of babies

Joseph Redmon: How computers learn to recognize objects instantly

Sebastian Thrun and Chris Anderson: What AI is -- and isn't

Linda Liukas: A delightful way to teach kids about computers