Joseph Redmon: How computers learn to recognize objects instantly

Ten years ago, computer vision researchers thought that getting a computer to tell the difference between a cat and a dog would be almost impossible, even with the significant advance in the state of artificial intelligence. Now we can do it at a level greater than 99 percent accuracy. This is called image classification -- give it an image, put a label to that image -- and computers know thousands of other categories as well.

10年前コンピュータービジョンの研究者はコンピューターで犬と猫を見分けるのはほとんど無理だと考えていました人工知能の大きな発展にもかかわらずです現在では99%以上の精度で見分けられるようになっていますこれは「画像分類」と呼ばれる問題でコンピューターに画像のラベル付けをさせるものですコンピューターは何千種もの物を識別できるようになっています

I'm a graduate student at the University of Washington, and I work on a project called Darknet, which is a neural network framework for training and testing computer vision models. So let's just see what Darknet thinks of this image that we have. When we run our classifier on this image, we see we don't just get a prediction of dog or cat, we actually get specific breed predictions. That's the level of granularity we have now. And it's correct. My dog is in fact a malamute.

私はワシントン大学の大学院生で Darknetというプロジェクトに取り組んでいますコンピュータービジョンのモデルをトレーニングしテストするためのニューラルネット・フレームワークです Darknetがあの犬の画像を何だと思うか見てみましょうあの画像を私たちの画像分類プログラムにかけると犬か猫かだけでなく具体的な犬種まで言い当てますそこまで細かいことが分かるようになっていますそして正しい答えを出しています [マラミュート犬 37% ハスキー犬 15% エスキモー犬 12%] 私の犬は確かにマラミュート犬です [マラミュート犬 37% ハスキー犬 15% エスキモー犬 12%]

So we've made amazing strides in image classification, but what happens when we run our classifier on an image that looks like this? Well ... We see that the classifier comes back with a pretty similar prediction. And it's correct, there is a malamute in the image, but just given this label, we don't actually know that much about what's going on in the image. We need something more powerful. I work on a problem called object detection, where we look at an image and try to find all of the objects, put bounding boxes around them and say what those objects are. So here's what happens when we run a detector on this image.

画像分類は驚くほど進歩しましたがこういう複数の物が写った写真を画像分類にかけたらどうなるのでしょう？結果は— 前とほぼ同じになっています [マラミュート犬 7% エスキモー犬 6% ハスキー犬 6%] それは正しくて画像の中には確かにマラミュート犬がいますがそのラベルだけではこの画像の中でどんなことが起きているのかあまりわかりませんもっと強力なものがほしいところです私は「物体検出」と呼ばれる問題に取り組んでいてそれは画像を見てその中にある物体をすべて検出しそれぞれの物を箱で囲ってそれが何か識別するという問題ですこの画像を物体検出プログラムにかけるとどうなるか見てみましょう

Now, with this kind of result, we can do a lot more with our computer vision algorithms. We see that it knows that there's a cat and a dog. It knows their relative locations, their size. It may even know some extra information. There's a book sitting in the background. And if you want to build a system on top of computer vision, say a self-driving vehicle or a robotic system, this is the kind of information that you want. You want something so that you can interact with the physical world. Now, when I started working on object detection, it took 20 seconds to process a single image. And to get a feel for why speed is so important in this domain, here's an example of an object detector that takes two seconds to process an image. So this is 10 times faster than the 20-seconds-per-image detector, and you can see that by the time it makes predictions, the entire state of the world has changed, and this wouldn't be very useful for an application.

得られる結果はこういうもので色んなことができます猫と犬がいることがわかり相対的な位置や大きさもわかりますおまけの情報もあります向こうに本があるとかコンピュータービジョンを使ったシステム自動運転車やロボットを作ろうとするならこれはまさに欲しい情報でしょう周りの世界と作用し合えるようにしてくれるものが欲しいのです私が物体検出に取り組み始めた頃は１つの画像の処理に 20秒かかっていましたこの領域でなぜスピードが重要なのか分かってもらうため物体検出で画像の処理に２秒かかるとどんな具合か見ていただきましょうこれは画像１つにつき20秒かかる画像検出プログラムより 10倍速いわけですがプログラムが答えを出したときには状況は既に変わっているためあまりアプリケーションの役には立ちません

If we speed this up by another factor of 10, this is a detector running at five frames per second. This is a lot better, but for example, if there's any significant movement, I wouldn't want a system like this driving my car.

さらに10倍高速化してみましょう毎秒 5フレーム処理していますだいぶマシにはなりましたが何か大きな動きがあるとズレが出ますこのようなシステムに自分の車を運転して欲しくはありません

This is our detection system running in real time on my laptop. So it smoothly tracks me as I move around the frame, and it's robust to a wide variety of changes in size, pose, forward, backward. This is great. This is what we really need if we're going to build systems on top of computer vision.

これは私たちの物体検出システムでノートPC上でリアルタイムで動いています私が動き回ってもスムーズに追尾します様々な種類の変化にも対応できます大きさとかポーズとか前向き後ろ向きとてもいいですこれこそコンピュータービジョンを使ったシステムを作ろうというときに欲しいものです

(Applause)

(拍手)

So in just a few years, we've gone from 20 seconds per image to 20 milliseconds per image, a thousand times faster. How did we get there? Well, in the past, object detection systems would take an image like this and split it into a bunch of regions and then run a classifier on each of these regions, and high scores for that classifier would be considered detections in the image. But this involved running a classifier thousands of times over an image, thousands of neural network evaluations to produce detection. Instead, we trained a single network to do all of detection for us. It produces all of the bounding boxes and class probabilities simultaneously. With our system, instead of looking at an image thousands of times to produce detection, you only look once, and that's why we call it the YOLO method of object detection. So with this speed, we're not just limited to images; we can process video in real time. And now, instead of just seeing that cat and dog, we can see them move around and interact with each other.

ほんの数年で１画像あたり20秒から 20ミリ秒へと 1000倍高速化しましたどうやって実現したのか？以前の物体検出システムはこのような画像を受け取ると沢山の領域に分割しそれぞれの領域を分類プログラムにかけ高いスコアが出たところに物体が検出されたと見なしていましたこの方法だと１つの画像に対し分類プログラムを何千回も走らせニューラルネットによる評価が何千回も必要になりますそうする代わりに１つのニューラルネットですべての検出を行うようトレーニングしました境界の箱や分類の確からしさの確率をすべて同時に生成するのです我々のシステムでは物体検出を行うために画像を何千回も見る代わりにたった一度しか見ないのですそれがYOLO (You Only Look Once)の名の所以ですこれだけ速いと画像だけでなく映像もリアルタイムで処理できます猫と犬を検出するだけでなくそれぞれが動き回り相手に反応しているのが分かります

This is a detector that we trained on 80 different classes in Microsoft's COCO dataset. It has all sorts of things like spoon and fork, bowl, common objects like that. It has a variety of more exotic things: animals, cars, zebras, giraffes. And now we're going to do something fun. We're just going to go out into the audience and see what kind of things we can detect. Does anyone want a stuffed animal? There are some teddy bears out there. And we can turn down our threshold for detection a little bit, so we can find more of you guys out in the audience. Let's see if we can get these stop signs. We find some backpacks. Let's just zoom in a little bit. And this is great. And all of the processing is happening in real time on the laptop.

この検出プログラムは MicrosoftのCOCOデータセットにある 80種の物に対してトレーニングしてありますスプーンやフォークといった日常的な物もあればもっと変わった物もあります動物車シマウマキリンちょっと面白いことをやりましょう客席からどんなものが検出できるか試してみますぬいぐるみの動物が欲しい人？そこかしこにテディベアがあります検出器の閾値を少し下げて客席の皆さんを検出できるようにしましょう「一時停止」の標識を検出できるでしょうかバックパックがいくつかありますねもう少しズームしましょう素晴らしいですすべての処理がノートPC上でリアルタイムで実行されています

And it's important to remember that this is a general purpose object detection system, so we can train this for any image domain. The same code that we use to find stop signs or pedestrians, bicycles in a self-driving vehicle, can be used to find cancer cells in a tissue biopsy. And there are researchers around the globe already using this technology for advances in things like medicine, robotics. This morning, I read a paper where they were taking a census of animals in Nairobi National Park with YOLO as part of this detection system. And that's because Darknet is open source and in the public domain, free for anyone to use.

重要なのはこれが汎用物体検出システムだということでどのような領域の画像に対してもトレーニングできます自動運転車が一時停止の標識や歩行者や自転車を検知するのに使うのと同じプログラムを組織生検でガンを見つけるためにも使えるのですすでに世界中の研究者達がこの技術を使って医学やロボット工学を前進させています今朝新聞で読んだんですがナイロビ国立公園では YOLOを検出システムとして使って動物の個体数調査をしているそうですそれというのもDarknetはオープンソースでパブリックドメインなため誰でも無料で使えるからです

(Applause)

(拍手)

But we wanted to make detection even more accessible and usable, so through a combination of model optimization, network binarization and approximation, we actually have object detection running on a phone.

私たちは物体検出技術をさらに近づきやすく使いやすいものにしたいと思いモデルの最適化やネットワーク・バイナリぜーション近似を組み合わせることでスマートフォン上で動かせるようにしました

(Applause)

(拍手)

And I'm really excited because now we have a pretty powerful solution to this low-level computer vision problem, and anyone can take it and build something with it. So now the rest is up to all of you and people around the world with access to this software, and I can't wait to see what people will build with this technology.

私はすごくワクワクしていますいまやこの基本的なコンピュータービジョンの問題に対してとても強力な解があり誰でもそれを使って何か作り出すことができるんですあとは皆さんやこのソフトウェアを使える世界中の人々にかかっていますこの技術を使ってみんながどんなものを作ってくれるか楽しみです

Thank you.

ありがとうございました

(Applause)

(拍手)