Joseph Redmon: How computers learn to recognize objects instantly

Ten years ago, computer vision researchers thought that getting a computer to tell the difference between a cat and a dog would be almost impossible, even with the significant advance in the state of artificial intelligence. Now we can do it at a level greater than 99 percent accuracy. This is called image classification -- give it an image, put a label to that image -- and computers know thousands of other categories as well.

십년 전 만해도 컴퓨터 시각 연구자들은 개와 고양이를 컴퓨터가 구별해 내는 것은 거의 불가능하다 생각했습니다. 아무리 인공지능이 발전해도 말이지요. 지금은 99% 이상 정확하게 그 일이 가능한데, 이것을 '이미지 분류' 라고 합니다. 이미지 마다 이름표를 붙여주면 컴퓨터는 수천 개의 다른 유형까지 알아냅니다.

I'm a graduate student at the University of Washington, and I work on a project called Darknet, which is a neural network framework for training and testing computer vision models. So let's just see what Darknet thinks of this image that we have. When we run our classifier on this image, we see we don't just get a prediction of dog or cat, we actually get specific breed predictions. That's the level of granularity we have now. And it's correct. My dog is in fact a malamute.

저는 워싱턴 대학교에서 석사과정을 밟고 있고 '다크넷' 이라 불리는 프로젝트를 연구하고 있습니다. 일종의 신경망 체제의 프로그램인데 컴퓨터 시각 견본을 교육하고 실험하는데 쓰입니다. 자 이제, '다크넷'이 어떤 식으로 이 이미지를 인식하는지 보겠습니다. 지금 이 이미지에 저희가 개발한 선별법을 적용하면 단지 개 또는 고양이의 예측 뿐만 아니라 자세한 종까지도 알아 낼 수 있습니다. 이미 이 정도로 세밀한 수준에 올라와 있습니다. 아주 정확하기까지 합니다. 제 반려견은 말라뮤트 입니다.

So we've made amazing strides in image classification, but what happens when we run our classifier on an image that looks like this? Well ... We see that the classifier comes back with a pretty similar prediction. And it's correct, there is a malamute in the image, but just given this label, we don't actually know that much about what's going on in the image. We need something more powerful. I work on a problem called object detection, where we look at an image and try to find all of the objects, put bounding boxes around them and say what those objects are. So here's what happens when we run a detector on this image.

'이미지 선별법'이 엄청나게 발전을 해왔는데 이런 종류의 이미지에 저희 선별법을 적용시키면 과연 어떤 결과가 나올까요? 자... 대략 비슷한 예측을 하는 것을 볼 수 있습니다. 맞습니다, 사진에 말라뮤트가 있죠. 하지만 이 정도로는 어떤 장면인지 많은 것을 알 수 없습니다. 좀 더 효과적인 것이 필요하겠지요. 저는 지금 '사물감지'라 불리는 문제에 대해 연구하고 있습니다. 한 이미지 안에 있는 모든 사물들을 찾아내서 테두리 상자를 치고 그것이 무엇인지 맞추는 것입니다. 여기에 감지법을 적용하면 어떻게 되는지 보겠습니다.

Now, with this kind of result, we can do a lot more with our computer vision algorithms. We see that it knows that there's a cat and a dog. It knows their relative locations, their size. It may even know some extra information. There's a book sitting in the background. And if you want to build a system on top of computer vision, say a self-driving vehicle or a robotic system, this is the kind of information that you want. You want something so that you can interact with the physical world. Now, when I started working on object detection, it took 20 seconds to process a single image. And to get a feel for why speed is so important in this domain, here's an example of an object detector that takes two seconds to process an image. So this is 10 times faster than the 20-seconds-per-image detector, and you can see that by the time it makes predictions, the entire state of the world has changed, and this wouldn't be very useful for an application.

자, 이런 식의 결과라면 컴퓨터 시각 알고리듬으로 더 많은 것을 해낼 수 있겠군요. 이제 이미지 안에 고양이와 개가 있고 대략 그들의 위치 그리고 크기까지 파악하고 있습니다. 그외 다른 정보들까지 알고 있을지도 모르겠네요. 저 뒤 쪽에 책 한 권이 있네요. 이 컴퓨터 시각을 이용해서 어떤 시스템을 개발한다면, 자율주행 자동차나 로봇 시스템일텐데 바로 이런 것들이 여러분들이 원하는 정보일 겁니다. 물리적 세계와 교감을 가능하게 하는 것들 말이지요. 자, 제가 처음으로 '사물감지' 연구에 들어갔을 때 이미지 하나를 처리하는데 20초가 걸렸습니다. 이 분야에서 왜 속도가 중요한지 알고 싶다면 여기 사물감지기능의 한 예가 있습니다. 이미지 하나를 처리하는데 2초 밖에 걸리지 않습니다. 20초 짜리 감지기능보다는 10배나 빠른 속도이지요. 보시는 것 처럼, 이 기능이 예측을 하기 시작할 때면 이미 벌어지고 상황은 바뀌어 있을 테니까 응용 프로그램으로는 별 효용이 없을 겁니다.

If we speed this up by another factor of 10, this is a detector running at five frames per second. This is a lot better, but for example, if there's any significant movement, I wouldn't want a system like this driving my car.

만일 10배를 더 빠르게 한다면 초당 다섯 장면을 처리하는 감지기능이 됩니다. 훨씬 낫죠. 하지만 만일, 여기서 더 큰 발전이 없다면 이 정도의 시스템이 제 차를 운전하기를 원친 않겠지요.

This is our detection system running in real time on my laptop. So it smoothly tracks me as I move around the frame, and it's robust to a wide variety of changes in size, pose, forward, backward. This is great. This is what we really need if we're going to build systems on top of computer vision.

이것이 제 노트북에서 실시간으로 작동되고 있는 감지 시스템입니다. 아주 부드럽게 제가 틀안에서 움직이는 대로 따라오죠. 아무 문제가 없습니다. 다양한 크기 자세 앞뒤 움직임에도 훌륭하죠. 이런 것이 바로 우리에게 필요한 것입니다. 컴퓨터 시각을 이용한 시스템을 개발할 때 말이지요.

(Applause)

(박수)

So in just a few years, we've gone from 20 seconds per image to 20 milliseconds per image, a thousand times faster. How did we get there? Well, in the past, object detection systems would take an image like this and split it into a bunch of regions and then run a classifier on each of these regions, and high scores for that classifier would be considered detections in the image. But this involved running a classifier thousands of times over an image, thousands of neural network evaluations to produce detection. Instead, we trained a single network to do all of detection for us. It produces all of the bounding boxes and class probabilities simultaneously. With our system, instead of looking at an image thousands of times to produce detection, you only look once, and that's why we call it the YOLO method of object detection. So with this speed, we're not just limited to images; we can process video in real time. And now, instead of just seeing that cat and dog, we can see them move around and interact with each other.

불과 몇년 만에 한 이미지를 처리하는 시간이 20초에서 500분의 1초로, 천배나 빨라졌습니다. 어떻게 가능했을까요? 과거에는, 사물감지 시스템들은 이런 이미지를 가지고 여러 영역으로 잘라내서 각 영역 마다 선별작업을 실행하고 그 선별작업에서 산출된 가장 높은 점수들이 이미지의 감지로 간주되는 방식이었습니다. 하지만, 감지를 하기까지 한 이미지에 수천 번의 분류작업이 또 수천 번의 신경망 감정을 거쳐야 했습니다. 대신에, 우리는 단일 네트워크로 모든 탐지가 가능케 했습니다. 모든 테두리 상자와 분류 개연성을 동시에 처리해 내는 것이지요. 저희 시스템에서는 감지를 해내기 위해 한 이미지를 수천 번이 아니라 단 한 번 보는 것으로 가능하고 저희가 이것을 사물감지의 '욜로'법 으로 부르는 이유입니다. 이 속도로는, 이미지 뿐만 아니라 동영상도 실시간으로 처리할 수 있습니다. 이제는 단순히 개와 고양이를 인지하는 것을 넘어서 그들이 돌아다니는 것도, 서로 어울리는 것도 볼 수 있습니다.

This is a detector that we trained on 80 different classes in Microsoft's COCO dataset. It has all sorts of things like spoon and fork, bowl, common objects like that. It has a variety of more exotic things: animals, cars, zebras, giraffes. And now we're going to do something fun. We're just going to go out into the audience and see what kind of things we can detect. Does anyone want a stuffed animal? There are some teddy bears out there. And we can turn down our threshold for detection a little bit, so we can find more of you guys out in the audience. Let's see if we can get these stop signs. We find some backpacks. Let's just zoom in a little bit. And this is great. And all of the processing is happening in real time on the laptop.

이것이 저희가 개발해낸 감지기능입니다. 마이크로소프트의 코코 데이터 세트 안에서 80개의 등급에 적용시켜 얻어낸 것이지요. 숟가락, 포크, 그릇 같이 평범한 물건들이 다양하게 있네요. 좀 특이한 것들도 보이지요. 동물, 자동차, 얼룩말, 기린. 재미난 걸 한번 해볼까요. 방청석으로 들어가서 어떤 물건들이 감지되는지 보겠습니다. 동물인형 갖고 싶으신 분? 저기 곰인형도 몇개 있네요. 감지한계치를 조금 낮추면, 더 많은 분들이 화면에 잡히겠지요. 이 정지표지판들도 잡아낼 수 있는지 보겠습니다. 배낭도 몇개 보이네요. 조금 가까이 당겨 보지요. 좋습니다. 이 모든 것이 컴퓨터에서 실시간으로 처리되고 있습니다.

And it's important to remember that this is a general purpose object detection system, so we can train this for any image domain. The same code that we use to find stop signs or pedestrians, bicycles in a self-driving vehicle, can be used to find cancer cells in a tissue biopsy. And there are researchers around the globe already using this technology for advances in things like medicine, robotics. This morning, I read a paper where they were taking a census of animals in Nairobi National Park with YOLO as part of this detection system. And that's because Darknet is open source and in the public domain, free for anyone to use.

꼭 알아둘 것은 이것이 총괄적인 사물감지 시스템이란 것입니다. 그래야 어떠한 이미지 종류에도 적용시킬 수 있겠지요. 동일한 코드가 정지표지판 또는 보행자 자율주행 자동차 안의 자전거들을 찾아내기도 하고 조직검사를 통해 암세포를 찾아낼 때도 사용될 수 있습니다. 이미 세계 곳곳의 연구원들이 이 기술을 의학과 로봇공학의 발전 등에 쓰고 있습니다. 오늘 아침 신문에 나이로비 국립공원의 동물 수 조사에 욜로가 감지 시스템의 일부로 사용된다고 나왔더군요. 다크넷이 오픈소스이기도 하고 모두가 무료로 사용할 수 있도록 열려있기 때문입니다.

(Applause)

(박수)

But we wanted to make detection even more accessible and usable, so through a combination of model optimization, network binarization and approximation, we actually have object detection running on a phone.

그런데, 저희는 감지기능의 접근성과 사용성을 더 높이고 싶었고 견본 최적화 네트워크 이진화와 근사치의 적절한 조화를 통해서 이제 휴대전화에서도 사물감지가 가능하게 했습니다.

(Applause)

(박수)

And I'm really excited because now we have a pretty powerful solution to this low-level computer vision problem, and anyone can take it and build something with it. So now the rest is up to all of you and people around the world with access to this software, and I can't wait to see what people will build with this technology.

아주 흥분되는데요. 왜냐면 급이 낮은 컴퓨터 시각 문제점들을 해결할 아주 효과적인 방법이 있으니까요. 누구나 이 기술을 가지고 원하는 것들을 만들어 낼 수 있습니다. 이제 나머지는 여러분들의 몫이고요. 또 이 소프트웨어를 사용하는 세상의 모든 분들의 몫입니다. 이 기술로 사람들이 어떤 것들을 만들어 낼지 너무 기대됩니다.

Thank you.

감사합니다.

(Applause)

(박수)