Joseph Redmon: How computers learn to recognize objects instantly

Ten years ago, computer vision researchers thought that getting a computer to tell the difference between a cat and a dog would be almost impossible, even with the significant advance in the state of artificial intelligence. Now we can do it at a level greater than 99 percent accuracy. This is called image classification -- give it an image, put a label to that image -- and computers know thousands of other categories as well.

Há dez anos, os investigadores de visão computacional pensavam que fazer com que um computador percebesse a diferença entre um gato e um cão seria praticamente impossível, mesmo com o avanço significativo no campo da inteligência artificial. Agora, podemos fazê-lo com uma precisão superior a 99%. Isto chama-se classificação de imagem: dar ao computador uma imagem, atribuirmos-lhe um rótulo, e os computadores reconhecem milhares de outras categorias também.

I'm a graduate student at the University of Washington, and I work on a project called Darknet, which is a neural network framework for training and testing computer vision models. So let's just see what Darknet thinks of this image that we have. When we run our classifier on this image, we see we don't just get a prediction of dog or cat, we actually get specific breed predictions. That's the level of granularity we have now. And it's correct. My dog is in fact a malamute.

Sou um aluno de pós-graduação da Universidade de Washington, e trabalho num projeto chamado "Darknet", que é uma estrutura de rede neural para treinar e testar modelos de visão computacional. Então, vamos ver o que a "Darknet" pensa desta imagem que temos. Quando usamos o nosso classificador nesta imagem, não ficamos somente a saber se é um cão ou um gato, mas obtemos a raça específica. Este é o nível de detalhe que temos agora. E está correto. O meu cão é, de facto, um malamute.

So we've made amazing strides in image classification, but what happens when we run our classifier on an image that looks like this? Well ... We see that the classifier comes back with a pretty similar prediction. And it's correct, there is a malamute in the image, but just given this label, we don't actually know that much about what's going on in the image. We need something more powerful. I work on a problem called object detection, where we look at an image and try to find all of the objects, put bounding boxes around them and say what those objects are. So here's what happens when we run a detector on this image.

Então, fizemos incríveis avanços na classificação de imagem, mas o que acontece quando usamos o classificador numa imagem como esta? Bem... Vemos que o classificador retorna com uma predição muito similar. E está correta, há um malamute na imagem, mas só através deste rótulo, não conseguimos saber muito sobre o que se passa na imagem. Precisamos de algo mais eficaz. Eu trabalho numa questão chamada deteção de objetos, em que analisamos uma imagem e tentamos achar todos os objetos, colocamos caixas delimitadoras à volta deles e dizemos que objetos são aqueles. Isto é o que acontece quando usamos um detetor nesta imagem.

Now, with this kind of result, we can do a lot more with our computer vision algorithms. We see that it knows that there's a cat and a dog. It knows their relative locations, their size. It may even know some extra information. There's a book sitting in the background. And if you want to build a system on top of computer vision, say a self-driving vehicle or a robotic system, this is the kind of information that you want. You want something so that you can interact with the physical world. Now, when I started working on object detection, it took 20 seconds to process a single image. And to get a feel for why speed is so important in this domain, here's an example of an object detector that takes two seconds to process an image. So this is 10 times faster than the 20-seconds-per-image detector, and you can see that by the time it makes predictions, the entire state of the world has changed, and this wouldn't be very useful for an application.

Com este tipo de resultado, podemos fazer muito mais com os nossos algoritmos de visão computacional. Vemos que o detetor sabe que está ali um gato e um cão. Ele sabe quais as suas posições relativas, o tamanho deles. Pode até saber alguma informação extra: está um livro ali ao fundo. E se quisermos construir um sistema a partir da visão computacional, digamos, um veículo autónomo ou um sistema robótico, este é o tipo de informação que queremos. Precisamos de algo que possa interagir com o mundo físico. Quando comecei a trabalhar em deteção de objetos, levava 20 segundos para processar uma única imagem. E para terem uma noção de como a velocidade é tão importante neste campo, este é um exemplo de um detetor de objetos que leva dois segundos para processar uma imagem. Isto é, 10 vezes mais rápido que o detetor de 20-segundos-por-imagem. Podem ver que. no tempo que demora a fazer predições, tudo mudou. Isso não seria muito útil para uma aplicação.

If we speed this up by another factor of 10, this is a detector running at five frames per second. This is a lot better, but for example, if there's any significant movement, I wouldn't want a system like this driving my car.

Se aumentarmos a velocidade 10 vezes mais, este é um detetor a cinco quadros por segundo. Isto é muito melhor mas, por exemplo, se houver qualquer movimento significativo, eu não quereria um sistema como este a conduzir o meu carro.

This is our detection system running in real time on my laptop. So it smoothly tracks me as I move around the frame, and it's robust to a wide variety of changes in size, pose, forward, backward. This is great. This is what we really need if we're going to build systems on top of computer vision.

Este é o nosso sistema de deteção em tempo-real no meu portátil. Ele acompanha-me suavemente enquanto me movo à volta do quadro, e é potente para uma grande variedade de mudanças de tamanho, de posição, para a frente, para trás, isto é uma maravilha. É disto que realmente precisamos se formos construir sistemas a partir de visão computacional.

(Applause)

(Aplausos)

So in just a few years, we've gone from 20 seconds per image to 20 milliseconds per image, a thousand times faster. How did we get there? Well, in the past, object detection systems would take an image like this and split it into a bunch of regions and then run a classifier on each of these regions, and high scores for that classifier would be considered detections in the image. But this involved running a classifier thousands of times over an image, thousands of neural network evaluations to produce detection. Instead, we trained a single network to do all of detection for us. It produces all of the bounding boxes and class probabilities simultaneously. With our system, instead of looking at an image thousands of times to produce detection, you only look once, and that's why we call it the YOLO method of object detection. So with this speed, we're not just limited to images; we can process video in real time. And now, instead of just seeing that cat and dog, we can see them move around and interact with each other.

Apenas nalguns anos, mudámos de 20 segundos por imagem para 20 milissegundos por imagem, mil vezes mais rápido. Como chegámos lá? Bom, no passado, os sistemas de deteção de objetos agarrariam numa imagem assim e dividi-la-iam num punhado de regiões e depois rodariam um classificador em cada uma destas regiões. As pontuações altas para aquele classificador seriam consideradas deteções na imagem. Mas isto envolvia rodar o classificador milhares de vezes numa imagem, milhares de avaliações de redes neurais para produzir deteção. Em vez disso, treinamos uma única rede para fazer toda a deteção por nós. Ela produz todas as caixas limitadoras e probabilidades de classe simultaneamente. Com o nosso sistema, em vez de olhar para uma imagem milhares de vezes para produzir deteção, olhamos apenas para uma. É por isso que lhe chamamos o método YOLO de deteção de objetos. Então com essa velocidade, não estamos limitados apenas a imagens; podemos processar vídeos em tempo real. E agora, em vez de ver um gato e um cão, podemos vê-los moverem-se e interagirem um com o outro.

This is a detector that we trained on 80 different classes in Microsoft's COCO dataset. It has all sorts of things like spoon and fork, bowl, common objects like that. It has a variety of more exotic things: animals, cars, zebras, giraffes. And now we're going to do something fun. We're just going to go out into the audience and see what kind of things we can detect. Does anyone want a stuffed animal? There are some teddy bears out there. And we can turn down our threshold for detection a little bit, so we can find more of you guys out in the audience. Let's see if we can get these stop signs. We find some backpacks. Let's just zoom in a little bit. And this is great. And all of the processing is happening in real time on the laptop.

Este é um detetor que treinámos em 80 classes diferentes no conjunto de dados COCO, da Microsoft. Ele tem todo o tipo de coisas, como colher e garfo, tigela, objetos comuns como esses. Tem uma variedade de coisas mais exóticas: animais, carros, zebras, girafas. E agora vamos fazer algo divertido. Vamos até à audiência e ver que tipos de coisas podemos detetar. Alguém quer um animal de peluche? Há por aqui uns ursinhos. Podemos diminuir o nosso limiar de deteção por um momento, para podemos detetar mais pessoas na audiência. Vamos ver se conseguimos detetar essas placas. Achámos umas mochilas. Vamos ampliar um pouco. E isso é ótimo. Todo o processamento está a ocorrer em tempo real no portátil.

And it's important to remember that this is a general purpose object detection system, so we can train this for any image domain. The same code that we use to find stop signs or pedestrians, bicycles in a self-driving vehicle, can be used to find cancer cells in a tissue biopsy. And there are researchers around the globe already using this technology for advances in things like medicine, robotics. This morning, I read a paper where they were taking a census of animals in Nairobi National Park with YOLO as part of this detection system. And that's because Darknet is open source and in the public domain, free for anyone to use.

E é importante lembrarem-se que este é um sistema de deteção de propósito geral, e assim podemos treiná-lo para qualquer domínio de imagens. O mesmo código que usamos para encontrar sinais de stop ou pedestres, bicicletas num veículo automotor, pode ser usado para encontrar células cancerígenas numa biópsia de tecido. Há investigadores por todo o mundo que já usam esta tecnologia para progredirem em coisas como a medicina, a robótica. Hoje de manhã, li um artigo que dizia que estavam a fazer um recenseamento de animais no Parque Nacional Nairobi com YOLO como parte deste sistema de deteção. Isto é possível porque a Darknet é de código aberto e está sob domínio público, livre para qualquer um usar.

(Applause)

(Aplausos)

But we wanted to make detection even more accessible and usable, so through a combination of model optimization, network binarization and approximation, we actually have object detection running on a phone.

Mas queríamos tornar a deteção ainda mais acessível e utilizável, então através de uma combinação de otimização de modelos, binarização de redes e aproximação, conseguimos fazer a deteção de objetos rodar num telemóvel.

(Applause)

(Aplausos)

And I'm really excited because now we have a pretty powerful solution to this low-level computer vision problem, and anyone can take it and build something with it. So now the rest is up to all of you and people around the world with access to this software, and I can't wait to see what people will build with this technology.

Estou realmente ansioso porque agora temos uma solução muito poderosa para este problema de visão computacional de baixo-nível, e todos podem tomá-lo e fazer algo com ele. Agora o resto é com todos vocês e com as pessoas em todo o mundo com acesso a este software. Não posso esperar para ver o que as pessoas construirão com esta tecnologia.

Thank you.

Obrigado.

(Applause)

(Aplausos)