Joseph Redmon: How computers learn to recognize objects instantly

Ten years ago, computer vision researchers thought that getting a computer to tell the difference between a cat and a dog would be almost impossible, even with the significant advance in the state of artificial intelligence. Now we can do it at a level greater than 99 percent accuracy. This is called image classification -- give it an image, put a label to that image -- and computers know thousands of other categories as well.

Dieci anni fa, i ricercatori di visione artificiale pensavano che fare in modo che un computer riuscisse a differenziare un gatto e un cane sarebbe stato quasi impossibile, nonostante il progresso significativo nel campo dell'intelligenza artificiale. Ora possiamo farlo a un livello di precisione superiore al 99 per cento. Questa è chiamata classificazione d'immagini -- dategli un'immagine, etichettate quell'immagine -- e i computer riconoscono anche migliaia di altre categorie.

I'm a graduate student at the University of Washington, and I work on a project called Darknet, which is a neural network framework for training and testing computer vision models. So let's just see what Darknet thinks of this image that we have. When we run our classifier on this image, we see we don't just get a prediction of dog or cat, we actually get specific breed predictions. That's the level of granularity we have now. And it's correct. My dog is in fact a malamute.

Sono un dottorando della University of Washington, e lavoro su un progetto chiamato Darknet, che è un framework di rete neurale per sviluppare e testare i modelli di visione artificiale. Quindi vediamo cosa ne pensa Darknet di quest'immagine. Quando eseguiamo il nostro classificatore su quest'immagine, vediamo che non otteniamo solo la previsione di un cane o di un gatto, ma la previsione della razza specifica. Questo è il livello di precisione attuale. Ed è corretto. Infatti, il mio cane è un Alaskan Malamute.

So we've made amazing strides in image classification, but what happens when we run our classifier on an image that looks like this? Well ... We see that the classifier comes back with a pretty similar prediction. And it's correct, there is a malamute in the image, but just given this label, we don't actually know that much about what's going on in the image. We need something more powerful. I work on a problem called object detection, where we look at an image and try to find all of the objects, put bounding boxes around them and say what those objects are. So here's what happens when we run a detector on this image.

Abbiamo fatto dei passi da gigante nella classificazione di immagini, ma cosa succede quando passiamo il nostro classificatore su un'immagine come questa? Beh... Vediamo che il classificatore ritorna con una predizione abbastanza simile. Ed è corretto, c'è un Alaskan Malamute sull'immagine, ma con questa sola etichetta, non ne sappiamo poi molto di quello che succede nell'immagine. Abbiamo bisogno di qualcosa di più potente. Io lavoro su un problema chiamato riconoscimento di oggetti, dove guardiamo un'immagine e cerchiamo di trovare gli oggetti, li delimitiamo con dei "bounding boxes" e definiamo quegli oggetti. Questo è quello che succede quando passiamo un rilevatore su quest'immagine.

Now, with this kind of result, we can do a lot more with our computer vision algorithms. We see that it knows that there's a cat and a dog. It knows their relative locations, their size. It may even know some extra information. There's a book sitting in the background. And if you want to build a system on top of computer vision, say a self-driving vehicle or a robotic system, this is the kind of information that you want. You want something so that you can interact with the physical world. Now, when I started working on object detection, it took 20 seconds to process a single image. And to get a feel for why speed is so important in this domain, here's an example of an object detector that takes two seconds to process an image. So this is 10 times faster than the 20-seconds-per-image detector, and you can see that by the time it makes predictions, the entire state of the world has changed, and this wouldn't be very useful for an application.

Ora, con questo tipo di risultato, possiamo fare molto di più con i nostri algoritmi di visione artificiale. Vediamo che riconosce che ci sono un gatto e un cane. Conosce la loro posizione, la loro taglia. Potrebbe addirittura conoscere informazioni extra. C'è un libro sullo sfondo. Se vuoi costruire un sistema basato sulla visione artificiale, diciamo un'auto senza pilota o un sistema robotico, questo è il tipo di informazione che vuoi. Vuoi qualcosa per poter interagire con il mondo fisico. Quando ho iniziato a lavorare sul riconoscimento di oggetti, servivano 20 secondi per processare una sola immagine. E per capire quanto è importante la velocità in questo settore, ecco un esempio di un rilevatore di oggetti che impiega due secondi per processare un'immagine. Quindi questo è 10 volte più veloce del rilevatore a 20 secondi a immagine, e potete vedere che nel momento in cui fa le sue predizioni l'aspetto del mondo è cambiato, e questo non sarebbe utile per un'applicazione.

If we speed this up by another factor of 10, this is a detector running at five frames per second. This is a lot better, but for example, if there's any significant movement, I wouldn't want a system like this driving my car.

Se velocizziamo di un altro fattore 10, questo è un rilevatore che funziona a cinque immagini al secondo. È molto meglio, ma, ad esempio, se c'è un qualsiasi movimento significativo, non vorrei un sistema come questo mentre guido.

This is our detection system running in real time on my laptop. So it smoothly tracks me as I move around the frame, and it's robust to a wide variety of changes in size, pose, forward, backward. This is great. This is what we really need if we're going to build systems on top of computer vision.

Questo è il nostro sistema di riconoscimento in funzione in tempo reale sul mio computer. Quindi mi identifica senza problemi mentre mi muovo sull'immagine, ed è efficace anche quando cambiano la taglia, la posa, avanti, indietro. È fantastico. Questo è ciò di cui abbiamo davvero bisogno se vogliamo costruire sistemi basati sulla visione artificiale.

(Applause)

(Applausi)

So in just a few years, we've gone from 20 seconds per image to 20 milliseconds per image, a thousand times faster. How did we get there? Well, in the past, object detection systems would take an image like this and split it into a bunch of regions and then run a classifier on each of these regions, and high scores for that classifier would be considered detections in the image. But this involved running a classifier thousands of times over an image, thousands of neural network evaluations to produce detection. Instead, we trained a single network to do all of detection for us. It produces all of the bounding boxes and class probabilities simultaneously. With our system, instead of looking at an image thousands of times to produce detection, you only look once, and that's why we call it the YOLO method of object detection. So with this speed, we're not just limited to images; we can process video in real time. And now, instead of just seeing that cat and dog, we can see them move around and interact with each other.

Quindi in pochi anni, siamo passati da 20 secondi a immagine a 20 millisecondi a immagine, mille volte più veloce. Come ci siamo riusciti? In passato, i sistemi di riconoscimento di oggetti avrebbero preso un'immagine come questa e l'avrebbero divisa in un insieme di regioni e poi passato un classificatore su ognuna di queste regioni, e punteggi elevati per quel classificatore sarebbero stati considerati come riconoscimenti nell'immagine. Ma questo voleva dire passare un classificatore migliaia di volte su un'immagine, migliaia di valutazioni di rete neurale per produrre il riconoscimento. Invece, abbiamo allenato una singola rete a fare tutto il riconoscimento per noi. Produce tutti i bounding boxes e ordina le probabilità simultaneamente. Con il nostro sistema, invece di guardare un'immagine migliaia di volte per ottenere il riconoscimento, guardi una volta sola, ed è per questo che lo chiamiamo il metodo YOLO del riconoscimento d'oggetti. Con questa velocità possiamo quindi non limitarci alle immagini; ma possiamo processare video in tempo reale. E ora, invece di vedere solo il cane e il gatto, possiamo vederli muovere e interagire tra loro.

This is a detector that we trained on 80 different classes in Microsoft's COCO dataset. It has all sorts of things like spoon and fork, bowl, common objects like that. It has a variety of more exotic things: animals, cars, zebras, giraffes. And now we're going to do something fun. We're just going to go out into the audience and see what kind of things we can detect. Does anyone want a stuffed animal? There are some teddy bears out there. And we can turn down our threshold for detection a little bit, so we can find more of you guys out in the audience. Let's see if we can get these stop signs. We find some backpacks. Let's just zoom in a little bit. And this is great. And all of the processing is happening in real time on the laptop.

Questo è un rilevatore che abbiamo allenato su 80 classi diverse nel dataset COCO di Microsoft. Contiene di tutto come cucchiaio e forchetta, ciotola, oggetti comuni come questi. Ma anche una varietà di cose più esotiche: animali, auto, zebre, giraffe. E adesso facciamo qualcosa di divertente. Ci metteremo in mezzo al pubblico per vedere che tipo di oggetti possiamo identificare. Qualcuno vuole un peluche? Ci sono degli orsacchiotti lì in mezzo. E possiamo abbassare un po' la nostra soglia di riconoscimento, così possiamo riconoscervi meglio in mezzo al pubblico. Vediamo se riusciamo a trovare dei segnali di stop. Troviamo degli zaini. Facciamo uno zoom. Ed è fantastico. E tutto il processo avviene in tempo reale sul computer.

And it's important to remember that this is a general purpose object detection system, so we can train this for any image domain. The same code that we use to find stop signs or pedestrians, bicycles in a self-driving vehicle, can be used to find cancer cells in a tissue biopsy. And there are researchers around the globe already using this technology for advances in things like medicine, robotics. This morning, I read a paper where they were taking a census of animals in Nairobi National Park with YOLO as part of this detection system. And that's because Darknet is open source and in the public domain, free for anyone to use.

Ed è importante ricordare che questo è un sistema di riconoscimento di oggetti di uso generale, quindi lo possiamo allenare per qualsiasi settore di immagini. Lo stesso codice che usiamo per trovare segnali di stop o pedoni, biciclette in un veicolo con pilota automatico, può essere usato per trovare cellule cancerose durante una biopsia. E ci sono ricercatori in tutto il mondo che stanno già usando questa tecnologia per fare passi avanti in campi come la medicina, la robotica. Questa mattina, ho letto un articolo in cui si parlava di un censimento degli animali al Nairobi National Park con YOLO integrato nel sistema di riconoscimento. Ed è perché Darknet è open source ed è di dominio pubblico, e chiunque può utilizzarlo liberamente.

(Applause)

(Applausi)

But we wanted to make detection even more accessible and usable, so through a combination of model optimization, network binarization and approximation, we actually have object detection running on a phone.

Ma volevamo rendere il riconoscimento ancora più accessibile e fruibile, e attraverso una combinazione di ottimizzazione del modello, binarizzazione di rete e approssimazione, abbiamo un riconoscimento di oggetti che funziona su un telefono.

(Applause)

(Applausi)

And I'm really excited because now we have a pretty powerful solution to this low-level computer vision problem, and anyone can take it and build something with it. So now the rest is up to all of you and people around the world with access to this software, and I can't wait to see what people will build with this technology.

E sono davvero contento perché abbiamo una soluzione piuttosto efficace a questo problema di visione di computer di basso livello, e chiunque può prenderlo e costruirci qualcosa. Quindi il resto è nelle vostre mani e in quelle delle persone nel mondo che hanno accesso a questo software, e sono impaziente di vedere

Thank you.

cosa le persone faranno con questa tecnologia.

(Applause)

Grazie.