Kenneth Cukier: Big data is better data

America's favorite pie is?

Koja je omiljena američka pita?

Audience: Apple. Kenneth Cukier: Apple. Of course it is. How do we know it? Because of data. You look at supermarket sales. You look at supermarket sales of 30-centimeter pies that are frozen, and apple wins, no contest. The majority of the sales are apple. But then supermarkets started selling smaller, 11-centimeter pies, and suddenly, apple fell to fourth or fifth place. Why? What happened? Okay, think about it. When you buy a 30-centimeter pie, the whole family has to agree, and apple is everyone's second favorite. (Laughter) But when you buy an individual 11-centimeter pie, you can buy the one that you want. You can get your first choice. You have more data. You can see something that you couldn't see when you only had smaller amounts of it.

Publika: Od jabuke. Kenet Kukir: Od jabuke. Naravno. Kako to znamo? Zbog podataka. Posmatramo rasprodaju u supermarketima, prodaju zamrznutih pita prečnika 30 cm, i jabuka pobeđuje. Bez konkurencije. Najveći deo prodaje je od jabuka. Zatim su supermarketi počeli da prodaju manje pite, pite prečnika 11 cm. Odjednom, jabuka pada na četvrto ili peto mesto. Zašto? Šta se dogodilo? Dobro. Razmislite o tome. Kada kupite pitu od 30cm, cela porodica mora da se složi, a pita od jabuka je svima drugi omiljeni izbor. (Smeh) Ali kad kupite zasebnu pitu od 11cm, možete da kupite onu koju vi hoćete. Možete da uzmete vaš prvi izbor. Imate više podataka. Možete da vidite nešto što niste mogli da vidite kada ste ih imali u manjim količinama. Dakle, poenta je da više podataka

Now, the point here is that more data doesn't just let us see more, more of the same thing we were looking at. More data allows us to see new. It allows us to see better. It allows us to see different. In this case, it allows us to see what America's favorite pie is: not apple.

ne samo što nam omogućava da vidimo više, više o tome što posmatramo. Više podataka nam omogućava da vidimo novo. Omogućava nam da vidimo bolje. Omogućava nam da vidimo različito. U ovom slučaju, omogućava nam da vidimo koja je omiljena američka pita: nije od jabuka. Svi ste verovatno čuli izraz "veliki podaci".

Now, you probably all have heard the term big data. In fact, you're probably sick of hearing the term big data. It is true that there is a lot of hype around the term, and that is very unfortunate, because big data is an extremely important tool by which society is going to advance. In the past, we used to look at small data and think about what it would mean to try to understand the world, and now we have a lot more of it, more than we ever could before. What we find is that when we have a large body of data, we can fundamentally do things that we couldn't do when we only had smaller amounts. Big data is important, and big data is new, and when you think about it, the only way this planet is going to deal with its global challenges — to feed people, supply them with medical care, supply them with energy, electricity, and to make sure they're not burnt to a crisp because of global warming — is because of the effective use of data.

Verovatno vam je i loše na pomenu izraza "veliki podaci". Tačno je da se podigla velika buka oko ovog izraza, što je loše. Zato što su veliki podaci veoma važan alat pomoću kog će društvo da napreduje. U prošlosti smo posmatrali "male podatke" i razmišljali o tome šta bi značilo da pokušamo da razumemo svet, a sada ih imamo mnogo više, više nego što smo ikada imali. Shvatili smo da kada imamo mnogo podataka, u principu možemo uraditi stvari koje nismo mogli sa manje podataka. Veliki podaci su bitni, i to je nešto novo, kada razmislimo o tome, jedini način na koji će se planeta suočiti sa svojim globalnim izazovima - nahraniti ljude, obezbediti im medicinsku negu, pružiti im energiju, struju, da se pobrine da ne izgore zbog globalnog zagrevanja - jeste zbog efikasne upotrebe podataka.

So what is new about big data? What is the big deal? Well, to answer that question, let's think about what information looked like, physically looked like in the past. In 1908, on the island of Crete, archaeologists discovered a clay disc. They dated it from 2000 B.C., so it's 4,000 years old. Now, there's inscriptions on this disc, but we actually don't know what it means. It's a complete mystery, but the point is that this is what information used to look like 4,000 years ago. This is how society stored and transmitted information.

Šta je novo u vezi sa velikim podacima? U čemu je velika caka? Da bismo odgovorili na to pitanje, razmislimo kako su informacije izgledale, fizički izgledale u prošlosti. 1908. godine na Kritu, arheolozi su pronašli glineni disk. Smestili su ga oko 2000. g. pre Hrista, dakle star je 4000 godina. Na tom disku postoji zapis, ali ne znamo šta on znači. Potpuna je zagonetka, ali poenta je u tome da su tako informacije izgledale pre 4000 godina. Tako je društvo čuvalo i prenosilo informacije.

Now, society hasn't advanced all that much. We still store information on discs, but now we can store a lot more information, more than ever before. Searching it is easier. Copying it easier. Sharing it is easier. Processing it is easier. And what we can do is we can reuse this information for uses that we never even imagined when we first collected the data. In this respect, the data has gone from a stock to a flow, from something that is stationary and static to something that is fluid and dynamic. There is, if you will, a liquidity to information. The disc that was discovered off of Crete that's 4,000 years old, is heavy, it doesn't store a lot of information, and that information is unchangeable. By contrast, all of the files that Edward Snowden took from the National Security Agency in the United States fits on a memory stick the size of a fingernail, and it can be shared at the speed of light. More data. More.

Društvo nije baš toliko napredovalo. I dalje čuvamo informacije na diskovima, ali danas možemo da čuvamo mnogo više, više nego ikada. Pretraživanje je lakše. Kopiranje je lakše. Deljenje je lakše. Obrada je lakša. Možemo da koristimo te informacije iznova, na načine na koje nismo ni zamišljali kada smo počeli da sakupljamo podatke. U tom smislu, podaci su prešli iz skladištenja u protok. Od nečega što je stacionarno i statično do nečega što je fluidno i dinamično. Ako ćemo tako, informacija je kao tečnost. Disk, koji je otkriven u blizini Krita, pre 4000 godina, je težak. Ne sadrži puno informacija, i te informacije su nepromenljive. Nasuprot tome, svi fajlovi koje je Edvard Snouden uzeo od Državne bezbednosne agencije u SAD-u staju na memorijski uređaj veličine nokta, i mogu se razmenjivati brzinom svetlosti. Još podataka. Više.

Now, one reason why we have so much data in the world today is we are collecting things that we've always collected information on, but another reason why is we're taking things that have always been informational but have never been rendered into a data format and we are putting it into data. Think, for example, the question of location. Take, for example, Martin Luther. If we wanted to know in the 1500s where Martin Luther was, we would have to follow him at all times, maybe with a feathery quill and an inkwell, and record it, but now think about what it looks like today. You know that somewhere, probably in a telecommunications carrier's database, there is a spreadsheet or at least a database entry that records your information of where you've been at all times. If you have a cell phone, and that cell phone has GPS, but even if it doesn't have GPS, it can record your information. In this respect, location has been datafied.

Jedan razlog zašto danas imamo toliko podataka je što sakupljamo stvari o kojima smo uvek skupljali informacije, ali drugi razlog je zato što uzimamo stvari koje su uvek bile informativne ali nikad nisu prebačene u oblik podataka i stavljamo ih u podatke. Zamislite, npr. pitanje lokacije. Uzmimo Martina Lutera za primer. Da smo 1500. god. želeli da znamo gde je Martin Luter, morali bismo da ga pratimo u svakom trenutku, možda sa perom i mastilom, i da to beležimo, ali razmislite kako to izgleda danas. Znate da negde, verovatno u bazi podataka telefonskog operatera, postoji tabela ili bar podatak u bazi koji beleži informaciju o tome gde ste bili u svakom momentu. Ako imate mobilni telefon, koji ima GPS, čak i ako nema GPS, on čuva informacije. U ovom smislu, lokacija je postala "podatkovana".

Now think, for example, of the issue of posture, the way that you are all sitting right now, the way that you sit, the way that you sit, the way that you sit. It's all different, and it's a function of your leg length and your back and the contours of your back, and if I were to put sensors, maybe 100 sensors into all of your chairs right now, I could create an index that's fairly unique to you, sort of like a fingerprint, but it's not your finger.

Razmislimo, npr. o pitanju držanja, načinu na koji upravo sedite, načinu na koji vi sedite, načinu na koji vi sedite, i vi. Svi se razlikuju, i zavise od dužine nogu i leđa i od konture leđa, i, ako bih postavio senzore, možda 100 senzora u sve vaše stolice, našao bih indeks koji je jedinstven za svakoga, kao otisak prsta, ali nije od prsta.

So what could we do with this? Researchers in Tokyo are using it as a potential anti-theft device in cars. The idea is that the carjacker sits behind the wheel, tries to stream off, but the car recognizes that a non-approved driver is behind the wheel, and maybe the engine just stops, unless you type in a password into the dashboard to say, "Hey, I have authorization to drive." Great.

Međutim, šta bismo mogli sa tim? Istraživači u Tokiju ga koriste kao potencijalni alarmni uređaj u kolima. Ideja je da ako za volan sedne lopov, pokuša da pobegne, ali automobil prepozna da za volanom nije odobreni vozač, možda zaustavi motor, osim ako vozač ne unese šifru u kontrolnu tablu da kaže: "Hej, imam dozvolu da vozim". Odlično!

What if every single car in Europe had this technology in it? What could we do then? Maybe, if we aggregated the data, maybe we could identify telltale signs that best predict that a car accident is going to take place in the next five seconds. And then what we will have datafied is driver fatigue, and the service would be when the car senses that the person slumps into that position, automatically knows, hey, set an internal alarm that would vibrate the steering wheel, honk inside to say, "Hey, wake up, pay more attention to the road." These are the sorts of things we can do when we datafy more aspects of our lives.

Šta ako bi svaki automobil u Evropi imao ovu tehnologiju? Šta bismo mogli tada? Možda, kada bismo nagomilali podatke, mogli bismo da uočimo znakove upozorenja koji najbolje predviđaju da će se dogoditi automobilska nesreća u narednih pet sekundi. Tada bismo u obliku podataka beležili zamor vozača, i svrha bi bila da kada kola osete da je vozač upao u određeni položaj, automatski kaže: "Hej, pusti interni alarm." kojim bi zavibrirao volan, zatrubio iznutra i rekao "Hej, budi se! obrati više pažnje na put." To su neke stvari koje možemo da uradimo kada prebacimo u podatke više aspekata naših života.

So what is the value of big data? Well, think about it. You have more information. You can do things that you couldn't do before. One of the most impressive areas where this concept is taking place is in the area of machine learning. Machine learning is a branch of artificial intelligence, which itself is a branch of computer science. The general idea is that instead of instructing a computer what do do, we are going to simply throw data at the problem and tell the computer to figure it out for itself. And it will help you understand it by seeing its origins. In the 1950s, a computer scientist at IBM named Arthur Samuel liked to play checkers, so he wrote a computer program so he could play against the computer. He played. He won. He played. He won. He played. He won, because the computer only knew what a legal move was. Arthur Samuel knew something else. Arthur Samuel knew strategy. So he wrote a small sub-program alongside it operating in the background, and all it did was score the probability that a given board configuration would likely lead to a winning board versus a losing board after every move. He plays the computer. He wins. He plays the computer. He wins. He plays the computer. He wins. And then Arthur Samuel leaves the computer to play itself. It plays itself. It collects more data. It collects more data. It increases the accuracy of its prediction. And then Arthur Samuel goes back to the computer and he plays it, and he loses, and he plays it, and he loses, and he plays it, and he loses, and Arthur Samuel has created a machine that surpasses his ability in a task that he taught it.

Koja je vrednost velikih podataka? Pa, razmislite o tome. Imate više informacija. Možete da uradite ono što niste mogli ranije. Jedna od najimpresivnijih oblasti u kojoj ovaj koncept igra ulogu jeste mašinsko učenje. Mašinsko učenje je grana veštačke inteligencije, koja je grana računarskih nauka. Glavna ideja je da umesto da kažemo računaru šta da radi, jednostavno ubacimo podatke u problem i kažemo računaru da ga reši sam. Pomoći će vam da ga razumete gledajući u njegove korene. U 1950-im, informatičar u IBM-u, Artur Semjuel, voleo je da igra "Damu", te je napisao kompjuterski program kako bi igrao protiv računara. Igrao je. Pobedio je. Igrao je. Pobedio je. Igrao, pobedio. Jer je računar znao dozvoljene poteze. Artur Semjuel je znao nešto drugo. Artur Semjuel je poznavao strategiju. Napisao je mali potprogram, pored ovog, koji je radio u pozadini, i samo računao verovatnoću da data situacija na tabli pre vodi ka pobedničkoj tabli nego ka gubitničkoj, nakon svakog poteza. Igra protiv računara. Pobeđuje. Igra protiv računara. Pobeđuje. Igra protiv računara. Pobeđuje. Zatim je Artur Semjuel pustio računar da igra protiv sebe. Igrao je. Sakupljao je više podataka. Sakupljajući više podataka, povećavao je tačnost svog predviđanja. Zatim se Artur Semjuel vratio do računara. Igra, i gubi. Igra, i gubi. Igra, i gubi. I tako je Artur Semjuel stvorio mašinu koja prevazilazi njegove mogućnosti u igri kojoj ju je naučio.

And this idea of machine learning is going everywhere. How do you think we have self-driving cars? Are we any better off as a society enshrining all the rules of the road into software? No. Memory is cheaper. No. Algorithms are faster. No. Processors are better. No. All of those things matter, but that's not why. It's because we changed the nature of the problem. We changed the nature of the problem from one in which we tried to overtly and explicitly explain to the computer how to drive to one in which we say, "Here's a lot of data around the vehicle. You figure it out. You figure it out that that is a traffic light, that that traffic light is red and not green, that that means that you need to stop and not go forward."

Ova ideja mašinskog učenja se širi na sve strane. Šta mislite, odakle nam samoupravljajuća vozila? Da li napredujemo kao društvo ubacivanjem svih pravila vožnje u softver? Ne. Memorija je jeftinija. Ne. Algoritmi su brži. Ne. Procesori su brži. Ne. Sve to je bitno, ali ne zbog toga. Nego zato što smo promenili koren problema. Promenili smo prirodu problema od one u kojoj smo direktno objasnili računaru kako da vozi, do one u kojoj kažemo: "Evo ti mnogo podataka u vezi sa vozilom. Shvati sam. Shvati da je ovo svetlo na semaforu. Da je crveno, a ne zeleno. Da to znači da moraš da staneš, a ne da nastaviš."

Machine learning is at the basis of many of the things that we do online: search engines, Amazon's personalization algorithm, computer translation, voice recognition systems. Researchers recently have looked at the question of biopsies, cancerous biopsies, and they've asked the computer to identify by looking at the data and survival rates to determine whether cells are actually cancerous or not, and sure enough, when you throw the data at it, through a machine-learning algorithm, the machine was able to identify the 12 telltale signs that best predict that this biopsy of the breast cancer cells are indeed cancerous. The problem: The medical literature only knew nine of them. Three of the traits were ones that people didn't need to look for, but that the machine spotted.

Mašinsko učenje je u osnovi mnogih stvari na mreži. Pretraživači, Amazonov personalizovani algoritam, računarsko prevođenje, sistemi za prepoznavanje glasa. Istraživači su skoro posmatrali problem biopsije. Biopsije raka. Pitali su računar da ustanovi posmatrajući podatke i stopu preživljavanja, da odluči da li su ćelije zapravo kancerogene ili ne. Zasigurno, kada ubacite podatke, pomoću algoritma mašinskog učenja, mašina je postala sposobna da prepozna 12 znakova koji najbolje predviđaju da je biopsija raka ćelija dojke zaista zahvaćena rakom. Problem? Medicinska literatura je poznavala samo devet od njih. Tri od tih simptoma su bili oni koje ljudi nisu trebali da traže, ali ih je mašina uočila.

Now, there are dark sides to big data as well. It will improve our lives, but there are problems that we need to be conscious of, and the first one is the idea that we may be punished for predictions, that the police may use big data for their purposes, a little bit like "Minority Report." Now, it's a term called predictive policing, or algorithmic criminology, and the idea is that if we take a lot of data, for example where past crimes have been, we know where to send the patrols. That makes sense, but the problem, of course, is that it's not simply going to stop on location data, it's going to go down to the level of the individual. Why don't we use data about the person's high school transcript? Maybe we should use the fact that they're unemployed or not, their credit score, their web-surfing behavior, whether they're up late at night. Their Fitbit, when it's able to identify biochemistries, will show that they have aggressive thoughts. We may have algorithms that are likely to predict what we are about to do, and we may be held accountable before we've actually acted. Privacy was the central challenge in a small data era. In the big data age, the challenge will be safeguarding free will, moral choice, human volition, human agency.

Ali, postoji loša strana velikih podataka. Unaprediće naše živote, ali postoje problemi kojih moramo biti svesni. Prvi od njih je ideja da možemo biti kažnjeni za predviđanja, da policija može koristiti velike podatke u svoje svrhe, nešto poput fima "Suvišni izveštaj". Ovaj izraz zovemo sposobnost predviđanja ili algoritamska kriminologija, i ideja je da ako uzmemo mnogo podataka npr. mesta prošlih zločina, znamo gde da pošaljemo patrole. To ima smisla, ali problem je, naravno, u tome što se neće završiti samo na podacima o lokaciji. Ići će do ličnog nivoa. Zašto ne koristimo podatke o nečijim ocenama iz srednje škole? Možda da iskoristimo činjenice o zaposlenosti, o kreditnom stanju, o ponašanju na internetu, da li su budni noću. Ako njihov Fitbit može da prepozna njihove biohemijske parametre, pokazaće kada imaju agresivne misli. Možemo imati algoritme koji bi mogli predviđati šta ćemo uraditi, i mogu nas smatrati odgovornim pre nego što delamo. Privatnost je bila centralni izazov u eri malih podataka. U danima velikih podataka, izazov će biti zaštita slobodne volje, moralnih izbora, ljudske volje, ljudske odlučnosti.

There is another problem: Big data is going to steal our jobs. Big data and algorithms are going to challenge white collar, professional knowledge work in the 21st century in the same way that factory automation and the assembly line challenged blue collar labor in the 20th century. Think about a lab technician who is looking through a microscope at a cancer biopsy and determining whether it's cancerous or not. The person went to university. The person buys property. He or she votes. He or she is a stakeholder in society. And that person's job, as well as an entire fleet of professionals like that person, is going to find that their jobs are radically changed or actually completely eliminated. Now, we like to think that technology creates jobs over a period of time after a short, temporary period of dislocation, and that is true for the frame of reference with which we all live, the Industrial Revolution, because that's precisely what happened. But we forget something in that analysis: There are some categories of jobs that simply get eliminated and never come back. The Industrial Revolution wasn't very good if you were a horse. So we're going to need to be careful and take big data and adjust it for our needs, our very human needs. We have to be the master of this technology, not its servant. We are just at the outset of the big data era, and honestly, we are not very good at handling all the data that we can now collect. It's not just a problem for the National Security Agency. Businesses collect lots of data, and they misuse it too, and we need to get better at this, and this will take time. It's a little bit like the challenge that was faced by primitive man and fire. This is a tool, but this is a tool that, unless we're careful, will burn us.

Postoji još jedan problem. Veliki podaci će nam ukrasti poslove. Veliki podaci i algoritmi će izazvati kancelarijske, visoko obrazovane radnike dvadeset prvog veka slično kao što su automatizacija i pokretna traka izazvale radničku klasu u 20. veku. Setimo se laboratorijskog tehničara, koji pod mikroskopom posmatra biopsiju raka da bi zaključio da li je zahvaćena rakom. Ova osoba je završila fakultet. Ona kupuje imovinu. On ili ona glasa. On ili ona je član društva. Posao ove osobe, i celog niza stručnjaka kao što je ova osoba, shvatiće da se njihov posao znatno menja ili da će potpuno nestati. Volimo da mislimo da će vremenom tehnologija praviti poslove iza kratkog, privremenog doba dislokacije, što je i tačno za taj referentni okvir u kom svi živimo, industrijsku revoluciju, jer tako se tačno i dogodilo. Međutim, u toj analizi zaboravljamo da postoje kategorije poslova koje će jednostavno nestati i neće se vratiti. Industrijska revolucija nije bila dobra ako ste bili konj. Dakle, moramo biti pažljivi, i moramo velike podatke prilagoditi našim potrebama, našim ljudskim potrebama. Moramo biti gospodari tehnologije, a ne njene sluge. Na samom smo početku doba velikih podataka, i iskreno, za sada ne rukujemo dobro podacima koje sada možemo da prikupimo. To nije problem samo Državne bezbednosne agencije. Firme sakupljaju dosta podataka, i takođe ih ne koriste dobro, moramo ovladati time, a za to je potrebno vreme. Podseća na situaciju kada se primitivni čovek suočio sa vatrom. To je alat, ali alat koji će nas opeći ako ne budemo pažljivi.

Big data is going to transform how we live, how we work and how we think. It is going to help us manage our careers and lead lives of satisfaction and hope and happiness and health, but in the past, we've often looked at information technology and our eyes have only seen the T, the technology, the hardware, because that's what was physical. We now need to recast our gaze at the I, the information, which is less apparent, but in some ways a lot more important. Humanity can finally learn from the information that it can collect, as part of our timeless quest to understand the world and our place in it, and that's why big data is a big deal.

Veliki podaci će promeniti naš način života, način rada i razmišljanja. Pomoći će nam da organizujemo svoje karijere i da živimo zadovoljno i sa nadom, u sreći i zdravlju. Ranije smo često od informacionih tehnologija gledali samo u T, u tehnologiju, u hardver, zato što je to ono što je opipljivo. Sada moramo da bacimo oko na I, na informacije, na ono manje uočljivo, ali na određeni način mnogo bitnije. Čovečanstvo konačno uči iz informacija koje može da prikupi, kao deo našeg vanvremenskog zadatka da shvatimo svet i naše mesto u njemu i zato veliki podaci jesu velika stvar.

(Applause)

(Aplauz)

America's favorite pie is?

Koja je omiljena američka pita?

(Applause)

(Aplauz)

Kenneth Cukier: Big data is better data

Kenneth Cukier: Big data is better data

Related talks

David McCandless: The beauty of data visualization

Talithia Williams: Own your body's data

Tim Berners-Lee: The next web

Shyam Sankar: The rise of human-computer cooperation

Giorgia Lupi: How we can find ourselves in data

Anders Ynnerman: Visualizing the medical data explosion

Related talks

David McCandless: The beauty of data visualization

Talithia Williams: Own your body's data

Tim Berners-Lee: The next web

Shyam Sankar: The rise of human-computer cooperation

Giorgia Lupi: How we can find ourselves in data

Anders Ynnerman: Visualizing the medical data explosion