Kenneth Cukier: Big data is better data

America's favorite pie is?

Plăcinta preferată a Americii. Care este?

Audience: Apple. Kenneth Cukier: Apple. Of course it is. How do we know it? Because of data. You look at supermarket sales. You look at supermarket sales of 30-centimeter pies that are frozen, and apple wins, no contest. The majority of the sales are apple. But then supermarkets started selling smaller, 11-centimeter pies, and suddenly, apple fell to fourth or fifth place. Why? What happened? Okay, think about it. When you buy a 30-centimeter pie, the whole family has to agree, and apple is everyone's second favorite. (Laughter) But when you buy an individual 11-centimeter pie, you can buy the one that you want. You can get your first choice. You have more data. You can see something that you couldn't see when you only had smaller amounts of it.

Public: De mere! Kenneth Cukier: De mere, sigur că da. De unde știm? Datorită datelor. Ne uităm la vânzările supermarketurilor, vânzările plăcintelor congelate de 30 cm. Iar plăcinta de mere câștigă. Detașat. Majoritatea vânzărilor sunt la plăcinta de mere. Dar apoi supermarketurile au început să vândă plăcinte mai mici, de 11 cm, și dintr-o dată plăcinta de mere a căzut pe locul 4 sau 5. De ce? Ce s-a întâmplat? Ia să ne gândim. Când cumpărăm o plăcintă de 30 cm toată familia trebuie să fie de acord, iar plăcinta de mere e a doua preferință a tuturor. (Râsete) Dar când iei o plăcintă personală, de 11 cm, poți s-o iei pe cea pe care ți-o dorești. Poți să-ți alegi prima preferință. Avem mai multe date. Vedem ceva ce nu puteam vedea când aveam cantități mai mici de date.

Now, the point here is that more data doesn't just let us see more, more of the same thing we were looking at. More data allows us to see new. It allows us to see better. It allows us to see different. In this case, it allows us to see what America's favorite pie is: not apple.

Dar ideea e că datele mai multe nu ne permit doar să vedem mai mult, mai mult din același lucru. Datele mai multe ne permit să vedem ceva nou. Ne permit să vedem mai bine. Ne permit să vedem altfel. În cazul de faţă ne permit să vedem care e plăcinta preferată a Americii: nu cea de mere.

Now, you probably all have heard the term big data. In fact, you're probably sick of hearing the term big data. It is true that there is a lot of hype around the term, and that is very unfortunate, because big data is an extremely important tool by which society is going to advance. In the past, we used to look at small data and think about what it would mean to try to understand the world, and now we have a lot more of it, more than we ever could before. What we find is that when we have a large body of data, we can fundamentally do things that we couldn't do when we only had smaller amounts. Big data is important, and big data is new, and when you think about it, the only way this planet is going to deal with its global challenges — to feed people, supply them with medical care, supply them with energy, electricity, and to make sure they're not burnt to a crisp because of global warming — is because of the effective use of data.

Probabil toată lumea a auzit termenul „date masive”. De fapt probabil vi s-a făcut acru tot auzind „date masive”. E adevărat că se face mult tam-tam pe seama termenului și e mare păcat, pentru că datele masive sunt o unealtă foarte importantă prin care va avansa societatea. În trecut ne uitam la date puține și ne întrebam cum am putea încerca să înțelegem lumea, iar acum avem mult mai multe, mai multe decât era posibil înainte. Constatăm că având o colecție mare de date putem face lucruri imposibil de realizat cu date puține. Datele masive sunt importante și sunt noi. Dacă ne gândim bine, singurul mod în care planeta va face față problemelor globale — hrănirea populației, furnizarea serviciilor medicale, alimentarea cu energie, electricitate, și cum facem să nu ne rumenim cu încălzirea globală — e prin folosirea eficientă a datelor.

So what is new about big data? What is the big deal? Well, to answer that question, let's think about what information looked like, physically looked like in the past. In 1908, on the island of Crete, archaeologists discovered a clay disc. They dated it from 2000 B.C., so it's 4,000 years old. Now, there's inscriptions on this disc, but we actually don't know what it means. It's a complete mystery, but the point is that this is what information used to look like 4,000 years ago. This is how society stored and transmitted information.

Dar ce e nou în datele masive? Care e marea scofală? Pentru a răspunde, să ne amintim cum arătau informațiile efectiv în trecut. În 1908, pe insula Creta, arheologii au descoperit un disc de lut. L-au datat în 2000 î.Hr., deci e vechi de 4000 ani. Discul are inscripții, dar nu știm ce înseamnă, e un mister complet. Dar ideea e că așa arătau informațiile acum 4000 ani. Așa proceda societatea pentru a păstra și transmite informațiile.

Now, society hasn't advanced all that much. We still store information on discs, but now we can store a lot more information, more than ever before. Searching it is easier. Copying it easier. Sharing it is easier. Processing it is easier. And what we can do is we can reuse this information for uses that we never even imagined when we first collected the data. In this respect, the data has gone from a stock to a flow, from something that is stationary and static to something that is fluid and dynamic. There is, if you will, a liquidity to information. The disc that was discovered off of Crete that's 4,000 years old, is heavy, it doesn't store a lot of information, and that information is unchangeable. By contrast, all of the files that Edward Snowden took from the National Security Agency in the United States fits on a memory stick the size of a fingernail, and it can be shared at the speed of light. More data. More.

Dar societatea nu a avansat așa de mult. Încă mai păstrăm informații pe discuri, dar acum putem stoca mult mai multe informații decât oricând. Căutarea e mai ușoară. Copierea e mai ușoară. Distribuirea e mai ușoară. Prelucrarea e mai ușoară. Și putem refolosi aceste informații în moduri pe care nu ni le-am închipuit când am colectat datele. În această privință datele au trecut de la a fi păstrate la a fi circulate, de la ceva staționar și static la ceva fluid și dinamic. Ca să zic așa, informațiile au o lichiditate. Discul descoperit în Creta și vechi de 4000 ani e greu, nu stochează multe informații. Iar acele informații nu pot fi schimbate. Pe de altă parte, toate documentele pe care le-a luat Edward Snowden de la Agenția de Securitate Națională din SUA încap pe un stick de memorie de mărimea unei unghii, și se pot transmite cu viteza luminii. Mai multe date. Mai multe.

Now, one reason why we have so much data in the world today is we are collecting things that we've always collected information on, but another reason why is we're taking things that have always been informational but have never been rendered into a data format and we are putting it into data. Think, for example, the question of location. Take, for example, Martin Luther. If we wanted to know in the 1500s where Martin Luther was, we would have to follow him at all times, maybe with a feathery quill and an inkwell, and record it, but now think about what it looks like today. You know that somewhere, probably in a telecommunications carrier's database, there is a spreadsheet or at least a database entry that records your information of where you've been at all times. If you have a cell phone, and that cell phone has GPS, but even if it doesn't have GPS, it can record your information. In this respect, location has been datafied.

Un motiv pentru care azi lumea are atâtea date e că adunăm lucruri despre care dintotdeauna am adunat informații. Dar un alt motiv e că luăm lucruri care au fost mereu informaționale, dar n-au mai fost puse sub formă de date, și le transformăm în date. Gândiți-vă de exemplu la problema localizării. Să-l luăm pe Martin Luther. Dacă voiam să știm, în anii 1500, unde se află Martin Luther, trebuia să-l urmăm peste tot, poate cu pană și cerneală, ca să notăm. Dar acum gândiți-vă cum e azi. Știți că undeva, printr-o bază de date a unui furnizor de telecomunicații, e un tabel sau cel puțin o înscriere care înregistrează informații despre noi, unde am fost în fiecare moment. Dacă aveți un telefon mobil cu GPS, dar chiar dacă nu are GPS, vă poate memora informațiile. În această privință localizarea a fost „datificată”.

Now think, for example, of the issue of posture, the way that you are all sitting right now, the way that you sit, the way that you sit, the way that you sit. It's all different, and it's a function of your leg length and your back and the contours of your back, and if I were to put sensors, maybe 100 sensors into all of your chairs right now, I could create an index that's fairly unique to you, sort of like a fingerprint, but it's not your finger.

Acum gândiți-vă de exemplu la problema posturii, cum stați așezați, cum stați dv. sau cum stați dv. sau dv. Diferă în funcție de lungimea piciorului, de spate și de conturul spatelui. Dacă aș pune sensori, să zicem 100 de sensori, în scaunele tuturor, aș putea crea un index unic pentru fiecare. Ca o amprentă, dar nu a degetului.

So what could we do with this? Researchers in Tokyo are using it as a potential anti-theft device in cars. The idea is that the carjacker sits behind the wheel, tries to stream off, but the car recognizes that a non-approved driver is behind the wheel, and maybe the engine just stops, unless you type in a password into the dashboard to say, "Hey, I have authorization to drive." Great.

Și la ce am putea s-o folosim? Unii cercetători din Tokio o folosesc ca posibil sistem antifurt pentru mașini. Ideea e că hoțul stă la volan, încearcă să pornească, dar mașina recunoaște că la volan e un șofer neautorizat și atunci de exemplu se oprește motorul dacă nu tastezi o parolă în sistem ca să-i spui: „Hei, sunt autorizat să conduc.” Grozav.

What if every single car in Europe had this technology in it? What could we do then? Maybe, if we aggregated the data, maybe we could identify telltale signs that best predict that a car accident is going to take place in the next five seconds. And then what we will have datafied is driver fatigue, and the service would be when the car senses that the person slumps into that position, automatically knows, hey, set an internal alarm that would vibrate the steering wheel, honk inside to say, "Hey, wake up, pay more attention to the road." These are the sorts of things we can do when we datafy more aspects of our lives.

Ce-ar fi dacă toate mașinile din Europa ar folosi această tehnologie? Ce am putea face atunci? Dacă punem datele cap la cap poate reușim să identificăm semnele distinctive care să prezică optim că se va produce un accident de mașină în următoarele cinci secunde. Astfel s-ar datifica oboseala șoferului. Iar utilitatea apare când mașina simte că persoana cade în poziția aceea și știe automat să pornească o alarmă internă, să vibreze volanul sau să claxoneze înăuntru, să spună: „Trezește-te, fii mai atent la drum!” Astfel de lucruri putem face datificând mai multe aspecte ale vieții.

So what is the value of big data? Well, think about it. You have more information. You can do things that you couldn't do before. One of the most impressive areas where this concept is taking place is in the area of machine learning. Machine learning is a branch of artificial intelligence, which itself is a branch of computer science. The general idea is that instead of instructing a computer what do do, we are going to simply throw data at the problem and tell the computer to figure it out for itself. And it will help you understand it by seeing its origins. In the 1950s, a computer scientist at IBM named Arthur Samuel liked to play checkers, so he wrote a computer program so he could play against the computer. He played. He won. He played. He won. He played. He won, because the computer only knew what a legal move was. Arthur Samuel knew something else. Arthur Samuel knew strategy. So he wrote a small sub-program alongside it operating in the background, and all it did was score the probability that a given board configuration would likely lead to a winning board versus a losing board after every move. He plays the computer. He wins. He plays the computer. He wins. He plays the computer. He wins. And then Arthur Samuel leaves the computer to play itself. It plays itself. It collects more data. It collects more data. It increases the accuracy of its prediction. And then Arthur Samuel goes back to the computer and he plays it, and he loses, and he plays it, and he loses, and he plays it, and he loses, and Arthur Samuel has created a machine that surpasses his ability in a task that he taught it.

Deci ce valoare au datele masive? Ia gândiți-vă. Avem mai multe informații. Putem face lucruri pe care nu le puteam face înainte. Una din aplicațiile impresionante ale acestei noțiuni e în domeniul învățării automate. Învățarea automată e o ramură a inteligenței artificiale care ea însăși e o ramură a informaticii. Pe scurt, în loc să instruim un calculator ce să facă, bombardăm problema cu informații și-i cerem calculatorului să descopere singur. Veți înțelege mai bine dacă veți vedea începuturile. În anii 1950, un informatician de la IBM pe nume Arthur Samuel, căruia îi plăcea să joace dame, a scris un program pentru a putea juca împotriva calculatorului. A jucat. A câștigat. A jucat. A câștigat. A jucat. A câștigat. Pentru că tot ce știa calculatorul erau mutările permise. Arthur Samuel mai știa altceva. Arthur Samuel mai știa și strategie. Atunci a adăugat un mic sub-program care să opereze în fundal. Tot ce făcea era să calculeze probabilitatea ca o configurație dată să conducă la o tablă de joc câștigătoare sau necâștigătoare după fiecare mutare. Joacă cu calculatorul. Câștigă. Joacă cu calculatorul. Câștigă. Joacă cu calculatorul. Câștigă. Atunci Arthur Samuel lasă calculatorul să joace singur. Joacă singur, adună mai multe date. Adună mai multe date, îi crește precizia predicției. Atunci Arthur Samuel se întoarce la calculator și joacă cu el, și pierde. Și joacă, și pierde. Și joacă, și pierde. Și Arthur Samuel a creat o mașină care îl depășește la o sarcină în care el a inițiat-o.

And this idea of machine learning is going everywhere. How do you think we have self-driving cars? Are we any better off as a society enshrining all the rules of the road into software? No. Memory is cheaper. No. Algorithms are faster. No. Processors are better. No. All of those things matter, but that's not why. It's because we changed the nature of the problem. We changed the nature of the problem from one in which we tried to overtly and explicitly explain to the computer how to drive to one in which we say, "Here's a lot of data around the vehicle. You figure it out. You figure it out that that is a traffic light, that that traffic light is red and not green, that that means that you need to stop and not go forward."

Această idee de învățare automată se răspândește peste tot. Cum credeți că avem mașini care se conduc singure? E societatea mai capabilă să pună toate regulile rutiere într-un software? Nu. E mai ieftină memoria? Nu. Sunt mai rapizi algoritmii? Nu. Sunt mai bune procesoarele? Nu. Toate astea contează, dar nu sunt ele motivul. Motivul e că am schimbat natura problemei. Am trecut de la a-i spune deschis și explicit calculatorului cum să conducă la a-i spune: „Iată o mulțime de date despre vehicul. Descurcă-te. Prinde-te singur că ăla e un semafor, că semaforul e roșu și nu verde, că asta înseamnă să te oprești și nu să continui”.

Machine learning is at the basis of many of the things that we do online: search engines, Amazon's personalization algorithm, computer translation, voice recognition systems. Researchers recently have looked at the question of biopsies, cancerous biopsies, and they've asked the computer to identify by looking at the data and survival rates to determine whether cells are actually cancerous or not, and sure enough, when you throw the data at it, through a machine-learning algorithm, the machine was able to identify the 12 telltale signs that best predict that this biopsy of the breast cancer cells are indeed cancerous. The problem: The medical literature only knew nine of them. Three of the traits were ones that people didn't need to look for, but that the machine spotted.

Învățarea automată e la baza multor lucruri pe care le facem online: motoare de căutare, algoritmul de personalizare de la Amazon, traduceri computerizate, sisteme de recunoaștere a vocii. Cercetătorii s-au interesat recent de problema biopsiilor, a biopsiilor de cancer. Au cerut calculatorului să identifice, analizând datele și procentajul de supraviețuire, să determine dacă într-adevăr celulele sunt canceroase sau nu. Și bineînțeles, folosind date și un algoritm de învățare automată, mașina a reușit să identifice cele 12 semne tipice care prezic optim că biopsia unor celule canceroase de sân e într-adevăr canceroasă. Problema? Literatura medicală cunoștea numai nouă dintre ele. Trei caracteristici nu erau între cele care trebuiau verificate, dar mașina le-a detectat.

Now, there are dark sides to big data as well. It will improve our lives, but there are problems that we need to be conscious of, and the first one is the idea that we may be punished for predictions, that the police may use big data for their purposes, a little bit like "Minority Report." Now, it's a term called predictive policing, or algorithmic criminology, and the idea is that if we take a lot of data, for example where past crimes have been, we know where to send the patrols. That makes sense, but the problem, of course, is that it's not simply going to stop on location data, it's going to go down to the level of the individual. Why don't we use data about the person's high school transcript? Maybe we should use the fact that they're unemployed or not, their credit score, their web-surfing behavior, whether they're up late at night. Their Fitbit, when it's able to identify biochemistries, will show that they have aggressive thoughts. We may have algorithms that are likely to predict what we are about to do, and we may be held accountable before we've actually acted. Privacy was the central challenge in a small data era. In the big data age, the challenge will be safeguarding free will, moral choice, human volition, human agency.

Datele masive au și părți negative. Ne vor îmbunătăți viața, dar sunt probleme de care trebuie să fim conștienți. Prima e ideea că s-ar putea să fim pedepsiți pentru predicții, că poliția ar putea folosi datele masive pentru propriile scopuri, ca în filmul „Raport Special”. Se numește „poliție preventivă” sau „criminologie algoritmică”, iar ideea e că folosind multe date, de exemplu locul crimelor trecute, știm unde să trimitem patrulele. Are logică, dar desigur problema e că nu se va limita la localizare, ci va ajunge la nivelul individului. De ce să nu folosim date din foaia matricolă de liceu? Poate ar trebui să ținem cont dacă sunt șomeri, ce risc de credit au, ce comportament au pe internet, dacă se culcă noaptea târziu. Fitbitul lor, când va decela biochimia, va arăta că au gânduri agresive. Poate vom avea algoritmi care să prezică ce avem de gând să facem și poate vom fi trași la răspundere înainte de a face ceva. Intimitatea era problema centrală în epoca datelor puține. În epoca datelor masive problema va fi de a proteja liberul arbitru, alegerea morală, voința umană, factorul uman.

There is another problem: Big data is going to steal our jobs. Big data and algorithms are going to challenge white collar, professional knowledge work in the 21st century in the same way that factory automation and the assembly line challenged blue collar labor in the 20th century. Think about a lab technician who is looking through a microscope at a cancer biopsy and determining whether it's cancerous or not. The person went to university. The person buys property. He or she votes. He or she is a stakeholder in society. And that person's job, as well as an entire fleet of professionals like that person, is going to find that their jobs are radically changed or actually completely eliminated. Now, we like to think that technology creates jobs over a period of time after a short, temporary period of dislocation, and that is true for the frame of reference with which we all live, the Industrial Revolution, because that's precisely what happened. But we forget something in that analysis: There are some categories of jobs that simply get eliminated and never come back. The Industrial Revolution wasn't very good if you were a horse. So we're going to need to be careful and take big data and adjust it for our needs, our very human needs. We have to be the master of this technology, not its servant. We are just at the outset of the big data era, and honestly, we are not very good at handling all the data that we can now collect. It's not just a problem for the National Security Agency. Businesses collect lots of data, and they misuse it too, and we need to get better at this, and this will take time. It's a little bit like the challenge that was faced by primitive man and fire. This is a tool, but this is a tool that, unless we're careful, will burn us.

Mai e o problemă. Datele masive ne vor fura locurile de muncă. Datele masive și algoritmii vor pune la încercare munca funcționarilor și a profesioniștilor în secolul XXI așa cum automatizarea fabricilor și linia de asamblare au pus la încercare muncitorimea în secolul XX. Să luăm un laborant care se uită cu microscopul la o biopsie de cancer să vadă dacă e canceroasă sau nu. Omul a fost la facultate. Cumpără proprietate. Votează. E acționar în societate. Omul acesta și o întreagă armată de profesioniști ca el își vor găsi slujbele schimbate radical sau chiar complet eliminate. Ne place să credem că tehnologia creează slujbe pentru o vreme după o perioadă scurtă de dislocare. E adevărat în sistemul de referință cu care trăim: revoluția industrială. Pentru că exact așa s-a întâmplat. Dar uităm ceva în analiza aceasta: anumite categorii de locuri de muncă sunt eliminate total și nu se mai întorc. Revoluția industrială nu prea a fost bună dacă erai un cal. Deci va trebui să avem grijă, să luăm datele masive și să le adaptăm la nevoile noastre, la nevoile noastre foarte omenești. Trebuie să fim stăpânul tehnologiei, nu servitorul ei. Era datelor masive abia acum începe și, sincer, nu prea ne descurcăm cu datele pe care le putem colecta acum. Nu e doar o problemă pentru Agenția de Securitate Națională. Firmele adună o mulțime de date și mai abuzează de ele. Trebuie să avansăm, iar asta durează. E cam ca problema pe care o avea omul primitiv cu focul. E o unealtă, dar e o unealtă care, dacă nu suntem atenți, ne va arde.

Big data is going to transform how we live, how we work and how we think. It is going to help us manage our careers and lead lives of satisfaction and hope and happiness and health, but in the past, we've often looked at information technology and our eyes have only seen the T, the technology, the hardware, because that's what was physical. We now need to recast our gaze at the I, the information, which is less apparent, but in some ways a lot more important. Humanity can finally learn from the information that it can collect, as part of our timeless quest to understand the world and our place in it, and that's why big data is a big deal.

Datele masive ne vor transforma viața, munca și gândirea. Ne vor ajuta să ne ocupăm de cariere și să trăim o viață plină de satisfacții, de speranță, de fericire și de sănătate. Dar în trecut ne-am uitat adesea la tehnologia informației și ochii noștri au văzut doar T-ul, tehnologia, hardware-ul, pentru că asta era partea fizică. Acum trebuie să ne aruncăm privirea pe I, informația, care e mai puțin vizibilă, dar în unele privințe mult mai importantă. Omenirea poate în sfârșit învăța din informațiile pe care le poate colecta, în încercarea noastră dintotdeauna de a înțelege lumea și locul nostru în ea. De aceea datele masive sunt mare scofală.

(Applause)

(Aplauze)

America's favorite pie is?

Plăcinta preferată a Americii. Care este?

(Applause)

(Aplauze)

Kenneth Cukier: Big data is better data

Kenneth Cukier: Big data is better data

Related talks

David McCandless: The beauty of data visualization

Talithia Williams: Own your body's data

Tim Berners-Lee: The next web

Shyam Sankar: The rise of human-computer cooperation

Giorgia Lupi: How we can find ourselves in data

Anders Ynnerman: Visualizing the medical data explosion

Related talks

David McCandless: The beauty of data visualization

Talithia Williams: Own your body's data

Tim Berners-Lee: The next web

Shyam Sankar: The rise of human-computer cooperation

Giorgia Lupi: How we can find ourselves in data

Anders Ynnerman: Visualizing the medical data explosion