Kenneth Cukier: Big data is better data

America's favorite pie is?

Wat is de lievelingstaart van Amerika?

Audience: Apple. Kenneth Cukier: Apple. Of course it is. How do we know it? Because of data. You look at supermarket sales. You look at supermarket sales of 30-centimeter pies that are frozen, and apple wins, no contest. The majority of the sales are apple. But then supermarkets started selling smaller, 11-centimeter pies, and suddenly, apple fell to fourth or fifth place. Why? What happened? Okay, think about it. When you buy a 30-centimeter pie, the whole family has to agree, and apple is everyone's second favorite. (Laughter) But when you buy an individual 11-centimeter pie, you can buy the one that you want. You can get your first choice. You have more data. You can see something that you couldn't see when you only had smaller amounts of it.

Publiek: Appel. Kenneth Cukier: Appel, natuurlijk. Hoe weten we dat? Door gegevens. Je kijkt naar de supermarktverkoop. Je kijkt naar omzetcijfers van diepvriestaart van 30 centimeter en appel wint zeker. Appel is het meest verkocht. Toen gingen supermarkten ook kleinere taarten verkopen van 11 centimeter. Dan komt appel op de vierde of vijfde plaats. Hoezo? Hoe komt dat nou? Denk eens goed na. Als je een taart van 30 centimeter koopt, moet het hele gezin het eens zijn. Appel staat bij iedereen op de tweede plaats. (Gelach) Maar als je een eenpersoonstaart van 11 centimeter koopt, kan je kopen wat je zelf wilt. Je kan je eerste keus nemen. Je hebt meer gegevens. Je ziet iets dat je eerst niet zag toen er nog niet zoveel gegevens waren. Het gaat erom dat je met meer gegevens

Now, the point here is that more data doesn't just let us see more, more of the same thing we were looking at. More data allows us to see new. It allows us to see better. It allows us to see different. In this case, it allows us to see what America's favorite pie is: not apple.

niet alleen meer ziet, meer van hetzelfde. Met meer gegevens zien we nieuwe dingen. Je krijgt er een betere kijk op. Je gaat er anders tegenaan kijken. In dit geval onthult het wat de lievelingstaart van Amerika is: geen appel. Je hebt vast wel gehoord van de kreet 'big data'.

Now, you probably all have heard the term big data. In fact, you're probably sick of hearing the term big data. It is true that there is a lot of hype around the term, and that is very unfortunate, because big data is an extremely important tool by which society is going to advance. In the past, we used to look at small data and think about what it would mean to try to understand the world, and now we have a lot more of it, more than we ever could before. What we find is that when we have a large body of data, we can fundamentally do things that we couldn't do when we only had smaller amounts. Big data is important, and big data is new, and when you think about it, the only way this planet is going to deal with its global challenges — to feed people, supply them with medical care, supply them with energy, electricity, and to make sure they're not burnt to a crisp because of global warming — is because of the effective use of data.

Je wordt vast doodziek van de kreet 'big data'. Het is inderdaad een hype en dat is erg jammer, omdat big data een enorm belangrijk middel is waarmee de maatschappij opgestuwd wordt. In het verleden keken we naar 'small data' en dachten we daarover na. Zo poogden we de wereld te snappen. Nu hebben we veel meer gegevens. Meer dan we ooit hebben gehad. We ontdekken dat we met een grote hoeveelheid gegevens dingen kunnen doen die we niet konden doen met kleinere hoeveelheden. Big data is belangrijk en nieuw. Ga maar na: de enige manier waarmee deze aarde om kan gaan met wereldwijde uitdagingen -- de wereld voeden, medische verzorging geven, van energie voorzien, van elektriciteit, en zorgen dat we niet verbranden door opwarming van de aarde -- is door effectief gebruik van gegevens. Wat is nieuw aan big data? Wat is het bijzondere?

So what is new about big data? What is the big deal? Well, to answer that question, let's think about what information looked like, physically looked like in the past. In 1908, on the island of Crete, archaeologists discovered a clay disc. They dated it from 2000 B.C., so it's 4,000 years old. Now, there's inscriptions on this disc, but we actually don't know what it means. It's a complete mystery, but the point is that this is what information used to look like 4,000 years ago. This is how society stored and transmitted information.

Om daar antwoord op te geven, moet je eens bedenken hoe informatie eruitzag. Hoe het er ooit fysiek uitzag. In 1908, op het eiland Kreta, ontdekten archeologen een schijf van klei. Ze dateerden die op 2.000 voor Christus, dus 4.000 jaar oud. Er staan inscripties op die schijf, maar die kunnen we niet lezen. Het is een raadsel, maar het gaat erom dat informatie er 4.000 jaar geleden zo uitzag. Zo bewaarde en communiceerde de maatschappij informatie. De maatschappij is niet zoveel veranderd.

Now, society hasn't advanced all that much. We still store information on discs, but now we can store a lot more information, more than ever before. Searching it is easier. Copying it easier. Sharing it is easier. Processing it is easier. And what we can do is we can reuse this information for uses that we never even imagined when we first collected the data. In this respect, the data has gone from a stock to a flow, from something that is stationary and static to something that is fluid and dynamic. There is, if you will, a liquidity to information. The disc that was discovered off of Crete that's 4,000 years old, is heavy, it doesn't store a lot of information, and that information is unchangeable. By contrast, all of the files that Edward Snowden took from the National Security Agency in the United States fits on a memory stick the size of a fingernail, and it can be shared at the speed of light. More data. More.

We bewaren informatie nog steeds op schijven, maar nu kunnen we veel meer informatie bewaren. Meer dan ooit. Zoeken is makkelijker. Kopiëren is makkelijker. Delen en verwerken zijn makkelijker. We kunnen deze informatie hergebruiken voor dingen waar we nooit aan dachten toen we die informatie verzamelden. In dit verband zijn de gegevens gegaan van een stapel naar een stroom. Van iets bewegingsloos en statisch naar iets dat vloeiend en dynamisch is. Je kan stellen dat informatie vloeibaar is. De schijf die 4.000 jaar geleden op Kreta werd ontdekt, is zwaar. Er staat weinig informatie op, en die informatie is niet te wijzigen. Aan de andere kant pasten alle bestanden die Edward Snowden pikte van de NSA in de VS op een geheugenstick met de grootte van een vingernagel. De informatie kan worden gedeeld met lichtsnelheid. Meer data. Meer. We hebben nu veel meer data,

Now, one reason why we have so much data in the world today is we are collecting things that we've always collected information on, but another reason why is we're taking things that have always been informational but have never been rendered into a data format and we are putting it into data. Think, for example, the question of location. Take, for example, Martin Luther. If we wanted to know in the 1500s where Martin Luther was, we would have to follow him at all times, maybe with a feathery quill and an inkwell, and record it, but now think about what it looks like today. You know that somewhere, probably in a telecommunications carrier's database, there is a spreadsheet or at least a database entry that records your information of where you've been at all times. If you have a cell phone, and that cell phone has GPS, but even if it doesn't have GPS, it can record your information. In this respect, location has been datafied.

omdat we dingen verzamelen waarover we van oudsher informatie verzamelen. Maar ook omdat we dingen verzamelen die altijd al informatie boden, maar die nooit in dataformaat werden opgeslagen. We maken er nu gegevens van. Denk eens aan de locatie. Neem bijvoorbeeld Martin Luther. Als we in de 16e eeuw wilden weten waar Martin Luther zich bevond, zouden we hem steeds moeten volgen, misschien met inkt en een veer, en we zouden dat vastleggen. Maar bedenk eens hoe dat er nu uitziet. Je weet dat er ergens, in een database van een telefoonbedrijf een bestand is, of iets in dat bestand, dat informatie over jou bevat. Over waar je ooit hebt uitgehangen. Als je een mobieltje hebt met gps, maar zelfs zonder, dan bewaart het die informatie. Hierbij wordt locatie in gegevens omgezet.

Now think, for example, of the issue of posture, the way that you are all sitting right now, the way that you sit, the way that you sit, the way that you sit. It's all different, and it's a function of your leg length and your back and the contours of your back, and if I were to put sensors, maybe 100 sensors into all of your chairs right now, I could create an index that's fairly unique to you, sort of like a fingerprint, but it's not your finger.

Denk bijvoorbeeld eens aan je houding. Hoe je op dit moment zit. Hoe jij zit, hoe jij zit, hoe jij zit. Dat is verschillend en hangt af van je beenlengte en je rug en de kromming ervan. Als ik 100 sensoren zou plaatsen in jullie stoelen, zou ik een serie getallen krijgen die uniek voor jou is.

So what could we do with this? Researchers in Tokyo are using it as a potential anti-theft device in cars. The idea is that the carjacker sits behind the wheel, tries to stream off, but the car recognizes that a non-approved driver is behind the wheel, and maybe the engine just stops, unless you type in a password into the dashboard to say, "Hey, I have authorization to drive." Great.

Een soort vingerafdruk maar niet van je vinger. Wat zouden we er dan mee kunnen doen? Onderzoekers in Tokyo gebruiken dit als mogelijk anti-autodiefstal-apparaat. Het idee is dat de autodief achter het stuur zit en ervandoor gaat, maar de auto ontdekt dat iemand stuurt die daar niet hoort. De motor stopt dan, tenzij je een wachtwoord intypt op het dashboard waarmee je zegt: "Ik heb toestemming." Geweldig.

What if every single car in Europe had this technology in it? What could we do then? Maybe, if we aggregated the data, maybe we could identify telltale signs that best predict that a car accident is going to take place in the next five seconds. And then what we will have datafied is driver fatigue, and the service would be when the car senses that the person slumps into that position, automatically knows, hey, set an internal alarm that would vibrate the steering wheel, honk inside to say, "Hey, wake up, pay more attention to the road." These are the sorts of things we can do when we datafy more aspects of our lives.

Als alle auto's in Europa deze technologie nou eens hadden? Wat zouden we nog meer kunnen doen? Als we deze informatie zouden verzamelen konden we aanwijzingen signaleren die voorspellen dat over 5 seconden een auto-ongeluk gaat gebeuren. Wat we daarvoor zullen gaan vastleggen, is vermoeidheid tijdens het rijden. De applicatie in de auto merkt het als de persoon wegzakt en weet dan automatisch dat er een wekker moet afgaan die het stuurwiel laat trillen, toetert, en zegt: "Wakker worden, beter opletten op de weg." Dit soort dingen kunnen we doen als we nog meer van ons leven vastleggen in gegevens. Wat is de waarde van big data?

So what is the value of big data? Well, think about it. You have more information. You can do things that you couldn't do before. One of the most impressive areas where this concept is taking place is in the area of machine learning. Machine learning is a branch of artificial intelligence, which itself is a branch of computer science. The general idea is that instead of instructing a computer what do do, we are going to simply throw data at the problem and tell the computer to figure it out for itself. And it will help you understand it by seeing its origins. In the 1950s, a computer scientist at IBM named Arthur Samuel liked to play checkers, so he wrote a computer program so he could play against the computer. He played. He won. He played. He won. He played. He won, because the computer only knew what a legal move was. Arthur Samuel knew something else. Arthur Samuel knew strategy. So he wrote a small sub-program alongside it operating in the background, and all it did was score the probability that a given board configuration would likely lead to a winning board versus a losing board after every move. He plays the computer. He wins. He plays the computer. He wins. He plays the computer. He wins. And then Arthur Samuel leaves the computer to play itself. It plays itself. It collects more data. It collects more data. It increases the accuracy of its prediction. And then Arthur Samuel goes back to the computer and he plays it, and he loses, and he plays it, and he loses, and he plays it, and he loses, and Arthur Samuel has created a machine that surpasses his ability in a task that he taught it.

Denk eens na. Je hebt meer informatie. Je kan dingen doen die je eerst niet kon doen. Een van de indrukwekkendste plekken waar dit idee wordt uitgevoerd is die van machine-leren. Machine-leren is een tak van kunstmatige intelligentie die zelf weer onderdeel is van computerwetenschap. Het idee is dat we de computer niet vertellen wat hij moet doen maar dat we er gewoon gegevens in gooien en de computer opdragen het probleem zelf op te lossen. Je begrijpt het beter als je ziet waar het vandaan komt. In de vijftiger jaren vond computerwetenschapper Arthur Samuel van IBM, het leuk om te dammen. Hij schreef een programma zodat hij tegen de computer kon spelen. Hij speelde en won. Hij speelde en won. Hij speelde en won, omdat de computer alleen wist wat een geldige zet was. Arthur Samuel wist nog meer. Arthur Samuel kende strategie. Hij schreef er een programmaatje bij dat op de achtergrond werkte en bijhield hoe groot de kans was dat een bepaalde spelsituatie leidde naar winst of verlies na elke zet. Hij speelt tegen de computer. Hij wint. Hij speelt tegen de computer. Hij wint. Hij speelt tegen de computer. Hij wint. En dan laat hij het de computer tegen zichzelf spelen. Hij speelt tegen zichzelf en verzamelt meer gegevens. Hij verzamelt meer gegevens en wordt preciezer in zijn voorspelling. Dan gaat Samuel terug naar de computer, speelt tegen hem en verliest. En speelt nogmaals en verliest. En speelt weer en verliest. Arthur Samuel heeft een machine gemaakt die beter is dan Arthur, in een taak die Arthur hem aangeleerd heeft. Dit idee van machine-leren

And this idea of machine learning is going everywhere. How do you think we have self-driving cars? Are we any better off as a society enshrining all the rules of the road into software? No. Memory is cheaper. No. Algorithms are faster. No. Processors are better. No. All of those things matter, but that's not why. It's because we changed the nature of the problem. We changed the nature of the problem from one in which we tried to overtly and explicitly explain to the computer how to drive to one in which we say, "Here's a lot of data around the vehicle. You figure it out. You figure it out that that is a traffic light, that that traffic light is red and not green, that that means that you need to stop and not go forward."

heeft allerlei gevolgen. Hoe denk je dat we zelfrijdende auto's krijgen? Zijn we beter af als maatschappij door alle regels van de weg in software te stoppen? Nee. Geheugen is goedkoper. Nee. Algoritmes zijn sneller. Nee. Processors zijn beter. Nee. Die dingen doen ertoe, maar dat is niet de reden. Het komt omdat we de aard van het probleem veranderd hebben. Vroeger probeerden we aan de computer uit te leggen hoe hij moet rijden. Nu zeggen we: "Hier zijn veel gegevens over dit voertuig. Zoek het maar uit. Vind maar uit dat het een verkeerslicht is, dat het rood is en niet groen, dat je dan moet stoppen en niet meer vooruit moet gaan." Machine-leren is de basis

Machine learning is at the basis of many of the things that we do online: search engines, Amazon's personalization algorithm, computer translation, voice recognition systems. Researchers recently have looked at the question of biopsies, cancerous biopsies, and they've asked the computer to identify by looking at the data and survival rates to determine whether cells are actually cancerous or not, and sure enough, when you throw the data at it, through a machine-learning algorithm, the machine was able to identify the 12 telltale signs that best predict that this biopsy of the breast cancer cells are indeed cancerous. The problem: The medical literature only knew nine of them. Three of the traits were ones that people didn't need to look for, but that the machine spotted.

van veel dingen die we online doen: zoekmachines, het personaliseer-algoritme van Amazon, computervertalingen, systemen voor stemherkenning. Onderzoekers hebben onlangs gekeken naar de kwestie van biopsies, kankerbiopsies. Ze vroegen de computer te kijken naar de gegevens en overlevingsstatistieken om te bepalen of het kankercellen zijn of niet. Als je de gegevens erin gooit en een machine-leer-algoritme gebruikt, bleek dat de machine 12 verklikkers kon bepalen die het beste voorspellen of een biopsie van kankercellen inderdaad kanker is. Het probleem: de medische literatuur kende er maar negen. Naar drie kenmerken hoefde de mens niet te kijken, maar de machine zag ze wel.

Now, there are dark sides to big data as well. It will improve our lives, but there are problems that we need to be conscious of, and the first one is the idea that we may be punished for predictions, that the police may use big data for their purposes, a little bit like "Minority Report." Now, it's a term called predictive policing, or algorithmic criminology, and the idea is that if we take a lot of data, for example where past crimes have been, we know where to send the patrols. That makes sense, but the problem, of course, is that it's not simply going to stop on location data, it's going to go down to the level of the individual. Why don't we use data about the person's high school transcript? Maybe we should use the fact that they're unemployed or not, their credit score, their web-surfing behavior, whether they're up late at night. Their Fitbit, when it's able to identify biochemistries, will show that they have aggressive thoughts. We may have algorithms that are likely to predict what we are about to do, and we may be held accountable before we've actually acted. Privacy was the central challenge in a small data era. In the big data age, the challenge will be safeguarding free will, moral choice, human volition, human agency.

Er zit ook een zwarte kant aan big data. Het verbetert ons leven, maar er zijn problemen waar we ons van bewust moeten zijn. De eerste is het idee dat we gestraft kunnen worden voor onze voorspellingen en dat de politie ook big data gebruikt. Een beetje als in de film 'Minory Report'. Het heet 'predictive policing' (voorspellend politiewerk), of algoritme-criminologie. We nemen daarbij een hoop gegevens bijvoorbeeld waar criminaliteit voorkwam, om te weten we waar agenten hun ronde moeten doen. Dat lijkt slim maar het probleem is dat het niet zal blijven bij gegevens over de locatie, maar dat tot op het individuele niveau zal gaan. Waarom gebruiken we geen gegevens over iemands middelbareschoolverleden? Of ze werkloos zijn of niet, hun kredietwaardigheid, hun websurfgedrag, of ze laat naar bed gaan. Als hun gps-horloge biologische dingen kan meten, zal het merken of hij agressieve gedachten heeft. We zullen algoritmes krijgen die voorspellen wat we van plan zijn te doen en zouden kunnen worden aangesproken voordat we gehandeld hebben. Privacy was de grote uitdaging bij weinig gegevens. In het big data-tijdperk wordt de uitdaging het waarborgen van de vrije wil, morele keuzevrijheid, menselijke wil, menselijk handelen.

There is another problem: Big data is going to steal our jobs. Big data and algorithms are going to challenge white collar, professional knowledge work in the 21st century in the same way that factory automation and the assembly line challenged blue collar labor in the 20th century. Think about a lab technician who is looking through a microscope at a cancer biopsy and determining whether it's cancerous or not. The person went to university. The person buys property. He or she votes. He or she is a stakeholder in society. And that person's job, as well as an entire fleet of professionals like that person, is going to find that their jobs are radically changed or actually completely eliminated. Now, we like to think that technology creates jobs over a period of time after a short, temporary period of dislocation, and that is true for the frame of reference with which we all live, the Industrial Revolution, because that's precisely what happened. But we forget something in that analysis: There are some categories of jobs that simply get eliminated and never come back. The Industrial Revolution wasn't very good if you were a horse. So we're going to need to be careful and take big data and adjust it for our needs, our very human needs. We have to be the master of this technology, not its servant. We are just at the outset of the big data era, and honestly, we are not very good at handling all the data that we can now collect. It's not just a problem for the National Security Agency. Businesses collect lots of data, and they misuse it too, and we need to get better at this, and this will take time. It's a little bit like the challenge that was faced by primitive man and fire. This is a tool, but this is a tool that, unless we're careful, will burn us.

Er is nog een probleem. Big data zal banen gaan kosten. Big data en algoritmes zullen het opnemen tegen de kantoormensen, kenniswerk, in de 21ste eeuw. Net zoals fabrieksautomatisering en lopende banden het opnamen tegen de fabrieksaarbeiders in de 20ste eeuw. Denk eens aan een laborant die in een microscoop kijkt naar een kankerbiopsie om te kijken of het kanker is of niet. Deze persoon is afgestudeerd, koopt een huis, hij of zij gaat naar de stembus, en neemt deel aan de maatschappij. Die persoon en nog een hele rij soortgelijke professionals zullen merken dat hun baan heel erg verandert of helemaal verdwijnt. We willen graag geloven dat techniek mettertijd voor banen zorgt, na een periode van ontwrichting. Dat klopt voor het referentiekader waarin we leven, de industriële revolutie. Want dat is precies wat er is gebeurd. Maar we hebben iets vergeten in die analyse. Er zijn een paar soorten banen die gewoon verdwijnen en nooit meer terugkomen. De industriële revolutie was niet best als je een paard was. We moeten dus erg zorgvuldig zijn, en big data aanpassen aan onze behoeftes, onze zeer menselijke behoeftes. We moeten de technologie de baas zijn. niet omgekeerd. We staan aan het begin van het big data-tijdperk, en eerlijk gezegd zijn we nog niet zo goed in het omgaan met de big data die we verzamelen. Het is niet alleen een probleem voor de NSA. Bedrijven verzamelen veel data en misbruiken die ook. We moeten er beter in worden en dat kost tijd. Het lijkt op de uitdaging die de mens ooit had met vuur. Het is een gereedschap, maar als je niet uitkijkt, verbrand je.

Big data is going to transform how we live, how we work and how we think. It is going to help us manage our careers and lead lives of satisfaction and hope and happiness and health, but in the past, we've often looked at information technology and our eyes have only seen the T, the technology, the hardware, because that's what was physical. We now need to recast our gaze at the I, the information, which is less apparent, but in some ways a lot more important. Humanity can finally learn from the information that it can collect, as part of our timeless quest to understand the world and our place in it, and that's why big data is a big deal.

Big data gaat ons leven veranderen. Hoe we leven, hoe we werken en hoe we denken. Het gaat ons helpen bij onze carrière en ons een tevreden en hoopvol leven laten leiden in blijdschap en gezondheid. Maar vroeger keken we vaak naar informatietechnologie en zagen we alleen de T, Technologie, de spullen, omdat dat fysiek was. Maar nu moeten we onze blik aanpassen om de I te zien, de Informatie, die minder zichtbaar is maar in zekere zin belangrijker. De mensheid kan eindelijk leren van informatie die ze verzamelt als onderdeel van onze eeuwige zoektocht om de wereld en onze plek erin beter te begrijpen. Daarom is big data van groot belang.

(Applause)

(Applaus)

America's favorite pie is?

Wat is de lievelingstaart van Amerika?

(Applause)

(Applaus)

Kenneth Cukier: Big data is better data

Kenneth Cukier: Big data is better data

Related talks

David McCandless: The beauty of data visualization

Talithia Williams: Own your body's data

Tim Berners-Lee: The next web

Shyam Sankar: The rise of human-computer cooperation

Giorgia Lupi: How we can find ourselves in data

Anders Ynnerman: Visualizing the medical data explosion

Related talks

David McCandless: The beauty of data visualization

Talithia Williams: Own your body's data

Tim Berners-Lee: The next web

Shyam Sankar: The rise of human-computer cooperation

Giorgia Lupi: How we can find ourselves in data

Anders Ynnerman: Visualizing the medical data explosion