Kenneth Cukier: Big data is better data

Který koláč je v Americe nejoblíbenější?

America's favorite pie is?

Publikum: Jablečný Kenneth Cukier: Jablečný, samozřejmě. Jak to víme? Díky datům. Podíváte se na prodej v obchodech. Když se podíváte na 30 cm velké mražené koláče, jablečné jednoznačně vítězí. Bezpochyby. Nejvíc se prodá jablečných koláčů. Jenže pak supermarkety začaly prodávat menší 11cm koláče a najednou se jablečné propadly na čtvrté nebo páté místo. Proč? Co se stalo? Zamysleme se. Když kupujete třiceticentimetrový koláč, celá rodina se musí shodnout a jablko je u všech druhé nejoblíbenější. (smích) Ale když si kupujete vlastní 11cm koláč, můžete si koupit ten, který chcete vy. Můžete mít váš nejoblíbenější. Získáváte více dat. Zjistíte věci, které nejsou patrné s menším množstvím dat.

Audience: Apple. Kenneth Cukier: Apple. Of course it is. How do we know it? Because of data. You look at supermarket sales. You look at supermarket sales of 30-centimeter pies that are frozen, and apple wins, no contest. The majority of the sales are apple. But then supermarkets started selling smaller, 11-centimeter pies, and suddenly, apple fell to fourth or fifth place. Why? What happened? Okay, think about it. When you buy a 30-centimeter pie, the whole family has to agree, and apple is everyone's second favorite. (Laughter) But when you buy an individual 11-centimeter pie, you can buy the one that you want. You can get your first choice. You have more data. You can see something that you couldn't see when you only had smaller amounts of it.

Jde o to, že s více daty, nejenže vyzkoumáme víc z toho, co už sledujeme, více dat nám umožní spatřit věci nově. Umožňuje nám to vidět věci lépe. Vidět věci jinak. V tomto případě nám umožňují zjistit, že nejoblíbenější koláč v Americe není jablečný.

Now, the point here is that more data doesn't just let us see more, more of the same thing we were looking at. More data allows us to see new. It allows us to see better. It allows us to see different. In this case, it allows us to see what America's favorite pie is: not apple.

Už jste asi slyšeli pojem velká data. Spíš se vám už dělá špatně, když slyšíte termín velká data. Mluvit o velkých datech je v kurzu, což je velmi nešťastné, protože velká data jsou extrémně důležitým nástrojem, díky kterému společnost postoupí dál. V minulosti jsme zkoumali malá data a přemýšleli, co znamenají pro naše porozumění světu Teď jich máme mnohem více, více než kdykoli předtím. Když máme velké množství dat, můžeme s nimi dělat věci, které jsme s menším množstvím nemohli dělat. Velká data jsou důležitá a nová. A když se nad tím zamyslíte, jediný způsob, jak se planeta vypořádá se svými globálními problémy - nakrmit lidi, dát jim zdravotní péči, dát jim energii, elektřinu, zajistit, abychom se nespálili na uhel vlivem globálního oteplování - je efektivním využitím dat.

Now, you probably all have heard the term big data. In fact, you're probably sick of hearing the term big data. It is true that there is a lot of hype around the term, and that is very unfortunate, because big data is an extremely important tool by which society is going to advance. In the past, we used to look at small data and think about what it would mean to try to understand the world, and now we have a lot more of it, more than we ever could before. What we find is that when we have a large body of data, we can fundamentally do things that we couldn't do when we only had smaller amounts. Big data is important, and big data is new, and when you think about it, the only way this planet is going to deal with its global challenges — to feed people, supply them with medical care, supply them with energy, electricity, and to make sure they're not burnt to a crisp because of global warming — is because of the effective use of data.

Co nového přináší velká data? Proč ten poprask? Pro odpověď se zamysleme nad tím, jak informace v minulosti vypadaly fyzicky. V roce 1908 na ostrově Kréta archeologové objevili hliněný disk. Byl vyroben 2 tisíce let před Kristem, je 4 tisíce let starý. Na tom disku jsou nápisy, o kterých nevíme co znamenají. Je to záhada. Podstatné ale je to, jak informace vypadaly před 4 tisíci lety. Tak společnost uchovávala a přenášela informace.

So what is new about big data? What is the big deal? Well, to answer that question, let's think about what information looked like, physically looked like in the past. In 1908, on the island of Crete, archaeologists discovered a clay disc. They dated it from 2000 B.C., so it's 4,000 years old. Now, there's inscriptions on this disc, but we actually don't know what it means. It's a complete mystery, but the point is that this is what information used to look like 4,000 years ago. This is how society stored and transmitted information.

Současná společnost zase tak moc nepokročila. Pořád skladujeme informace na discích, ale můžeme informací uchovat mnohem více než kdy dříve. Je snazší jejich hledání, kopírování, sdílení i zpracování - vše je jednodušší. Také můžeme tyto informace využít způsoby nepředstavitelnými v době, kdy jsme data začali sbírat. V tomto ohledu se data proměnila ze soupisu na tok, z něčeho, co je nehybné a stálé, na něco tekutého a dynamického. Mají tak trochu tekutou povahu. Disk, který byl objeven na Krétě, starý 4 tisíce let, je těžký, neobsahuje mnoho informací a tyto informace jsou neměnné. Naproti tomu všechny soubory, které vzal Edward Snowden

Now, society hasn't advanced all that much. We still store information on discs, but now we can store a lot more information, more than ever before. Searching it is easier. Copying it easier. Sharing it is easier. Processing it is easier. And what we can do is we can reuse this information for uses that we never even imagined when we first collected the data. In this respect, the data has gone from a stock to a flow, from something that is stationary and static to something that is fluid and dynamic. There is, if you will, a liquidity to information. The disc that was discovered off of Crete that's 4,000 years old, is heavy, it doesn't store a lot of information, and that information is unchangeable. By contrast, all of the files that Edward Snowden took

z Národní bezpečnostní agentury Spojených států, se vejdou na flash disk velikosti nehtu. A mohou být sdílena rychlostí světla. Více dat. Více.

from the National Security Agency in the United States fits on a memory stick the size of a fingernail, and it can be shared at the speed of light. More data. More.

Jedním z důvodů, proč máme dnes na světě tolik dat, je to, že sledujeme věci, o kterých jsme vždy informace měli, ale druhým důvodem je, že sledujeme věci, které vždy nesly informace, ale nebyly převedeny do datového formátu ale teď o nich data ukládáme. Například taková poloha. Vezměte si třeba Martina Luthera. Kdybychom kolem roku 1500 chtěli vědět, kde se Martin Luther nachází, museli bychom jej stále sledovat, možná s brkem a kalamářem a zapisovat to. Ale zamyslete se, jak to funguje dnes. Víte, že někde, pravděpodobně v databázi operátora, je tabulka nebo alespoň záznam v databázi, který zaznamenávaná informace o tom, kde jste kdy byli. Máte-li mobilní telefon, který má GPS, - ale i když GPS nemá - může zaznamenávat polohu. V tomto ohledu byla poloha převedena do formy dat.

Now, one reason why we have so much data in the world today is we are collecting things that we've always collected information on, but another reason why is we're taking things that have always been informational but have never been rendered into a data format and we are putting it into data. Think, for example, the question of location. Take, for example, Martin Luther. If we wanted to know in the 1500s where Martin Luther was, we would have to follow him at all times, maybe with a feathery quill and an inkwell, and record it, but now think about what it looks like today. You know that somewhere, probably in a telecommunications carrier's database, there is a spreadsheet or at least a database entry that records your information of where you've been at all times. If you have a cell phone, and that cell phone has GPS, but even if it doesn't have GPS, it can record your information.

Teď si vezměte třeba takové držení těla, způsob, jakým teď všichni sedíte, jak sedíte vy, vy a vy. Liší se to vlivem délky vašich nohou, vašich zad a jejich tvaru. Kdybych do všech vašich židlí nyní dal třeba 100 senzorů, mohl bych vytvořit záznam unikátní pro každého z vás, něco jako otisk prstu, ale nejde o váš prst.

In this respect, location has been datafied. Now think, for example, of the issue of posture, the way that you are all sitting right now, the way that you sit, the way that you sit, the way that you sit. It's all different, and it's a function of your leg length and your back and the contours of your back, and if I were to put sensors, maybe 100 sensors into all of your chairs right now, I could create an index that's fairly unique to you, sort of like a fingerprint, but it's not your finger.

Co bychom s tím mohli udělat? Vědci v Tokiu jej využívají jako možné opatření proti krádeži aut. Jde o to, že zloděj usedne za volant, snaží se odjet, ale auto pozná, že za volantem sedí neoprávněný řidič a auto třeba vypne motor, dokud na palubní desce nezadáte správné heslo, čímž řeknete: „Hele, jsem oprávněný to řídit.“ Paráda.

So what could we do with this? Researchers in Tokyo are using it as a potential anti-theft device in cars. The idea is that the carjacker sits behind the wheel, tries to stream off, but the car recognizes that a non-approved driver is behind the wheel, and maybe the engine just stops, unless you type in a password into the dashboard to say, "Hey, I have authorization to drive." Great.

Co kdyby každé auto v Evropě mělo takovou technologii? Co bychom dělali pak? Možná, kdybychom dali data dohromady, možná bychom mohli rozpoznat signály, které by předvídaly, že se v příštích 5 sekundách stane dopravní nehoda. Tak bychom mohli pomocí dat odhalit únavu řidiče a když by auto poznalo, že se řidič sune do určité pozice, automaticky to rozezná a spustí uvnitř poplach, může to být vibrace volantu nebo klakson v autě, a tím řekne: „Hele, vzbuď se, dávej pozor na cestu!“ Tyhle věci můžeme udělat, když převedeme na data více aspektů našeho života.

What if every single car in Europe had this technology in it? What could we do then? Maybe, if we aggregated the data, maybe we could identify telltale signs that best predict that a car accident is going to take place in the next five seconds. And then what we will have datafied is driver fatigue, and the service would be when the car senses that the person slumps into that position, automatically knows, hey, set an internal alarm that would vibrate the steering wheel, honk inside to say, "Hey, wake up, pay more attention to the road." These are the sorts of things we can do when we datafy more aspects of our lives.

Jaká je tedy hodnota velkých dat? Inu, zamysleme se. Máte více informací. Můžete dělat věci, které jste předtím nemohli. Jednou z nejpůsobivějších oblastí, kde se tento koncept uplatňuje je v oblasti strojového učení. Strojové učení je odvětví umělé inteligence, které patří do počítačových věd. Hlavní myšlenkou je, že místo toho, že řekneme počítači, co má dělat, jednoduše mu dáme všechna data související s problémem a řekneme mu, aby problém vyřešil sám. Lépe to pochopíte, když víte, jak to vzniklo. V 50. letech Arthur Samuel, počítačový vědec v IBM, rád hrál piškvorky, takže napsal počítačový program, aby mohl hrát s počítačem. Hrál. A vyhrál. Hrál. A vyhrál. Hrál a vyhrál. Protože počítač uměl jen tahy, které ho naučil. Arthur Samuel uměl něco jiného. Arthur Samuel znal strategii. Takže napsal podprogram, který běžel na pozadí a po každém tahu počítal pravděpodobnost, s jakou uspořádání na hracím poli povede k vítěznému tahu nebo prohře. Takže hraje s počítačem. A vyhrává. Hraje s počítačem a vyhrává. Hraje s počítačem a vyhrává. Pak Arthur Samuel nechá počítač, aby si hrál sám. Hraje si sám. Sbírá více dat. Sbírá informace. Zvýší spolehlivost svého odhadu. Pak se Arthur Samuel vrátí k počítači a hraje s ním. A prohraje. Hraje s ním a prohraje, hraje s ním a prohraje. A tak Arthur Samuel vytvořil stroj, který překonal jeho schopnosti v úloze, kterou jej naučil.

So what is the value of big data? Well, think about it. You have more information. You can do things that you couldn't do before. One of the most impressive areas where this concept is taking place is in the area of machine learning. Machine learning is a branch of artificial intelligence, which itself is a branch of computer science. The general idea is that instead of instructing a computer what do do, we are going to simply throw data at the problem and tell the computer to figure it out for itself. And it will help you understand it by seeing its origins. In the 1950s, a computer scientist at IBM named Arthur Samuel liked to play checkers, so he wrote a computer program so he could play against the computer. He played. He won. He played. He won. He played. He won, because the computer only knew what a legal move was. Arthur Samuel knew something else. Arthur Samuel knew strategy. So he wrote a small sub-program alongside it operating in the background, and all it did was score the probability that a given board configuration would likely lead to a winning board versus a losing board after every move. He plays the computer. He wins. He plays the computer. He wins. He plays the computer. He wins. And then Arthur Samuel leaves the computer to play itself. It plays itself. It collects more data. It collects more data. It increases the accuracy of its prediction. And then Arthur Samuel goes back to the computer and he plays it, and he loses, and he plays it, and he loses, and he plays it, and he loses, and Arthur Samuel has created a machine that surpasses his ability in a task that he taught it.

Myšlenka strojového učení je přítomná všude. Odkud myslíte, že máme auta, která se sama řídí? Vede si naše společnost lépe díky tomu, že všechna pravidla silničního provozu nasypeme do softwaru? Ne. Paměť je levnější? Ne. Algoritmy jsou rychlejší? Ne. Procesory jsou lepší? Ne. Na tom vše záleží, ale to není ten důvod. Je to proto, že jsme změnili povahu problému. Z problému, kde jsme se zevrubně a doslovně snažili vysvětlit počítači jak má jezdit, na úkol, kdy mu řekneme: „Tady máš spoustu dat z okolí vozu. Vyřeš to. Uvědom si, že toto je semafor, na tom semaforu je červená a ne zelená, že to znamená, že bys měl zastavit a nepokračovat dál.“

And this idea of machine learning is going everywhere. How do you think we have self-driving cars? Are we any better off as a society enshrining all the rules of the road into software? No. Memory is cheaper. No. Algorithms are faster. No. Processors are better. No. All of those things matter, but that's not why. It's because we changed the nature of the problem. We changed the nature of the problem from one in which we tried to overtly and explicitly explain to the computer how to drive to one in which we say, "Here's a lot of data around the vehicle. You figure it out. You figure it out that that is a traffic light, that that traffic light is red and not green, that that means that you need to stop and not go forward."

Strojové učení je základem mnoha věcí, které využíváme online: vyhledávače, algoritmus personalizace v Amazonu, počítačový překlad, systém rozpoznání hlasu. Vědci nyní řeší otázky biopsií, rakovinových biopsií. Požádali počítač, aby zjistil z dat a statistik o přežití, jestli jsou buňky skutečně zhoubné nebo ne. Když dostal data pomocí algoritmu strojového učení, počítač byl schopný identifikovat 12 znaků nejlépe určujících, zda buňky z biopsie rakoviny prsu jsou skutečně zhoubné. Problém byl, že odborná lékařská literatura uváděla jenom 9 z nich. Tři z těchto znaků lidé nehledali, ale stroj je našel.

Machine learning is at the basis of many of the things that we do online: search engines, Amazon's personalization algorithm, computer translation, voice recognition systems. Researchers recently have looked at the question of biopsies, cancerous biopsies, and they've asked the computer to identify by looking at the data and survival rates to determine whether cells are actually cancerous or not, and sure enough, when you throw the data at it, through a machine-learning algorithm, the machine was able to identify the 12 telltale signs that best predict that this biopsy of the breast cancer cells are indeed cancerous. The problem: The medical literature only knew nine of them. Three of the traits were ones that people didn't need to look for, but that the machine spotted.

Velká data mají i své stinné stránky. Mohou zlepšit naše životy, ale jsou zde věci, o kterých musíme vědět. První je možnost, že můžeme být potrestáni za předpovědi; že policie může využít velká data pro své účely tak trochu jako ve filmu Minority report. Říká se tomu prediktivní kontrola nebo algoritmová kriminologie a základem je, že když vezmeme hodně dat, třeba kde se staly zločiny v minulosti, víte, kam poslat hlídky. To dává smysl, ale problémem je, že se to nezastaví u údajů o poloze, ale dojde i na údaje o osobách. Proč nevyužít údaje o výsledcích ze střední školy? Možná bychom měli využít data o tom, zda lidé mají práci, data o dluzích nebo co dělají na internetu; zda ponocují. Jejich fitness náramky (fitbit), přečtou jejich biochemické údaje a zjistí, zda mají agresivní myšlenky. Můžeme mít algoritmy, které jsou schopny předpovědět, co se chystáme udělat a my můžeme být zodpovědní už před tím, než začneme jednat. Soukromí bylo výzvou v éře malých dat. V éře velkých dat bude výzvou ochrana svobodné vůle, ochrana morální volby, lidské vůle, lidského jednání.

Now, there are dark sides to big data as well. It will improve our lives, but there are problems that we need to be conscious of, and the first one is the idea that we may be punished for predictions, that the police may use big data for their purposes, a little bit like "Minority Report." Now, it's a term called predictive policing, or algorithmic criminology, and the idea is that if we take a lot of data, for example where past crimes have been, we know where to send the patrols. That makes sense, but the problem, of course, is that it's not simply going to stop on location data, it's going to go down to the level of the individual. Why don't we use data about the person's high school transcript? Maybe we should use the fact that they're unemployed or not, their credit score, their web-surfing behavior, whether they're up late at night. Their Fitbit, when it's able to identify biochemistries, will show that they have aggressive thoughts. We may have algorithms that are likely to predict what we are about to do, and we may be held accountable before we've actually acted. Privacy was the central challenge in a small data era. In the big data age, the challenge will be safeguarding free will, moral choice, human volition, human agency.

Pak je zde další problém. Velká data nám vezmou práci. V 21. století velká data a algoritmy vyzvou na souboj bílé límečky a odbornou práci, stejně jako automatizace ve výrobě a výrobní linky změnily práci modrých límečků ve 20. století. Vezměte si laboratorního technika, který v mikroskopu zkoumá biopsii a určuje, zda jde o rakovinu nebo ne. Ten člověk chodil na univerzitu, koupil si dům, chodí k volbám, je platným členem společnosti. Ale práce tohoto člověka, stejně jako celé řady jemu podobných odborníků, se radikálně změní nebo dokonce přestane být vůbec potřeba. Rádi bychom si mysleli, že technologie vytvoří jiná pracovní místa, po krátkém, dočasném období změn a pro období, ve kterém žijeme, je to pravda, v průmyslové revoluci se přesně to stalo. Ale zapomínáme na to, že některé typy prací jednoduše zmizely a nikdy se nevrátily. Průmyslová revoluce nebyla moc dobrá, když jste byli kůň. Takže musíme být opatrní a využívat velká data pro naše potřeby, naše lidské potřeby. Musíme technologii vládnout, ne jí sloužit. Jsme právě na prahu éry velkých dat a upřímně, zacházení se všemi údaji, které nyní sbíráme, nám moc nejde. Není to jen problém NSA. (Národní bezpečnostní agentury) I firmy sbírají mnoho údajů a také je zneužívají, musíme se s tím naučit zacházet, což zabere nějaký čas. Je to trochu jako výzva, které čelili pralidé s ohněm. Je to užitečný nástroj, ale když si nedáme pozor, popálí nás.

There is another problem: Big data is going to steal our jobs. Big data and algorithms are going to challenge white collar, professional knowledge work in the 21st century in the same way that factory automation and the assembly line challenged blue collar labor in the 20th century. Think about a lab technician who is looking through a microscope at a cancer biopsy and determining whether it's cancerous or not. The person went to university. The person buys property. He or she votes. He or she is a stakeholder in society. And that person's job, as well as an entire fleet of professionals like that person, is going to find that their jobs are radically changed or actually completely eliminated. Now, we like to think that technology creates jobs over a period of time after a short, temporary period of dislocation, and that is true for the frame of reference with which we all live, the Industrial Revolution, because that's precisely what happened. But we forget something in that analysis: There are some categories of jobs that simply get eliminated and never come back. The Industrial Revolution wasn't very good if you were a horse. So we're going to need to be careful and take big data and adjust it for our needs, our very human needs. We have to be the master of this technology, not its servant. We are just at the outset of the big data era, and honestly, we are not very good at handling all the data that we can now collect. It's not just a problem for the National Security Agency. Businesses collect lots of data, and they misuse it too, and we need to get better at this, and this will take time. It's a little bit like the challenge that was faced by primitive man and fire. This is a tool, but this is a tool that, unless we're careful, will burn us.

Velká data změní to, jak žijeme, jak pracujeme a jak myslíme. Pomohou nám lépe řídit naše kariéry a žít spokojený život v naději, štěstí a zdraví. Ale v minulosti jsme se často dívali na informační technologie a viděli jsme jen to T - technologii, hardware, protože to bylo fyzické. Nyní zaostříme na I - informace, které jsou méně zjevné, ale v řadě věcí důležitější. Lidstvo se konečně může učit z informací, které může sbírat na nekonečné cestě k porozumění světu a našemu místu v něm. Proto jsou velká data tak důležitá. (potlesk)

Big data is going to transform how we live, how we work and how we think. It is going to help us manage our careers and lead lives of satisfaction and hope and happiness and health, but in the past, we've often looked at information technology and our eyes have only seen the T, the technology, the hardware, because that's what was physical. We now need to recast our gaze at the I, the information, which is less apparent, but in some ways a lot more important. Humanity can finally learn from the information that it can collect, as part of our timeless quest to understand the world and our place in it, and that's why big data is a big deal. (Applause)

Který koláč je v Americe nejoblíbenější?

America's favorite pie is?

z Národní bezpečnostní agentury Spojených států, se vejdou na flash disk velikosti nehtu. A mohou být sdílena rychlostí světla. Více dat. Více.

from the National Security Agency in the United States fits on a memory stick the size of a fingernail, and it can be shared at the speed of light. More data. More.

Kenneth Cukier: Big data is better data

Kenneth Cukier: Big data is better data

Related talks

David McCandless: The beauty of data visualization

Talithia Williams: Own your body's data

Tim Berners-Lee: The next web

Shyam Sankar: The rise of human-computer cooperation

Giorgia Lupi: How we can find ourselves in data

Anders Ynnerman: Visualizing the medical data explosion

Related talks

David McCandless: The beauty of data visualization

Talithia Williams: Own your body's data

Tim Berners-Lee: The next web

Shyam Sankar: The rise of human-computer cooperation

Giorgia Lupi: How we can find ourselves in data

Anders Ynnerman: Visualizing the medical data explosion