Kenneth Cukier: Big data is better data

America's favorite pie is?

A tarte preferida da América é...?

Audience: Apple. Kenneth Cukier: Apple. Of course it is. How do we know it? Because of data. You look at supermarket sales. You look at supermarket sales of 30-centimeter pies that are frozen, and apple wins, no contest. The majority of the sales are apple. But then supermarkets started selling smaller, 11-centimeter pies, and suddenly, apple fell to fourth or fifth place. Why? What happened? Okay, think about it. When you buy a 30-centimeter pie, the whole family has to agree, and apple is everyone's second favorite. (Laughter) But when you buy an individual 11-centimeter pie, you can buy the one that you want. You can get your first choice. You have more data. You can see something that you couldn't see when you only had smaller amounts of it.

Audiência: Maçã! Kenneth Cukier: Maçã. Claro que é. Como é que sabemos? Por causa dos dados! Vemos nas vendas dos supermercados. Vemos nos supermercados as vendas das tartes de 30 cm que estão congeladas, e a maçã ganha, sem discussão. A maioria das vendas são as de maçã. Mas depois os supermercados começaram a vender tartes mais pequenas, de 11 cm. De repente, a maçã caiu para quarto ou quinto lugar. Porquê? O que é que aconteceu? Pensem bem. Quando vocês compram uma tarte de 30 cm, toda a família tem que aceitar, mas a maçã é a segunda escolha de todos. (Risos) Mas quando compram uma tarte individual de 11 cm, podem comprar aquela que quiserem. Podem obter a vossa primeira escolha. Vocês têm mais dados. Podem ver uma coisa que não viam, quando só tinham uma quantidade mais pequena.

Now, the point here is that more data doesn't just let us see more, more of the same thing we were looking at. More data allows us to see new. It allows us to see better. It allows us to see different. In this case, it allows us to see what America's favorite pie is: not apple.

A questão aqui é que mais dados não só nos permitem ver mais, mais da mesma coisa para que estamos a olhar, mas mais dados também nos permitem ver coisas novas. Permitem-nos ver melhor. Permitem-nos ver de modo diferente. Neste caso, permitem-nos ver qual é a tarte preferida da América: não é a de maçã.

Now, you probably all have heard the term big data. In fact, you're probably sick of hearing the term big data. It is true that there is a lot of hype around the term, and that is very unfortunate, because big data is an extremely important tool by which society is going to advance. In the past, we used to look at small data and think about what it would mean to try to understand the world, and now we have a lot more of it, more than we ever could before. What we find is that when we have a large body of data, we can fundamentally do things that we couldn't do when we only had smaller amounts. Big data is important, and big data is new, and when you think about it, the only way this planet is going to deal with its global challenges — to feed people, supply them with medical care, supply them with energy, electricity, and to make sure they're not burnt to a crisp because of global warming — is because of the effective use of data.

Provavelmente, já todos ouviram falar do termo "megadados". Provavelmente já estão enjoados de ouvir o termo Megadados. É verdade que há muita publicidade em torno deste termo e isso é lamentável, porque os megadados são uma ferramenta extremamente importante com a qual a sociedade vai avançar. No passado, olhávamos para conjuntos reduzidos de dados e pensávamos o que significariam para tentar entender o mundo. Agora temos muito mais, mais do que alguma vez tivemos. Descobrimos que, quando temos um grande conjunto de dados, podemos fazer coisas que não eram possíveis, quando só tínhamos quantidades mais pequenas. Os megadados são importantes e são uma novidade. Pensem nisto. A única forma como este planeta vai poder lidar com os desafios globais — alimentar as pessoas, dispensar-lhes cuidados médicos, fornecer-lhes energia, eletricidade, e garantir que elas não vão ficar em torresmos por causa do aquecimento global — é através do uso eficaz de dados.

So what is new about big data? What is the big deal? Well, to answer that question, let's think about what information looked like, physically looked like in the past. In 1908, on the island of Crete, archaeologists discovered a clay disc. They dated it from 2000 B.C., so it's 4,000 years old. Now, there's inscriptions on this disc, but we actually don't know what it means. It's a complete mystery, but the point is that this is what information used to look like 4,000 years ago. This is how society stored and transmitted information.

Então o que há de novo quanto aos megadados? O que têm de tão importante? Para responder a esta pergunta, pensemos em como era a informação, qual era o seu aspeto físico, no passado. Em 1908, na Ilha de Creta, os arqueólogos descobriram um disco de barro. Dataram-no de 2000 a.C., portanto com 4000 anos. Há inscrições nesse disco, mas não sabemos o que é que significam. É um mistério total, mas a questão é que era aquele o aspeto das informações há 4000 anos. Era assim que a sociedade armazenava e transmitia as informações.

Now, society hasn't advanced all that much. We still store information on discs, but now we can store a lot more information, more than ever before. Searching it is easier. Copying it easier. Sharing it is easier. Processing it is easier. And what we can do is we can reuse this information for uses that we never even imagined when we first collected the data. In this respect, the data has gone from a stock to a flow, from something that is stationary and static to something that is fluid and dynamic. There is, if you will, a liquidity to information. The disc that was discovered off of Crete that's 4,000 years old, is heavy, it doesn't store a lot of information, and that information is unchangeable. By contrast, all of the files that Edward Snowden took from the National Security Agency in the United States fits on a memory stick the size of a fingernail, and it can be shared at the speed of light. More data. More.

A sociedade não avançou assim muito. Continuamos a armazenar informação em discos, mas agora podemos guardar muito mais informações, muito mais do que até aqui. É mais fácil pesquisar. É mais fácil copiar. É mais fácil partilhar. É mais fácil processar. Podemos reutilizar essas informações para usos que nunca sequer imaginávamos quando a princípio reunimos os dados. Neste aspeto, os dados passaram de "stock" a fluxo, passaram duma coisa que é fixa e estática para uma coisa que é fluida e dinâmica. Há, se quiserem, uma liquidez na informação. O disco que foi descoberto em Creta que tem 4000 anos, é pesado, não guarda muitas informações e essas informações são imutáveis. Em contrapartida, todos os arquivos que Edward Snowden levou da National Security Agency nos Estados Unidos da América cabem num cartão de memória do tamanho duma unha e podem ser partilhados à velocidade da luz. Mais dados. Mais.

Now, one reason why we have so much data in the world today is we are collecting things that we've always collected information on, but another reason why is we're taking things that have always been informational but have never been rendered into a data format and we are putting it into data. Think, for example, the question of location. Take, for example, Martin Luther. If we wanted to know in the 1500s where Martin Luther was, we would have to follow him at all times, maybe with a feathery quill and an inkwell, and record it, but now think about what it looks like today. You know that somewhere, probably in a telecommunications carrier's database, there is a spreadsheet or at least a database entry that records your information of where you've been at all times. If you have a cell phone, and that cell phone has GPS, but even if it doesn't have GPS, it can record your information. In this respect, location has been datafied.

Uma das razões por que temos hoje tantos dados no mundo é porque estamos a reunir coisas sobre as quais sempre reunimos informação. Mas outra razão é porque estamos a agarrar em coisas que sempre foram informativas mas nunca foram transformadas num formato de dados e estamos a transformá-las em dados. Por exemplo, pensem na questão da localização. Por exemplo, Martinho Lutero. Se quiséssemos saber nos anos de 1500 onde estava Martinho Lutero, tínhamos que andar sempre atrás dele, — talvez com uma pluma de pato e um tinteiro — e registar isso. Agora pensem no que se passa hoje. Sabemos que algures, provavelmente na base de dados duma empresa de telecomunicações, há uma folha de cálculo ou, pelo menos, uma entrada numa base de dados que regista as informações sobre onde estivemos em todas as ocasiões, se tivermos um telemóvel e esse telemóvel tiver GPS. Mas, mesmo que não tenha GPS, pode registar as nossas informações. Nesse aspeto, a localização foi transformada em dados.

Now think, for example, of the issue of posture, the way that you are all sitting right now, the way that you sit, the way that you sit, the way that you sit. It's all different, and it's a function of your leg length and your back and the contours of your back, and if I were to put sensors, maybe 100 sensors into all of your chairs right now, I could create an index that's fairly unique to you, sort of like a fingerprint, but it's not your finger.

Pensem, por exemplo, na questão da postura, na forma como estão sentados neste momento, na forma como se sentam, na forma como você se senta, na forma como você se senta. São diferentes, em função do tamanho das vossas pernas, dos contornos das vossas costas. Se puséssemos censores, — talvez uns 100 censores — em todas as cadeiras, neste momento, eu podia criar um índice único para cada um de vocês, uma espécie de impressão digital, mas não dos vossos dedos.

So what could we do with this? Researchers in Tokyo are using it as a potential anti-theft device in cars. The idea is that the carjacker sits behind the wheel, tries to stream off, but the car recognizes that a non-approved driver is behind the wheel, and maybe the engine just stops, unless you type in a password into the dashboard to say, "Hey, I have authorization to drive." Great.

Mas o que é que podíamos fazer com isso? Há investigadores em Tóquio que estão a usar isso como um possível aparelho antirroubo em automóveis. A ideia é que o ladrão senta-se ao volante, tenta arrancar, mas o carro reconhece que, ao volante, está um condutor não autorizado e o motor não arranca, a não ser que se digite uma senha no painel de comandos, que diz: "Olha lá, eu tenho autorização para guiar". Fantástico.

What if every single car in Europe had this technology in it? What could we do then? Maybe, if we aggregated the data, maybe we could identify telltale signs that best predict that a car accident is going to take place in the next five seconds. And then what we will have datafied is driver fatigue, and the service would be when the car senses that the person slumps into that position, automatically knows, hey, set an internal alarm that would vibrate the steering wheel, honk inside to say, "Hey, wake up, pay more attention to the road." These are the sorts of things we can do when we datafy more aspects of our lives.

E, se todos os carros na Europa tivessem essa tecnologia? O que é que podíamos fazer? Se agregássemos os dados, talvez pudéssemos identificar sinais reveladores que previssem melhor que vai ocorrer um acidente de viação nos cinco segundos seguintes. E mais, o que teremos registado em dados é a fadiga do condutor. O objetivo seria que, quando o carro pressente que a pessoa entra nessa situação, sabe automaticamente que deve ligar um alarme interno que fará vibrar a buzina do volante, no interior, a dizer: "Ei, acorda! Presta mais atenção à estrada!" São estas as coisas que podemos fazer quando transformamos em dados mais aspetos da nossa vida.

So what is the value of big data? Well, think about it. You have more information. You can do things that you couldn't do before. One of the most impressive areas where this concept is taking place is in the area of machine learning. Machine learning is a branch of artificial intelligence, which itself is a branch of computer science. The general idea is that instead of instructing a computer what do do, we are going to simply throw data at the problem and tell the computer to figure it out for itself. And it will help you understand it by seeing its origins. In the 1950s, a computer scientist at IBM named Arthur Samuel liked to play checkers, so he wrote a computer program so he could play against the computer. He played. He won. He played. He won. He played. He won, because the computer only knew what a legal move was. Arthur Samuel knew something else. Arthur Samuel knew strategy. So he wrote a small sub-program alongside it operating in the background, and all it did was score the probability that a given board configuration would likely lead to a winning board versus a losing board after every move. He plays the computer. He wins. He plays the computer. He wins. He plays the computer. He wins. And then Arthur Samuel leaves the computer to play itself. It plays itself. It collects more data. It collects more data. It increases the accuracy of its prediction. And then Arthur Samuel goes back to the computer and he plays it, and he loses, and he plays it, and he loses, and he plays it, and he loses, and Arthur Samuel has created a machine that surpasses his ability in a task that he taught it.

Então, qual é o valor dos megadados? Pensem só. Temos mais informações. Podemos fazer coisas que não podíamos fazer antes. Uma das áreas mais impressionantes em que está a ocorrer este conceito é na área da aprendizagem das máquinas. A aprendizagem das máquinas é um ramo da inteligência artificial, que, por sua vez, é um ramo das tecnologias da informação. A ideia geral é que, em vez de dar instruções a um computador sobre o que fazer, vamos simplesmente lançar dados para o problema e dizer ao computador para arranjar a solução por si mesmo. Vão compreender melhor conhecendo as suas origens. Nos anos 50, um engenheiro informático da IBM, chamado Arthur Samuel, gostava de jogar xadrez. Por isso escreveu um programa para computador para poder jogar com o computador. Jogou. Ganhou. Jogou. Ganhou. Jogou. Ganhou. Porque o computador só sabia o que era um movimento legal. Arthur Samuel sabia mais qualquer coisa. Arthur Samuel sabia estratégia. Assim, escreveu um pequeno subprograma que corria por detrás, que apenas calculava as probabilidades de uma dada configuração do tabuleiro levar a um tabuleiro vencedor ou a um tabuleiro perdedor, depois de cada movimento. Joga com o computador. Ganha. Joga com o computador. Ganha. Joga com o computador. Ganha. Então, Arthur Samuel deixa o computador jogar sozinho. Ele joga sozinho. Reúne mais dados. Reúne mais dados. Aumenta o rigor das suas previsões. Arthur Samuel volta ao computador. Joga e perde, joga e perde, joga e perde. Arthur Samuel criara uma máquina que ultrapassara a sua capacidade numa tarefa que ele lhe ensinara.

And this idea of machine learning is going everywhere. How do you think we have self-driving cars? Are we any better off as a society enshrining all the rules of the road into software? No. Memory is cheaper. No. Algorithms are faster. No. Processors are better. No. All of those things matter, but that's not why. It's because we changed the nature of the problem. We changed the nature of the problem from one in which we tried to overtly and explicitly explain to the computer how to drive to one in which we say, "Here's a lot of data around the vehicle. You figure it out. You figure it out that that is a traffic light, that that traffic light is red and not green, that that means that you need to stop and not go forward."

Esta ideia de aprendizagem da máquina está a espalhar-se por todo o lado. Como é que julgam que temos carros autoguiados? A nossa sociedade está melhor por meter todas as regras de trânsito em "software"? Não. A memória é mais barata? Não. Os algoritmos são mais rápidos? Não. Os processadores são melhores? Não. Todas essas coisas são importantes, mas a razão não é essa. É porque alterámos a natureza do problema. Alterámos a natureza do problema. Em vez de tentarmos aberta e explicitamente explicar ao computador como guiar, dizemos: "Estão aqui os dados sobre o veículo. "Descobre lá como é. "Descobre que isto é um semáforo, "que a luz do semáforo está vermelha e não verde, "que isso significa que tens que parar "em vez de avançar".

Machine learning is at the basis of many of the things that we do online: search engines, Amazon's personalization algorithm, computer translation, voice recognition systems. Researchers recently have looked at the question of biopsies, cancerous biopsies, and they've asked the computer to identify by looking at the data and survival rates to determine whether cells are actually cancerous or not, and sure enough, when you throw the data at it, through a machine-learning algorithm, the machine was able to identify the 12 telltale signs that best predict that this biopsy of the breast cancer cells are indeed cancerous. The problem: The medical literature only knew nine of them. Three of the traits were ones that people didn't need to look for, but that the machine spotted.

A aprendizagem da máquina está na base de muitas das coisas que fazemos "online". Por exemplo, motores de busca, algoritmo de personalização do Amazon, tradução por computador, sistemas de reconhecimento de voz. Recentemente, houve investigadores que analisaram a questão das biópsias, das biópsias relacionadas com cancros. Pediram ao computador para identificar, olhando para os dados e as taxas de sobrevivência, para determinar se as células eram cancerosas ou não. Claro que, quando lá meteram os dados, através dum algoritmo de aprendizagem das máquinas, a máquina foi capaz de identificar os 12 sinais reveladores que melhor preveem que naquela biopsia do cancro da mama as células são de facto cancerosas. Um problema: a literatura médica só conhecia nove desses sinais. Três dos sinais eram sinais que as pessoas não precisavam de procurar, mas a máquina detetou-os.

Now, there are dark sides to big data as well. It will improve our lives, but there are problems that we need to be conscious of, and the first one is the idea that we may be punished for predictions, that the police may use big data for their purposes, a little bit like "Minority Report." Now, it's a term called predictive policing, or algorithmic criminology, and the idea is that if we take a lot of data, for example where past crimes have been, we know where to send the patrols. That makes sense, but the problem, of course, is that it's not simply going to stop on location data, it's going to go down to the level of the individual. Why don't we use data about the person's high school transcript? Maybe we should use the fact that they're unemployed or not, their credit score, their web-surfing behavior, whether they're up late at night. Their Fitbit, when it's able to identify biochemistries, will show that they have aggressive thoughts. We may have algorithms that are likely to predict what we are about to do, and we may be held accountable before we've actually acted. Privacy was the central challenge in a small data era. In the big data age, the challenge will be safeguarding free will, moral choice, human volition, human agency.

Mas também há o lado sombrio dos megadados. Vai melhorar a nossa vida, mas há problemas de que temos que ter consciência. O primeiro é a ideia de que podemos ser punidos por causa das previsões, de que a polícia possa usar os megadados para os seus objetivos, um pouco como no "Minority Report". É um termo chamado policiamento previsível, ou criminologia algorítmica. A ideia é que, se tivermos muitos dados, por exemplo, sítios onde ocorreram crimes no passado, sabemos para onde enviar as patrulhas. Isso faz sentido, mas claro que o problema é que isso não vai parar nos dados de localização, vai descer ao nível do indivíduo. Porque é que não usamos dados sobre o historial universitário das pessoas? Podíamos usar o facto de eles estarem empregados ou não, o grau de confiança, o comportamento de navegação na Internet, se se deitam muito tarde. O seu Fitbit, quando conseguir identificar bioquímicos, mostrará que eles têm pensamentos agressivos. Podemos ter algoritmos que poderão prever o que estamos a pensar fazer, e podemos ser responsabilizados ainda antes de termos agido. A privacidade era o desafio central numa era de pequenos dados. Na era dos megadados, o desafio será a salvaguarda da nossa vontade, da escolha moral, da volição humana, da atividade humana.

There is another problem: Big data is going to steal our jobs. Big data and algorithms are going to challenge white collar, professional knowledge work in the 21st century in the same way that factory automation and the assembly line challenged blue collar labor in the 20th century. Think about a lab technician who is looking through a microscope at a cancer biopsy and determining whether it's cancerous or not. The person went to university. The person buys property. He or she votes. He or she is a stakeholder in society. And that person's job, as well as an entire fleet of professionals like that person, is going to find that their jobs are radically changed or actually completely eliminated. Now, we like to think that technology creates jobs over a period of time after a short, temporary period of dislocation, and that is true for the frame of reference with which we all live, the Industrial Revolution, because that's precisely what happened. But we forget something in that analysis: There are some categories of jobs that simply get eliminated and never come back. The Industrial Revolution wasn't very good if you were a horse. So we're going to need to be careful and take big data and adjust it for our needs, our very human needs. We have to be the master of this technology, not its servant. We are just at the outset of the big data era, and honestly, we are not very good at handling all the data that we can now collect. It's not just a problem for the National Security Agency. Businesses collect lots of data, and they misuse it too, and we need to get better at this, and this will take time. It's a little bit like the challenge that was faced by primitive man and fire. This is a tool, but this is a tool that, unless we're careful, will burn us.

Há um outro problema: Os megadados vão roubar-nos trabalho. Os megadados e os algoritmos vão pôr em causa o trabalho intelectual de colarinho branco, dos profissionais no século XXI, tal como a automação das fábricas e as linhas de montagem puseram em causa o trabalho dos operários no século XX. Pensem num técnico de laboratório que observa ao microscópio a biópsia dum cancro e determina se é cancerosa ou não. Essa pessoa frequentou a universidade. Essa pessoa compra bens. Vota. É parte interessada da sociedade. Essa pessoa, assim como todo um grupo de profissionais iguais a ela, vai deparar-se com o seu trabalho radicalmente alterado ou totalmente eliminado. Gostamos de pensar que a tecnologia cria empregos depois de um curto período temporário de transição. Isso é verdade, dentro do quadro de referência em que todos vivemos na Revolução Industrial porque foi exatamente o que aconteceu. Mas esquecemos uma coisa nessa análise. Há algumas categorias de trabalhos que são eliminados e nunca mais voltam. A Revolução Industrial não foi muito boa para os cavalos. Portanto, vamos ter que ser cuidadosos, agarrar nos megadados e ajustá-los às nossas necessidades, às nossas necessidades humanas. Temos que ser os donos desta tecnologia e não os seus escravos. Estamos apenas no início da era dos megadados. Honestamente, não somos muito bons em manejar todos os dados que conseguimos reunir atualmente. Não é só um problema com a NSA. A indústria reúne muitos dados e também os utiliza mal. Precisamos de melhorar nisso, mas isso vai levar tempo. É um pouco como o desafio que os homens primitivos enfrentaram com o fogo. É uma ferramenta, mas é uma ferramenta que, se não tivermos cuidado, pode-nos queimar.

Big data is going to transform how we live, how we work and how we think. It is going to help us manage our careers and lead lives of satisfaction and hope and happiness and health, but in the past, we've often looked at information technology and our eyes have only seen the T, the technology, the hardware, because that's what was physical. We now need to recast our gaze at the I, the information, which is less apparent, but in some ways a lot more important. Humanity can finally learn from the information that it can collect, as part of our timeless quest to understand the world and our place in it, and that's why big data is a big deal.

Os megadados vão transformar o modo como vivemos, como trabalhamos e como pensamos. Vão ajudar-nos a gerir as nossas carreiras e a viver com satisfação e esperança, com felicidade e saúde. No passado, olhámos muitas vezes para a tecnologia da informação e os nossos olhos só viram o "T," a Tecnologia, o equipamento, porque era o que era físico. Agora precisamos de focar o nosso olhar no "I", a Informação que é menos aparente, mas em certos aspetos, muito mais importante. A humanidade pode finalmente aprender com as informações que consegue reunir, faz parte da nossa busca incansável para entender o mundo e o nosso lugar nele. Por isso é que os megadados são uma coisa importante.

(Applause)

(Aplausos)

America's favorite pie is?

A tarte preferida da América é...?

(Applause)

(Aplausos)

Kenneth Cukier: Big data is better data

Kenneth Cukier: Big data is better data

Related talks

David McCandless: The beauty of data visualization

Talithia Williams: Own your body's data

Tim Berners-Lee: The next web

Shyam Sankar: The rise of human-computer cooperation

Giorgia Lupi: How we can find ourselves in data

Anders Ynnerman: Visualizing the medical data explosion

Related talks

David McCandless: The beauty of data visualization

Talithia Williams: Own your body's data

Tim Berners-Lee: The next web

Shyam Sankar: The rise of human-computer cooperation

Giorgia Lupi: How we can find ourselves in data

Anders Ynnerman: Visualizing the medical data explosion