Kenneth Cukier: Big data is better data

America's favorite pie is?

A torta preferida dos americanos é?

Audience: Apple. Kenneth Cukier: Apple. Of course it is. How do we know it? Because of data. You look at supermarket sales. You look at supermarket sales of 30-centimeter pies that are frozen, and apple wins, no contest. The majority of the sales are apple. But then supermarkets started selling smaller, 11-centimeter pies, and suddenly, apple fell to fourth or fifth place. Why? What happened? Okay, think about it. When you buy a 30-centimeter pie, the whole family has to agree, and apple is everyone's second favorite. (Laughter) But when you buy an individual 11-centimeter pie, you can buy the one that you want. You can get your first choice. You have more data. You can see something that you couldn't see when you only had smaller amounts of it.

Plateia: Maçã Kenneth Cukier: Maçã. Claro que sim. Como sabemos isso? Por causa dos dados. Vejam as vendas dos supermercados. Vejam as vendas das tortas congeladas de 30 cm, a de maçã vence, sem dúvida. A maioria das vendas são de maçã. Mas então os supermercados começaram a vender tortas menores, de 11 cm, e, de repente, a de maçã caiu para o quarto ou quinto lugar. Por quê? O que aconteceu? Certo, pensem bem. Quando você compra uma torta de 30 cm, a família toda tem que aceitar, e a de maçã é a segunda mais favorita de todos. (Risos) Mas quando você compra uma torta individual de 11 cm, você pode comprar a que quiser. Você pode comprar a sua primeira opção. Você tem mais dados. Pode ver algo que você não via, quando tinha menos informação. A questão aqui é que mais dados

Now, the point here is that more data doesn't just let us see more, more of the same thing we were looking at. More data allows us to see new. It allows us to see better. It allows us to see different. In this case, it allows us to see what America's favorite pie is: not apple.

não nos permite apenas ver mais daquilo que já tínhamos antes. Mais dados nos permitem ver coisas novas. Permitem-nos ver melhor. Permitem-nos ver de forma diferente. Neste caso, permitem-nos ver qual é a torta favorita dos EUA: não é a de maçã. Vocês já devem ter ouvido o termo "megadados".

Now, you probably all have heard the term big data. In fact, you're probably sick of hearing the term big data. It is true that there is a lot of hype around the term, and that is very unfortunate, because big data is an extremely important tool by which society is going to advance. In the past, we used to look at small data and think about what it would mean to try to understand the world, and now we have a lot more of it, more than we ever could before. What we find is that when we have a large body of data, we can fundamentally do things that we couldn't do when we only had smaller amounts. Big data is important, and big data is new, and when you think about it, the only way this planet is going to deal with its global challenges — to feed people, supply them with medical care, supply them with energy, electricity, and to make sure they're not burnt to a crisp because of global warming — is because of the effective use of data.

Na verdade, já devem estar fartos de ouvir o termo "megadados". É verdade, há muitos exageros associados ao termo, e isso é lamentável, porque os megadados são uma ferramenta extremamente importante pela a qual a sociedade vai avançar. No passado, olhávamos para os dados pequenos e pensávamos no que significaria tentar entender o mundo, e agora nós temos muito mais dados, como jamais tivemos antes. O que vemos é que quando temos um grande volume de dados, podemos fazer coisas que não podíamos fazer quando só havia quantidades pequenas. Megadados são importantes, e megadados são novidade, e quando se pensa nisso, o único jeito deste planeta lidar com os desafios globais– alimentar as pessoas, provê-las com cuidados médicos, provê-las com energia, eletricidade, e assegurar que não ficaremos "tostados" com o aquecimento global – será com a utilização de dados de forma eficaz.

So what is new about big data? What is the big deal? Well, to answer that question, let's think about what information looked like, physically looked like in the past. In 1908, on the island of Crete, archaeologists discovered a clay disc. They dated it from 2000 B.C., so it's 4,000 years old. Now, there's inscriptions on this disc, but we actually don't know what it means. It's a complete mystery, but the point is that this is what information used to look like 4,000 years ago. This is how society stored and transmitted information.

Então, o que há de novo sobre megadados? Bom, para responder isso, vamos pensar em como informação era, fisicamente, no passado. Em 1908, na ilha de Creta, arqueólogos descobriram um disco de argila, que foi datado de 2000 A.C., então são 4 mil anos de idade. Há inscrições no disco, mas não sabemos o que significam. É um completo mistério, mas o detalhe é que era assim que a informação parecia 4 mil anos atrás. Era assim que a sociedade armazenava e transmitia informação.

Now, society hasn't advanced all that much. We still store information on discs, but now we can store a lot more information, more than ever before. Searching it is easier. Copying it easier. Sharing it is easier. Processing it is easier. And what we can do is we can reuse this information for uses that we never even imagined when we first collected the data. In this respect, the data has gone from a stock to a flow, from something that is stationary and static to something that is fluid and dynamic. There is, if you will, a liquidity to information. The disc that was discovered off of Crete that's 4,000 years old, is heavy, it doesn't store a lot of information, and that information is unchangeable. By contrast, all of the files that Edward Snowden took from the National Security Agency in the United States fits on a memory stick the size of a fingernail, and it can be shared at the speed of light. More data. More.

Hoje, sociedades não avançaram tanto assim. Ainda armazenamos informação em discos, mas hoje podemos guardar muito mais informação, mais do que podíamos antes. Buscar é mais fácil. Copiar é mais fácil. Compartilhar é mais fácil. Processar é mais fácil. E podemos reutilizar essa informação para usos que não imaginávamos quando coletamos os dados pela primeira vez Neste contexto, os dados mudaram de um estoque para um fluxo, de algo estacionário e estático para algo que é fluido e dinâmico. Existe um tipo de "liquidez" na informação. O disco descoberto em Creta, de 4 mil anos de idade, é pesado, ele não armazena muita informação, e a informação não é modificável. Em contraste, todos os arquivos que Edward Snowden levou da Agência de Segurança Nacional dos Estados Unidos cabem em um pen drive do tamanho de uma unha, e podem ser compartilhado à velocidade da luz. Mais dados. Mais.

Now, one reason why we have so much data in the world today is we are collecting things that we've always collected information on, but another reason why is we're taking things that have always been informational but have never been rendered into a data format and we are putting it into data. Think, for example, the question of location. Take, for example, Martin Luther. If we wanted to know in the 1500s where Martin Luther was, we would have to follow him at all times, maybe with a feathery quill and an inkwell, and record it, but now think about what it looks like today. You know that somewhere, probably in a telecommunications carrier's database, there is a spreadsheet or at least a database entry that records your information of where you've been at all times. If you have a cell phone, and that cell phone has GPS, but even if it doesn't have GPS, it can record your information. In this respect, location has been datafied.

Agora, uma explicação de termos tantos dados hoje é que estamos coletando coisas sobre as quais sempre coletamos informação. Mas uma outra razão é que estamos pegando coisas, que sempre foram informacionais mas não estavam em formato de dados, e agora estão sendo transformados em dados. Vejam, por exemplo, a questão da localização. Vejam, por exemplo, Martinho Lutero. Se quiséssemos saber, nos anos 1500, onde Martinho Lutero estava, teríamos que segui-lo o tempo todo, talvez com uma pena e um tinteiro, e fazer registros, e agora veja como isso é hoje. Sabemos que em algum lugar, talvez num banco de dados de operadora de telefonia, há uma planilha, ou ao menos um registro em banco de dados gravando sua informação de onde você esteve, a todo momento. Se você tem um celular, com um GPS, mas mesmo que não tenha um GPS, ele pode registrar sua informação. Neste contexto, a localização foi transformada em dados.

Now think, for example, of the issue of posture, the way that you are all sitting right now, the way that you sit, the way that you sit, the way that you sit. It's all different, and it's a function of your leg length and your back and the contours of your back, and if I were to put sensors, maybe 100 sensors into all of your chairs right now, I could create an index that's fairly unique to you, sort of like a fingerprint, but it's not your finger.

Vejam, por exemplo, a questão da postura, como vocês estão se sentando agora, o modo como você se senta, o modo como você se senta. São todos diferentes, e é uma função dos comprimentos das pernas e do contorno das suas costas, e seu eu colocasse censores, talvez 100 censores nas cadeiras de todos aqui agora, eu poderia criar um índice único para cada um de vocês, como uma impressão digital, mas não é do seu dedo.

So what could we do with this? Researchers in Tokyo are using it as a potential anti-theft device in cars. The idea is that the carjacker sits behind the wheel, tries to stream off, but the car recognizes that a non-approved driver is behind the wheel, and maybe the engine just stops, unless you type in a password into the dashboard to say, "Hey, I have authorization to drive." Great.

Então, o que poderíamos fazer com isso? Pesquisadores em Tóquio estão usando isso como um potencial dispositivo anti-roubo para carros. A ideia é que o ladrão senta ao volante, tenta ligar, mas o carro reconhece que um motorista não autorizado está ao volante, e talvez o motor pare, a não ser que você digite uma senha no painel para dizer: "Ei, eu tenho autorização para dirigir". Ótimo.

What if every single car in Europe had this technology in it? What could we do then? Maybe, if we aggregated the data, maybe we could identify telltale signs that best predict that a car accident is going to take place in the next five seconds. And then what we will have datafied is driver fatigue, and the service would be when the car senses that the person slumps into that position, automatically knows, hey, set an internal alarm that would vibrate the steering wheel, honk inside to say, "Hey, wake up, pay more attention to the road." These are the sorts of things we can do when we datafy more aspects of our lives.

E se todos os carros na Europa usassem esta tecnologia? O que poderíamos fazer? Talvez, se agregássemos os dados, talvez pudéssemos identificar sinais que indicassem se um acidente de carro irá acontecer nos próximos 5 segundos. E então, o que teremos transformados em dados, é fatiga do motorista, e o serviço seria, quando o carro sente que a pessoa assume aquela posição, automaticamente ativar um alarme interno que faria o volante vibrar, emitir um som, como se dissesse: "Ei, acorda, presta mais atenção na estrada." Essas são as coisas que podemos fazer quando convertemos em dados mais aspectos das nossas vidas.

So what is the value of big data? Well, think about it. You have more information. You can do things that you couldn't do before. One of the most impressive areas where this concept is taking place is in the area of machine learning. Machine learning is a branch of artificial intelligence, which itself is a branch of computer science. The general idea is that instead of instructing a computer what do do, we are going to simply throw data at the problem and tell the computer to figure it out for itself. And it will help you understand it by seeing its origins. In the 1950s, a computer scientist at IBM named Arthur Samuel liked to play checkers, so he wrote a computer program so he could play against the computer. He played. He won. He played. He won. He played. He won, because the computer only knew what a legal move was. Arthur Samuel knew something else. Arthur Samuel knew strategy. So he wrote a small sub-program alongside it operating in the background, and all it did was score the probability that a given board configuration would likely lead to a winning board versus a losing board after every move. He plays the computer. He wins. He plays the computer. He wins. He plays the computer. He wins. And then Arthur Samuel leaves the computer to play itself. It plays itself. It collects more data. It collects more data. It increases the accuracy of its prediction. And then Arthur Samuel goes back to the computer and he plays it, and he loses, and he plays it, and he loses, and he plays it, and he loses, and Arthur Samuel has created a machine that surpasses his ability in a task that he taught it.

Então, qual o valor dos megadados? Pensem bem. Temos mais informação. Podemos fazer coisas que não podíamos antes. Uma das áreas mais impressionantes em que este conceito está ocorrendo é na área do aprendizado de máquina. O aprendizado de máquina é um ramo da inteligência artificial, que, por sua vez, é um ramo da ciência da computação. A idéia geral é que em vez de instruir um computador sobre o que fazer, nós simplesmente jogamos dados no problema e dizemos ao computador que resolva sozinho. Conhecer as origens disso ajudará vocês a entenderem. Por volta de 1950, um cientista da computação da IBM, Arthur Samuel, gostava de jogar damas, ele então criou um programa para que pudesse jogar contra o computador. Ele jogou. Ele ganhou. Ele jogou. Ele ganhou. Ele jogou. Ele ganhou, porque o computador só sabia o que era uma jogada legal. Arthur Samuel sabia algo mais. Arthur Samuel sabia sobre estratégia. Ele então criou um pequeno sub-programa, operando em segundo plano, e tudo que ele fazia era calcular a probabilidade que uma dada configuração do tabuleiro pudesse levar a um jogo vencedor contra um perdedor, após cada movimento. Ele joga contra o computador. Ele ganha. Ele joga contra o computador. Ele ganha. Ele joga contra o computador. Ele ganha. Então Arthur Samuel deixou o computador jogar contra ele mesmo. Ele joga com ele mesmo. Coleta mais dados. Coleta mais dados. A precisão de sua predição aumenta. Arthur Samuel volta ao computador ele joga, e ele perde, ele joga, e ele perde, ele joga, e ele perde, e Arthur Samuel criou uma máquina que supera sua habilidade em uma tarefa que ele mesmo ensinou.

And this idea of machine learning is going everywhere. How do you think we have self-driving cars? Are we any better off as a society enshrining all the rules of the road into software? No. Memory is cheaper. No. Algorithms are faster. No. Processors are better. No. All of those things matter, but that's not why. It's because we changed the nature of the problem. We changed the nature of the problem from one in which we tried to overtly and explicitly explain to the computer how to drive to one in which we say, "Here's a lot of data around the vehicle. You figure it out. You figure it out that that is a traffic light, that that traffic light is red and not green, that that means that you need to stop and not go forward."

Essa ideia de aprendizagem de máquina estará em todas as partes. Como vocês acham que funcionam os carros auto-dirigidos? Será que estamos melhores como uma sociedade, colocando todas as regras de trânsito em um software? Não. A memória está mais barata. Não. Os algoritmos são mais rápidos. Não. Os processadores são melhores. Não. Tudo isso é importante, mas não é o principal. O essencial é que modificamos a natureza do problema. Mudamos o problema de um em que tentávamos aberta e explicitamente, explicar ao computador como dirigir, para um em que dizemos: "Aqui temos um monte de dados sobre o veículo. Resolva. Descubra se aquilo é um semáforo, se a luz é vermelha e não verde, que isso significa que você tem que parar e não ir adiante." A aprendizagem de máquina é a base

Machine learning is at the basis of many of the things that we do online: search engines, Amazon's personalization algorithm, computer translation, voice recognition systems. Researchers recently have looked at the question of biopsies, cancerous biopsies, and they've asked the computer to identify by looking at the data and survival rates to determine whether cells are actually cancerous or not, and sure enough, when you throw the data at it, through a machine-learning algorithm, the machine was able to identify the 12 telltale signs that best predict that this biopsy of the breast cancer cells are indeed cancerous. The problem: The medical literature only knew nine of them. Three of the traits were ones that people didn't need to look for, but that the machine spotted.

de muita coisa que fazemos online: Motores de busca, algoritmos de personalização da Amazon, tradução automática, sistemas de reconhecimento de voz. Pesquisadores recentemente examinaram a questão das biópsias, biópsias de câncer, e pediram para o computador para identificar, olhando os dados e taxas de sobrevivência para determinar se as células são cancerosas ou não, e, certamente, alimentando os dados em um algoritmo de aprendizagem de máquina, a máquina foi capaz de identificar os 12 sinais que melhor predizem que nesta biópsia, as células da mama são realmente cancerosas. O problema: a literatura médica só conhecia nove deles. Três das características eram traços que as pessoas não tinham necessidade de procurar, mas o computador detectou.

Now, there are dark sides to big data as well. It will improve our lives, but there are problems that we need to be conscious of, and the first one is the idea that we may be punished for predictions, that the police may use big data for their purposes, a little bit like "Minority Report." Now, it's a term called predictive policing, or algorithmic criminology, and the idea is that if we take a lot of data, for example where past crimes have been, we know where to send the patrols. That makes sense, but the problem, of course, is that it's not simply going to stop on location data, it's going to go down to the level of the individual. Why don't we use data about the person's high school transcript? Maybe we should use the fact that they're unemployed or not, their credit score, their web-surfing behavior, whether they're up late at night. Their Fitbit, when it's able to identify biochemistries, will show that they have aggressive thoughts. We may have algorithms that are likely to predict what we are about to do, and we may be held accountable before we've actually acted. Privacy was the central challenge in a small data era. In the big data age, the challenge will be safeguarding free will, moral choice, human volition, human agency.

Mas também há o lado sombrio dos megadados. Eles vão melhorar nossas vidas, mas há problemas dos quais devemos estar cientes. O primeiro é que possamos ser punidos por causa das previsões; que a polícia poderá usar megadados para seus propósitos, um pouco como "Minority Report". Atualmente, é um termo chamado "policiamento preditivo", ou criminologia algorítmica. A ideia é que, se tivermos muitos dados, digamos, onde um crime ocorreu no passado, sabemos para onde enviar as patrulhas. Isso faz sentido, mas o problema, claro, é que isso não vai parar só nos dados de localização, vai chegar ao nível do indivíduo. Por que não usar os dados do histórico escolar das pessoas? Talvez usar o fato de estarem desempregadas, ou não, usar sua pontuação de crédito, sua conduta na Internet, se ficam acordadas de noite. Seus Fitbits, quando puderem identificar dados bioquímicos, mostrarão se têm pensamentos agressivos. Poderemos ter algoritmos que poderão prever o que estamos prestes a fazer. e poderemos ser incriminados mesmo antes de agirmos. A privacidade era o desafio principal na era dos pequenos dados. Na fase dos megadados, o desafio será salvaguardar nosso livre arbítrio, escolha moral, volição humana, e atuação humana.

There is another problem: Big data is going to steal our jobs. Big data and algorithms are going to challenge white collar, professional knowledge work in the 21st century in the same way that factory automation and the assembly line challenged blue collar labor in the 20th century. Think about a lab technician who is looking through a microscope at a cancer biopsy and determining whether it's cancerous or not. The person went to university. The person buys property. He or she votes. He or she is a stakeholder in society. And that person's job, as well as an entire fleet of professionals like that person, is going to find that their jobs are radically changed or actually completely eliminated. Now, we like to think that technology creates jobs over a period of time after a short, temporary period of dislocation, and that is true for the frame of reference with which we all live, the Industrial Revolution, because that's precisely what happened. But we forget something in that analysis: There are some categories of jobs that simply get eliminated and never come back. The Industrial Revolution wasn't very good if you were a horse. So we're going to need to be careful and take big data and adjust it for our needs, our very human needs. We have to be the master of this technology, not its servant. We are just at the outset of the big data era, and honestly, we are not very good at handling all the data that we can now collect. It's not just a problem for the National Security Agency. Businesses collect lots of data, and they misuse it too, and we need to get better at this, and this will take time. It's a little bit like the challenge that was faced by primitive man and fire. This is a tool, but this is a tool that, unless we're careful, will burn us.

Há um outro problema: Os megadados vão roubar nossos empregos. Megadados e algoritmos desafiarão o conhecimento profissional dos funcionários do século 21, da mesma forma que a automação das fábricas e as linhas de produção desafiaram o trabalho dos operários no século 20. Imagine um técnico de laboratório examinando ao microscópio uma biópsia do câncer, decidindo se é maligno ou não. Esta pessoa frequentou faculdade. Ela compra um imóvel. Vota. É parte interessada da sociedade. E o seu emprego, assim como o de toda uma frota de profissionais, vão ver que seus empregos mudarão radicalmente ou, de fato, serão completamente extintos. Agora, gostamos de pensar que a tecnologia cria empregos durante um período tempo após um período de transição curto, temporário. Isso é verdade para o quadro de referência com o qual vivemos, a Revolução Industrial, porque foi exatamente o que ocorreu. Mas esquecemos algo nessa análise: Há certas categorias de empregos que simplesmente são eliminados e não voltam nunca. A Revolução Industrial não foi muito boa para os cavalos. Portanto precisamos ser cautelosos, e pegar os megadados e ajustá-los às nossas necessidades, às nossas necessidades humanas. Temos que ser os mestres desta tecnologia, e não seus criados. Estamos apenas no início da era dos megadados, e, honestamente, não somos muito bons em manejar todos dados que agora podemos coletar. O problema não é só para a Agência Segurança Nacional dos EUA. Empresas coletam muitos dados e também fazem mal uso deles, e precisamos melhorar nisso, o que levará tempo É um pouco como o desafio que os homens primitivos enfrentaram com o fogo. É uma ferramenta, mas é uma ferramenta que, se não tivermos cuidado, vai nos queimar.

Big data is going to transform how we live, how we work and how we think. It is going to help us manage our careers and lead lives of satisfaction and hope and happiness and health, but in the past, we've often looked at information technology and our eyes have only seen the T, the technology, the hardware, because that's what was physical. We now need to recast our gaze at the I, the information, which is less apparent, but in some ways a lot more important. Humanity can finally learn from the information that it can collect, as part of our timeless quest to understand the world and our place in it, and that's why big data is a big deal.

Megadados vão transformar o modo como vivemos, como trabalhamos e como pensamos. Vão ajudar no gerenciamento de nossas carreiras, e viver satisfeitos, com esperança, felizes e saudáveis, mas no passado muitas vezes olhávamos para a Tecnologia da Informação mas nossos olhos só viam o 'T', a tecnologia, o hardware, porque era a parte física. Agora precisamos colocar nossa atenção no 'I', de informação, que é menos tangível. mas em certos aspectos muito mais importante. A humanidade pode finalmente aprender a partir da informação que ela pode coletar, como parte da nossa eterna busca pelo entendimento do mundo e do nosso lugar nele, e é por isso que os megadados são importantes.

(Applause)

(Aplausos)

America's favorite pie is?

A torta preferida dos americanos é?

(Applause)

(Aplausos)

Kenneth Cukier: Big data is better data

Kenneth Cukier: Big data is better data

Related talks

David McCandless: The beauty of data visualization

Talithia Williams: Own your body's data

Tim Berners-Lee: The next web

Shyam Sankar: The rise of human-computer cooperation

Giorgia Lupi: How we can find ourselves in data

Anders Ynnerman: Visualizing the medical data explosion

Related talks

David McCandless: The beauty of data visualization

Talithia Williams: Own your body's data

Tim Berners-Lee: The next web

Shyam Sankar: The rise of human-computer cooperation

Giorgia Lupi: How we can find ourselves in data

Anders Ynnerman: Visualizing the medical data explosion