Kenneth Cukier: Big data is better data

America's favorite pie is?

¿Pastel favorito en EEUU?

Audience: Apple. Kenneth Cukier: Apple. Of course it is. How do we know it? Because of data. You look at supermarket sales. You look at supermarket sales of 30-centimeter pies that are frozen, and apple wins, no contest. The majority of the sales are apple. But then supermarkets started selling smaller, 11-centimeter pies, and suddenly, apple fell to fourth or fifth place. Why? What happened? Okay, think about it. When you buy a 30-centimeter pie, the whole family has to agree, and apple is everyone's second favorite. (Laughter) But when you buy an individual 11-centimeter pie, you can buy the one that you want. You can get your first choice. You have more data. You can see something that you couldn't see when you only had smaller amounts of it.

Audiencia: El de manzana. Kenneth Cukier: De manzana. Por supuesto. ¿Cómo lo sabemos? Por los datos. Se miran las ventas en supermercados. Se miran las ventas en supermercados de pasteles de 30 cm congelados, y los de manzana ganan, sin rival. La mayoría de las ventas son los de manzana. Pero los supermercados comenzaron a vender pasteles más pequeños, de 11 cm, y de repente, el de manzana cayó al 4º o 5º lugar. ¿Por qué? ¿Qué paso? Bueno, piensen en ello. Cuando compramos un pastel de 30 cm, toda la familia tiene que estar de acuerdo, y el de manzana es el segundo favorito de todos. (Risas) Pero si uno compra un pastel de 11 cm individual, puede comprar el que desee. Puede comprar su primera opción. Tenemos más datos. Podemos ver algo que no se podía ver cuando solo había menor cantidad de datos.

Now, the point here is that more data doesn't just let us see more, more of the same thing we were looking at. More data allows us to see new. It allows us to see better. It allows us to see different. In this case, it allows us to see what America's favorite pie is: not apple.

Ahora, el punto es que muchos más datos no solo nos permiten ver más, más de lo mismo que ya veíamos. Más datos nos permiten ver cosas nuevas. Nos permiten ver mejor. Nos permiten ver de forma diferente. En este caso, nos permiten ver que el pastel favorito de EEUU es: no el de manzana.

Now, you probably all have heard the term big data. In fact, you're probably sick of hearing the term big data. It is true that there is a lot of hype around the term, and that is very unfortunate, because big data is an extremely important tool by which society is going to advance. In the past, we used to look at small data and think about what it would mean to try to understand the world, and now we have a lot more of it, more than we ever could before. What we find is that when we have a large body of data, we can fundamentally do things that we couldn't do when we only had smaller amounts. Big data is important, and big data is new, and when you think about it, the only way this planet is going to deal with its global challenges — to feed people, supply them with medical care, supply them with energy, electricity, and to make sure they're not burnt to a crisp because of global warming — is because of the effective use of data.

Puede que todos hayan oído escuchado el término "Datos masivos". De hecho, es probable que estén hartos de escucharlo "Datos masivos". Es cierto que se exagera mucho el término, y eso es muy lamentable, porque los datos masivos son una herramienta muy importante para que la sociedad avance. En el pasado, solíamos observar pequeñas cantidades de datos y pensar qué significarían para tratar de entender el mundo. Ahora tenemos mucho más de ello, más de lo que podía existir antes. Lo que encontramos es que cuando tenemos una gran cantidad de datos, podemos hacer cosas que no podíamos hacer teniendo solo cantidades más pequeñas. Los datos masivos son importantes y es algo nuevo, y cuando se piensa en ello, la única forma en que este planeta afronte sus desafíos mundiales, esto es, alimentar a la gente, ofrecer atención médica, suministrar energía, electricidad, y asegurarse de que no nos achicharramos debido al calentamiento global, es utilizando de forma eficaz los datos.

So what is new about big data? What is the big deal? Well, to answer that question, let's think about what information looked like, physically looked like in the past. In 1908, on the island of Crete, archaeologists discovered a clay disc. They dated it from 2000 B.C., so it's 4,000 years old. Now, there's inscriptions on this disc, but we actually don't know what it means. It's a complete mystery, but the point is that this is what information used to look like 4,000 years ago. This is how society stored and transmitted information.

Entonces, ¿qué es lo nuevo de los datos masivos? ¿Cuál es la gran cosa? Bueno, para responder a esto, pensaremos en cómo se veía la información, físicamente en el pasado. En 1908 en la isla de Creta, los arqueólogos descubrieron un disco de arcilla. Datan del año 2000 aC, así que tienen 4000 años de antigüedad. Hay inscripciones en este disco, pero, no sabemos qué significan. Es un completo misterio, pero el punto es que así solía verse la información hace 4000 años. Esta es la forma en que la sociedad almacenaba y transmitía la información.

Now, society hasn't advanced all that much. We still store information on discs, but now we can store a lot more information, more than ever before. Searching it is easier. Copying it easier. Sharing it is easier. Processing it is easier. And what we can do is we can reuse this information for uses that we never even imagined when we first collected the data. In this respect, the data has gone from a stock to a flow, from something that is stationary and static to something that is fluid and dynamic. There is, if you will, a liquidity to information. The disc that was discovered off of Crete that's 4,000 years old, is heavy, it doesn't store a lot of information, and that information is unchangeable. By contrast, all of the files that Edward Snowden took from the National Security Agency in the United States fits on a memory stick the size of a fingernail, and it can be shared at the speed of light. More data. More.

Ahora, la sociedad no ha avanzado tanto. Todavía guardamos la información en discos, pero ahora podemos almacenar mucha más información, más que nunca. Buscar es más fácil. Copiar es más fácil. El compartir es más fácil. El procesamiento es más fácil. Y podemos volver a utilizar esta información para usos que nunca nos imaginamos cuando se recogieron los primeros datos. A este respecto, los datos han evolucionado de un almacén a un flujo, de algo que es estacionario y estático a algo que es fluido y dinámico. Hay, si quieren, una liquidez de información. El disco descubierto fuera de Creta que tiene 4000 años de antigüedad, es pesado, no almacena gran cantidad de información, y esa información no es modificable. Por el contrario, todos los archivos que Edward Snowden tomó de la Agencia de Seguridad Nacional de EEUU caben en un dispositivo de memoria extraíble del tamaño de una uña, y pueden compartirse a la velocidad de la luz. Más datos. Más.

Now, one reason why we have so much data in the world today is we are collecting things that we've always collected information on, but another reason why is we're taking things that have always been informational but have never been rendered into a data format and we are putting it into data. Think, for example, the question of location. Take, for example, Martin Luther. If we wanted to know in the 1500s where Martin Luther was, we would have to follow him at all times, maybe with a feathery quill and an inkwell, and record it, but now think about what it looks like today. You know that somewhere, probably in a telecommunications carrier's database, there is a spreadsheet or at least a database entry that records your information of where you've been at all times. If you have a cell phone, and that cell phone has GPS, but even if it doesn't have GPS, it can record your information. In this respect, location has been datafied.

Una razón para tener tantos datos hoy en el mundo es que recolectamos cosas sobre las que siempre hemos recopilado información, pero otra razón es que estamos tomando cosas que siempre han sido informacionales pero nunca se habían convertido a un formato de datos y las estamos convirtiendo en datos. Piensen, por ejemplo, en la cuestión de la ubicación. Tomemos, por ejemplo, Martín Lutero. Si hubiéramos querido saber en 1500 donde estaba Martín Lutero, habríamos tenido que seguirlo en todo momento, quizá con pluma y tintero, y anotarlo. Pero piensen cómo es hoy en día. Uds. saben que en algún lugar, quizá en la base de datos de una empresa de telecomunicaciones, hay una hoja de cálculo o entrada de base de datos donde se registra su información de donde han estado en todo momento. Si tienen celular, y el teléfono tiene GPS, pero incluso si no tiene GPS, se puede registrar su información. En este sentido, la localización ha sido un campo de datos.

Now think, for example, of the issue of posture, the way that you are all sitting right now, the way that you sit, the way that you sit, the way that you sit. It's all different, and it's a function of your leg length and your back and the contours of your back, and if I were to put sensors, maybe 100 sensors into all of your chairs right now, I could create an index that's fairly unique to you, sort of like a fingerprint, but it's not your finger.

Ahora piensen, por ejemplo, en el tema de la postura, la forma en que están sentados ahora, la forma en Ud. está sentado, la de Ud., la de Ud. Todas diferentes, en función de la longitud de las piernas, la espalda y su contorno, y si pusiera censores, tal vez 100 en todos los asientos ahora, podría crear un índice que es único para cada uno, algo así como una huella digital, que no es del dedo.

So what could we do with this? Researchers in Tokyo are using it as a potential anti-theft device in cars. The idea is that the carjacker sits behind the wheel, tries to stream off, but the car recognizes that a non-approved driver is behind the wheel, and maybe the engine just stops, unless you type in a password into the dashboard to say, "Hey, I have authorization to drive." Great.

Y entonces, ¿qué podemos hacer con esto? Los investigadores en Tokio están utilizando como un dispositivo potencial antirobo en los autos. La idea es que el ladrón se siente al volante, intente encenderlo, pero el auto reconoce que un conductor no autorizado está en el auto y, tal vez el motor se detiene, a menos que escriba una contraseña en el salpicadero para decir, "Tengo la autorización para conducir". Estupendo.

What if every single car in Europe had this technology in it? What could we do then? Maybe, if we aggregated the data, maybe we could identify telltale signs that best predict that a car accident is going to take place in the next five seconds. And then what we will have datafied is driver fatigue, and the service would be when the car senses that the person slumps into that position, automatically knows, hey, set an internal alarm that would vibrate the steering wheel, honk inside to say, "Hey, wake up, pay more attention to the road." These are the sorts of things we can do when we datafy more aspects of our lives.

¿Qué pasaría si cada automóvil en Europa tuviera esta tecnología? ¿Qué podemos hacer entonces? Tal vez, si agregamos los datos, tal vez podríamos identificar signos reveladores que predijeran mejor que un accidente de auto tendrá lugar en los próximos cinco segundos. Y entonces, la base de datos que tendremos es la fatiga del conductor, y el servicio se activaría cuando los sensores del automóvil detectaran que la persona reposa en esa posición, y automáticamente se activa una alarma interna que haría vibrar el volante, sonar una alarma para decir, "Despierta, presta más atención a la carretera". Este es el tipo de cosas que podemos hacer cuando tomamos datos en más aspectos de nuestras vidas.

So what is the value of big data? Well, think about it. You have more information. You can do things that you couldn't do before. One of the most impressive areas where this concept is taking place is in the area of machine learning. Machine learning is a branch of artificial intelligence, which itself is a branch of computer science. The general idea is that instead of instructing a computer what do do, we are going to simply throw data at the problem and tell the computer to figure it out for itself. And it will help you understand it by seeing its origins. In the 1950s, a computer scientist at IBM named Arthur Samuel liked to play checkers, so he wrote a computer program so he could play against the computer. He played. He won. He played. He won. He played. He won, because the computer only knew what a legal move was. Arthur Samuel knew something else. Arthur Samuel knew strategy. So he wrote a small sub-program alongside it operating in the background, and all it did was score the probability that a given board configuration would likely lead to a winning board versus a losing board after every move. He plays the computer. He wins. He plays the computer. He wins. He plays the computer. He wins. And then Arthur Samuel leaves the computer to play itself. It plays itself. It collects more data. It collects more data. It increases the accuracy of its prediction. And then Arthur Samuel goes back to the computer and he plays it, and he loses, and he plays it, and he loses, and he plays it, and he loses, and Arthur Samuel has created a machine that surpasses his ability in a task that he taught it.

Entonces, ¿cuál es el valor de los datos masivos? Bueno, piensen en ello. Tienen más información. Pueden hacer cosas que antes no se podían hacer. Una de las zonas más impresionantes donde este concepto se ve aplicado es en el área del aprendizaje automático. El aprendizaje automático es una rama de la inteligencia artificial, que en sí es una rama de la informática. La idea general es que en lugar de enseñar a un equipo algo, simplemente transferiremos datos al problema para decirle a la computadora que lo averigüe sola. Y nos ayude a entenderlo al ver sus orígenes. En la década de 1950, un científico de computación en IBM llamado Arthur Samuel al que le gustaba jugar a damas, por eso escribió un programa para poder jugar contra la computadora. Jugó. Ganó. Jugó. Ganó. Jugó. Ganó, porque el equipo solo sabía lo que era un movimiento legal. Arthur Samuel sabía algo más. Arthur Samuel sabía estrategia. Así que escribió un pequeño subprograma operando en el fondo. Y todo lo que hizo fue anotar la probabilidad de que una configuración del tablero condujera a un tablero ganador frente a un tablero perdedor después de cada movimiento. Él jugó contra el equipo. Él ganó. Él jugó contra el equipo. Él ganó. Él jugó contra el equipo. Él ganó. Y luego Arthur Samuel dejó que la computadora jugara sola. Juega sola. Y recoge más datos. Recoge más datos. Aumenta la precisión de su predicción. Y luego Arthur Samuel vuelve al equipo juega y pierde. Y juega y pierde. Y juega y pierde. Y Arthur Samuel ha creado una máquina que supera su capacidad en una tarea que él enseñó.

And this idea of machine learning is going everywhere. How do you think we have self-driving cars? Are we any better off as a society enshrining all the rules of the road into software? No. Memory is cheaper. No. Algorithms are faster. No. Processors are better. No. All of those things matter, but that's not why. It's because we changed the nature of the problem. We changed the nature of the problem from one in which we tried to overtly and explicitly explain to the computer how to drive to one in which we say, "Here's a lot of data around the vehicle. You figure it out. You figure it out that that is a traffic light, that that traffic light is red and not green, that that means that you need to stop and not go forward."

Y esta idea de aprendizaje automático irá a todas partes. ¿Cómo creen que tenemos autos autodirigidos? ¿Estamos mejor como sociedad almacenando todas las reglas de la carretera en un software? No. La memoria es más barata. No. Los algoritmos son más rápidos. No. Los procesadores son mejores. No. Todas esas cosas importan, pero no es por eso. Es porque hemos cambiado la naturaleza del problema. Hemos cambiado el problema de uno en el que intentábamos abierta y explícitamente explicar a la computadora cómo conducir, a uno en la que decimos, "Aquí hay una gran cantidad de datos del vehículo. Haz los números. Te diste cuenta de que eso es un semáforo, que está en rojo y no verde, eso significa que tienes que detenerte y no seguir".

Machine learning is at the basis of many of the things that we do online: search engines, Amazon's personalization algorithm, computer translation, voice recognition systems. Researchers recently have looked at the question of biopsies, cancerous biopsies, and they've asked the computer to identify by looking at the data and survival rates to determine whether cells are actually cancerous or not, and sure enough, when you throw the data at it, through a machine-learning algorithm, the machine was able to identify the 12 telltale signs that best predict that this biopsy of the breast cancer cells are indeed cancerous. The problem: The medical literature only knew nine of them. Three of the traits were ones that people didn't need to look for, but that the machine spotted.

El aprendizaje automático está en la base de muchas cosas que hacemos en línea: motores de búsqueda, el algoritmo de personalización de Amazon, la traducción automática por computadora, los sistemas de reconocimiento de voz. Recientemente, los investigadores han examinado la cuestión de biopsias, biopsias de cáncer, y han usado la computadora para identificar, mirando los datos y las tasas de supervivencia, si las células son en realidad cancerosas o no, y claro, al trasferir los datos por un algoritmo de aprendizaje automático, la máquina fue capaz de identificar los 12 signos reveladores que mejor predicen si en esta biopsia de células de cáncer de mama, hay, en efecto, cáncer. El problema: la literatura médica solo sabía nueve de ellos. Tres de los rasgos eran de los que las personas no buscan, pero que la máquina descubrió.

Now, there are dark sides to big data as well. It will improve our lives, but there are problems that we need to be conscious of, and the first one is the idea that we may be punished for predictions, that the police may use big data for their purposes, a little bit like "Minority Report." Now, it's a term called predictive policing, or algorithmic criminology, and the idea is that if we take a lot of data, for example where past crimes have been, we know where to send the patrols. That makes sense, but the problem, of course, is that it's not simply going to stop on location data, it's going to go down to the level of the individual. Why don't we use data about the person's high school transcript? Maybe we should use the fact that they're unemployed or not, their credit score, their web-surfing behavior, whether they're up late at night. Their Fitbit, when it's able to identify biochemistries, will show that they have aggressive thoughts. We may have algorithms that are likely to predict what we are about to do, and we may be held accountable before we've actually acted. Privacy was the central challenge in a small data era. In the big data age, the challenge will be safeguarding free will, moral choice, human volition, human agency.

También hay lados oscuros en los datos masivos. Mejorará nuestras vidas, pero hay problemas de los que tenemos que ser conscientes, y el primero es la idea de que podemos ser castigados por las predicciones, que la policía puede utilizar datos masivos para sus fines, un poco como "Minority Report". Es un término conocido como policial predictiva, o criminología algorítmica, y la idea es que, con gran cantidad de datos, por ejemplo, donde hubo crímenes antes, sabremos dónde enviar a las patrullas. Tiene sentido, pero, el problema, claro, es que no solo se quedarán en los datos de ubicación, irán al nivel del individuo. ¿Por qué no usamos los datos de personas con un alto expediente académico? Tal vez utilizar el hecho de que estén sin empleo, su record crediticio, su comportamiento en la web, si están despiertos tarde en la noche. Su controlador físico digital, cuando identifique datos bioquímicos, mostrará si tienen pensamientos agresivos. Podemos tener algoritmos que pueden predecir lo que estamos a punto de hacer, y podemos ser responsables antes de que realmente hayamos actuado. la privacidad era el desafío principal en la era de los datos pequeños. En la era de los datos masivos, el reto será salvaguardar el libre albedrío, la elección moral, la voluntad humana, la acción humana.

There is another problem: Big data is going to steal our jobs. Big data and algorithms are going to challenge white collar, professional knowledge work in the 21st century in the same way that factory automation and the assembly line challenged blue collar labor in the 20th century. Think about a lab technician who is looking through a microscope at a cancer biopsy and determining whether it's cancerous or not. The person went to university. The person buys property. He or she votes. He or she is a stakeholder in society. And that person's job, as well as an entire fleet of professionals like that person, is going to find that their jobs are radically changed or actually completely eliminated. Now, we like to think that technology creates jobs over a period of time after a short, temporary period of dislocation, and that is true for the frame of reference with which we all live, the Industrial Revolution, because that's precisely what happened. But we forget something in that analysis: There are some categories of jobs that simply get eliminated and never come back. The Industrial Revolution wasn't very good if you were a horse. So we're going to need to be careful and take big data and adjust it for our needs, our very human needs. We have to be the master of this technology, not its servant. We are just at the outset of the big data era, and honestly, we are not very good at handling all the data that we can now collect. It's not just a problem for the National Security Agency. Businesses collect lots of data, and they misuse it too, and we need to get better at this, and this will take time. It's a little bit like the challenge that was faced by primitive man and fire. This is a tool, but this is a tool that, unless we're careful, will burn us.

Hay otro problema: los datos masivos nos quitarán nuestros puestos de trabajo. Los datos masivos y algoritmos desafiarán los conocimientos profesionales de gestión en el siglo XXI de la misma manera que la automatización de las fábricas y las cadenas de montaje desafiaron el trabajo de los obreros en el siglo XX. Piensen en un técnico de laboratorio que mira en un microscopio una biopsia de cáncer para determinar si es cáncer o no. La persona que fue a la universidad. En el que compra propiedades. Él o ella vota. Él o ella es un constituyente de la sociedad. Y el trabajo de esa persona, así como toda una flota de profesionales como esa persona, se encontrará que sus puestos de trabajo han cambiado radicalmente o, en realidad, se han eliminado completamente. Ahora, nos gusta pensar que la tecnología crea puestos de trabajo después de un corto período de dislocación temporal, y es cierto para el marco de referencia de la Revolución Industrial, que vivimos, porque eso es precisamente lo que ocurrió. Pero nos olvidamos de algo en el análisis: Hay algunas categorías de empleos que simplemente se eliminan y no se crean nunca más. La Revolución Industrial no era muy buena si eras un caballo. Así que tendremos que tener cuidado y tomar datos masivos y ajustarlos a nuestras necesidades, a nuestras necesidades muy humanas. Tenemos que ser los dueños de esta tecnología, no sus siervos. Estamos justo en el comienzo de la era de los datos masivos, y honestamente, no somos muy buenos en el manejo de todos los datos que ahora podemos recoger. No es solo un problema para la Agencia de Seguridad Nacional. Las empresas recogen muchos datos, y también, hacen mal uso de ellos, y tenemos que mejorar en esto, y esto tomará tiempo. Es un poco como el desafío que enfrentó el hombre primitivo y el fuego. Es una herramienta, pero que, a menos que seamos cuidadosos, nos va a quemar.

Big data is going to transform how we live, how we work and how we think. It is going to help us manage our careers and lead lives of satisfaction and hope and happiness and health, but in the past, we've often looked at information technology and our eyes have only seen the T, the technology, the hardware, because that's what was physical. We now need to recast our gaze at the I, the information, which is less apparent, but in some ways a lot more important. Humanity can finally learn from the information that it can collect, as part of our timeless quest to understand the world and our place in it, and that's why big data is a big deal.

Los datos masivos transformarán la manera en que vivimos, cómo trabajamos y cómo pensamos. Nos ayudarán con nuestras carreras y a llevar una vida de satisfacción y esperanza y felicidad y salud, pero en el pasado, frecuentemente, vimos esa tecnología y nuestros ojos solo han visto la T la tecnología, el hardware, porque eso es físico. Ahora tenemos que reformular nuestra mirada a la I, la información, que es menos tangible, pero en algunos aspectos mucho más importante. La humanidad finalmente puede aprender de la información que puede recoger, como parte de nuestra búsqueda eterna para entender el mundo y nuestro lugar en él, y por eso los datos masivos es un gran asunto.

(Applause)

(Aplausos)

America's favorite pie is?

¿Pastel favorito en EEUU?

(Applause)

(Aplausos)

Kenneth Cukier: Big data is better data

Kenneth Cukier: Big data is better data

Related talks

David McCandless: The beauty of data visualization

Talithia Williams: Own your body's data

Tim Berners-Lee: The next web

Shyam Sankar: The rise of human-computer cooperation

Giorgia Lupi: How we can find ourselves in data

Anders Ynnerman: Visualizing the medical data explosion

Related talks

David McCandless: The beauty of data visualization

Talithia Williams: Own your body's data

Tim Berners-Lee: The next web

Shyam Sankar: The rise of human-computer cooperation

Giorgia Lupi: How we can find ourselves in data

Anders Ynnerman: Visualizing the medical data explosion