Tim Smith: Big Data

Big data is an elusive concept. It represents an amount of digital information, which is uncomfortable to store, transport, or analyze. Big data is so voluminous that it overwhelms the technologies of the day and challenges us to create the next generation of data storage tools and techniques. So, big data isn't new. In fact, physicists at CERN have been rangling with the challenge of their ever-expanding big data for decades. Fifty years ago, CERN's data could be stored in a single computer. OK, so it wasn't your usual computer, this was a mainframe computer that filled an entire building. To analyze the data, physicists from around the world traveled to CERN to connect to the enormous machine. In the 1970's, our ever-growing big data was distributed across different sets of computers, which mushroomed at CERN. Each set was joined together in dedicated, homegrown networks. But physicists collaborated without regard for the boundaries between sets, hence needed to access data on all of these. So, we bridged the independent networks together in our own CERNET. In the 1980's, islands of similar networks speaking different dialects sprung up all over Europe and the States, making remote access possible but torturous. To make it easy for our physicists across the world to access the ever-expanding big data stored at CERN without traveling, the networks needed to be talking with the same language. We adopted the fledgling internet working standard from the States, followed by the rest of Europe, and we established the principal link at CERN between Europe and the States in 1989, and the truly global internet took off! Physicists could easily then access the terabytes of big data remotely from around the world, generate results, and write papers in their home institutes. Then, they wanted to share their findings with all their colleagues. To make this information sharing easy, we created the web in the early 1990's. Physicists no longer needed to know where the information was stored in order to find it and access it on the web, an idea which caught on across the world and has transformed the way we communicate in our daily lives. During the early 2000's, the continued growth of our big data outstripped our capability to analyze it at CERN, despite having buildings full of computers. We had to start distributing the petabytes of data to our collaborating partners in order to employ local computing and storage at hundreds of different institutes. In order to orchestrate these interconnected resources with their diverse technologies, we developed a computing grid, enabling the seamless sharing of computing resources around the globe. This relies on trust relationships and mutual exchange. But this grid model could not be transferred out of our community so easily, where not everyone has resources to share nor could companies be expected to have the same level of trust. Instead, an alternative, more business-like approach for accessing on-demand resources has been flourishing recently, called cloud computing, which other communities are now exploiting to analyzing their big data. It might seem paradoxical for a place like CERN, a lab focused on the study of the unimaginably small building blocks of matter, to be the source of something as big as big data. But the way we study the fundamental particles, as well as the forces by which they interact, involves creating them fleetingly, colliding protons in our accelerators and capturing a trace of them as they zoom off near light speed. To see those traces, our detector, with 150 million sensors, acts like a really massive 3-D camera, taking a picture of each collision event - that's up to 14 millions times per second. That makes a lot of data. But if big data has been around for so long, why do we suddenly keep hearing about it now? Well, as the old metaphor explains, the whole is greater than the sum of its parts, and this is no longer just science that is exploiting this. The fact that we can derive more knowledge by joining related information together and spotting correlations can inform and enrich numerous aspects of everyday life, either in real time, such as traffic or financial conditions, in short-term evolutions, such as medical or meteorological, or in predictive situations, such as business, crime, or disease trends. Virtually every field is turning to gathering big data, with mobile sensor networks spanning the globe, cameras on the ground and in the air, archives storing information published on the web, and loggers capturing the activities of Internet citizens the world over. The challenge is on to invent new tools and techniques to mine these vast stores, to inform decision making, to improve medical diagnosis, and otherwise to answer needs and desires of tomorrow's society in ways that are unimagined today.

"Big data" es un concepto esquivo. Denota a una cantidad de información digital incómoda de almacenar, transportar o analizar. Son cantidades tan cuantiosas que sobrepasan a las tecnologías actuales y nos desafían a crear la próxima generación de herramientas y técnicas de almacenamiento de datos. "Big data" no es una idea nueva. De hecho, los físicos del CERN han estado riñendo con el desafío de esta expansión creciente de datos durante décadas. Hace 50 años, los datos del CERN podían almacenarse en una sola computadora. Claro, no era la computadora común, sino una computadora central que ocupaba todo un edificio. Para analizar los datos, físicos de todo el mundo viajaban al CERN para conectarse a la enorme máquina. En los años 70, los datos cada vez más voluminosos se distribuían en diferentes grupos de computadoras, que proliferaron en el CERN. Cada grupo se reunía en redes caseras, dedicadas. Pero los físicos colaboraban sin tener en cuenta los límites existentes entre los grupos ya que necesitaban acceder a todos los datos. Por eso se tendieron puentes entre las redes independientes de la propia CERNET. En los años 80 otras redes aisladas similares con diferentes dialectos surgieron en toda Europa y EE.UU., y eso permitió el acceso remoto, pero era tortuoso. Para facilitar el acceso de los físicos de todo el mundo a los volúmenes de datos siempre crecientes almacenados en el CERN, sin tener que viajar, las redes tenían que hablar el mismo idioma. Adoptamos la incipiente norma de trabajo en Internet de EE.UU., seguidos por el resto de Europa, y establecimos el enlace principal en el CERN entre Europa y EE.UU. en 1989, ¡y la red Internet comenzó a hacerse realmente global! Los físicos podían acceder fácilmente entonces a los terabytes de datos en forma remota desde todo el mundo, generar resultados, y escribir artículos en sus instituciones locales. Luego, quisimos compartir los hallazgos con todos los colegas. Para facilitar este intercambio de información, creamos la Web a principios de los 90. Los físicos ya no necesitaban saber dónde estaba almacenada la información para encontrarla y accederla desde la red; una idea que prendió en todo el mundo y ha transformado la forma de comunicarnos en nuestras vidas cotidianas. A principios del 2000 el continuo crecimiento de nuestros datos superaba nuestra capacidad de análisis en el CERN, a pesar de tener edificios repletos de computadoras. Tuvimos que empezar a distribuir los petabytes de datos a los socios que colaboraban con nosotros para usar capacidad local de almacenamiento y cómputo en cientos de instituciones diferentes. Para organizar estos recursos interconectados con sus diversas tecnologías, desarrollamos una red de computadoras que permite el intercambio irrestricto de recursos informáticos en todo el mundo. Esto se basa en relaciones de confianza y de intercambio mutuo. Pero este modelo de red no podía transferirse fuera de nuestra comunidad tan fácilmente, pues no todos tienen recursos para compartir ni puede esperarse que las empresas tengan el mismo nivel de confianza. En cambio, un enfoque alternativo, más empresarial para el acceso "a la carta" de los recursos, floreció recientemente, y se llama computación en la nube; algo que otras comunidades están explotando ahora para analizar sus grandes volúmenes de datos. Puede resultar paradójico que en un lugar como el CERN, un laboratorio que estudia lo inimaginablemente pequeño que constituye la materia, sea la fuente de grandes volúmenes de datos [big data]. Pero la forma en que estudiamos las partículas fundamentales, así como las fuerzas mediante las que interactúan, implica crearlas fugazmente, hacer colisionar protones en nuestros aceleradores y capturar sus rastros a casi la velocidad de la luz. Para ver esos rastros, nuestro detector, con 150 millones de sensores, funciona como una cámara 3D gigante, que toma fotos de cada colisión. Esto ocurre unas 14 millones de veces por segundo. Eso genera muchos datos. Pero si este volumen de datos existe desde hace tanto, ¿por qué de repente cobra tanta notoriedad ahora? Bueno, como dice la vieja metáfora, el todo es más grande que la suma de sus partes, y ya no es solo la ciencia que lo está usando. Poder obtener más conocimiento uniendo información relacionada y detectando correlaciones puede iluminar y enriquecer numerosos aspectos de la vida cotidiana, sea en tiempo real, como el estado del tránsito o de las finanzas, o en evoluciones de corto plazo, como las médicas o meteorológicas, o en situaciones predictivas, como las tendencias en el comercio, el crimen y las enfermedades. Se está recopilando ingentes volúmenes de datos en todas las áreas, con redes de sensores móviles que abarcan el mundo, cámaras en la tierra y en el aire, archivos que almacenan información publicada en la web, y registran las actividades de los internautas de todo el mundo. El desafío consiste en inventar nuevas herramientas y técnicas para analizar estos vastos repositorios, para iluminar la toma de decisiones, mejorar los diagnósticos médicos, y, en otras palabras, responder a las necesidades y deseos de la sociedad del futuro de formas, hoy, inimaginables.

Tim Smith: Big Data

Tim Smith: Big Data

Related talks

Sajan Saini: The hidden network that makes the internet possible

Mark Liddell: How statistics can be misleading

George Zaidan: Why is ketchup so hard to pour?

Related talks

Sajan Saini: The hidden network that makes the internet possible

Mark Liddell: How statistics can be misleading

George Zaidan: Why is ketchup so hard to pour?