Tim Smith: Big Data

Big data is an elusive concept. It represents an amount of digital information, which is uncomfortable to store, transport, or analyze. Big data is so voluminous that it overwhelms the technologies of the day and challenges us to create the next generation of data storage tools and techniques. So, big data isn't new. In fact, physicists at CERN have been rangling with the challenge of their ever-expanding big data for decades. Fifty years ago, CERN's data could be stored in a single computer. OK, so it wasn't your usual computer, this was a mainframe computer that filled an entire building. To analyze the data, physicists from around the world traveled to CERN to connect to the enormous machine. In the 1970's, our ever-growing big data was distributed across different sets of computers, which mushroomed at CERN. Each set was joined together in dedicated, homegrown networks. But physicists collaborated without regard for the boundaries between sets, hence needed to access data on all of these. So, we bridged the independent networks together in our own CERNET. In the 1980's, islands of similar networks speaking different dialects sprung up all over Europe and the States, making remote access possible but torturous. To make it easy for our physicists across the world to access the ever-expanding big data stored at CERN without traveling, the networks needed to be talking with the same language. We adopted the fledgling internet working standard from the States, followed by the rest of Europe, and we established the principal link at CERN between Europe and the States in 1989, and the truly global internet took off! Physicists could easily then access the terabytes of big data remotely from around the world, generate results, and write papers in their home institutes. Then, they wanted to share their findings with all their colleagues. To make this information sharing easy, we created the web in the early 1990's. Physicists no longer needed to know where the information was stored in order to find it and access it on the web, an idea which caught on across the world and has transformed the way we communicate in our daily lives. During the early 2000's, the continued growth of our big data outstripped our capability to analyze it at CERN, despite having buildings full of computers. We had to start distributing the petabytes of data to our collaborating partners in order to employ local computing and storage at hundreds of different institutes. In order to orchestrate these interconnected resources with their diverse technologies, we developed a computing grid, enabling the seamless sharing of computing resources around the globe. This relies on trust relationships and mutual exchange. But this grid model could not be transferred out of our community so easily, where not everyone has resources to share nor could companies be expected to have the same level of trust. Instead, an alternative, more business-like approach for accessing on-demand resources has been flourishing recently, called cloud computing, which other communities are now exploiting to analyzing their big data. It might seem paradoxical for a place like CERN, a lab focused on the study of the unimaginably small building blocks of matter, to be the source of something as big as big data. But the way we study the fundamental particles, as well as the forces by which they interact, involves creating them fleetingly, colliding protons in our accelerators and capturing a trace of them as they zoom off near light speed. To see those traces, our detector, with 150 million sensors, acts like a really massive 3-D camera, taking a picture of each collision event - that's up to 14 millions times per second. That makes a lot of data. But if big data has been around for so long, why do we suddenly keep hearing about it now? Well, as the old metaphor explains, the whole is greater than the sum of its parts, and this is no longer just science that is exploiting this. The fact that we can derive more knowledge by joining related information together and spotting correlations can inform and enrich numerous aspects of everyday life, either in real time, such as traffic or financial conditions, in short-term evolutions, such as medical or meteorological, or in predictive situations, such as business, crime, or disease trends. Virtually every field is turning to gathering big data, with mobile sensor networks spanning the globe, cameras on the ground and in the air, archives storing information published on the web, and loggers capturing the activities of Internet citizens the world over. The challenge is on to invent new tools and techniques to mine these vast stores, to inform decision making, to improve medical diagnosis, and otherwise to answer needs and desires of tomorrow's society in ways that are unimagined today.

Os "big data" são um conceito traiçoeiro. Representam uma quantidade de informações digitais, muito difíceis de armazenar, de transportar ou de analisar. Os "big data" são tão volumosos que sobrecarregam as tecnologias do dia e desafiam-nos a criar a próxima geração de ferramentas e técnicas de armazenamento de dados. Os "big data" não são uma coisa nova. Com efeito, os físicos no CERN têm vindo a debater-se com o problema do crescimento dos "big data", desde há décadas. Há 50 anos, os dados do CERN podiam ser guardados num único computador. Claro, não era um computador vulgar, era um computador central que ocupava um edifício inteiro. Para analisar os dados, físicos do mundo inteiro deslocavam-se ao CERN para se ligarem àquela máquina enorme. Nos anos 70, os "big data", sempre em crescimento, foram distribuídos por diversos grupos de computadores, que proliferaram no CERN. Cada grupo estava ligado por redes dedicadas, feitas de propósito. Mas os físicos colaboravam, sem se preocuparem com os limites entre esses grupos. e, portanto, precisavam de aceder aos dados de todos eles. Por isso, fizemos pontes entre as redes independentes na nossa CERNET. Nos anos 80, surgiram ilhas de redes semelhantes, falando diferentes dialetos, por toda a Europa e EUA, tornando possível o acesso remoto mas tortuoso. Para facilitar a vida aos físicos do mundo inteiro, quanto ao acesso aos "big data", sempre em crescimento, guardados no CERN, sem terem de se deslocar, era preciso que as redes falassem a mesma linguagem. Adotámos a norma de trabalho da nova Internet, nos EUA, no que fomos seguidos pelo resto da Europa, e instituímos a principal ligação do CERN, entre a Europa e os EUA em 1989, e assim arrancou a Internet realmente global! Os físicos passaram a ter um acesso fácil e à distância ao terabytes dos "big data" do mundo inteiro, a gerar resultados, e a escrever artigos nos seus institutos locais. Depois, quiseram partilhar as suas descobertas com todos os colegas. Para facilitar esta partilha de informações, criámos a "web" no início dos anos 90. Os físicos já não precisavam de saber onde estavam armazenadas as informações para as encontrar na "web" e ter-lhes acesso, uma ideia que se espalhou pelo mundo inteiro e transformou a forma como comunicamos na nossa vida diária. No início dos anos 2000, o contínuo crescimento dos "big data" ultrapassou a capacidade de os analisarmos no CERN, apesar de termos edifícios cheios de computadores. Tivemos que distribuir os petabytes de dados pelos nossos parceiros colaboradores a fim de utilizar a informática e a armazenagem locais em centenas de diversos institutos. Para orquestrar estes recursos interligados com as suas diversas tecnologias, elaborámos uma grelha informática, que permitia a partilha ininterrupta dos recursos informáticos por todo o globo. Isto assenta em relações de confiança e de troca mutual. Mas este modelo de grelha não podia ser transferido para fora da nossa comunidade, facilmente, porque nem toda a gente tem recursos para partilhar nem podíamos esperar que as empresas tivessem o mesmo grau de confiança. Em alternativa, tem vindo a florescer recentemente, uma abordagem mais empresarial para acesso a recursos, sob pedido. Chama-se "nuvem informática", que outras comunidades estão agora a explorar para analisar os seus "big data". Pode parecer um paradoxo que um local como o CERN, um laboratório focado no estudo de elementos constitutivos da matéria, incrivelmente pequenos, seja a origem duma coisa tão grande como os "big data". Mas a forma como estudamos as partículas fundamentais, assim como as forças segundo as quais elas interagem, consiste em criá-las fugazmente, fazendo colidir protões nos nossos aceleradores e captando o rasto deles quando eles aceleram à velocidade da luz. Para ver esses rastos, o nosso detetor, com 150 milhões de sensores, atua como uma câmara a 3D muito grande, que tira uma foto de cada colisão — ou seja 14 milhões de vezes por segundo. Isto produz uma grande quantidade de dados. Mas se os "big data" já existem há tanto tempo porque é que, de repente, ouvimos falar deles agora? Como explica a antiga metáfora, o todo é maior do que a soma das partes, e já não é só a ciência que explora isso. O facto de podermos obter mais conhecimentos reunindo informações relacionadas, e detetando correlações pode informar e enriquecer inúmeros aspetos da vida diária, quer em tempo real, como o tráfico ou as condições financeiras, quer em evoluções a curto prazo, como em situações médicas ou meteorológicas quer em situações de previsão, como as tendências nos negócios, no crime ou na doença. Praticamente todos os campos estão a virar-se para os "big data", com redes móveis de sensores, espalhadas pelo globo, câmaras no solo e no ar, arquivos que guardam informações publicadas na "web", e registadores que captam as atividades de cidadãos da Internet, por todo o mundo. O problema está na invenção de novas ferramentas e técnicas para explorar estes enormes armazéns, para informar a tomada de decisões, para melhorar os diagnósticos médicos e também para responder a necessidades e desejos da sociedade de amanhã, em formas que hoje são inimagináveis.

Tim Smith: Big Data

Tim Smith: Big Data

Related talks

Sajan Saini: The hidden network that makes the internet possible

Mark Liddell: How statistics can be misleading

George Zaidan: Why is ketchup so hard to pour?

Related talks

Sajan Saini: The hidden network that makes the internet possible

Mark Liddell: How statistics can be misleading

George Zaidan: Why is ketchup so hard to pour?