Tim Smith: Big Data

Big data is an elusive concept. It represents an amount of digital information, which is uncomfortable to store, transport, or analyze. Big data is so voluminous that it overwhelms the technologies of the day and challenges us to create the next generation of data storage tools and techniques. So, big data isn't new. In fact, physicists at CERN have been rangling with the challenge of their ever-expanding big data for decades. Fifty years ago, CERN's data could be stored in a single computer. OK, so it wasn't your usual computer, this was a mainframe computer that filled an entire building. To analyze the data, physicists from around the world traveled to CERN to connect to the enormous machine. In the 1970's, our ever-growing big data was distributed across different sets of computers, which mushroomed at CERN. Each set was joined together in dedicated, homegrown networks. But physicists collaborated without regard for the boundaries between sets, hence needed to access data on all of these. So, we bridged the independent networks together in our own CERNET. In the 1980's, islands of similar networks speaking different dialects sprung up all over Europe and the States, making remote access possible but torturous. To make it easy for our physicists across the world to access the ever-expanding big data stored at CERN without traveling, the networks needed to be talking with the same language. We adopted the fledgling internet working standard from the States, followed by the rest of Europe, and we established the principal link at CERN between Europe and the States in 1989, and the truly global internet took off! Physicists could easily then access the terabytes of big data remotely from around the world, generate results, and write papers in their home institutes. Then, they wanted to share their findings with all their colleagues. To make this information sharing easy, we created the web in the early 1990's. Physicists no longer needed to know where the information was stored in order to find it and access it on the web, an idea which caught on across the world and has transformed the way we communicate in our daily lives. During the early 2000's, the continued growth of our big data outstripped our capability to analyze it at CERN, despite having buildings full of computers. We had to start distributing the petabytes of data to our collaborating partners in order to employ local computing and storage at hundreds of different institutes. In order to orchestrate these interconnected resources with their diverse technologies, we developed a computing grid, enabling the seamless sharing of computing resources around the globe. This relies on trust relationships and mutual exchange. But this grid model could not be transferred out of our community so easily, where not everyone has resources to share nor could companies be expected to have the same level of trust. Instead, an alternative, more business-like approach for accessing on-demand resources has been flourishing recently, called cloud computing, which other communities are now exploiting to analyzing their big data. It might seem paradoxical for a place like CERN, a lab focused on the study of the unimaginably small building blocks of matter, to be the source of something as big as big data. But the way we study the fundamental particles, as well as the forces by which they interact, involves creating them fleetingly, colliding protons in our accelerators and capturing a trace of them as they zoom off near light speed. To see those traces, our detector, with 150 million sensors, acts like a really massive 3-D camera, taking a picture of each collision event - that's up to 14 millions times per second. That makes a lot of data. But if big data has been around for so long, why do we suddenly keep hearing about it now? Well, as the old metaphor explains, the whole is greater than the sum of its parts, and this is no longer just science that is exploiting this. The fact that we can derive more knowledge by joining related information together and spotting correlations can inform and enrich numerous aspects of everyday life, either in real time, such as traffic or financial conditions, in short-term evolutions, such as medical or meteorological, or in predictive situations, such as business, crime, or disease trends. Virtually every field is turning to gathering big data, with mobile sensor networks spanning the globe, cameras on the ground and in the air, archives storing information published on the web, and loggers capturing the activities of Internet citizens the world over. The challenge is on to invent new tools and techniques to mine these vast stores, to inform decision making, to improve medical diagnosis, and otherwise to answer needs and desires of tomorrow's society in ways that are unimagined today.

빅 데이터는 어려운 개념입니다. 빅 데이터는 디지털 정보의 양을 나타내는데, 그 양은 저장하기에도, 전송하기에도, 분석하기도 힘들 만큼 큽니다. 빅 데이터는 너무 커서 오늘날의 기술로 감당하기 어렵고 데이터를 저장하는 도구와 기술을 새롭게 만들어야 합니다. 그런데 빅 데이터는 새로운 것이 아닙니다. 사실 유럽 원자핵 공동연구소(CERN)의 물리학자들은 계속 늘어만가는 큰 데이터를 처리하기 위해 수십년동안 도전해 왔습니다. 50년 전 유럽 원자핵 공동연구소(CERN)의 자료는 컴퓨터 1 대에 저장할 수 있었습니다. 하지만 그건 보통 컴퓨터가 아니라 건물 전체를 다 채우는 대형 컴퓨터였습니다. 데이터를 분석하기 위해서 전세계의 과학자들은 CERN에 와서 대형 컴퓨터에 접속했습니다. 1970년대에 계속 늘어만 가는 데이터는 서로 다른 컴퓨터 무리로 분산되었는데 각각의 무리는 모두 CERN에 몰려 있었습니다. 하나의 무리는 자체적인 네트워크 상에서 정밀하게 연결되었습니다. 하지만 물리학자들은 무리 사이의 경계를 넘어서 일하기 때문에 모든 데이터에 접속해야 할 필요가 있었습니다. 그래서 독립된 네크워크를 함께 연결해 자체적인 CERNET를 만들었습니다. 1980년대에는 서로 다른 언어를 사용하는 비슷한 네트워크의 섬들이 유럽과 미국 전역에 생겨나 원거리 접속을 가능하게 했지만 고생이 많았습니다. 전세계의 과학자들이 CERN에 저장되어 계속 늘어만 가는 데이터에 쉽게 접속하게 하려면, CERN에 오지 않고도 접속하려면, 네트워크끼리 같은 언어를 사용할 필요가 있습니다. 우리는 미국에서 나온 초기 단계의 인터넷 표준을 적용했고 나머지 유럽이 그 뒤를 따랐습니다. 주된 링크를 CERN에 세웠고 1989년에 유럽과 미국 사이에 링크를 세워 진정한 전세계의 인터넷이 시작되었죠! 그제서야 물리학자들은 테라바이트에 달하는 큰 데이터에 전세계 어디서나 접속할 수 있었고 결과를 볼 수 있고 그들이 속한 기관에서 논문을 쓸 수 있었습니다. 이후 과학자들이 알아낸 결과를 동료들과 공유하고 싶어했습니다. 이런 정보를 쉽게 공유하기 위해서 1990년대 초반에 웹을 만들었습니다. 물리학자들은 더이상 정보가 어디에 저장되었는지 알 필요가 없이 웹에서 찾아 접속하면 됩니다. 이 아이디어는 전세계에서 인기를 얻었고 우리가 일상 생활에서 소통하는 방식을 바꾸었습니다. 2000년대 초반 계속 늘어난 빅 데이터는 건물마다 컴퓨터로 꽉 찬 CERN에서조차 분석이 불가능할 만큼 늘어났습니다. 페타바이트에 이르는 데이터를 함께 일하는 협력 기관에 나눠주기 시작했습니다. 다른 기관에 있는 수백 개의 컴퓨터와 저장 공간을 이용하기 위해서죠. 이렇게 서로 연결된 자원을 다양한 기술로 관리하기 위해서 전산망을 개발해서 전세계의 컴퓨터 자원들을 경계없이 공유할 수 있게 되었습니다. 이는 서로의 신뢰와 상호 교류를 바탕으로 합니다. 하지만 이런 전산망 모델은 쉽게 이전할 수 없습니다. 공유할 자원을 누구나 가진 것도 아니고 회사들이 똑같은 수준의 신뢰를 갖도록 기대할 수도 없습니다. 대신에 그 대안으로서 필요할 때 자원에 접속할 수 있는 사업 모델이 최근에 번성하고 있으며 그것은 클라우드 컴퓨팅이라고 하는데 다른 공동체에서는 빅 데이터를 분석하기 위해 클라우드 컴퓨팅을 연구하고 있습니다. CERN과 같은 곳에서, 물질을 이루는 아주 작은 구조를 연구하는 곳이 빅 데이터를 다루는 원천이라는 점이 좀 역설적입니다. 근본 입자와 근본 입자가 서로 교류하는 힘을 연구하는데는 근본 입자를 순간적으로 생성시켜야 합니다. 가속기에서 양성자를 충돌시켜 빛의 속도로 사라지는 흔적을 잡아내야 합니다. 그 흔적을 보려면 1억 5천만 개의 센서가 달린 감지기가 아주 거대한 3D 카메라처럼 작동해서 입자가 충돌할 때마다 사진을 찍는데 초당 1천 4백만 번을 찍습니다. 그게 아주 많은 데이터를 만듭니다. 빅 데이터는 오래 전부터 있었는데 왜 갑자기 지금에서야 듣게 되는 걸까요? 글쎄요, 오래된 비유를 들자면 전체는 부분의 총합보다 큽니다. 이것을 탐구하는 것은 과학뿐만이 아닙니다. 서로 관련이 있는 정보를 연결하여 더 많은 지식을 얻을 수 있고 상관 관계를 알아낼 수 있다는 사실은 일상 생활의 여러 측면을 풍부하게 하고 정보를 알릴 수 있습니다. 실시간으로 교통 정보나 금융 상태를 알리거나, 단기적으로 의료나 기상 정보를 알릴 수 있고 또는 예측 가능한 상황에서 사업, 범죄, 또는 질병의 동향 따위를 알릴 수 있습니다. 거의 모든 분야가 빅 데이터를 모으려고 합니다. 전세계에 퍼져 있는 휴대용 센서 네트워크, 지상과 대기에 있는 카메라, 웹에 나온 정보를 저장하는 기록물, 전세계 인터넷 시민들의 활동을 수집하는 기록기. 이렇게 거대한 저장고에서 자료를 뽑는 새로운 도구와 기술을 개발해야 합니다. 그래서 의사 결정에 필요한 정보를 알리고 의료 진단을 개선하고 오늘날에는 상상할 수 없었던 방법으로 미래 사회의 수요와 요구에 답하기 위해서 말이죠.

Tim Smith: Big Data

Tim Smith: Big Data

Related talks

Sajan Saini: The hidden network that makes the internet possible

Mark Liddell: How statistics can be misleading

George Zaidan: Why is ketchup so hard to pour?

Related talks

Sajan Saini: The hidden network that makes the internet possible

Mark Liddell: How statistics can be misleading

George Zaidan: Why is ketchup so hard to pour?