Tim Smith: Big Data

Big data is an elusive concept. It represents an amount of digital information, which is uncomfortable to store, transport, or analyze. Big data is so voluminous that it overwhelms the technologies of the day and challenges us to create the next generation of data storage tools and techniques. So, big data isn't new. In fact, physicists at CERN have been rangling with the challenge of their ever-expanding big data for decades. Fifty years ago, CERN's data could be stored in a single computer. OK, so it wasn't your usual computer, this was a mainframe computer that filled an entire building. To analyze the data, physicists from around the world traveled to CERN to connect to the enormous machine. In the 1970's, our ever-growing big data was distributed across different sets of computers, which mushroomed at CERN. Each set was joined together in dedicated, homegrown networks. But physicists collaborated without regard for the boundaries between sets, hence needed to access data on all of these. So, we bridged the independent networks together in our own CERNET. In the 1980's, islands of similar networks speaking different dialects sprung up all over Europe and the States, making remote access possible but torturous. To make it easy for our physicists across the world to access the ever-expanding big data stored at CERN without traveling, the networks needed to be talking with the same language. We adopted the fledgling internet working standard from the States, followed by the rest of Europe, and we established the principal link at CERN between Europe and the States in 1989, and the truly global internet took off! Physicists could easily then access the terabytes of big data remotely from around the world, generate results, and write papers in their home institutes. Then, they wanted to share their findings with all their colleagues. To make this information sharing easy, we created the web in the early 1990's. Physicists no longer needed to know where the information was stored in order to find it and access it on the web, an idea which caught on across the world and has transformed the way we communicate in our daily lives. During the early 2000's, the continued growth of our big data outstripped our capability to analyze it at CERN, despite having buildings full of computers. We had to start distributing the petabytes of data to our collaborating partners in order to employ local computing and storage at hundreds of different institutes. In order to orchestrate these interconnected resources with their diverse technologies, we developed a computing grid, enabling the seamless sharing of computing resources around the globe. This relies on trust relationships and mutual exchange. But this grid model could not be transferred out of our community so easily, where not everyone has resources to share nor could companies be expected to have the same level of trust. Instead, an alternative, more business-like approach for accessing on-demand resources has been flourishing recently, called cloud computing, which other communities are now exploiting to analyzing their big data. It might seem paradoxical for a place like CERN, a lab focused on the study of the unimaginably small building blocks of matter, to be the source of something as big as big data. But the way we study the fundamental particles, as well as the forces by which they interact, involves creating them fleetingly, colliding protons in our accelerators and capturing a trace of them as they zoom off near light speed. To see those traces, our detector, with 150 million sensors, acts like a really massive 3-D camera, taking a picture of each collision event - that's up to 14 millions times per second. That makes a lot of data. But if big data has been around for so long, why do we suddenly keep hearing about it now? Well, as the old metaphor explains, the whole is greater than the sum of its parts, and this is no longer just science that is exploiting this. The fact that we can derive more knowledge by joining related information together and spotting correlations can inform and enrich numerous aspects of everyday life, either in real time, such as traffic or financial conditions, in short-term evolutions, such as medical or meteorological, or in predictive situations, such as business, crime, or disease trends. Virtually every field is turning to gathering big data, with mobile sensor networks spanning the globe, cameras on the ground and in the air, archives storing information published on the web, and loggers capturing the activities of Internet citizens the world over. The challenge is on to invent new tools and techniques to mine these vast stores, to inform decision making, to improve medical diagnosis, and otherwise to answer needs and desires of tomorrow's society in ways that are unimagined today.

ビッグデータはとらえ所のない概念です大量のデジタル情報のことを表しますが保存したり転送したり分析することは困難ですビッグデータは大容量なので現在の技術では扱い切れず次世代のデータ保存ツールや技術を創るという課題を突き付けてきますビッグデータは新しいわけではありません実際 CERNの物理学者は数十年にも渡って拡大を続けるビッグデータの挑戦に立ち向かってきました 50年前 CERNのデータは１台のコンピュータに保存可能でした通常のコンピュータではなく１つの建物を埋め尽くすほどのメインフレームというコンピュータを使っていましたデータを分析するために世界中の物理学者がCERNにやって来てこの巨大なマシンに接続していたのです 1970年代には増殖を続けるビッグデータは CERNのあちこちにある別々のコンピュータに分散して保存されていましたそれぞれのコンピュータは専用の自家製ネットワークで接続されていましたしかし物理学者はコンピュータ同士の境界を越えて作業をしていたためしかし物理学者はコンピュータ同士の境界を越えて作業をしていたためあらゆるコンピュータのデータにアクセスする必要がありましたそこで私たちは独立したネットワークをまとめ CERNETという独自のネットワークを構築したのです 1980年代にはヨーロッパやアメリカの至る所で異なる方言を持つ同様のネットワークが増殖・拡大するようになり遠隔アクセス自体は可能でしたがひどく面倒でした世界中の物理学者が CERNに保存された — 増殖を続けるビッグデータへのアクセスを現地に行くこと無く実現するためにはあらゆるネットワークが同じ言語を使う必要がありました私たちはアメリカで標準となりヨーロッパがそれに続いた当時まだ未成熟なインターネットを採用し 1989年に CERNで初めてヨーロッパとアメリカをつなぎましたこれが本当の意味でのグローバルインターネットの始まりですそれにより物理学者は何テラバイトものビッグデータにそれにより物理学者は何テラバイトものビッグデータに容易に遠隔アクセスできるようになり結果を出したり — 出張無しで研究所での論文執筆が可能になりました出張無しで研究所での論文執筆が可能になりました新たな知見を同僚と共有したいというニーズが出てきました新たな知見を同僚と共有したいというニーズが出てきました情報共有を容易にするために 1990年代初頭にウェブを開発しました物理学者はもう情報がどこに保存されているのか知る必要がなくなりましたウェブ上で情報を見つけアクセスするために世界中で話題となったアイデアは日常生活におけるコミュニケーションを一変させました 2000年代初頭にもビッグデータは増殖を続けており CERNは建物一杯のコンピュータを所有していたにも関わらず解析能力を超えてしまい何百もの施設でローカルにデータを計算・保存させるため何百もの施設でローカルにデータを計算・保存させるためペタバイト級のデータをパートナーのコンピュータに分散させる必要が出てきましたこれらの相互接続されたリソースを様々な技術を用いて協調させるためにグリッドコンピューティング技術を開発し世界中の計算資源をシームレスに共有できるようにしました信頼関係と相互交流に依存しますがこのグリッドモデルではコミュニティ外へのデータ転送は容易ではありませんでした全員が共有できるだけのリソースを持っていなかったり企業に同レベルの信頼を置けるとは限らないからです最近では必要に応じてリソースにアクセスするための代替案としてクラウドコンピューティングというよりビジネスライクなアプローチが盛んに用いられていますこれは今では他のコミュニティーでもビッグデータを分析するために用いられています逆説的ですが想像もつかない程小さな素粒子の逆説的ですが想像もつかない程小さな素粒子の研究に特化したCERNのような場がビッグデータのような大きなものの源となっているのですしかし素粒子やその相互作用力を研究するにはほんの一瞬の間だけ素粒子を生成したり加速器で陽子を衝突させたり光速に近い速度まで加速する — 素粒子の軌道を観測したりしなければなりません軌道を観測するために 1.5億個のセンサーが付いた検出器が大規模3Dカメラのように振る舞い１秒間に最大1400万回発生する衝突のそれぞれを撮影するため大量のデータが生成されますしかしビッグデータがこれほど長く存在していたのなら最近突然その言葉をよく聞くようになったのはなぜでしょう？昔からの比喩にあるように集団は部分の総和より大きな力を発揮しますこれは科学に限ったことではありません関連した情報をまとめたり相関を見抜いたりすることでより多くの知識が得られそれが日常生活を様々な面で豊かにしてくれるのです渋滞や経済状況などリアルタイムのことであったり医療や気象といった短期間の変化であったりビジネスや犯罪病気の流行などを予測するといった場面でも有効です事実上全ての分野がビッグデータを集めるようになってきていますモバイルセンサーネットワークは地球全体に広がりカメラが地上や空中に存在しておりまたウェブ上の情報はアーカイブに保存され世界中のインターネット利用者の活動記録は全てログに記録されるようになっています課題となるのは新たなツールや技術を開発し大量のデータから情報を集め意思決定に役立てたり医療診断を改善したり今日では想像もできない — 未来の社会のニーズや要求に応えることです

Tim Smith: Big Data

Tim Smith: Big Data

Related talks

Sajan Saini: The hidden network that makes the internet possible

Mark Liddell: How statistics can be misleading

George Zaidan: Why is ketchup so hard to pour?

Related talks

Sajan Saini: The hidden network that makes the internet possible

Mark Liddell: How statistics can be misleading

George Zaidan: Why is ketchup so hard to pour?