Tim Smith: Big Data

Big data is an elusive concept. It represents an amount of digital information, which is uncomfortable to store, transport, or analyze. Big data is so voluminous that it overwhelms the technologies of the day and challenges us to create the next generation of data storage tools and techniques. So, big data isn't new. In fact, physicists at CERN have been rangling with the challenge of their ever-expanding big data for decades. Fifty years ago, CERN's data could be stored in a single computer. OK, so it wasn't your usual computer, this was a mainframe computer that filled an entire building. To analyze the data, physicists from around the world traveled to CERN to connect to the enormous machine. In the 1970's, our ever-growing big data was distributed across different sets of computers, which mushroomed at CERN. Each set was joined together in dedicated, homegrown networks. But physicists collaborated without regard for the boundaries between sets, hence needed to access data on all of these. So, we bridged the independent networks together in our own CERNET. In the 1980's, islands of similar networks speaking different dialects sprung up all over Europe and the States, making remote access possible but torturous. To make it easy for our physicists across the world to access the ever-expanding big data stored at CERN without traveling, the networks needed to be talking with the same language. We adopted the fledgling internet working standard from the States, followed by the rest of Europe, and we established the principal link at CERN between Europe and the States in 1989, and the truly global internet took off! Physicists could easily then access the terabytes of big data remotely from around the world, generate results, and write papers in their home institutes. Then, they wanted to share their findings with all their colleagues. To make this information sharing easy, we created the web in the early 1990's. Physicists no longer needed to know where the information was stored in order to find it and access it on the web, an idea which caught on across the world and has transformed the way we communicate in our daily lives. During the early 2000's, the continued growth of our big data outstripped our capability to analyze it at CERN, despite having buildings full of computers. We had to start distributing the petabytes of data to our collaborating partners in order to employ local computing and storage at hundreds of different institutes. In order to orchestrate these interconnected resources with their diverse technologies, we developed a computing grid, enabling the seamless sharing of computing resources around the globe. This relies on trust relationships and mutual exchange. But this grid model could not be transferred out of our community so easily, where not everyone has resources to share nor could companies be expected to have the same level of trust. Instead, an alternative, more business-like approach for accessing on-demand resources has been flourishing recently, called cloud computing, which other communities are now exploiting to analyzing their big data. It might seem paradoxical for a place like CERN, a lab focused on the study of the unimaginably small building blocks of matter, to be the source of something as big as big data. But the way we study the fundamental particles, as well as the forces by which they interact, involves creating them fleetingly, colliding protons in our accelerators and capturing a trace of them as they zoom off near light speed. To see those traces, our detector, with 150 million sensors, acts like a really massive 3-D camera, taking a picture of each collision event - that's up to 14 millions times per second. That makes a lot of data. But if big data has been around for so long, why do we suddenly keep hearing about it now? Well, as the old metaphor explains, the whole is greater than the sum of its parts, and this is no longer just science that is exploiting this. The fact that we can derive more knowledge by joining related information together and spotting correlations can inform and enrich numerous aspects of everyday life, either in real time, such as traffic or financial conditions, in short-term evolutions, such as medical or meteorological, or in predictive situations, such as business, crime, or disease trends. Virtually every field is turning to gathering big data, with mobile sensor networks spanning the globe, cameras on the ground and in the air, archives storing information published on the web, and loggers capturing the activities of Internet citizens the world over. The challenge is on to invent new tools and techniques to mine these vast stores, to inform decision making, to improve medical diagnosis, and otherwise to answer needs and desires of tomorrow's society in ways that are unimagined today.

کلان داده مفهومی گیج کننده است نمایان‌گر میزانی از اطلاعات دیجیتالی است، که ذخیره‌، انتقال، یا تحلیل‌شان دشوار است. کلان داده به قدری حجیم است که تکنولوژی روز را از پا در می‌آورد و ما را برای ساختن نسل بعدی ابزار ذخیره اطلاعات به چالش می‌کشد. بنابراین، کلان داده مفهوم جدیدی نیست. در واقع، فیزیک‌ دانان در سرن دهه‌ها با چالش کلان داده‌هایی که همواره در حال گسترش‌اند درگیر هستند. پنجاه سال قبل، داده‌های سرن فقط می‌توانستند در یک کامپیوتر ذخیره شوند. قبول، کامپیوتر معمولی شما نه، این یک ابر کامپیوتر بود که تمام فضای یک ساختمان را اشغال می‌کرد. برای تحلیل داده‌ها، فیزیک دانان از سراسر دنیا به سازمان سرن سفر می‌کردند، تا به این ابر ماشین متصل شوند. در سال ۱۹۷۰ ، کلان داده‌های همیشه در حال رشدمان بین مجموعه‌های مختلفی از کامپیوترها توزیع می‌شد، که به سرعت در سرن گسترش یافت. هریک از مجموعه‌ها به یک‌دیگر در شبکه‌های اختصاصی، محلی متصل‌ بودند. اما فیزیک‌دانان بدون در نظر گرفتن مرز بین این دستگاه‌ها همکاری می‌کردند، پس لازم بود تا به داده‌های همه آن‌ها دسترسی داشته باشیم. بنابراین، ما شبکه‌های مستقل را به یکدیگر در شبکه آموزشی تحقیقاتی خود متصل کردیم. در دهه۱۹۸۰، نواحی مجزا از شبکه‌های مشابه که به گویش‌های مختلفی صحبت می‌کردند به سراسر اروپا و ایالات متحده آمریکا هجوم آوردند، دسترسی از راه دور را ممکن اما بسیار دشوار کردند. برای اینکه فیزیک‌دانان در سراسر جهان به آسانی امکان دسترسی به کلان داده‌های پیوسته در حال رشد ذخیره در سرن را بدون سفر داشته باشند شبکه‌ها نیاز به ارتباط با یک زبان واحد داشتند. ما استاندارد نوظهور کار با اینترنت را از ایالات متحده اتخاذ کردیم، پس از آن بقیه اروپا و ما ارتباط بنیادی در سرن بین اروپا و ایالات متحده در سال ۱۹۸۹ تاسیس کردیم. و اینترنت جهانی حقیقی ناگهان موفق شد! فیزیک‌دانان به راحتی امکان دسترسی به چندین ترابایت کلان داده به صورت از راه دور از سراسر دنیا، ایجاد نتایج، نوشتن مقالات در انجمن‌های زیستگاه خود را داشتند. سپس، می‌خواستند یافته‌های خود را با همه همکاران‌شان به اشتراک بگذارند. برای میسر ساختن اشتراک اطلاعات ما در اوایل دهه ۱۹۹۰ وب را ایجاد کردیم. دیگر فیزیک‌دانان نیازی به دانستن اینکه محل ذخیره‌سازی اطلاعات برای یافتن و دسترسی به آن‌ها در وب نداشتند ایده‌ای که در سراسر جهان باب شد و نحوه ارتباط ما با یکدیگر در زندگی روزمره‌مان را تغییر داد. در اوایل دهه ۲۰۰۰، رشد ادامه‌دار کلان داده ما توانایی‌مان برای تحلیل آن‌ها در سرن را ارتقا داد، با وجود داشتن ساختمان‌هایی پر از کامپیوترها. ما باید توزیع پتابایت‌هایی از داده را برای شرکای همکاری خود آغاز می‌کردیم تا رایانش محلی و مخازن را در صدها مجموعه مختلف به کار گیریم. برای هماهنگ سازی این مجموعه‌های به هم پیوسته با تکنولوژی‌های متنوع‌شان، ما یک رایانش مشبک ایجاد کردیم، که اشتراک گذاری یکپارچه منابع محاسباتی در سراسر جهان را میسر می‌سازد. که وابسته به ارتباط بر مبنای اعتماد و مبادله متقابل است. اما این مدل شبکه‌ای نمی‌توانست به راحتی خارج از اجتماع ما منتقل شود، جایی که نه هر کسی منابع برای اشتراک دارد و نه می‌توان از شرکت‌ها انتظار میزان یکسانی از اعتماد را داشت. در عوض، یک جایگزین، روشی تجاری مانند برای دسترسی به منابع مورد نیاز اخیرا در حال ترقی است، رایانش ابری نامیده می‌شود، که دیگر جوامع از آن برای تحلیل کلان داده خود بهره می‌برند. برای مکانی مانند سرن ممکن است متناقض به نظر برسد آزمایشگاهی با تمرکز بر پژوهش بر بلوک‌های کوچک باورنکردنی سازنده ماده، که منبعی از چیزی به بزرگی کلان داده باشد. اما نحوه مطالعات ما در زمینه ذرات بنیادی، همچنین نیروهایی که به وسیله آن‌ها با هم تعامل دارند، شامل ایجاد آن‌ها به صورت گذرا، ایجاد تصادم پروتون‌ها در شتاب‌‌دهنده‌هایمان و کنترل مسیرشان در حالی که با سرعتی نزدیک سرعت نور حرکت می‌کنند برای دیدن این مسیرها ردیاب ما، با ۱۵۰ میلیون سنسور، که مانند یک دوربین سه بعدی غول‌آسا عمل می‌کند، از هر تصادم تصاویری ثبت می‌کند - که برابر ۱۴ میلیون بار در ثانیه است. که مقدار زیادی داده تولید می‌کند. اما اگر کلان داده چنین قدمتی داشته، چرا ما ناگهان چیزهایی از آن می‌شنویم؟ خب، یک ضرب المثل قدیمی می‌گوید، کل از مجموع اجزای آن بزرگتر است، و دیگر تنها علم نیست که از این موضوع بهره‌برداری می‌کند. این حقیقت که می‌توانیم دانش بیشتری را از کنار هم قرار دادن اطلاعات استخراج کنیم و همبستگی‌ها را شناسایی کنیم می‌تواند جنبه‌های متعددی از زندگی روزمره را متاثر کند و توسعه بخشد، یا هر لحظه، مانند ترافیک یا شرایط مالی، تحولات کوتاه مدت، مانند پزشکی یا هواشناسی، یا در پیش‌بینی موقعیت‌ها، مانند تجارت، جنایت، یا روند بیماری. تقریبا همه شاخه‌های دانش به سمت جمع‌آوری کلان داده پیش می‌روند، شبکه‌های حسگر تلفن همراه که کره زمین پوشش می‌دهند دوربین‌ها روی زمین و در هوا، آرشیو‌هایی که اطلاعات منتشر شده در وب را ذخیره می‌کنند، واقعه‌نگارهایی که فعالیت‌های شهروندان اینترنت در سراسر دنیا را ضبط می‌کنند. چالش پیش رو ابداع ابزار و تکنیک‌های جدید برای حفر کردن این مخازن وسیع، برای آگاه سازی در تصمیم‌گیری، بهبود تشخیص پزشکی، و جز این برای پاسخگویی به نیازها و تمایلاتی از جامعه فردا که امروزه باورنکردنی به نظر می‌رسد است.

Tim Smith: Big Data

Tim Smith: Big Data

Related talks

Sajan Saini: The hidden network that makes the internet possible

Mark Liddell: How statistics can be misleading

George Zaidan: Why is ketchup so hard to pour?

Related talks

Sajan Saini: The hidden network that makes the internet possible

Mark Liddell: How statistics can be misleading

George Zaidan: Why is ketchup so hard to pour?