Tim Smith: Big Data

Big data is an elusive concept. It represents an amount of digital information, which is uncomfortable to store, transport, or analyze. Big data is so voluminous that it overwhelms the technologies of the day and challenges us to create the next generation of data storage tools and techniques. So, big data isn't new. In fact, physicists at CERN have been rangling with the challenge of their ever-expanding big data for decades. Fifty years ago, CERN's data could be stored in a single computer. OK, so it wasn't your usual computer, this was a mainframe computer that filled an entire building. To analyze the data, physicists from around the world traveled to CERN to connect to the enormous machine. In the 1970's, our ever-growing big data was distributed across different sets of computers, which mushroomed at CERN. Each set was joined together in dedicated, homegrown networks. But physicists collaborated without regard for the boundaries between sets, hence needed to access data on all of these. So, we bridged the independent networks together in our own CERNET. In the 1980's, islands of similar networks speaking different dialects sprung up all over Europe and the States, making remote access possible but torturous. To make it easy for our physicists across the world to access the ever-expanding big data stored at CERN without traveling, the networks needed to be talking with the same language. We adopted the fledgling internet working standard from the States, followed by the rest of Europe, and we established the principal link at CERN between Europe and the States in 1989, and the truly global internet took off! Physicists could easily then access the terabytes of big data remotely from around the world, generate results, and write papers in their home institutes. Then, they wanted to share their findings with all their colleagues. To make this information sharing easy, we created the web in the early 1990's. Physicists no longer needed to know where the information was stored in order to find it and access it on the web, an idea which caught on across the world and has transformed the way we communicate in our daily lives. During the early 2000's, the continued growth of our big data outstripped our capability to analyze it at CERN, despite having buildings full of computers. We had to start distributing the petabytes of data to our collaborating partners in order to employ local computing and storage at hundreds of different institutes. In order to orchestrate these interconnected resources with their diverse technologies, we developed a computing grid, enabling the seamless sharing of computing resources around the globe. This relies on trust relationships and mutual exchange. But this grid model could not be transferred out of our community so easily, where not everyone has resources to share nor could companies be expected to have the same level of trust. Instead, an alternative, more business-like approach for accessing on-demand resources has been flourishing recently, called cloud computing, which other communities are now exploiting to analyzing their big data. It might seem paradoxical for a place like CERN, a lab focused on the study of the unimaginably small building blocks of matter, to be the source of something as big as big data. But the way we study the fundamental particles, as well as the forces by which they interact, involves creating them fleetingly, colliding protons in our accelerators and capturing a trace of them as they zoom off near light speed. To see those traces, our detector, with 150 million sensors, acts like a really massive 3-D camera, taking a picture of each collision event - that's up to 14 millions times per second. That makes a lot of data. But if big data has been around for so long, why do we suddenly keep hearing about it now? Well, as the old metaphor explains, the whole is greater than the sum of its parts, and this is no longer just science that is exploiting this. The fact that we can derive more knowledge by joining related information together and spotting correlations can inform and enrich numerous aspects of everyday life, either in real time, such as traffic or financial conditions, in short-term evolutions, such as medical or meteorological, or in predictive situations, such as business, crime, or disease trends. Virtually every field is turning to gathering big data, with mobile sensor networks spanning the globe, cameras on the ground and in the air, archives storing information published on the web, and loggers capturing the activities of Internet citizens the world over. The challenge is on to invent new tools and techniques to mine these vast stores, to inform decision making, to improve medical diagnosis, and otherwise to answer needs and desires of tomorrow's society in ways that are unimagined today.

Büyük veri tarifi zor bir kavram. Saklaması, aktarımı veya analizi zor olan dijital bilgilerin miktarını yansıtıyor. Büyük veri öyle geniş ki günümüz teknolojisini istila ediyor ve bizi sonraki jenerasyonun veri saklama araçlarını ve tekniklerini üretmeye zorluyor. Büyük veri yeni bir kavram değil. Aslında, CERN'deki fizikçiler onlarca yıldır sürekli büyüyen büyük verinin zorluklarıyla uğraşıyor. Elli yıl önce, CERN verileri tek bir bilgisayarda saklanabiliyordu. Bu bilgisayar sıradan bilgisayarınız gibi değil, bütün bir binayı kaplayan bir ana sistem bilgisayarıydı. Veriyi analiz etmek için dünyanın her yerinden fizikçiler CERN'e bu makineye bağlanmaya giderdi. 1970'lerde, sürekli büyüyen büyük verimiz CERN'de mantar gibi yayılan farklı bilgisayar takımları boyunca dağıtıldı. Her bir takım özel, yerinde geliştirilen ağlarda birbirine bağlandı. Fizikçilerse takımlar arasındaki sınırlara bakmasızın birlikte çalıştı çünkü tüm bunların üzerindeki veriye erişmeleri gerekti. Bu yüzden biz de bağımsız ağları kendi CERNET ağımızda birleştirdik. 1980'lerde, farklı lehçeler konuşan benzer ağ adaları tüm Avrupa ve ABD'ye yayıldı, uzaktan erişim mümkün ama eziyetliydi. Tüm dünyadaki fizikçilerimizin CERN’de saklanan ve sürekli gelişen büyük veriye oraya gitmeden erişimini kolaylaştırmak için ağların aynı dilde konuşması gerekliydi. ABD'de yeni filizlenen internet çalışma standardını uyguladık, bunu Avrupa'nın geri kalanı takip etti ve esas halkayı Avrupa ile ABD arasında 1989'da CERN'de kurduk, böylece global internet uçuşa geçti! Ardından fizikçiler kolaylıkla büyük verinin terabaytlarına dünyanın her yerinden uzaktan erişebilir, sonuç üretebilir ve ev sahibi kurumlarda makale yazabilir oldular. Sonra, bulgularını diğer tüm meslektaşlarıyla paylaşmak istediler. Bu bilginin paylaşımını kolaylaştırmak için 1990'ların başında internet ağını yarattık. Fizikçilerin artık bilgiyi bulmak ve ona internette ulaşmak için nerede saklandığını bilmeleri gerekmiyordu, bu, dünya genelinde rağbet gören ve günlük yaşamdaki iletişim şeklimizi değiştiren bir fikirdi. 2000'lerin başında, büyük verimizin devam eden gelişimi bilgisayarlarla dolu binalarımız olsa da onu CERN'de analiz etme becerimizi geride bıraktı. Petabaytlık verileri, yüzlerce farklı kurumdaki yerel bilgi işlemi ve belleği kullanmak için ortaklarımıza dağıtmak zorundaydık. Birbirine bağlı bu kaynakları onların farklı teknolojileriyle düzenlemek için bir hesaplama kılavuzu geliştirerek dünya çapındaki bilgi işlem kaynaklarının kesintisiz paylaşımını devreye soktuk. Bu güven ilişkilerine ve karşılıklı değiş tokuşa dayanıyor. Ama bu şebeke modeli, ne herkesin paylaşacak kaynağının olduğu ne de şirketlerin aynı derecede güvene sahip olmasının beklendiği bir topluluğun dışına kolaylıkla çıkabilirdi. Bunun yerine son zamanlarda, talep edilen kaynaklara erişim için alternatif ve daha sistematik bir yaklaşım olan "bulut bilişim" denilen diğer toplulukların büyük verilerini analizde bugün kullandığı bu teknoloji gelişiyor. Bu durum CERN gibi bir yer için çelişkili görünebilir; büyük veri kadar devasa bir şeyin kaynağı olan maddenin hayal edilemeyecek derecede küçük yapı taşlarını araştırmaya odaklanan bir laboratuvar. Ama temel parçacıkları çalışma yöntemimiz, etkileşime girdikleri güçlerin yanı sıra onları geçici olarak yaratmayı, protonları hızlandırıcılarımızda çarpıştırmayı ve onlar ışık hızına yaklaşırken izlerini yakalamayı içeriyor. Bu izleri görmek için, 150 milyon sensörlü dedektörümüz gerçekten büyük bir 3-D kamera gibi hareket ederek her bir patlama olayını resimliyor, bunu saniyede 14 milyon kez yapıyor. Bu, çok sayıda veri demek. Peki büyük veri bu kadar zamandır varsa neden birden onunla ilgili haberler duymaya başladık? Eski bir metaforda denildiği gibi ''bütün, kendi parçalarının toplamından daha büyüktür'' artık bu gerçekten istifade eden sadece bilim değil. İlişkili bilgileri birbirine bağlayarak ve bağıntıları saptayarak daha çok bilgi elde ettiğimiz gerçeği günlük yaşamın birçok yönünü bilgilendirip zenginleştirebilir, bunu trafik veya mali durumlar gibi gerçek zamanlı, tıp veya meteoroloji gibi kısa süreli gelişmeler veya iş, suç ve hastalık trendleri gibi öngörüsel durumlarda yapabilir. Neredeyse her alan büyük veri toplamaya yöneliyor, bunu da tüm dünyayı kapsayan mobil sensör şebekeleriyle, yerdeki ve havadaki kameralarla, internette yayınlanan bilgi saklayan arşivlerle ve tüm dünyadaki Internet kullanıcılarının aktivitelerini yakalayan kaydedicilerle yapıyorlar. Zorlu görev bu geniş bellekleri araştırıp bulmak, karar vermeyi bilgilendirmek, tıbbi teşhisleri geliştirmek ve diğer yandan yarının toplumunun ihtiyaç ve isteklerine cevap vermek için bugün hayal edilemeyen yöntemlerle yeni aygıtlar ve teknolojiler icat etmek.

Tim Smith: Big Data

Tim Smith: Big Data

Related talks

Sajan Saini: The hidden network that makes the internet possible

Mark Liddell: How statistics can be misleading

George Zaidan: Why is ketchup so hard to pour?

Related talks

Sajan Saini: The hidden network that makes the internet possible

Mark Liddell: How statistics can be misleading

George Zaidan: Why is ketchup so hard to pour?