Jean-Baptiste Michel + Erez Lieberman Aiden: What we learned from 5 million books

Erez Lieberman Aiden: Everyone knows that a picture is worth a thousand words. But we at Harvard were wondering if this was really true. (Laughter) So we assembled a team of experts, spanning Harvard, MIT, The American Heritage Dictionary, The Encyclopedia Britannica and even our proud sponsors, the Google. And we cogitated about this for about four years. And we came to a startling conclusion. Ladies and gentlemen, a picture is not worth a thousand words. In fact, we found some pictures that are worth 500 billion words.

Erez Lieberman Aiden: Každý vie, že obrázok je hoden tisíc slov. Ale my na Harvarde sme sa zamysleli, či je to naozaj pravda. (Smiech) Zhromaždili sme teda tím odborníkov z Harvardu, MIT, The American Heritage Dictionary, Encyklopédie Britannica a aj od našich hrdých sponzorov z Googlu. A uvažovali sme o tom asi štyri roky. A došli sme k prekvapujúcemu záveru. Dámy a páni, obrázok nie je hoden tísíc slov. V skutočnosti sme našli obrázky hodné 500 miliárd slov.

Jean-Baptiste Michel: So how did we get to this conclusion? So Erez and I were thinking about ways to get a big picture of human culture and human history: change over time. So many books actually have been written over the years. So we were thinking, well the best way to learn from them is to read all of these millions of books. Now of course, if there's a scale for how awesome that is, that has to rank extremely, extremely high. Now the problem is there's an X-axis for that, which is the practical axis. This is very, very low.

Jean-Baptiste Michel: Takže, ako sme dospeli k tomuto záveru? Erez a ja sme premýšľali o cestách k získaniu celistvého obrazu o ľudskej kultúre a ľudskej histórii: ich zmenách v priebehu času. Tak veľa kníh bolo napísaných za všetky tie roky. Takže sme si pomysleli: najlepší spôsob, ako sa z nich poučiť, je prečítať všetky tieto milióny kníh. Samozrejme, ak si predstavíme mieru úžasnosti niečoho takého, toto musí bodovať veľmi, veľmi vysoko. Problém je, že k tomu prislúcha aj X-ová os - os praktičnosti. Toto je veľmi, veľmi nízko.

(Applause)

(Potlesk)

Now people tend to use an alternative approach, which is to take a few sources and read them very carefully. This is extremely practical, but not so awesome. What you really want to do is to get to the awesome yet practical part of this space. So it turns out there was a company across the river called Google who had started a digitization project a few years back that might just enable this approach. They have digitized millions of books. So what that means is, one could use computational methods to read all of the books in a click of a button. That's very practical and extremely awesome.

Ľudia zvyknú používať alternatívny prístup, vyberú zopár prameňov a prečítajú ich veľmi pozorne. Toto je veľmi praktické, ale nie až také úžasné. Čo naozaj chcete dosiahnuť, je umiestniť sa do úžasnej, ešte však praktickej časti tohto priestoru. Tak sa stalo, že kúsok cez rieku bola spoločnosť nazývaná Google, ktorá pred pár rokmi začala digitalizačný projekt, ktorý by akurát mohol umožniť takýto prístup. Digitalizovali milióny kníh. To znamená, že je možné použiť výpočtové metódy na čítanie všetkých týchto kníh stlačením klávesy. To je veľmi praktické a extrémne úžasné.

ELA: Let me tell you a little bit about where books come from. Since time immemorial, there have been authors. These authors have been striving to write books. And this became considerably easier with the development of the printing press some centuries ago. Since then, the authors have won on 129 million distinct occasions, publishing books. Now if those books are not lost to history, then they are somewhere in a library, and many of those books have been getting retrieved from the libraries and digitized by Google, which has scanned 15 million books to date.

ELA: Dovoľte mi rozpovedať vám o tom, odkiaľ knihy prichádzajú. Od nepamäti existovali spisovatelia. Títo spisovatelia sa snažili písať kníhy. A to sa im významne zjednodušilo s rozvojom kníhtlače pred niekoľkými storočiami. Odvtedy sa spisovateľom podarilo, pri 129 miliónoch rôznych príležitostiach, vydať knihu. Ak sa tieto knihy nestratili v prúde času, potom sú niekde v nejakej knižnici, a mnoho z týchto kníh bolo získaných z týchto knižníc a digitalizovaných v Google, ktorý doteraz oskenoval 15 miliónov kníh.

Now when Google digitizes a book, they put it into a really nice format. Now we've got the data, plus we have metadata. We have information about things like where was it published, who was the author, when was it published. And what we do is go through all of those records and exclude everything that's not the highest quality data. What we're left with is a collection of five million books, 500 billion words, a string of characters a thousand times longer than the human genome -- a text which, when written out, would stretch from here to the Moon and back 10 times over -- a veritable shard of our cultural genome. Of course what we did when faced with such outrageous hyperbole ... (Laughter) was what any self-respecting researchers would have done. We took a page out of XKCD, and we said, "Stand back. We're going to try science."

Keď Google digitalizuje knihu, uložia ju do ozaj pekného formátu. Máme dáta a navyše máme aj metadáta. Máme informácie o veciach ako je miesto vydania, autor, obdobie vydania. A naša činnosť potom spočíva v prehliadaní týchto záznamov a vylúčení všetkého, okrem dát najvyššej kvality. Čo nám zostane, je súbor piatich miliónov kníh, 500 miliárd slov, reťazec znakov tisíckrát dlhší než ľudský genóm -- text, ktorý, ak by sme ho napísali, by sa tiahol odtiaľ na Mesiac a späť 10 krát -- ozajstný úlomok nášho kultúrneho genómu. Samozrejme, čo sme urobili, čeliac takejto hroznej hyperbole ... (Smiech) sme urobili to, čo by býval urobil každý výskumník so štipkou sebaúcty. Vybrali sme stránku z XKCD, a riekli, "Ustúp. Ideme vyskúšať vedu."

(Laughter)

(Smiech)

JM: Now of course, we were thinking, well let's just first put the data out there for people to do science to it. Now we're thinking, what data can we release? Well of course, you want to take the books and release the full text of these five million books. Now Google, and Jon Orwant in particular, told us a little equation that we should learn. So you have five million, that is, five million authors and five million plaintiffs is a massive lawsuit. So, although that would be really, really awesome, again, that's extremely, extremely impractical. (Laughter)

JM: Samozrejme, uvažovali sme, skúsme my len najprv zverejniť dáta, pre ostatných nech si na tom robia vedu. A tak uvažujeme, ktoré dáta môžeme zverejniť? Samozrejme, chcete vziať knihy a vydať plný text týchto piatich miliónov kníh. Google a osobitne Jon Orwant, nám ukázali malú rovnicu, ktorú sme sa museli naučiť. Vezmite päť miliónov kníh, to znamená päť miliónov autorov a päť miliónov žalobcov a máte masívny súdny proces. Takže, aj keď by to bolo veľmi, veľmi úžasné, opäť, extrémne, extrémne nepraktické. (Smiech)

Now again, we kind of caved in, and we did the very practical approach, which was a bit less awesome. We said, well instead of releasing the full text, we're going to release statistics about the books. So take for instance "A gleam of happiness." It's four words; we call that a four-gram. We're going to tell you how many times a particular four-gram appeared in books in 1801, 1802, 1803, all the way up to 2008. That gives us a time series of how frequently this particular sentence was used over time. We do that for all the words and phrases that appear in those books, and that gives us a big table of two billion lines that tell us about the way culture has been changing.

Opäť sme to svojim spôsobom vyriešili a zvolili sme veľmi praktický prístup, ktorý bol o kúsok menej úžasný. Povedali sme si, namiesto zverejnenia plného textu zverejníme štatistické informácie o knihách. Napríklad "A gleam of happiness" ("Záblesk šťastia"). To sú štyri slová: nazývame to štyr-gram. Povieme vám, koľkokrát sa určitý štyr-gram objavuje v knihách v rokoch 1801, 1802, 1803, až do roku 2008. To nám dáva časovú závislosť frekvencie použitia určitej vety v priebehu času. Urobíme to pre všetky slová a frázy, ktoré sa objavujú v týchto knihách a to nám dáva veľkú tabuľku s dvoma miliardami riadkov, ktorá nám hovorí a cestách kultúrnych zmien.

ELA: So those two billion lines, we call them two billion n-grams. What do they tell us? Well the individual n-grams measure cultural trends. Let me give you an example. Let's suppose that I am thriving, then tomorrow I want to tell you about how well I did. And so I might say, "Yesterday, I throve." Alternatively, I could say, "Yesterday, I thrived." Well which one should I use? How to know?

ELA: Teda tie dve miliardy riadkov, nazývame ich dve miliardy n-gramov. Čo nám hovoria? Individuálne n-gramy sú mierou kultúrnych trendov. Dovoľte mi uviesť vám jeden príklad. Predpokladajme, že je mi skvele, a potom zajtra vám chcem povedať, ako dobre mi bolo. A teda by som mohol povedať "Včera som si voľkal." Alternatívne by som mohol povedať "Včara som sa tešil." Ktorý z nich by som mal použiť? Ako sa rozhodnúť?

As of about six months ago, the state of the art in this field is that you would, for instance, go up to the following psychologist with fabulous hair, and you'd say, "Steve, you're an expert on the irregular verbs. What should I do?" And he'd tell you, "Well most people say thrived, but some people say throve." And you also knew, more or less, that if you were to go back in time 200 years and ask the following statesman with equally fabulous hair, (Laughter) "Tom, what should I say?" He'd say, "Well, in my day, most people throve, but some thrived." So now what I'm just going to show you is raw data. Two rows from this table of two billion entries. What you're seeing is year by year frequency of "thrived" and "throve" over time. Now this is just two out of two billion rows. So the entire data set is a billion times more awesome than this slide.

Už približne šesť mesiacov špičkový prístup v tejto oblasti je, že by ste, napríklad, navštívili nasledujúceho psychológa s úžasným účesom, a riekli by ste, "Steve, vy ste expert na nepravidelné slovesá. Čo by som mal robiť?" A on by vám povedal, "Väčšina ľudí hovorí tešiť sa, ale niektorí ľudia hovoria voľkať si." A tiež ste vedeli, viac-menej, že, ak by ste sa presunuli späť v čase o 200 rokov a opýtali sa nasledujúceho štátnika s rovnako úžasným účesom: (Smiech) "Tom, čo by som mal povedať?" On by odpovedal, "Za mojich čias, väčšina ľudí používala voľkať si, no niektorí používali tešiť sa." Takže to, čo vám teraz ukážem sú iba holé dáta. Dva riadky z tabuľky s dvoma miliardami záznamov. To, čo vidíte je frekvencia výskytu, rok za rokom, "tešiť sa" a "voľkať si" v priebehu času. Toto sú iba dva z dvoch miliárd riadkov. Takže, celý set dát je miliardukrát úžasnejší než tento obrázok.

(Laughter)

(Smiech)

(Applause)

(Potlesk)

JM: Now there are many other pictures that are worth 500 billion words. For instance, this one. If you just take influenza, you will see peaks at the time where you knew big flu epidemics were killing people around the globe.

JM: Je mnoho ďalších obrázkov, ktoré sú hodné 500 miliárd slov. Napríklad tento. Ak vezmete slovo influenza, spozorujete zvýšený výskyt v časoch, o ktorých je známe, že chrípkové epidémie práve zabíjali ľudí po svete.

ELA: If you were not yet convinced, sea levels are rising, so is atmospheric CO2 and global temperature.

ELA: Ak ešte nie ste presvedčení, hladiny morí stúpajú, rovnako aj atmosférický CO2 a globálna teplota.

JM: You might also want to have a look at this particular n-gram, and that's to tell Nietzsche that God is not dead, although you might agree that he might need a better publicist.

JM: Mohol by vás zaujímať aj tento partikulárny n-gram, ktorý Nietzschemu hovorí, že Boh nie je mŕtvy, aj keď by ste mohli súhlasiť, že by sa mu hodil lepší PR manažér.

(Laughter)

(Smiech)

ELA: You can get at some pretty abstract concepts with this sort of thing. For instance, let me tell you the history of the year 1950. Pretty much for the vast majority of history, no one gave a damn about 1950. In 1700, in 1800, in 1900, no one cared. Through the 30s and 40s, no one cared. Suddenly, in the mid-40s, there started to be a buzz. People realized that 1950 was going to happen, and it could be big. (Laughter) But nothing got people interested in 1950 like the year 1950. (Laughter) People were walking around obsessed. They couldn't stop talking about all the things they did in 1950, all the things they were planning to do in 1950, all the dreams of what they wanted to accomplish in 1950. In fact, 1950 was so fascinating that for years thereafter, people just kept talking about all the amazing things that happened, in '51, '52, '53. Finally in 1954, someone woke up and realized that 1950 had gotten somewhat passé. (Laughter) And just like that, the bubble burst.

ELA: S touto vecičkou môžete dospieť k pekne abstraktným konceptom. Napríklad, dovoľte mi rozpovedať vám históriu roku 1950. Podstatnú väčšinu dejín, nikto na rok 1950 ani nekýchol v rokoch 1700, 1800, 1900, nik sa nezaujímal. V priebehu 30-tych a 40-tych, sa nik nezaujímal. Zrazu, v polovici 40-tych nastal šum. Ľudia si uvedomili, že rok 1950 prichádza a mohol by byť veľkolepý. (Smiech) Avšak nič ľudí nezaujalo počas roku 1950, tak, ako rok 1950. (Smiech) Ľudia chodili ako posadnutí. Nemohli prestať hovoriť o všetkom, čo robili počas roku 1950, všetkom, čo plánovali robiť v roku 1950, všetkých snoch, ktoré si chceli splniť v roku 1950. Fakticky, rok 1950 bol taký fascinujúci, že celé roky potom ľudia jednoducho ďalej hovorili o všetkých úžasných veciach, ktoré sa udiali. v rokoch 51, 52, 53. Konečne, v roku 1954 sa ktosi prebral a nahliadol, že rok 1950 je akosi passé. (Smiech) A takto bublina spľasla.

(Laughter)

(Smiech)

And the story of 1950 is the story of every year that we have on record, with a little twist, because now we've got these nice charts. And because we have these nice charts, we can measure things. We can say, "Well how fast does the bubble burst?" And it turns out that we can measure that very precisely. Equations were derived, graphs were produced, and the net result is that we find that the bubble bursts faster and faster with each passing year. We are losing interest in the past more rapidly.

A príbeh roku 1950, je príbehom každého roku, o ktorom máme záznamy. s malým háčikom, pretože teraz máme tieto pekné tabuľky. A pretože máme tieto pekné tabuľky, môžeme veci merať. Môžeme sa opýtať: "Hm, ako rýchlo bublina spľasne?" A ukazuje sa, že to môžeme merať veľmi presne. Rovnice boli odvodené, grafy vytvorené, a výsledok je, že bubliny spľasnú rýchlejšie a rýchlejšie každým odchádzajúcim rokom. Záujem o minulosť strácame rýchlejšie.

JM: Now a little piece of career advice. So for those of you who seek to be famous, we can learn from the 25 most famous political figures, authors, actors and so on. So if you want to become famous early on, you should be an actor, because then fame starts rising by the end of your 20s -- you're still young, it's really great. Now if you can wait a little bit, you should be an author, because then you rise to very great heights, like Mark Twain, for instance: extremely famous. But if you want to reach the very top, you should delay gratification and, of course, become a politician. So here you will become famous by the end of your 50s, and become very, very famous afterward. So scientists also tend to get famous when they're much older. Like for instance, biologists and physics tend to be almost as famous as actors. One mistake you should not do is become a mathematician. (Laughter) If you do that, you might think, "Oh great. I'm going to do my best work when I'm in my 20s." But guess what, nobody will really care.

JM: Teraz malá rada ku kariérnemu rastu. Takže pre tých z vás, ktorí chcú byť slávni, sa môžeme poučiť od 25 najznámejších politikov, spisovateľov, hercov a tak ďalej. Takže ak sa chcete stať slávnym čo najskôr, mali by ste byť hercom, pretože potom vaša sláva začne rásť ešte pred tridsiatkou -- ste ešte mladý, je to ozaj super. Ak môžete chvíľu počkať, staňte sa spisovateľom, pretože potom môžete dosiahnuť k výšinám, ako Mark Twain, napríklad: extrémne slávny. Ale ak chcete naozaj na vrchol, mali by ste odložiť príjemnosti a samozrejme, stať sa politikom. Takže tu sa stávate slávnym pred vašou šesťdesiatkou, a následne sa stávate veľmi, veľmi slávnym. Vedci sa k sláve dostávajú ako omnoho starší. Tak napríklad, biológovia a fyzici sú takmer takí slávni ako herci. Chyby, ktorej by ste sa mali vyvarovať je stať sa matematikom. (Smiech) Ak to urobíte, môžete si myslieť: "Ó, skvelé, do tridsiatky urobím svoju najlepšiu prácu." Ale hádajte čo? Nikoho to nebude naozaj zaujímať.

(Laughter)

(Smiech)

ELA: There are more sobering notes among the n-grams. For instance, here's the trajectory of Marc Chagall, an artist born in 1887. And this looks like the normal trajectory of a famous person. He gets more and more and more famous, except if you look in German. If you look in German, you see something completely bizarre, something you pretty much never see, which is he becomes extremely famous and then all of a sudden plummets, going through a nadir between 1933 and 1945, before rebounding afterward. And of course, what we're seeing is the fact Marc Chagall was a Jewish artist in Nazi Germany.

ELA: N-gramy prinášajú ešte viac vytriezvujúcich poznatkov. Napríklad tu je trajektória Marca Chagalla, umelca narodeného v roku 1887. A toto vyzerá ako normálna trajektória slávnej osoby. Stáva sa slávnejším a slávnejším, s výnimkou, ak hľadáte v nemčine. Ak hľadáte v nemčine, uvidíte niečo úplne zvláštne, niečo, čo sa takmer nikdy neobjaví, teda, že sa stáva extrémne slávnym a potom z ničoho nič zmizne, prechádzajúc úplným minimom medzi rokmi 1933 a 1945, a následne opätovne narastajúc. Samozrejme, to, čo vidíme, je skutočnosť, že Marc Chagall bol židovským umelcom v nacistickom Nemecku.

Now these signals are actually so strong that we don't need to know that someone was censored. We can actually figure it out using really basic signal processing. Here's a simple way to do it. Well, a reasonable expectation is that somebody's fame in a given period of time should be roughly the average of their fame before and their fame after. So that's sort of what we expect. And we compare that to the fame that we observe. And we just divide one by the other to produce something we call a suppression index. If the suppression index is very, very, very small, then you very well might be being suppressed. If it's very large, maybe you're benefiting from propaganda.

Tieto signály sú v skutočnosti také silné, že nepotrebujeme vedieť, či bol niekto cenzúrovaný. Môžeme na to jednoducho prísť použitím naozaj základného spracovania signálov. Tu je jednoduchý spôsob, ako to urobiť. Je rozumné predpokladať, že sláva danej osoby počas istého časového úseku, by mala byť približne priemerom jej slávy pred a slávy po ňom. Takže očakávame takéto niečo. A porovnáme to so slávou, ktorú pozorujeme. A jednoducho vydelíme jednu druhou, aby sme dostali niečo, čo nazývame index supresie. Ak je index supresie veľmi, veľmi, veľmi malý, potom je dosť možné, že ste potláčaný. Ak je veľmi veľký, je možné, že si pomáhate propagandou.

JM: Now you can actually look at the distribution of suppression indexes over whole populations. So for instance, here -- this suppression index is for 5,000 people picked in English books where there's no known suppression -- it would be like this, basically tightly centered on one. What you expect is basically what you observe. This is distribution as seen in Germany -- very different, it's shifted to the left. People talked about it twice less as it should have been. But much more importantly, the distribution is much wider. There are many people who end up on the far left on this distribution who are talked about 10 times fewer than they should have been. But then also many people on the far right who seem to benefit from propaganda. This picture is the hallmark of censorship in the book record.

JM: Vskutku sa môžete pozrieť na distribúciu indexov supresie cez celé populácie. Napríklad, tu -- tento index supresie je vyrátaný pre 5000 ľudí vybraných v anglických knihách. Kde nie je žiadna supresia -- vyzeralo by to takto, tesne centrované okolo jednotky. Čo očakávate, je, v podstate, to, čo pozorujete. Toto je distribúcia pozorovaná v Nemecku -- veľmi rozdielna, je posunutá doľava. Ľudia o tom hovorili asi dvakrát menej ako by sa dalo očakávať, ale čo je ešte dôležitejšie, distribúcia je oveľa širšia. Je mnoho ľudí, ktorý skončia na ľavom konci tejto distribúcie, o ktorých sa hovorí asi 10 ráz menej, než by sa malo. Ale tiež mnoho ľudí na pravom konci, ktorým, zdá sa, pomáha propaganda. Tento obrázok predstavuje etalón cenzorstva v knižných záznamoch.

ELA: So culturomics is what we call this method. It's kind of like genomics. Except genomics is a lens on biology through the window of the sequence of bases in the human genome. Culturomics is similar. It's the application of massive-scale data collection analysis to the study of human culture. Here, instead of through the lens of a genome, through the lens of digitized pieces of the historical record. The great thing about culturomics is that everyone can do it. Why can everyone do it? Everyone can do it because three guys, Jon Orwant, Matt Gray and Will Brockman over at Google, saw the prototype of the Ngram Viewer, and they said, "This is so fun. We have to make this available for people." So in two weeks flat -- the two weeks before our paper came out -- they coded up a version of the Ngram Viewer for the general public. And so you too can type in any word or phrase that you're interested in and see its n-gram immediately -- also browse examples of all the various books in which your n-gram appears.

ELA: Takže kulturonómia je termín, ktorý používame pre túto metódu. Je podobná genomike. Zatiaľ, čo genomika je objektívom biológie cez okno sekvencie ľudského genómu, kulturonómia je podobná. Je to aplikácia analýzy dát masívneho rozsahu pre štúdium ľudskej kultúry. Tu je genóm nahradený objektívom digitalizovaných historických záznamov. Skvelé na kulturonómii je, že ju môže robiť každý. Prečo každý? Môže ju robiť ktokoľvek, pretože traja chlapíci, Jon Orwant, Matt Gray a Will Brockman z Google sa pozreli na prototyp Ngram Viewer a povedali si, "Toto je taká zábava, musíme ju sprístupniť ľuďom!" Takže za dva týždne - dva týždne pred vydaním nášho článku - naprogramovali verziu Ngram Viewer-u pre verejnosť. Takže teraz môžete vpísať akékoľvek slovo alebo frázu, ktorá vás zaujíma a okamžite vidieť príslušný N-gram, a tiež prezerať príklady všetkých rôznych kníh, v ktorých sa objavuje váš N-gram.

JM: Now this was used over a million times on the first day, and this is really the best of all the queries. So people want to be their best, put their best foot forward. But it turns out in the 18th century, people didn't really care about that at all. They didn't want to be their best, they wanted to be their beft. So what happened is, of course, this is just a mistake. It's not that strove for mediocrity, it's just that the S used to be written differently, kind of like an F. Now of course, Google didn't pick this up at the time, so we reported this in the science article that we wrote. But it turns out this is just a reminder that, although this is a lot of fun, when you interpret these graphs, you have to be very careful, and you have to adopt the base standards in the sciences.

JM: Aplikácia bola použitá viac ako miliónkrát počas prvého dňa, a toto je naozaj najlepší zo všetkých dotazov. Takže ľudia sa snažia robiť všetko najlepšie ("their best") v službách pokroku. Ale ukazuje sa, že v 18-tom storočí, sa o to nestarali vôbec. Nechceli robiť "their best", robili "their beft". Čo sa stalo, je, samozrejme, iba chyba. Nebola to snaha po priemernosti, išlo len o to, že "s" sa písalo odlišne, podobne ako "f." Samozrejme, Google o tom vtedy ešte nevedel, takže sme to reportovali v našom odbornom článku. Ale to je iba pripomienka, že aj keď je toto veľká zábava, pri interpretácii grafov musíte byť veľmi opatrní a používať základné vedecké pravidlá.

ELA: People have been using this for all kinds of fun purposes. (Laughter) Actually, we're not going to have to talk, we're just going to show you all the slides and remain silent. This person was interested in the history of frustration. There's various types of frustration. If you stub your toe, that's a one A "argh." If the planet Earth is annihilated by the Vogons to make room for an interstellar bypass, that's an eight A "aaaaaaaargh." This person studies all the "arghs," from one through eight A's. And it turns out that the less-frequent "arghs" are, of course, the ones that correspond to things that are more frustrating -- except, oddly, in the early 80s. We think that might have something to do with Reagan.

ELA: Ľudia to používajú na všetky možné srandovné účely. (Smiech) Vskutku, nemusíme ani rozprávať, iba vám mlčky ukážeme všetky zostávajúce obrázky Túto osobu zaujímala história frustrácie. Existujú rôzne druhy frustrácie. Ak si prepichnete prst je to "argh" (ach) s jedným "a" Ak je planéta Zem anihilovaná Vogónmi za účelom uvoľnenia priestoru pre vesmírnu diaľnicu, je to "aaaaaaaargh" o ôsmich "a." Táto osoba skúmala všetky "argh", s jedným až ôsmimi "a" A ukazuje sa že menej frekventované "arghs" sú, samozrejme, tie, ktoré zodpovedajú veciam, ktoré sú frustrujúcejšie -- s výnimkou, prekvapujúco, začiatku 80-tych. Myslíme, že by to mohlo mať dočinenia s Reaganom.

(Laughter)

(Smiech)

JM: There are many usages of this data, but the bottom line is that the historical record is being digitized. Google has started to digitize 15 million books. That's 12 percent of all the books that have ever been published. It's a sizable chunk of human culture. There's much more in culture: there's manuscripts, there newspapers, there's things that are not text, like art and paintings. These all happen to be on our computers, on computers across the world. And when that happens, that will transform the way we have to understand our past, our present and human culture.

JM: Je veľa použití pre tieto dáta, ale najpodstatnejšie je, že historické záznamy sú digitalizované. Google začal s digitalizáciou 15 miliónov kníh. To je 12 percent všetkých kníh, ktoré kedy boli vydané. To predstavuje veľkú časť ľudskej kultúry. Kultúra je oveľa širšia: spadajú tam rukopisy, noviny, patria tam veci, ktoré nie sú textom, ako výtvarné umenie a maľby. Toto všetko bude na našich počítačoch, na počítačoch po celom svete. Až sa toto stane, transformuje to náš prístup k porozumeniu našej minulosti, prítomnosti a ľudstvu.

Thank you very much.

Ďakujeme veľmi pekne.

(Applause)

(Potlesk)

(Applause)

(Potlesk)

(Laughter)

(Smiech)

(Laughter)

(Smiech)

(Applause)

(Potlesk)

ELA: If you were not yet convinced, sea levels are rising, so is atmospheric CO2 and global temperature.

ELA: Ak ešte nie ste presvedčení, hladiny morí stúpajú, rovnako aj atmosférický CO2 a globálna teplota.

JM: You might also want to have a look at this particular n-gram, and that's to tell Nietzsche that God is not dead, although you might agree that he might need a better publicist.

JM: Mohol by vás zaujímať aj tento partikulárny n-gram, ktorý Nietzschemu hovorí, že Boh nie je mŕtvy, aj keď by ste mohli súhlasiť, že by sa mu hodil lepší PR manažér.

(Laughter)

(Smiech)

(Laughter)

(Smiech)

(Laughter)

(Smiech)

(Laughter)

(Smiech)

Thank you very much.

Ďakujeme veľmi pekne.

(Applause)

(Potlesk)

Jean-Baptiste Michel + Erez Lieberman Aiden: What we learned from 5 million books

Jean-Baptiste Michel + Erez Lieberman Aiden: What we learned from 5 million books

Related talks

Brewster Kahle: A free digital library

Aaron Koblin: Visualizing ourselves ... with crowd-sourced data

Amit Sood: Building a museum of museums on the web

Chip Kidd: Designing books is no laughing matter. OK, it is.

Ilan Stavans: Why should you read "Don Quixote"?

Chand John: What's the fastest way to alphabetize your bookshelf?

Related talks

Brewster Kahle: A free digital library

Aaron Koblin: Visualizing ourselves ... with crowd-sourced data

Amit Sood: Building a museum of museums on the web

Chip Kidd: Designing books is no laughing matter. OK, it is.

Ilan Stavans: Why should you read "Don Quixote"?

Chand John: What's the fastest way to alphabetize your bookshelf?