Erez Lieberman Aide: Svako zna da jedna slika vrijedi hiljadu riječi. Ali mi na Harvardu smo se pitali da li je ovo stvarno tačno. (Smijeh) Stoga smo skupili tim eksperata, iz Harvarda, MIT-a, The American Heritage Dictionary, Enciklopedije Britannica, i naših ponosnih sponzora, Googlea. Razmišljali smo o tome oko 4 godine. I došli smo do zapanjujućeg zaključka. Dame i gospodo, slika ne vrijedi hiljadu riječi. Zapravo, našli smo neke slike koje vrijede 500 milijardi riječi.
Erez Lieberman Aiden: Everyone knows that a picture is worth a thousand words. But we at Harvard were wondering if this was really true. (Laughter) So we assembled a team of experts, spanning Harvard, MIT, The American Heritage Dictionary, The Encyclopedia Britannica and even our proud sponsors, the Google. And we cogitated about this for about four years. And we came to a startling conclusion. Ladies and gentlemen, a picture is not worth a thousand words. In fact, we found some pictures that are worth 500 billion words.
Jean-Baptiste Michel: Kako smo došli do ovog zaključka? Erez i ja smo razmišljali kako da pronađemo načine da napravimo sliku ljudske kulture i ljudske historije: promjenu tokom vremena. Mnoštvo knjiga je napisano tokom godina. Pa smo razmišljali da je najbolji način da se iz njih uči jeste da pročitamo sve ove knjige. Naravno, ako postoji skala fenomenalnosti, mora biti jako, jako visoko. Problem je što za to postoji X-osa, stvarna osa. Koja je veoma, veoma nisko.
Jean-Baptiste Michel: So how did we get to this conclusion? So Erez and I were thinking about ways to get a big picture of human culture and human history: change over time. So many books actually have been written over the years. So we were thinking, well the best way to learn from them is to read all of these millions of books. Now of course, if there's a scale for how awesome that is, that has to rank extremely, extremely high. Now the problem is there's an X-axis for that, which is the practical axis. This is very, very low.
(Aplauz)
(Applause)
Ljudi obično koriste drugi pristup, uzmu par izvora i pažljivo ih čitaju. Ovo je veoma praktično, ali nije tako fenomenalno. Ono što zapravo želite postići jeste fenomenalno, ali praktični dio ovog prostora. Postoji kompanija koja se zove Google i koja je prije nekoliko godina krenula sa digitalizacijom koja bi pomogla ovom pristupu. Digitalizirali su milione knjiga. To znači da možemo kompjuterski pročitati sve knjige u samo jednom kliku. To je veoma praktično i fenomenalno.
Now people tend to use an alternative approach, which is to take a few sources and read them very carefully. This is extremely practical, but not so awesome. What you really want to do is to get to the awesome yet practical part of this space. So it turns out there was a company across the river called Google who had started a digitization project a few years back that might just enable this approach. They have digitized millions of books. So what that means is, one could use computational methods to read all of the books in a click of a button. That's very practical and extremely awesome.
ELA: Dozvolite mi da nešto kažem o tome odakle su potjekle knjige. Od prastarih vremena, postojali su autori. Ovi autori su težili da pišu knjige. Ovo je postalo znatno lakše od kada se, prije nekoliko stoljeća, pojavila mašina za štampanje. Od tada, autori su objavili oko 129 miliona knjiga. Ako se ove knjige nisu izgubile u prošlosti, onda su negdje u knjižari, a mnoge knjige su podizane iz bibilioteka i digitalizovane od strane Goolgea, koji je do sada skenirao 15 miliona knjiga.
ELA: Let me tell you a little bit about where books come from. Since time immemorial, there have been authors. These authors have been striving to write books. And this became considerably easier with the development of the printing press some centuries ago. Since then, the authors have won on 129 million distinct occasions, publishing books. Now if those books are not lost to history, then they are somewhere in a library, and many of those books have been getting retrieved from the libraries and digitized by Google, which has scanned 15 million books to date.
Kada Google digitalizuje knjigu, stave je u veoma dobar format. Sada imamo podatke i meta-podatke. Imamo podatke o tome gdje je objavljena, ko je autor, kada je objavljena. I mi prelazimo sve ove podatke i izbacujemo sve one podatke koji nisu kvalitetni. Ono što nam preostaje je kolekcija od 5 miliona knjiga, 500 milijardi riječi, i niz slova, 1000 puta duži od ljudskog genoma -- tekst koji, kada se ispiše, bi se protezao do Mjeseca i nazad 10 puta -- prava krhotina našeg kulturnog genoma. Naravno, kada smo se suočili sa ovakvom nečuvenom hiperbolom... (Smijeh) uradili smo ono što bi svaki istraživač uradio. Uzeli smo stranicu iz XKCD, i rekli, "Odmaknite se. Isprobat ćemo nauku."
Now when Google digitizes a book, they put it into a really nice format. Now we've got the data, plus we have metadata. We have information about things like where was it published, who was the author, when was it published. And what we do is go through all of those records and exclude everything that's not the highest quality data. What we're left with is a collection of five million books, 500 billion words, a string of characters a thousand times longer than the human genome -- a text which, when written out, would stretch from here to the Moon and back 10 times over -- a veritable shard of our cultural genome. Of course what we did when faced with such outrageous hyperbole ... (Laughter) was what any self-respecting researchers would have done. We took a page out of XKCD, and we said, "Stand back. We're going to try science."
(Smijeh)
(Laughter)
JM: Naravno, mislili smo, hajmo prvo ubaciti podatke koji bi ih iskoristili u nauci. Razmišljali smo, koje podatke možemo obajaviti? Naravno, želite objaviti cijeli tekst ovih 5 miliona knjiga. Google, a posebno Jon Orwant, nam je pokazao jednu jednačinu koju trebamo znati. Ako imate 5 miliona, tj., 5 miliona autora, to znači 5 miliona tužilaca. Iako bi to bilo veoma, veoma fenomenalno, ipak je jako nepraktično. (Smijeh)
JM: Now of course, we were thinking, well let's just first put the data out there for people to do science to it. Now we're thinking, what data can we release? Well of course, you want to take the books and release the full text of these five million books. Now Google, and Jon Orwant in particular, told us a little equation that we should learn. So you have five million, that is, five million authors and five million plaintiffs is a massive lawsuit. So, although that would be really, really awesome, again, that's extremely, extremely impractical. (Laughter)
Nekako smo popustili, i krenuli smo praktični pristup, koji je bio malo manje fenomenalan. Umjesto da objavljujemo cijeli tekst, objavit ćemo statistiku o knjigama. Uzmite naprimjer "Tračak sreće." Ima četiri riječi; zovemo je četiri-grama. Pokazat ćemo vam koliko puta se ona pojavila u knjigama u 1801, 1802, 1803, sve do 2008. Tako znamo koliko često se neka rečenica ponavljala tokom vremena. Uradili smo to za sve riječi i fraze koje se pojavljuju u ovim knjigama, i tako imamo tabelu od 2 milijarde redova koji nam govore kako se kultura mijenjala.
Now again, we kind of caved in, and we did the very practical approach, which was a bit less awesome. We said, well instead of releasing the full text, we're going to release statistics about the books. So take for instance "A gleam of happiness." It's four words; we call that a four-gram. We're going to tell you how many times a particular four-gram appeared in books in 1801, 1802, 1803, all the way up to 2008. That gives us a time series of how frequently this particular sentence was used over time. We do that for all the words and phrases that appear in those books, and that gives us a big table of two billion lines that tell us about the way culture has been changing.
ELA: Te redove zovemo 2 milijarde n-grama. Šta nam oni govore? Pojedinačni n-grami određuju kulturalne trendove. Evo primjera. Pretpostavimo da napredujem, i sutra vam želim ispričati kako sam uradio. Mogu reći, "Jučer sam napredovao." Umjesto toga, mogu reći, "Jučer napredovah." Koju riječ da koristim? Kako da znam?
ELA: So those two billion lines, we call them two billion n-grams. What do they tell us? Well the individual n-grams measure cultural trends. Let me give you an example. Let's suppose that I am thriving, then tomorrow I want to tell you about how well I did. And so I might say, "Yesterday, I throve." Alternatively, I could say, "Yesterday, I thrived." Well which one should I use? How to know?
Od prije šest mjeseci, stanje u ovom području je takvo da biste mogli, naprimjer, otići psihologu sa odličnom kosom, i reći, "Steve, ti si ekspert u nepravilnim glagolima. Šta trebam uraditi?" A on bi ti rekao, "Većina ljudi kaže napredova, ali neki kažu napredovah." Takođe ste znali, manje ili više, da ako se vratite 200 godina unazad i pitate državnika sa jednako dobrom kosom (Smijeh) "Tom, šta da kažem?" On bi rekao, "Pa, u moje vrijeme, većina ljudi kaže napredovao, a neki kažu napredovah." Sada ću vam pokazati nepripremljene podatke. Dvije kolone u tabeli sa 2 milijarde unosa. Možete vidjeti frekvenciju godinu za godinom za riječi "napredovao" i "napredovah". Ovo je samo 2 od 2 milijarde kolona. Čitav set podataka je milijardu puta fenomenalniji od ovog slajda.
As of about six months ago, the state of the art in this field is that you would, for instance, go up to the following psychologist with fabulous hair, and you'd say, "Steve, you're an expert on the irregular verbs. What should I do?" And he'd tell you, "Well most people say thrived, but some people say throve." And you also knew, more or less, that if you were to go back in time 200 years and ask the following statesman with equally fabulous hair, (Laughter) "Tom, what should I say?" He'd say, "Well, in my day, most people throve, but some thrived." So now what I'm just going to show you is raw data. Two rows from this table of two billion entries. What you're seeing is year by year frequency of "thrived" and "throve" over time. Now this is just two out of two billion rows. So the entire data set is a billion times more awesome than this slide.
(Smijeh)
(Laughter)
(Aplauz)
(Applause)
JM: Ima drugih slika koje vrijede 500 milijardi riječi. Naprimjer, ova. Ako uzmemo gripu, vidjećete razdoblja kada je poznato da je epidemija gripe ubijala ljude širom planete.
JM: Now there are many other pictures that are worth 500 billion words. For instance, this one. If you just take influenza, you will see peaks at the time where you knew big flu epidemics were killing people around the globe.
ELA: Ako još niste uvjereni, nivo mora se povećava, kao i nivo CO2 u atmosferi i globalna temperatura.
ELA: If you were not yet convinced, sea levels are rising, so is atmospheric CO2 and global temperature.
JM: Pogledajte ovaj n-gram, koji pokazuje Nietzscheu da Bog nije mrtav, iako se morate složiti da on bi mu dobro došao bolji publicist.
JM: You might also want to have a look at this particular n-gram, and that's to tell Nietzsche that God is not dead, although you might agree that he might need a better publicist.
(Smijeh)
(Laughter)
ELA: Možete posmatrati neke vrlo abstraktne koncepte. Naprimjer, dopustite da vam kažem nešto o godini 1950-toj. Tokom čitave prošlosti, poprilično nikome nije bilo stalo do godine 1950. U 1700, 1800, i 1900. nikome nije bilo stalo. Kroz 30-te i 40-te, nikome nije bilo stalo. Najednom, sredinom 40-tih, počela je galama. Ljudi su shvatili da će se desiti 1950 godina, i da bi mogla biti važna. (Smijeh) Ali nikada se ljudi nisu više zainteresirali za godinu 1950. kao u godini 1950. (Smijeh) Ljudi su opsjednuto hodali uokolo. Nisu mogli prestati pričati o stvarima koje su radili godine 1050., i o stvarima koje su planirali raditi godine 1950. o snovima koje su htjeli ostvariti godine 1950. Zapravo, godina 1950 bila je tako fascinantna da su godinama nakon, ljudi nastavili pričati o svim zapanjujućim stvarima koje su se desile, godine 1951, '52, '53. Napokon 1954., neko je shvatio da je 1950. nekako zastarijela. (Smijeh) I samo tako, balon je pukao.
ELA: You can get at some pretty abstract concepts with this sort of thing. For instance, let me tell you the history of the year 1950. Pretty much for the vast majority of history, no one gave a damn about 1950. In 1700, in 1800, in 1900, no one cared. Through the 30s and 40s, no one cared. Suddenly, in the mid-40s, there started to be a buzz. People realized that 1950 was going to happen, and it could be big. (Laughter) But nothing got people interested in 1950 like the year 1950. (Laughter) People were walking around obsessed. They couldn't stop talking about all the things they did in 1950, all the things they were planning to do in 1950, all the dreams of what they wanted to accomplish in 1950. In fact, 1950 was so fascinating that for years thereafter, people just kept talking about all the amazing things that happened, in '51, '52, '53. Finally in 1954, someone woke up and realized that 1950 had gotten somewhat passé. (Laughter) And just like that, the bubble burst.
(Smijeh)
(Laughter)
Priča o godini 1950. je priča o svakoj godini koju smo zabilježili, a malim preokretom, jer sada imamo ove lijepe grafikone. I zbog toga što imamo ove grafikone, možemo da mjerimo stvari. Možemo reći, "Kako brzo balon može da pukne?" Ispostavilo se da to možemo veoma precizno da izmjerimo. Jednačine su izvedene, grafikoni su napravljeni, i jednostavan rezultat je taj da balon buca sve brže kako godine prolaze. Sve brže gubimo interes za prošlost.
And the story of 1950 is the story of every year that we have on record, with a little twist, because now we've got these nice charts. And because we have these nice charts, we can measure things. We can say, "Well how fast does the bubble burst?" And it turns out that we can measure that very precisely. Equations were derived, graphs were produced, and the net result is that we find that the bubble bursts faster and faster with each passing year. We are losing interest in the past more rapidly.
JM: Sada ću vam dati jedan mali savjet u vezi odabira karijere. Za one koji žele postati poznati, saznali smo od 25 najpoznatijih političkih figura, pisaca, glumaca i tako dalje. Ako želite rano postati poznat, trebali ste biti glumac, jer u tom slučaju slava počinje da raste krajem vaših 20-tih godina -- još uvijek ste mladi, što je sjajno. Ako možete čekati još malo, onda bi ste trebali biti pisac, jer onda slava doseže velike visine, kao Mark Twain, naprimjer: on je veoma poznat. Ali ako želite doseći sam vrh, trebali bi ste odgoditi slavu i, naravno, postati političar. Ovako ćete postati popularni krajem vaših 50-tih godina, i ostati veoma, veoma, poznati i nakon. I naučnici postaju slavni kako stare. Naprimejr, biolozi i fizičari su obično poznati kao i glumci. Jedina greška koju ne smijete napraviti jeste da postanete matematičar. (Smijeh) Ako to uradite, možete pomisliti, "Super. Objavit ću najbolji rad u svojim 20-tim." Ali pogodite, nikome zaista neće biti stalo.
JM: Now a little piece of career advice. So for those of you who seek to be famous, we can learn from the 25 most famous political figures, authors, actors and so on. So if you want to become famous early on, you should be an actor, because then fame starts rising by the end of your 20s -- you're still young, it's really great. Now if you can wait a little bit, you should be an author, because then you rise to very great heights, like Mark Twain, for instance: extremely famous. But if you want to reach the very top, you should delay gratification and, of course, become a politician. So here you will become famous by the end of your 50s, and become very, very famous afterward. So scientists also tend to get famous when they're much older. Like for instance, biologists and physics tend to be almost as famous as actors. One mistake you should not do is become a mathematician. (Laughter) If you do that, you might think, "Oh great. I'm going to do my best work when I'm in my 20s." But guess what, nobody will really care.
(Smijeh)
(Laughter)
ELA: Ima i nešto trezvenih bilješki mešu n-gramima. Naprimjer, ovo je put Marca Chagalla, umjetnika rođenog 1887. I ovo izgleda kao normalan put poznate osobe. On postaje sve poznatiji, osim ako gledate na njemačkom. Na njemačkom, postoji nešto veoma bizarno, nešto što se skoro nikada ne može vidjeti, a to je da on postaje strašno poznat i onda najednom njegova popularnost snažno se penje, i doseže nebeske visine između 1933 i 1945., prije se ponovo vraća. Naravno, vidimo da je Marc Chagall bio jevrejski umjetnih u nacističkoj Njemačkoj.
ELA: There are more sobering notes among the n-grams. For instance, here's the trajectory of Marc Chagall, an artist born in 1887. And this looks like the normal trajectory of a famous person. He gets more and more and more famous, except if you look in German. If you look in German, you see something completely bizarre, something you pretty much never see, which is he becomes extremely famous and then all of a sudden plummets, going through a nadir between 1933 and 1945, before rebounding afterward. And of course, what we're seeing is the fact Marc Chagall was a Jewish artist in Nazi Germany.
Ovi signali su zapravo tako jaki da ne moramo znati da je neko cenzurisan. Možemo zapravo shvatiti procesuirajući jednostavne signale. Evo jednostavnog načina za to. Logično je očekivati da nečija slava u datom preiodu bi trebala otprilike biti prosjek njihove slave prije i slave poslije. Takvo nešto mi očekujemo. I poredimo to sa slavom koju mi posmatramo. I jednostavno podijelimo jedno sa drugim da bismo dobili nešto što nazivamo indeks zabrane. Ako je indeks veoma, veoma, veoma mali, onda možda ste zabranjeni. Ako je veoma veliki, onda možda imate korist od propagande.
Now these signals are actually so strong that we don't need to know that someone was censored. We can actually figure it out using really basic signal processing. Here's a simple way to do it. Well, a reasonable expectation is that somebody's fame in a given period of time should be roughly the average of their fame before and their fame after. So that's sort of what we expect. And we compare that to the fame that we observe. And we just divide one by the other to produce something we call a suppression index. If the suppression index is very, very, very small, then you very well might be being suppressed. If it's very large, maybe you're benefiting from propaganda.
JM: Možete zapravo posmatrati distribuciju indeksa zabrane čitave populacije. Naprimjer, ovdje -- indeks zabrane za 5,000 ljudi odabranih iz engleskih udžbenika gdje nema zabrana -- izgledalo bi ovako, usko centrirani na jedan. Ono što očekujete je jednostavno ono što posmatrate. Ovo je rasprostranjenost posmatrana u Njemačkoj -- veoma različita, pomjerena u lijevo. Ljudi su o tome govorili dva puta manje nego što je trebalo. Ali što je najvažnije, rasprostranjenost je mnogo šira. Mnogo je ljudi koji završe na krajnje lijevoj tački rasprostranjenosti o kojima se govori 10 puta manje nego što bi trebalo. Ali i mnogi ljudi na krajnje desnoj tački očigledno imaju korist od propadande. Ova slika je znak cenzure.
JM: Now you can actually look at the distribution of suppression indexes over whole populations. So for instance, here -- this suppression index is for 5,000 people picked in English books where there's no known suppression -- it would be like this, basically tightly centered on one. What you expect is basically what you observe. This is distribution as seen in Germany -- very different, it's shifted to the left. People talked about it twice less as it should have been. But much more importantly, the distribution is much wider. There are many people who end up on the far left on this distribution who are talked about 10 times fewer than they should have been. But then also many people on the far right who seem to benefit from propaganda. This picture is the hallmark of censorship in the book record.
ELA: Kulturomija je naziv ove naše metode. Nešto je nalik genomiji. Osim što je genomija uvid u bilogiju kroz prozor slijeda baza u ljudskom genomu. Kulturomija je slična. To je primjena skupljanja podataka velikog uzorka na ljudsku kulturu. Umjesto kroz ljudski genom, gleda se kroz digitalizirane historijske zapise. Odlična stvar u vezi kulturonomije je da svako to može uraditi. Zašto je dostupna svima? Zato što su tri čovjeka, Jon Orwant, Matt Gray i Will Brockman iz Googlea, su vidjeli prototip Ngram VIewera, i rekli su, "Ovo je tako zabavno. Moramo ovo pružiti ljudima." Za ravno dvije sedmice -- dvije sedmice prije nego naš rad objavljen -- napravili su verziju Ngram Viewera za javnost. Tako da sada možete ukucati bilo koju riječ ili frazu koja vas zanima i odmah vidjeti njen n-gram -- i naći primjere iz ranih knjiga u kojima se vaš n-gram spominje.
ELA: So culturomics is what we call this method. It's kind of like genomics. Except genomics is a lens on biology through the window of the sequence of bases in the human genome. Culturomics is similar. It's the application of massive-scale data collection analysis to the study of human culture. Here, instead of through the lens of a genome, through the lens of digitized pieces of the historical record. The great thing about culturomics is that everyone can do it. Why can everyone do it? Everyone can do it because three guys, Jon Orwant, Matt Gray and Will Brockman over at Google, saw the prototype of the Ngram Viewer, and they said, "This is so fun. We have to make this available for people." So in two weeks flat -- the two weeks before our paper came out -- they coded up a version of the Ngram Viewer for the general public. And so you too can type in any word or phrase that you're interested in and see its n-gram immediately -- also browse examples of all the various books in which your n-gram appears.
JM: Ngram Viewer i ovo je najbolje od svih upita. Ljudi žele najbolje, da urade najbolje što mogu. Ali izgleda da ljudi u 18-tom stoljeću o tome nisu marili. Nisu željeli uraditi najbolje, željeli su najbolje. Desilo se, naravno, ovo je samo pogreška. Nije da su težili prosjećnosti, već se S pisalo drugačije, slično F. Naravno, Google nije ovo izdvojio, tako da smo ovo naveli u naučnom članku. Ali ovo je ispalo kao podsjetnik da, iako je ovo veoma zabavno, kada tumačite ove grafikone, morate biti veoma pažljivi, i morati primijeniti ove standarde u nauci.
JM: Now this was used over a million times on the first day, and this is really the best of all the queries. So people want to be their best, put their best foot forward. But it turns out in the 18th century, people didn't really care about that at all. They didn't want to be their best, they wanted to be their beft. So what happened is, of course, this is just a mistake. It's not that strove for mediocrity, it's just that the S used to be written differently, kind of like an F. Now of course, Google didn't pick this up at the time, so we reported this in the science article that we wrote. But it turns out this is just a reminder that, although this is a lot of fun, when you interpret these graphs, you have to be very careful, and you have to adopt the base standards in the sciences.
ELA: Ljudi ovo koriste za razne zabavne svrhe. (Smijeh) Zapravo, ne moramo pričati, samo ćemo vam u tišini pokazati sve slajdove. Ovu osobu je interesovala historija frustracije. Postoje razni tipovi fustracija. Ako se udarite u nožni prst, to je jedno A "argh". Ako planetu Zemlju nasele Vogonci da naprave međuzvjezdanu zaobliaznicu, to je osam A "aaaaaaaargh." Ova osoba je istražila sve "arghove," od jednog pa do osam slova A. I ispada najrjeđi "arghovi" su, naravno, oni koji se odnose na stvari koji više frustrirajuće -- osim, začudo, početkom 80-tih. Možda to ima neke veze sa Reaganom.
ELA: People have been using this for all kinds of fun purposes. (Laughter) Actually, we're not going to have to talk, we're just going to show you all the slides and remain silent. This person was interested in the history of frustration. There's various types of frustration. If you stub your toe, that's a one A "argh." If the planet Earth is annihilated by the Vogons to make room for an interstellar bypass, that's an eight A "aaaaaaaargh." This person studies all the "arghs," from one through eight A's. And it turns out that the less-frequent "arghs" are, of course, the ones that correspond to things that are more frustrating -- except, oddly, in the early 80s. We think that might have something to do with Reagan.
(Smijeh)
(Laughter)
JM: Ovi podaci se koriste u razne svrhe, ali historijski zapisi se digitalizuju. Google je počeo sa digitalizacijom 15 miliona knjiga. To je 12 posto svih knjiga koje su izdate. To je veliki dio ljudske kulture. Tu su i rukopisi, novine, tu su i materijali bez teksta, kao umjetnost i slike. To je sve u našim kompjuterima, i kompjuterima širom svijeta. Kada se to desi, to će promijeniti način na koji mi shvatamo prošlost, sadašnjost i ljudsku kulturu.
JM: There are many usages of this data, but the bottom line is that the historical record is being digitized. Google has started to digitize 15 million books. That's 12 percent of all the books that have ever been published. It's a sizable chunk of human culture. There's much more in culture: there's manuscripts, there newspapers, there's things that are not text, like art and paintings. These all happen to be on our computers, on computers across the world. And when that happens, that will transform the way we have to understand our past, our present and human culture.
Hvala vam mnogo.
Thank you very much.
(Aplauz)
(Applause)