Erez Lieberman Aiden: Alle ved at et billede siger mere end tusind ord Men på Harvard spurgte vi os selv, om det egentlig er sandt. (Latter) Så vi samlede et hold eksperter, både fra Harvard, MIT, The American Heritage Dictionary, The Encyclopedia Britannica og sågar vores stolte sponsor... The Google. Og vi har funderet over dette i cirka fire år. Og vores konklusion er overraskende. Mine damer og herrer, et billede siger ikke mere end tusind ord. Det viste sig faktisk at nogle billeder siger mere end 500 milliarder ord.
Erez Lieberman Aiden: Everyone knows that a picture is worth a thousand words. But we at Harvard were wondering if this was really true. (Laughter) So we assembled a team of experts, spanning Harvard, MIT, The American Heritage Dictionary, The Encyclopedia Britannica and even our proud sponsors, the Google. And we cogitated about this for about four years. And we came to a startling conclusion. Ladies and gentlemen, a picture is not worth a thousand words. In fact, we found some pictures that are worth 500 billion words.
Jean-Baptiste Michel: Hvordan når vi denne konklusion? Erez og jeg tænkte på, hvordan man kunne få overblik over menneskets kultur og historie - - og ændringen over tid. Der skrevet så mange bøger gennem tiderne. Så vi tænkte at man kan lære mest af alle disse bøger ved at læse dem alle sammen. Hvis der er en skala for, hvor fantastisk det er må det selvfølgelig ligge meget, meget højt (Awesome). Problemet er, at der også er en X-akse, og det aksen for, om det også er praktisk. Den er meget, meget lav.
Jean-Baptiste Michel: So how did we get to this conclusion? So Erez and I were thinking about ways to get a big picture of human culture and human history: change over time. So many books actually have been written over the years. So we were thinking, well the best way to learn from them is to read all of these millions of books. Now of course, if there's a scale for how awesome that is, that has to rank extremely, extremely high. Now the problem is there's an X-axis for that, which is the practical axis. This is very, very low.
(Bifald)
(Applause)
Folk bruger som regel en anden tilgang, Man tager nogle få kilder og læser dem meget omhyggeligt. Dette er meget praktisk, men ikke særlig fantastisk. Det bedste må være at nå til dette fantastiske men alligevel praktiske område. Et firma på den anden side af floden - Google - startede et digitaliseringsprojekt for nogle år siden og det kan måske gøre denne tilgang mulig. De har digitaliseret millioner af bøger. Man kan således bruge computerbaserede metoder til at læse alle bøgerne med et enkelt klik. Det er meget praktisk og ekstremt fantastisk.
Now people tend to use an alternative approach, which is to take a few sources and read them very carefully. This is extremely practical, but not so awesome. What you really want to do is to get to the awesome yet practical part of this space. So it turns out there was a company across the river called Google who had started a digitization project a few years back that might just enable this approach. They have digitized millions of books. So what that means is, one could use computational methods to read all of the books in a click of a button. That's very practical and extremely awesome.
ELA: Nu skal I høre, hvor bøger stammer fra. Der har altid eksisteret forfattere. Disse forfattere har bestræbt sig på at skrive bøger. Og det blev væsentligt nemmere da trykpressen blev opfundet for nogle hundrede år siden. Siden da, er det lykkedes forfattere at udgive bøger 129 millioner gange. Hvis disse bøger ikke er gået tabt for historien, findes de på et bibliotek et sted, og mange bøgerne er blevet taget fra hylderne og er blevet digitaliseret af Google, som til dato har scannet 15 millioner bøger.
ELA: Let me tell you a little bit about where books come from. Since time immemorial, there have been authors. These authors have been striving to write books. And this became considerably easier with the development of the printing press some centuries ago. Since then, the authors have won on 129 million distinct occasions, publishing books. Now if those books are not lost to history, then they are somewhere in a library, and many of those books have been getting retrieved from the libraries and digitized by Google, which has scanned 15 million books to date.
Når Google digitaliserer en bog, får den et rigtig fint format. Nu har vi både data og metada. Vi har f.eks. oplysninger om, hvor den blev udgivet, hvem forfatteren var, og hvornår den blev udgivet. Og vi går gennem alle disse arkiver og udelukker alle data, der ikke er af højeste kvalitet. Det, der er tilbage, er en samling på fem millioner bøger, 500 milliarder ord, en tegnstreng, der er tusind gange længere end menneskets arvemasse. Hvis teksten blev skrevet ud, ville den nå herfra til månen og tilbage igen 10 gange! - Et sandt brudstykke af vores kulturelle arvemasse. Det vi gjorde, da vi stod over for så vanvittige sammenligninger... (Latter) var, hvad enhver forskere med respekt for sig selv ville have gjort. Vi gjorde som i tegneserien XKCD, og sagde "Gør plads! Vi prøver med videnskab".
Now when Google digitizes a book, they put it into a really nice format. Now we've got the data, plus we have metadata. We have information about things like where was it published, who was the author, when was it published. And what we do is go through all of those records and exclude everything that's not the highest quality data. What we're left with is a collection of five million books, 500 billion words, a string of characters a thousand times longer than the human genome -- a text which, when written out, would stretch from here to the Moon and back 10 times over -- a veritable shard of our cultural genome. Of course what we did when faced with such outrageous hyperbole ... (Laughter) was what any self-respecting researchers would have done. We took a page out of XKCD, and we said, "Stand back. We're going to try science."
(Latter)
(Laughter)
JM: Først tænkte vi selvfølgelig, "Vi gør bare data tilgængelige, så andre kan bruge videnskab på dem." Nu tænker vi "Hvilke data kan vi lægge ud?" Egentlig vil vi gerne tage bøgerne og lægge teksten fra alle fem millioner bøger ud. Men Google - og særligt Jon Orwant - fortalte om en ligning, vi skulle lære. Vi har altså fem millioner forfattere altså fem millioner, der gerne vil sagsøge os. Så selvom det ville være virkelig, virkelig fantastisk, ville det også være helt ekstremt upraktisk. (Latter)
JM: Now of course, we were thinking, well let's just first put the data out there for people to do science to it. Now we're thinking, what data can we release? Well of course, you want to take the books and release the full text of these five million books. Now Google, and Jon Orwant in particular, told us a little equation that we should learn. So you have five million, that is, five million authors and five million plaintiffs is a massive lawsuit. So, although that would be really, really awesome, again, that's extremely, extremely impractical. (Laughter)
Igen lod vi os overtale og fulgte den praktiske tilgang, der var lidt mindre fantastisk. I stedet for at lægge den fulde tekst ud ville vi gøre statistikker om bøgerne tilgængelige. Et eksempel er "A gleam of happiness" - Et glimpt af lykke Det er fire ord - det vi kalder et fire-gram Vi vil nu fortælle jer, hvor mange gange et bestemt fire-gram optrådte i bøger i 1801, 1802, 1803, og helt op til 2008 Det giver os en tidsserie, der viser hvor hyppigt denne ene sætning er blevet brugt over tid. Det gør vi for alle ord og udtryk i disse bøger. Det giver os en stor tabel med to milliarder linjer som viser hvordan kulturen har ændret sig.
Now again, we kind of caved in, and we did the very practical approach, which was a bit less awesome. We said, well instead of releasing the full text, we're going to release statistics about the books. So take for instance "A gleam of happiness." It's four words; we call that a four-gram. We're going to tell you how many times a particular four-gram appeared in books in 1801, 1802, 1803, all the way up to 2008. That gives us a time series of how frequently this particular sentence was used over time. We do that for all the words and phrases that appear in those books, and that gives us a big table of two billion lines that tell us about the way culture has been changing.
ELA: Disse to milliarder linjer som vi kalder to milliarder n-grammer... Hvad fortæller de os? De enkelte n-grammer måler kulturelle tendenser. Lad mig give et eksempel. Jeg vil sige, at jeg trives, i morgen siger jeg så, hvor godt jeg havde det. Jeg ville sige "I går trivedes (throve) jeg". Man kan også bruge "thrived" i stedet for "throve". Hvilket af de to ord skal jeg bruge? Hvor skulle jeg vide det fra?
ELA: So those two billion lines, we call them two billion n-grams. What do they tell us? Well the individual n-grams measure cultural trends. Let me give you an example. Let's suppose that I am thriving, then tomorrow I want to tell you about how well I did. And so I might say, "Yesterday, I throve." Alternatively, I could say, "Yesterday, I thrived." Well which one should I use? How to know?
Indtil for seks måneder siden var den anerkendte metode på dette område at du f.eks. kunne få fat i denne psykolog med lækkert hår og spørge ham: "Steve, du er ekspert i uregelmæssige verber. Hvad skal jeg gøre?" Og han ville sige: "De fleste mennesker bruger "thrived" men nogle siger "throve". Og du vidste også - mere eller mindre - at hvis du gik 200 år tilbage i tiden og spurgte denne statsmand med ligeså lækkert hår: (Latter) "Tom, hvad ville du sige?" Han ville sige: "På min tid brugte de fleste "throve, mens andre brugte "thrived". Så nu vil jeg bare vise jer rå data. To rækker i denne tabel ud af to millarder poster. Den viser hyppigheden pr. år af "thrived" og "throve" over tid. Det her er kun to ud af to milliarder rækker. Så hele datasættet er en milliard gange mere fantastisk end dette slide.
As of about six months ago, the state of the art in this field is that you would, for instance, go up to the following psychologist with fabulous hair, and you'd say, "Steve, you're an expert on the irregular verbs. What should I do?" And he'd tell you, "Well most people say thrived, but some people say throve." And you also knew, more or less, that if you were to go back in time 200 years and ask the following statesman with equally fabulous hair, (Laughter) "Tom, what should I say?" He'd say, "Well, in my day, most people throve, but some thrived." So now what I'm just going to show you is raw data. Two rows from this table of two billion entries. What you're seeing is year by year frequency of "thrived" and "throve" over time. Now this is just two out of two billion rows. So the entire data set is a billion times more awesome than this slide.
(Latter)
(Laughter)
(Bifald)
(Applause)
JM: Der er jo mange andre billeder, der siger mere end 500 milliarder ord. For eksempel dette. Hvis vi bare ser på influenza, vil I se høje udslag på de tidspunkter, hvor I vidste at der var store globale influenzaepidemier.
JM: Now there are many other pictures that are worth 500 billion words. For instance, this one. If you just take influenza, you will see peaks at the time where you knew big flu epidemics were killing people around the globe.
ELA: Hvis du ikke er overbevist, stiger vandstanden i havene - det gør CO2-indholdet i atmosfæren og den globale temperatur også.
ELA: If you were not yet convinced, sea levels are rising, so is atmospheric CO2 and global temperature.
JM: Prøv også at kaste et blik på dette n-gram, og det fortæller Nietzsche, at Gud ikke er død, selvom du måske også synes, han har brug for en bedre ///presseagent.
JM: You might also want to have a look at this particular n-gram, and that's to tell Nietzsche that God is not dead, although you might agree that he might need a better publicist.
(Latter)
(Laughter)
ELA: Man kan få nogle ret abstrakte begreber med disse ting. Lad mig f.eks. fortælle jer historien om året 1950. I den største del af vores historie har ingen interesseret sig en pind for 1950. I 1700 og 1800 og 1900 var ingen interesseret. Op gennem 30'erne og 40'erne var ingen interesseret. Pludselig, midt i 40'erne, blev der hvisket i krogene. Folk indså at 1950 var noget, der ville ske, og det kunne være noget stort. (Latter) Men det der gjorde folk allermest interesseret i 1950 var året 1950. (Latter) Folk var som besat. De kunne ikke lade være med at tale om alt det, de lavede i 1950, alt det de planlagde at skulle gøre i 1950, og alle drømmene om, hvad de ville opnå i 1950. Faktisk var 1950 så fascinerende at folk i flere år efter bare blev ved med at tale om alle de utrolige ting, der skete - i 1951, 1952 og 1953. Omsider i 1954 var der en der vågnede op og indså at 1950 var blevet noget passé. (Latter) Og uden videre sprang boblen.
ELA: You can get at some pretty abstract concepts with this sort of thing. For instance, let me tell you the history of the year 1950. Pretty much for the vast majority of history, no one gave a damn about 1950. In 1700, in 1800, in 1900, no one cared. Through the 30s and 40s, no one cared. Suddenly, in the mid-40s, there started to be a buzz. People realized that 1950 was going to happen, and it could be big. (Laughter) But nothing got people interested in 1950 like the year 1950. (Laughter) People were walking around obsessed. They couldn't stop talking about all the things they did in 1950, all the things they were planning to do in 1950, all the dreams of what they wanted to accomplish in 1950. In fact, 1950 was so fascinating that for years thereafter, people just kept talking about all the amazing things that happened, in '51, '52, '53. Finally in 1954, someone woke up and realized that 1950 had gotten somewhat passé. (Laughter) And just like that, the bubble burst.
(Latter)
(Laughter)
Og historien om 1950 er historien om alle de år, vi har registreret, med et lille tvist, fordi vi nu har disse fine grafer. Og fordi vi har disse fine grafer, kan vi nu måle ting. Vi kan sige "Hvor hurtigt springer boblen?" Og de viser sig, at vi kan måle dette meget præcist. Der blev udledt ligninger, og der opstillet grafer, og nettoresultatet er at det viser sig, at boblen springer hurtigere og hurtigere for hvert år der går. Vi mister interessen for fortiden hurtigere.
And the story of 1950 is the story of every year that we have on record, with a little twist, because now we've got these nice charts. And because we have these nice charts, we can measure things. We can say, "Well how fast does the bubble burst?" And it turns out that we can measure that very precisely. Equations were derived, graphs were produced, and the net result is that we find that the bubble bursts faster and faster with each passing year. We are losing interest in the past more rapidly.
JM: Og nu et godt karrieretip: For de af jer, der vil være berømte, kan vi lære af de 25 mest berømte politiske personligheder, forfattere, skuespillere osv. Så hvis du vil være berømt tidligt, skal du være skuespiller, fordi berømmelsen så begynder at stige, nrå du er sidst i 20'erne – Du er stadig ung, og det er virkelig skønt. Men hvis du kan vente lidt, skal du blive forfatter, fordi så opnår meget stor berømmelse, som f.eks. Mark Twain: Ekstremt berømt. Men hvis du vil helt til toppen, skal du udskyde den tilfredsstillelse, det er at blive berømt - og selvfølgelig blive politiker. Her vil du blive berømt, når du er i slutningen af 50'erne, og blive meget, meget berømt derefter. Videnskabsfolk plejer også at blive berømte, når de er meget ældre. For eksempel biologer og fysikere bliver næsten ligeså berømte som skuespillere. En fejl, du ikke skal begå, er at blive matematiker. (Latter) Hvis du gør det, tænker du måske "Herligt! Jeg leverer mit bedste arbejde, når jeg er i 20'erne" Men tænk engang... stort set ingen lægger mærke til det.
JM: Now a little piece of career advice. So for those of you who seek to be famous, we can learn from the 25 most famous political figures, authors, actors and so on. So if you want to become famous early on, you should be an actor, because then fame starts rising by the end of your 20s -- you're still young, it's really great. Now if you can wait a little bit, you should be an author, because then you rise to very great heights, like Mark Twain, for instance: extremely famous. But if you want to reach the very top, you should delay gratification and, of course, become a politician. So here you will become famous by the end of your 50s, and become very, very famous afterward. So scientists also tend to get famous when they're much older. Like for instance, biologists and physics tend to be almost as famous as actors. One mistake you should not do is become a mathematician. (Laughter) If you do that, you might think, "Oh great. I'm going to do my best work when I'm in my 20s." But guess what, nobody will really care.
(Latter)
(Laughter)
ELA: Der er mere nøgterne observationer blandt n-grammerne. Her er f.eks. Marc Chagalls livsforløb, som kunster født i 1887. Og dette ligner det normale forløb for en berømt person. Han bliver mere og mere berømt, bare ikke hvis vi ser på tysk. På tysk ser vi noget ganske bizart, noget man stort set aldrig ser, og det er, at han bliver ekstremt berømt hvorefter berømmelsen falder brat og er på nulpunktet mellem 1933 og 1945, hvorefter berømmelsen vender tilbage. Og de vi selvfølgelig kan se er at Marc Chagall var jødisk kunstner i nazi-Tyskland
ELA: There are more sobering notes among the n-grams. For instance, here's the trajectory of Marc Chagall, an artist born in 1887. And this looks like the normal trajectory of a famous person. He gets more and more and more famous, except if you look in German. If you look in German, you see something completely bizarre, something you pretty much never see, which is he becomes extremely famous and then all of a sudden plummets, going through a nadir between 1933 and 1945, before rebounding afterward. And of course, what we're seeing is the fact Marc Chagall was a Jewish artist in Nazi Germany.
Disse signaler er faktisk så stærk, at vi ikke behøver at vide, at en person er blevet censureret. Vi kan faktisk regne det ud ved hjælp af meget grundlæggende behandling af signalerne. Her er en simpel måde at gøre det på. Det er rimeligt at forvente at en persons berømmelse i en given periode vil være nogenlunde gennemsnittet af berømmelsen før og berømmelsen efter perioden. Så det er nogenlunde, det vi forventer. Og vi sammenligner med den berømmelse, vi kan aflæse. Og så dividerer vi bare den ene med den anden så vi får noget, vi kalder et undertrykkelsesindeks. Hvis undertrykkelsesindekset er meget, meget, meget lavt, er der stor sandsynlighed for at du er undertrykt. Hvis det er meget højt, får du måske hjælp af propaganda.
Now these signals are actually so strong that we don't need to know that someone was censored. We can actually figure it out using really basic signal processing. Here's a simple way to do it. Well, a reasonable expectation is that somebody's fame in a given period of time should be roughly the average of their fame before and their fame after. So that's sort of what we expect. And we compare that to the fame that we observe. And we just divide one by the other to produce something we call a suppression index. If the suppression index is very, very, very small, then you very well might be being suppressed. If it's very large, maybe you're benefiting from propaganda.
JM: Nu kan man faktisk se på fordelingen af undertrykkelsesindekser over hele populationer. For eksempel her: Dette undertrykkelsesindeks er for 5.000 personer taget fra engelske bøger uden nogen kendt undertrykkelse. Det ville være på denne måde, tæt centreret om ét. Det man kan aflæse, er grundlæggende som forventet. Dette er fordelingen, som den ses i Tyskland. Meget anderledes... den er forskudt til venstre. Folk talte dobbelt så lidt om det, som de burde. Men vigtigere er, at fordelingen er meget bredere. Der er mange personer, der ender ude til venstre i fordelingen, som der bliver talt 10 gange så lidt om, som der burde. Men der er også personer ude til højre, som synes at være hjulpet af propaganda. Dette er kendetegnende for censur i bogregisteret.
JM: Now you can actually look at the distribution of suppression indexes over whole populations. So for instance, here -- this suppression index is for 5,000 people picked in English books where there's no known suppression -- it would be like this, basically tightly centered on one. What you expect is basically what you observe. This is distribution as seen in Germany -- very different, it's shifted to the left. People talked about it twice less as it should have been. But much more importantly, the distribution is much wider. There are many people who end up on the far left on this distribution who are talked about 10 times fewer than they should have been. But then also many people on the far right who seem to benefit from propaganda. This picture is the hallmark of censorship in the book record.
ELA: Denne metode kalder vi "culturomics". Det er lidt ligesom genforskning Genomics - genforskning - er et nærbillede af biologi hvor man ser på sekvenser af baser i arvemassen. Culturomics minder om dette. Det er en analyse af en kæmpe samling data anvendt på studiet af menneskets kultur. I stedet for at bruge arvemassen som perspektiv, bruges digitaliserede stykker af historisk materiale. Det gode ved culturomics er at alle kan gøre det. Hvorfor kan alle gøre det? Alle kan gøre det, fordi disse tre herrer, Jon Orwant, Matt Gray og Will Brockman hos Google, så prototypen af Ngram Viewer, og sagde, "Det er så sjovt, at vi må gøre det tilgængeligt for alle." På nøjagtig de to uger inden offentliggørelsen af vores rapport kodede de en version af Ngram Viewer til almen brug. Du kan så skrive et vilkårligt ord, du er interesseret i og straks se det tilhørende n-gram, og du kan gennemse eksempler på alle bøger som dit n-gram optræder i.
ELA: So culturomics is what we call this method. It's kind of like genomics. Except genomics is a lens on biology through the window of the sequence of bases in the human genome. Culturomics is similar. It's the application of massive-scale data collection analysis to the study of human culture. Here, instead of through the lens of a genome, through the lens of digitized pieces of the historical record. The great thing about culturomics is that everyone can do it. Why can everyone do it? Everyone can do it because three guys, Jon Orwant, Matt Gray and Will Brockman over at Google, saw the prototype of the Ngram Viewer, and they said, "This is so fun. We have to make this available for people." So in two weeks flat -- the two weeks before our paper came out -- they coded up a version of the Ngram Viewer for the general public. And so you too can type in any word or phrase that you're interested in and see its n-gram immediately -- also browse examples of all the various books in which your n-gram appears.
Dette blev brugt over en million gang første dag, og dette er den bedste af alle søgninger. Så folk ønsker at yde deres bedste. Men i det 18. årh. var folk ligeglade med alt det. De ville ikke gøre bedste, de ville være "beft". Dette var selvfølgelig bare en fejl. Man stræbte ikke efter middelmådighed, men tidligere skrev man S anderledes, nærmest som et f. Det opdagede Google selvfølgelig ikke dengang, så vi skrev det i den videnskabelige artikel. Dette minder os om, at selvom det er rigtig sjovt, at fortolke disse grafer, skal man være forsigtig og overholde de videnskabelige standarder.
JM: Now this was used over a million times on the first day, and this is really the best of all the queries. So people want to be their best, put their best foot forward. But it turns out in the 18th century, people didn't really care about that at all. They didn't want to be their best, they wanted to be their beft. So what happened is, of course, this is just a mistake. It's not that strove for mediocrity, it's just that the S used to be written differently, kind of like an F. Now of course, Google didn't pick this up at the time, so we reported this in the science article that we wrote. But it turns out this is just a reminder that, although this is a lot of fun, when you interpret these graphs, you have to be very careful, and you have to adopt the base standards in the sciences.
Folk har brugt dette til mange sjove formål. (Latter) Vi behøver faktisk ikke tale, vi viser bare alle slides og tier stille. Denne person var interesseret i frustrationens historie. Der er forskellige typer frustration. Hvis slår tåen, er der ét A i "argh". Hvis Jorden udslettes af Vogonerne for at gøre plads til en intergalaktisk ekspresrute, er det et "aaaaaaaargh" med otte A'er. Personen undersøger alle udgaver af "argh" fra ét til otte A'er. Og det viser sig at de mindst hyppige "argh" vedrører vedrører ting, der er mere frustrerende men sjovt nok ikke i de tidlige 80'ere. Vi tror det kan være noget med Reagan.
ELA: People have been using this for all kinds of fun purposes. (Laughter) Actually, we're not going to have to talk, we're just going to show you all the slides and remain silent. This person was interested in the history of frustration. There's various types of frustration. If you stub your toe, that's a one A "argh." If the planet Earth is annihilated by the Vogons to make room for an interstellar bypass, that's an eight A "aaaaaaaargh." This person studies all the "arghs," from one through eight A's. And it turns out that the less-frequent "arghs" are, of course, the ones that correspond to things that are more frustrating -- except, oddly, in the early 80s. We think that might have something to do with Reagan.
(Latter)
(Laughter)
Disse data kan bruges til mange ting, men grundlaget er, at historien bliver digitaliseret. Google er begyndt at digitalisere 15 millioner bøger. Det er 12 % af alle bøger, der er udgivet. Det er en god klump af menneskets kultur. Kultur er meget mere: manuskripter, aviser noget er ikke tekst, f.eks. kunst og malerier. Disse vil alle findes på vores computere, på computere i hele verden. Og når det sker, ændrer det den måde vi forstår vores fortid, vores nutid og menneskets kultur.
JM: There are many usages of this data, but the bottom line is that the historical record is being digitized. Google has started to digitize 15 million books. That's 12 percent of all the books that have ever been published. It's a sizable chunk of human culture. There's much more in culture: there's manuscripts, there newspapers, there's things that are not text, like art and paintings. These all happen to be on our computers, on computers across the world. And when that happens, that will transform the way we have to understand our past, our present and human culture.
Mange tak.
Thank you very much.
(Bifald)
(Applause)