Jean-Baptiste Michel + Erez Lieberman Aiden: What we learned from 5 million books

Erez Lieberman Aiden: Everyone knows that a picture is worth a thousand words. But we at Harvard were wondering if this was really true. (Laughter) So we assembled a team of experts, spanning Harvard, MIT, The American Heritage Dictionary, The Encyclopedia Britannica and even our proud sponsors, the Google. And we cogitated about this for about four years. And we came to a startling conclusion. Ladies and gentlemen, a picture is not worth a thousand words. In fact, we found some pictures that are worth 500 billion words.

Erez Lieberman Aiden: Tout le monde sait qu’une image vaut un millier de mots. Mais à Harvard nous nous sommes demandé si c’était vrai. (Rires) Nous avons donc rassemblé un groupe d’experts, qui viennent d’Harvard, du MIT, de The American Heritage Dictionary, The Encyclopedia Britannica et même notre heureux sponsor, Google. Et nous y avons réfléchi pendant quatre ans. Et nous sommes arrivés à une conclusion surprenante. Mesdames et messieurs, une image ne vaut pas un millier de mots. En fait, nous avons trouvé des images qui valent 500 milliards de mots.

Jean-Baptiste Michel: So how did we get to this conclusion? So Erez and I were thinking about ways to get a big picture of human culture and human history: change over time. So many books actually have been written over the years. So we were thinking, well the best way to learn from them is to read all of these millions of books. Now of course, if there's a scale for how awesome that is, that has to rank extremely, extremely high. Now the problem is there's an X-axis for that, which is the practical axis. This is very, very low.

Jean-Baptiste Michel : Comment sommes-nous arrivés à cette conclusion ? Donc Erez et moi nous étions en train de penser à des moyens d’avoir une image complète de la culture humaine et de l‘histoire humaine : les changements dans le temps. Beaucoup de livres ont été écrits au fil des ans. Nous avons donc pensé, le meilleur moyen d’apprendre des livres est de lire tous ces millions de livres. Bien sur, s’il existe une échelle pour voir à quel point c’est impressionnant, ça doit être classé très très haut. Le problème est qu’il y a un axe X pour cela, qui est l’axe de la praticité. Ce qui le classe très très bas.

(Applause)

(Applaudissements)

Now people tend to use an alternative approach, which is to take a few sources and read them very carefully. This is extremely practical, but not so awesome. What you really want to do is to get to the awesome yet practical part of this space. So it turns out there was a company across the river called Google who had started a digitization project a few years back that might just enable this approach. They have digitized millions of books. So what that means is, one could use computational methods to read all of the books in a click of a button. That's very practical and extremely awesome.

Les gens ont tendance à utiliser une approche alternative, qui est de prendre quelques sources et les lire très attentivement. Ce qui est extrêmement pratique, mais pas si impressionnant. Ce que vous voulez vraiment faire c’est arriver à l’impressionnant et pratique en même temps de ce côté-là. Il s’avère qu’il y avait une compagnie au-delà du fleuve qui s’appelle Google qui avait commencé un projet de numérisation quelques années auparavant qui aurait pu faciliter cette approche. Ils ont numérisé des millions de livres. Cela signifie qu’on peut utiliser des méthodes de calcul pour lire tous les livres avec un clic. C’est très pratique et extrêmement impressionnant.

ELA: Let me tell you a little bit about where books come from. Since time immemorial, there have been authors. These authors have been striving to write books. And this became considerably easier with the development of the printing press some centuries ago. Since then, the authors have won on 129 million distinct occasions, publishing books. Now if those books are not lost to history, then they are somewhere in a library, and many of those books have been getting retrieved from the libraries and digitized by Google, which has scanned 15 million books to date.

ELA : Laissez-moi vous raconter d’où viennent les livres. Depuis des temps immémoriaux il y a eu des auteurs. Ces auteurs se sont efforcé d’écrire des livres. Et c’est devenu considérablement plus facile avec le développement de l’imprimerie il y a quelques siècles. Depuis, les auteurs ont eu 129 millions différentes occasions, de publier des livres. Si ces livres ne sont pas perdus dans l’histoire, alors ils sont quelque part dans une bibliothèque, et beaucoup de ces livres ont été récupérés des bibliothèques et numérisés par Google, qui a scanné 15 millions de livres à ce jour.

Now when Google digitizes a book, they put it into a really nice format. Now we've got the data, plus we have metadata. We have information about things like where was it published, who was the author, when was it published. And what we do is go through all of those records and exclude everything that's not the highest quality data. What we're left with is a collection of five million books, 500 billion words, a string of characters a thousand times longer than the human genome -- a text which, when written out, would stretch from here to the Moon and back 10 times over -- a veritable shard of our cultural genome. Of course what we did when faced with such outrageous hyperbole ... (Laughter) was what any self-respecting researchers would have done. We took a page out of XKCD, and we said, "Stand back. We're going to try science."

Quand Google numérise un livre, il le transforme en un très beau format. Nous avons maintenant les données, et nous avons les métadonnées. Nous avons les informations sur des choses comme où il a été publié, qui était l’auteur, la date de publication. Ce que nous faisons c’est parcourir tous ces archives et exclure tout ce qui n’est pas de très haute qualité. Ce qui nous reste c’est une collection de cinq millions de livres, 500 milliards de mots, une série de caractères mille fois plus longs que le génome humain -- un texte qui, écrit, un aller et un retour d'ici jusqu'à la lune plus de 10 fois -- un véritable fragment de notre génome culturel. Bien sur ce que nous avons fait en faisant face à cette hyperbole outrageuse… (Rires) est ce que n’importe quel chercheur respectable aurait fait. Nous avons pris une page de XKCD, et nous avons dit, « Marche arrière. Nous allons essayer la science. »

(Laughter)

(Rires)

JM: Now of course, we were thinking, well let's just first put the data out there for people to do science to it. Now we're thinking, what data can we release? Well of course, you want to take the books and release the full text of these five million books. Now Google, and Jon Orwant in particular, told us a little equation that we should learn. So you have five million, that is, five million authors and five million plaintiffs is a massive lawsuit. So, although that would be really, really awesome, again, that's extremely, extremely impractical. (Laughter)

JM : Bien sur, nous avons pensé, commençons par sortir les données pour que les gens puissent faire de la science. Nous avons réfléchi, quelles données pouvons-nous sortir? Bien sûr, on veut prendre les livres et sortir le texte complet de ces cinq millions de livres. Google, et en particulier Jon Orwant, nous a expliqué une équation qu’il nous faudrait apprendre. Vous avez cinq millions, ce qui représente, cinq millions d’auteurs et cinq millions de plaintes judiciaires c’est un procès énorme. Donc, malgré cela puisse être vraiment redoutable, encore une fois, c’est extrêmement, extrêmement peu pratique. (Rires)

Now again, we kind of caved in, and we did the very practical approach, which was a bit less awesome. We said, well instead of releasing the full text, we're going to release statistics about the books. So take for instance "A gleam of happiness." It's four words; we call that a four-gram. We're going to tell you how many times a particular four-gram appeared in books in 1801, 1802, 1803, all the way up to 2008. That gives us a time series of how frequently this particular sentence was used over time. We do that for all the words and phrases that appear in those books, and that gives us a big table of two billion lines that tell us about the way culture has been changing.

Encore une fois, nous avons cédé, et nous avons suivi l’approche pratique, qui était un peu moins redoutable. On s’est dit, plutôt que de faire sortir le texte complet, nous allons faire sortir des statistiques sur les livres. Prenez par exemple, « Une étincelle de bonheur ». Quatre mots ; on l’appelle quadri-gramme. Nous allons vous dire combien de fois un quadri-gramme en particulier est apparu dans les livres en 1801, 1802, 1803, jusqu’à 2008. Cela nous donne une série temporelle sur la fréquence avec laquelle cette phrase en particulier a été utilisée dans le temps. Nous le faisons pour tous les mots et les phrases qui apparaissent dans ces livres, ce qui nous donne un tableau de deux milliards de lignes qui nous raconte les changements culturels.

ELA: So those two billion lines, we call them two billion n-grams. What do they tell us? Well the individual n-grams measure cultural trends. Let me give you an example. Let's suppose that I am thriving, then tomorrow I want to tell you about how well I did. And so I might say, "Yesterday, I throve." Alternatively, I could say, "Yesterday, I thrived." Well which one should I use? How to know?

ELA : Donc ces deux milliards de lignes, nous les appelons les deux milliards d’n-grammes. Qu’est ce qu’ils nous disent? L’n-gramme individuel mesure les tendances culturelles. Prenons un exemple. Supposez que je sois épanoui, et demain matin je veux vous raconter comme j’étais bien. Je pourrai donc dire, « Yesterday, I throve. » [Hier j’étais épanoui] Sinon, je pourrai dire, « Yesterday, I thrived.» Laquelle devrais-je utiliser ? Comment savoir ?

As of about six months ago, the state of the art in this field is that you would, for instance, go up to the following psychologist with fabulous hair, and you'd say, "Steve, you're an expert on the irregular verbs. What should I do?" And he'd tell you, "Well most people say thrived, but some people say throve." And you also knew, more or less, that if you were to go back in time 200 years and ask the following statesman with equally fabulous hair, (Laughter) "Tom, what should I say?" He'd say, "Well, in my day, most people throve, but some thrived." So now what I'm just going to show you is raw data. Two rows from this table of two billion entries. What you're seeing is year by year frequency of "thrived" and "throve" over time. Now this is just two out of two billion rows. So the entire data set is a billion times more awesome than this slide.

Pour ce qui est d’il y a six mois, le dernier cri dans le domaine est que, par exemple, vous iriez chez le premier psychologue avec une coiffure fabuleuse, et vous diriez, « Steve, tu es un expert en verbes irréguliers. qu’est ce que je devrais faire ? » Et il dirait, « La plupart des gens disent ‘thrived’, mais certains disent ‘throve’» Et vous sauriez également que, si vous deviez revenir en arrière de 200 ans et demander la même chose à un homme d’état avec la même coiffure, (Rires) « Tom, qu’est que je devrais dire ? » Il dirait, « De nos jours, beaucoup disent ‘throve’, certains disent ‘thrived’» Maintenant ce que je vais vous montrer c’est des données bruts. Deux lignes de cette tableau de deux milliards d’entrées. Ce que vous voyez c’est la fréquence année par année de « thrived » et « throve » dans le temps. Voici deux des deux milliards de lignes. L’entière série de données est un milliard de fois plus impressionnante que cette diapo.

(Laughter)

(Rires)

(Applause)

(Applaudissements)

JM: Now there are many other pictures that are worth 500 billion words. For instance, this one. If you just take influenza, you will see peaks at the time where you knew big flu epidemics were killing people around the globe.

JM : Il y a plein d’autres photos qui valent 500 milliards de mots. Une par exemple. Si vous prenez la grippe, vous voyez des pics dans les périodes pendant lesquelles vous saviez que les grandes grippes épidémiques tuaient les gens partout dans le monde.

ELA: If you were not yet convinced, sea levels are rising, so is atmospheric CO2 and global temperature.

ELA : si vous n’êtes pas encore convaincu, le niveau des océans monte, ainsi que le CO2 et la température globale.

JM: You might also want to have a look at this particular n-gram, and that's to tell Nietzsche that God is not dead, although you might agree that he might need a better publicist.

JM : Vous pourriez également jeter un coup d’œil à cet n-gramme, et c’est pour dire à Nietzche que Dieu n’est pas mort, bien que vous pourriez convenir qu’il ait besoin d’un meilleur attaché de presse.

(Laughter)

(Rires)

ELA: You can get at some pretty abstract concepts with this sort of thing. For instance, let me tell you the history of the year 1950. Pretty much for the vast majority of history, no one gave a damn about 1950. In 1700, in 1800, in 1900, no one cared. Through the 30s and 40s, no one cared. Suddenly, in the mid-40s, there started to be a buzz. People realized that 1950 was going to happen, and it could be big. (Laughter) But nothing got people interested in 1950 like the year 1950. (Laughter) People were walking around obsessed. They couldn't stop talking about all the things they did in 1950, all the things they were planning to do in 1950, all the dreams of what they wanted to accomplish in 1950. In fact, 1950 was so fascinating that for years thereafter, people just kept talking about all the amazing things that happened, in '51, '52, '53. Finally in 1954, someone woke up and realized that 1950 had gotten somewhat passé. (Laughter) And just like that, the bubble burst.

ELA : Vous pouvez obtenir des concepts assez abstrait avec ce genre de chose. Par exemple, je vais vous raconter l’histoire de l’année 1950. Pendant la majeure partie de l’histoire, tout le monde s’est fiché de 1950. En 1700, en 1800, en 1900, personne ne s’en intéresse. Pendant les années 30 et 40, personne ne s’y intéresse. Tout à coup, au milieu des années 40, une agitation débute. Les gens se rendent compte que 1950 approche, et ça peut être énorme. (Rires) Mais rien n’a intéressé personne à 1950 jusqu’à 1950. (Rires) Les gens se promenaient de touts les côtés obsédés. Ils n’arrêtaient pas de parler de tout ce qu’ils avaient fait en 1950, tout ce qu’ils prévoyaient de faire en 1950, tous les rêves qu’ils voulaient accomplir en 1950. En fait, 1950 était si fascinant que dans les années qui ont suivi, les gens n’arrêtaient pas de parler de toutes les choses extraordinaires qui étaient arrivées en 51, 52, 53. Finalement en 1954, quelqu’un s’est levé et s’est rendu compte que 1950 en quelque sorte était passé. (Rires) Et tout d’un coup, la bulle a éclaté.

(Laughter)

(Rires)

And the story of 1950 is the story of every year that we have on record, with a little twist, because now we've got these nice charts. And because we have these nice charts, we can measure things. We can say, "Well how fast does the bubble burst?" And it turns out that we can measure that very precisely. Equations were derived, graphs were produced, and the net result is that we find that the bubble bursts faster and faster with each passing year. We are losing interest in the past more rapidly.

Et l’histoire de 1950 est l’histoire de chaque année pour laquelle nous avons un passé, avec un petit zest, puisque nous avons maintenant ces beaux diagrammes. Et vu que nous avons ces beaux diagrammes, nous pouvons mesurer les choses. Nous sommes en mesure de dire, « A quelle vitesse la bulle a éclaté » Et il s’avère que nous pouvons la mesurer avec précision. Nous en avons dérivé des équations et des représentations graphiques, et le résultat final est que nous découvrons que la bulle a éclaté de plus en plus vite à chaque année qui passait. Nous perdons intérêt pour le passé rapidement.

JM: Now a little piece of career advice. So for those of you who seek to be famous, we can learn from the 25 most famous political figures, authors, actors and so on. So if you want to become famous early on, you should be an actor, because then fame starts rising by the end of your 20s -- you're still young, it's really great. Now if you can wait a little bit, you should be an author, because then you rise to very great heights, like Mark Twain, for instance: extremely famous. But if you want to reach the very top, you should delay gratification and, of course, become a politician. So here you will become famous by the end of your 50s, and become very, very famous afterward. So scientists also tend to get famous when they're much older. Like for instance, biologists and physics tend to be almost as famous as actors. One mistake you should not do is become a mathematician. (Laughter) If you do that, you might think, "Oh great. I'm going to do my best work when I'm in my 20s." But guess what, nobody will really care.

JM : Maintenant un tout petit conseil pour la carrière. Pour ceux parmi vous qui cherchent à devenir célèbres, nous pouvons apprendre des 35 personnages politiques les plus célèbres, écrivains, acteurs et ainsi de suite. Si vous voulez devenir célèbre très tôt, vous devriez vous faire acteur, parce que la célébrité commence à monter vers la fin de la vingtaine -- vous êtes encore jeunes, c’est fabuleux. Si vous pouvez attendre un peu, vous devriez vous faire écrivain, parce qu’ensuite vous montez très haut, comme Mark Twain, par exemple : très célèbre. Mais si vous voulez atteindre le sommet, vous devriez retarder la gratification et, bien sûr, devenir un personnage politique. Et là vous devenez célèbre à la fin de la cinquantaine, et vous devenez très, très célèbre plus tard. Les scientifiques ont tendance à devenir célèbres quand ils sont plus âgés. Comme par exemple, les biologistes et les physicistes ont tendance à être célèbre autant que les acteurs. Une erreur à ne pas faire est de devenir mathématicien. (Rires) En faisant ça, vous pourriez penser, « Super. Je vais faire mon meilleur travail à vingt ans. » Mais devinez, personne ne s’y intéresse vraiment.

(Laughter)

(Rires)

ELA: There are more sobering notes among the n-grams. For instance, here's the trajectory of Marc Chagall, an artist born in 1887. And this looks like the normal trajectory of a famous person. He gets more and more and more famous, except if you look in German. If you look in German, you see something completely bizarre, something you pretty much never see, which is he becomes extremely famous and then all of a sudden plummets, going through a nadir between 1933 and 1945, before rebounding afterward. And of course, what we're seeing is the fact Marc Chagall was a Jewish artist in Nazi Germany.

ELA : Il y a encore des notes qui donnent à penser parmi les n-grammes. Par exemple, voici la trajectoire de Marc Chagall, un artiste né en 1887. Et ça ressemble à une normale trajectoire d’une personne célèbre. Il devient de plus en plus célèbre, sauf si vous regardez l’allemand. Si vous regardez l’allemand, vous voyez quelque chose de bizarre, quelque chose que normalement vous ne voyez pas, qui est qu’il devient extrêmement célèbre et tout à coup il s’effondre, en passant par un nadir entre 1933 et 1945, avant de rebondir tout de suite après. Et bien sur, ce que nous voyons est que Marc Chagall était un artiste juif dans l’Allemagne des Nazis.

Now these signals are actually so strong that we don't need to know that someone was censored. We can actually figure it out using really basic signal processing. Here's a simple way to do it. Well, a reasonable expectation is that somebody's fame in a given period of time should be roughly the average of their fame before and their fame after. So that's sort of what we expect. And we compare that to the fame that we observe. And we just divide one by the other to produce something we call a suppression index. If the suppression index is very, very, very small, then you very well might be being suppressed. If it's very large, maybe you're benefiting from propaganda.

Ces signaux sont tellement forts que nous n’avons pas besoin de savoir que quelqu’un a été censuré. On peut s’en apercevoir en utilisant un traitement des signaux élémentaires. Voici une manière simple de le faire. Une prévision raisonnable est que la célébrité de quelqu’un sur une période donnée de temps devrait être à peu près la moyenne de sa célébrité avant et de sa célébrité après. Voici à peu près ce à quoi l’on s’attend. Et nous comparons ceci avec la célébrité que nous observons. Et nous divisions l’un par l’autre pour obtenir une chose que nous appelons un index de suppression. Si l’index de suppression est très, très, très petit, alors vous avez très probablement été supprimé. S’il est très grand, vous avez peut-être bénéficié de la propagande.

JM: Now you can actually look at the distribution of suppression indexes over whole populations. So for instance, here -- this suppression index is for 5,000 people picked in English books where there's no known suppression -- it would be like this, basically tightly centered on one. What you expect is basically what you observe. This is distribution as seen in Germany -- very different, it's shifted to the left. People talked about it twice less as it should have been. But much more importantly, the distribution is much wider. There are many people who end up on the far left on this distribution who are talked about 10 times fewer than they should have been. But then also many people on the far right who seem to benefit from propaganda. This picture is the hallmark of censorship in the book record.

JM : En fait vous pouvez regarder la distribution des index de suppression sur toutes les populations. Donc par exemple, ici -- cet index de suppression est de 5000 personnes prises dans des livres anglais où il n’y a pas de suppression -- ce serait comme ça, en fait très centré sur une. Ce que vous vous attendez c’est ce que vous voyez. Voila la distribution vue en Allemagne -- très différente, déplacée sur la gauche. Les gens en parlaient deux fois moins que ce qu’ils auraient dû. Mais encore plus important, la distribution est plus large. Il y a plus de personnes qui tombent à l’extrême gauche de cette distribution de qui on a parlé 10 fois moins de ce qu’on aurait dû. Mai également beaucoup de personnes à l’extrême droite qui paraissent bénéficier de la propagande. Cette image est la marque de la censure dans les archives des livres.

ELA: So culturomics is what we call this method. It's kind of like genomics. Except genomics is a lens on biology through the window of the sequence of bases in the human genome. Culturomics is similar. It's the application of massive-scale data collection analysis to the study of human culture. Here, instead of through the lens of a genome, through the lens of digitized pieces of the historical record. The great thing about culturomics is that everyone can do it. Why can everyone do it? Everyone can do it because three guys, Jon Orwant, Matt Gray and Will Brockman over at Google, saw the prototype of the Ngram Viewer, and they said, "This is so fun. We have to make this available for people." So in two weeks flat -- the two weeks before our paper came out -- they coded up a version of the Ngram Viewer for the general public. And so you too can type in any word or phrase that you're interested in and see its n-gram immediately -- also browse examples of all the various books in which your n-gram appears.

ELA : Donc « culturomique » c’est le mot que nous employons pour définir cette méthode. C’est un peu comme la génomique. Sauf que la génomique est une lentille sur la biologie à travers la fenêtre de la séquence des bases dans le génome humain. La culturomique est similaire. C’est l’application sur grande échelle de l’analyse du recueil de données à l’étude de la culture humaine. Ici, plutôt qu’à travers la lentille d’un génome, à travers la lentille d’un morceau des archives numérisées. Ce qui est génial à propos de la culturomique c’est que tout le monde peut le faire. Pourquoi tout le monde peut le faire ? Tout le monde peut le faire parce que trois gars, Jon Orwant, Matt Gray et Will Brockman de chez Google, ont vu le prototype du Ngram Viewer, et ils ont dit, « C’est marrant. Nous devons le rendre disponible à tout le monde. » En deux semaines – les deux semaines qui ont précédé la sortie de notre article -- ils ont codé une version du Ngram Viewer tout public. Donc vous aussi vous pouvez taper n’importe quel mot ou phrase qui vous intéresse et voir son n-grammes immédiatement -- vous pouvez aussi feuilleter tous les livres dans lesquelles votre n-gramme apparaît.

JM: Now this was used over a million times on the first day, and this is really the best of all the queries. So people want to be their best, put their best foot forward. But it turns out in the 18th century, people didn't really care about that at all. They didn't want to be their best, they wanted to be their beft. So what happened is, of course, this is just a mistake. It's not that strove for mediocrity, it's just that the S used to be written differently, kind of like an F. Now of course, Google didn't pick this up at the time, so we reported this in the science article that we wrote. But it turns out this is just a reminder that, although this is a lot of fun, when you interpret these graphs, you have to be very careful, and you have to adopt the base standards in the sciences.

JM : Il a été utilisé plus d’un million de fois le premier jour, et c’est vraiment la meilleure de toutes les interrogations. Les gens veulent faire de leurs mieux [best], ils mettent leur meilleur [best] pied avant. Mais il s’avère qu’au 18° siècle, les gens ne s’y intéressaient pas du tout. Ils ne voulaient pas faire de leur « best », ils voulaient faire de leur « beft ». Ce qui est arrivé, bien sur, c’est que c’est une erreur. Ils n’étaient pas à la recherche de la médiocrité. c’est que le S s’écrivait différemment, un peu comme un F. Bien sur, Google ne s’en est pas aperçu, nous avons donc rapporté ceci dans l’article scientifique que nous avons écrit. Mais il s’avère que c’est juste un rappel sur le fait que malgré tout ceci soit amusant, en interprétant ces graphes, il faut faire très attention, et il faut adopter les standards de base de la science.

ELA: People have been using this for all kinds of fun purposes. (Laughter) Actually, we're not going to have to talk, we're just going to show you all the slides and remain silent. This person was interested in the history of frustration. There's various types of frustration. If you stub your toe, that's a one A "argh." If the planet Earth is annihilated by the Vogons to make room for an interstellar bypass, that's an eight A "aaaaaaaargh." This person studies all the "arghs," from one through eight A's. And it turns out that the less-frequent "arghs" are, of course, the ones that correspond to things that are more frustrating -- except, oddly, in the early 80s. We think that might have something to do with Reagan.

ELA : Les gens l’ont utilisé pour toutes sortes de raisons amusantes. (Rires) En fait, nous n'allons pas parler, nous allons juste vous montrer toutes les diapos et nous taire. Cette personne s'est intéressée à l’histoire de la frustration. Il y a différents types de frustration. Si vous vous cognez l’orteil, en voilà une avec 1 A « argh ». Si la terre est anéantie par les Vogons pour faire de la place à un voyageur intergalactique, c’est avec 8 A « aaaaaaaargh ». Cette personne étudie tous les « argh », de 1 à 8 A. Et il s’avère que les « argh » moins fréquents sont, bien sûr, ceux qui correspondent aux choses les plus frustrantes -- sauf, curieusement, au début des années 80. Nous pensons que ça pourrait avoir à faire avec Reagan,

(Laughter)

(Rires)

JM: There are many usages of this data, but the bottom line is that the historical record is being digitized. Google has started to digitize 15 million books. That's 12 percent of all the books that have ever been published. It's a sizable chunk of human culture. There's much more in culture: there's manuscripts, there newspapers, there's things that are not text, like art and paintings. These all happen to be on our computers, on computers across the world. And when that happens, that will transform the way we have to understand our past, our present and human culture.

JM : Il y a plusieurs utilisations de ces données, mais la vérité est que nous sommes en train de numériser les archives historiques. Google a commencé à numériser 15 millions de livres. C’est 12 pour cent de tous les livres qui aient jamais été publiés. C’est un gros morceau de la culture humaine. Il y a beaucoup plus dans la culture : il y a les manuscrits, il y a les journaux, il y a tout ce qui n’est pas texte, comme l’art et les tableaux. Tout ceci est dans nos ordinateurs, dans les ordinateurs du monde entier. Et dans le temps, ça transformera la manière dont nous comprenons notre passé, notre culture humaine du présent.

Thank you very much.

Merci beaucoup.

(Applause)

(Applaudissements)

(Applause)

(Applaudissements)

(Laughter)

(Rires)

(Laughter)

(Rires)

(Applause)

(Applaudissements)

ELA: If you were not yet convinced, sea levels are rising, so is atmospheric CO2 and global temperature.

ELA : si vous n’êtes pas encore convaincu, le niveau des océans monte, ainsi que le CO2 et la température globale.

JM: You might also want to have a look at this particular n-gram, and that's to tell Nietzsche that God is not dead, although you might agree that he might need a better publicist.

(Laughter)

(Rires)

(Laughter)

(Rires)

(Laughter)

(Rires)

(Laughter)

(Rires)

Thank you very much.

Merci beaucoup.

(Applause)

(Applaudissements)

Jean-Baptiste Michel + Erez Lieberman Aiden: What we learned from 5 million books

Jean-Baptiste Michel + Erez Lieberman Aiden: What we learned from 5 million books

Related talks

Brewster Kahle: A free digital library

Aaron Koblin: Visualizing ourselves ... with crowd-sourced data

Amit Sood: Building a museum of museums on the web

Chip Kidd: Designing books is no laughing matter. OK, it is.

Ilan Stavans: Why should you read "Don Quixote"?

Chand John: What's the fastest way to alphabetize your bookshelf?

Related talks

Brewster Kahle: A free digital library

Aaron Koblin: Visualizing ourselves ... with crowd-sourced data

Amit Sood: Building a museum of museums on the web

Chip Kidd: Designing books is no laughing matter. OK, it is.

Ilan Stavans: Why should you read "Don Quixote"?

Chand John: What's the fastest way to alphabetize your bookshelf?