Jean-Baptiste Michel + Erez Lieberman Aiden: What we learned from 5 million books

Erez Lieberman Aiden: Everyone knows that a picture is worth a thousand words. But we at Harvard were wondering if this was really true. (Laughter) So we assembled a team of experts, spanning Harvard, MIT, The American Heritage Dictionary, The Encyclopedia Britannica and even our proud sponsors, the Google. And we cogitated about this for about four years. And we came to a startling conclusion. Ladies and gentlemen, a picture is not worth a thousand words. In fact, we found some pictures that are worth 500 billion words.

Erez Lieberman Aiden：大家都知道一張圖勝過千言萬語但我們在哈佛時卻在思考這道理是否真是如此 (笑聲) 所以我們由來自哈佛大學麻省理工學院美國傳統英語詞典，大英百科全書甚至我們偉大的贊助商─Google的專家們組成一個團隊我們花了四年的時間在思考這個問題然後我們得到了一個驚人的結論女士先生們，一張圖片其實不只勝過千言萬語事實上，我們發現某些圖片更是勝過五千億個字

Jean-Baptiste Michel: So how did we get to this conclusion? So Erez and I were thinking about ways to get a big picture of human culture and human history: change over time. So many books actually have been written over the years. So we were thinking, well the best way to learn from them is to read all of these millions of books. Now of course, if there's a scale for how awesome that is, that has to rank extremely, extremely high. Now the problem is there's an X-axis for that, which is the practical axis. This is very, very low.

Jean-Baptiste Michel：我們是如何得出這項結論的呢？ Erez和我思考了不同的方式想更加了解人類文化以及人類歷史從古到今的變化的全景事實上，多年來已經出版了許多書籍。所以我們認為最好的學習方式就是將這上百萬的書全讀過一遍如果能有一個尺規來說明此舉的驚人程度這將會相當驚人但問題是這裡的X軸是表示實用程度這相當不實用

(Applause)

(掌聲)

Now people tend to use an alternative approach, which is to take a few sources and read them very carefully. This is extremely practical, but not so awesome. What you really want to do is to get to the awesome yet practical part of this space. So it turns out there was a company across the river called Google who had started a digitization project a few years back that might just enable this approach. They have digitized millions of books. So what that means is, one could use computational methods to read all of the books in a click of a button. That's very practical and extremely awesome.

現在人們希望用別的方式可以讀少一點書，但讀得非常仔細這會相當實用，但這一點都不吸引人我們真正想做的是要用一種吸引人且實用的方法來閱讀這些書所以在河的對岸有間公司叫做Google 他們幾年之前開始了一項數字化計畫這項計畫讓我們能實踐剛說的方法他們已將數百萬本書給數位化這意味著，我們可以透過電腦簡單按個按鈕就能閱讀所有的書這非常實用而且相當棒

ELA: Let me tell you a little bit about where books come from. Since time immemorial, there have been authors. These authors have been striving to write books. And this became considerably easier with the development of the printing press some centuries ago. Since then, the authors have won on 129 million distinct occasions, publishing books. Now if those books are not lost to history, then they are somewhere in a library, and many of those books have been getting retrieved from the libraries and digitized by Google, which has scanned 15 million books to date.

ELA：讓我為各位介紹這些書都來自何方自古以來，有非常多作家這些作家一直努力寫作但現在寫作變得相當容易這歸功於幾世紀前印刷術的革新自那時起作家們能在一億兩千九百萬個不同的地方出版書籍如果那些書沒有因為時代交替而遺失那麼那些書可能在某個圖書館的一處有相當多書可以從圖書館中被借閱由Google將其數位化迄今Google已經掃描了一千五百萬本書

Now when Google digitizes a book, they put it into a really nice format. Now we've got the data, plus we have metadata. We have information about things like where was it published, who was the author, when was it published. And what we do is go through all of those records and exclude everything that's not the highest quality data. What we're left with is a collection of five million books, 500 billion words, a string of characters a thousand times longer than the human genome -- a text which, when written out, would stretch from here to the Moon and back 10 times over -- a veritable shard of our cultural genome. Of course what we did when faced with such outrageous hyperbole ... (Laughter) was what any self-respecting researchers would have done. We took a page out of XKCD, and we said, "Stand back. We're going to try science."

Google將一本書數位化，並以優良的型式呈現現在我們有了這些數據，加上這些詮釋資料我們有了相關的資訊，比如出版地區，作者，出版時間我們所做的就是透過這些記錄並剔除不是最精華的資料我們後來得到的是五百萬本書五千億個詞這是一串比人類基因組還要長上一千倍的字符如果寫成文章將會是從這裡到月球來回距離的十倍以上這是我們文化基因名副其實的的一部分當然當我們面臨如此誇張的情況時 (笑聲) 我們也跟每一位有自尊心的研究人員一樣會做相同的事我們也和四格漫畫一樣我們決定「等等我們要用科學的方式來處理。」

(Laughter)

(笑聲)

JM: Now of course, we were thinking, well let's just first put the data out there for people to do science to it. Now we're thinking, what data can we release? Well of course, you want to take the books and release the full text of these five million books. Now Google, and Jon Orwant in particular, told us a little equation that we should learn. So you have five million, that is, five million authors and five million plaintiffs is a massive lawsuit. So, although that would be really, really awesome, again, that's extremely, extremely impractical. (Laughter)

JM：當然，我們在思考首先我們先把資料提取出來讓其他人以科學的方式去分析現在我們在思考，我們能發行何種數據？當然，我們想拿這些書將這五百萬本書的內容全部釋出現在Google，特別是Jon Orwant 告訴我們一個我們該注意的小方程式我們有五百萬本書，也就是有五百萬名作者而五百萬名原告是一場龐大的訴訟雖然這個過程是相當地驚人但這還是極度的不切實際 (笑聲)

Now again, we kind of caved in, and we did the very practical approach, which was a bit less awesome. We said, well instead of releasing the full text, we're going to release statistics about the books. So take for instance "A gleam of happiness." It's four words; we call that a four-gram. We're going to tell you how many times a particular four-gram appeared in books in 1801, 1802, 1803, all the way up to 2008. That gives us a time series of how frequently this particular sentence was used over time. We do that for all the words and phrases that appear in those books, and that gives us a big table of two billion lines that tell us about the way culture has been changing.

然後，我們似乎有點妥協我們試了比較實際的方式，這方法不怎麼吸引人我們認為，與其釋出全部的書籍資料我們選擇將這些書的數據資料給呈現出來舉個例子「幸福的光」這是四個字，我們稱做「四字詞」我們要告訴各位一個特定的四字詞從1801，1802，1803年開始出現在書本裡直到2008年這給我們一個時間軸來了解這些特定的字句從過去到現在的使用頻率我們計算了所有出現在這些書中的字詞彙整出的資料畫出了二十億條曲線這告訴了我們文化是如何改變的

ELA: So those two billion lines, we call them two billion n-grams. What do they tell us? Well the individual n-grams measure cultural trends. Let me give you an example. Let's suppose that I am thriving, then tomorrow I want to tell you about how well I did. And so I might say, "Yesterday, I throve." Alternatively, I could say, "Yesterday, I thrived." Well which one should I use? How to know?

ELA：這二十億條曲線我們稱為二十億組詞這告訴了我們每一組詞代表了不同的文化趨勢讓我舉個例子假設我做了件不得了的事明天我要告訴你是多不得了我可能會說「"Yesterday, I throve."」或者，我也可以說「"Yesterday, I thrived."」但我應該說哪一種呢？要怎麼知道

As of about six months ago, the state of the art in this field is that you would, for instance, go up to the following psychologist with fabulous hair, and you'd say, "Steve, you're an expert on the irregular verbs. What should I do?" And he'd tell you, "Well most people say thrived, but some people say throve." And you also knew, more or less, that if you were to go back in time 200 years and ask the following statesman with equally fabulous hair, (Laughter) "Tom, what should I say?" He'd say, "Well, in my day, most people throve, but some thrived." So now what I'm just going to show you is raw data. Two rows from this table of two billion entries. What you're seeing is year by year frequency of "thrived" and "throve" over time. Now this is just two out of two billion rows. So the entire data set is a billion times more awesome than this slide.

大概在六個月前要知道這一領域最尖端的方法你可能得要去詢問一位有著時髦髮型的心理學家你可能會問「史蒂夫，你是不規則動詞的專家。我該怎麼說呢？」而他會告訴你「嗯，大部分的人會說"thrive" 但有些人會說"throve"。」而你也或多或少知道如果我們回到兩百年前去問一位同樣也有時髦髮型的政治家 (笑聲) 「湯姆，我應該怎麼說呢？」他說「嗯，在我的年代，大部份的人說"throve"，但少部分的人說"thrived"」現在我要向各位展示原始數據這二十億條目資料中的其中兩條數據各位將會看到的是"thrived"和"throve"兩個字在各年時期的出現頻率這只是二十億筆資料中其中兩個詞條的資訊這全部的數據資料將會比此張投影片還要驚人億萬倍

(Laughter)

(笑聲)

(Applause)

(掌聲)

JM: Now there are many other pictures that are worth 500 billion words. For instance, this one. If you just take influenza, you will see peaks at the time where you knew big flu epidemics were killing people around the globe.

JM：還有其他圖片也具有五千億字的價值例如這張如果談到感冒從這幾個高峰點我們可以知道感冒病毒的大流行在全球造成人類死亡

ELA: If you were not yet convinced, sea levels are rising, so is atmospheric CO2 and global temperature.

ELA：如果各位還不太相信其他像是海平面升高大氣中的二氧化碳和全球暖化

JM: You might also want to have a look at this particular n-gram, and that's to tell Nietzsche that God is not dead, although you might agree that he might need a better publicist.

JM：你也許會想看看這組特別的詞組「告訴尼采，上帝還沒死」也許你可能還會認為，他可能需要一個更好的公關

(Laughter)

(笑聲)

ELA: You can get at some pretty abstract concepts with this sort of thing. For instance, let me tell you the history of the year 1950. Pretty much for the vast majority of history, no one gave a damn about 1950. In 1700, in 1800, in 1900, no one cared. Through the 30s and 40s, no one cared. Suddenly, in the mid-40s, there started to be a buzz. People realized that 1950 was going to happen, and it could be big. (Laughter) But nothing got people interested in 1950 like the year 1950. (Laughter) People were walking around obsessed. They couldn't stop talking about all the things they did in 1950, all the things they were planning to do in 1950, all the dreams of what they wanted to accomplish in 1950. In fact, 1950 was so fascinating that for years thereafter, people just kept talking about all the amazing things that happened, in '51, '52, '53. Finally in 1954, someone woke up and realized that 1950 had gotten somewhat passé. (Laughter) And just like that, the bubble burst.

ELA：從這當中，各位也能獲得一些相當抽象的概念例如，讓我跟各位說說有關「1950年」的歷史幾乎在絕大多數的歷史裡沒有特別談論1950這一年在1700年，在1800年，1900年沒有人在乎甚至到30年代和40年代也沒有人在談論突然到了40年代中期開始出現了風潮人們意識到1950年就要來臨這是件大事 (笑聲) 但也沒有因此讓大眾對該年份產生興趣像是「那1950年」 (笑聲) 人們開始對這一年著迷大家無法停止談論有關他們在1950年所做的一切所有他們計畫要在1950年所做的事所有他們要在1950年完成的夢想事實上，1950年跟往後幾年相較是相當迷人的一年人們不停談論所有發生在 '51，'52，'53年的驚奇事件直到1954年有人驚覺而且意識到 1950年已經變得過時了 (笑聲) 這一切就像泡沫破滅一樣

(Laughter)

(笑聲)

And the story of 1950 is the story of every year that we have on record, with a little twist, because now we've got these nice charts. And because we have these nice charts, we can measure things. We can say, "Well how fast does the bubble burst?" And it turns out that we can measure that very precisely. Equations were derived, graphs were produced, and the net result is that we find that the bubble bursts faster and faster with each passing year. We are losing interest in the past more rapidly.

1950年的情況其實就是我們數據上每一個年份的情況一樣稍微編排一下，我們有這些精美的圖表因為有這些不錯的圖表，我們就能計算我們可以了解「風潮消逝的速度是多快？」結果就是我們能很精確測量出一份數據有了方程式，也有圖表最終的結果就是談論年份的風潮一年比一年消退的更快我們對於過去的興趣日漸消逝

JM: Now a little piece of career advice. So for those of you who seek to be famous, we can learn from the 25 most famous political figures, authors, actors and so on. So if you want to become famous early on, you should be an actor, because then fame starts rising by the end of your 20s -- you're still young, it's really great. Now if you can wait a little bit, you should be an author, because then you rise to very great heights, like Mark Twain, for instance: extremely famous. But if you want to reach the very top, you should delay gratification and, of course, become a politician. So here you will become famous by the end of your 50s, and become very, very famous afterward. So scientists also tend to get famous when they're much older. Like for instance, biologists and physics tend to be almost as famous as actors. One mistake you should not do is become a mathematician. (Laughter) If you do that, you might think, "Oh great. I'm going to do my best work when I'm in my 20s." But guess what, nobody will really care.

JM：這張圖是有關職業建議對於那些想成名的人我們可以知道二十五位最有名的政治人物作家、演員等等如果各位想在年輕時就成名，那麼各位應該要當演員因為你的名氣會從二十歲後開始累積那時正值青春年華，會相當不錯如果各位有耐心一點，那麼就應該當個作家因為各位就能攀上高峰成為像是馬克吐溫這樣有名望的作家但如果各位想攀上最頂尖的位置就得延後滿足自己的慾望然後當一位政治家那麼各位會在五十歲過後開始成名然後你的名氣會在未來持續延續科學家也往往是在老年時才成名而生物學家和物理學家一樣往往也是和演員一樣著名唯一不要做的職業就是變成數學家 (笑聲) 如果各位真要做這行各位可能會想「太好了，當我在二十多歲時，我會盡一切努力。」但事實上，沒人會真正去在乎你所做的事

(Laughter)

(笑聲)

ELA: There are more sobering notes among the n-grams. For instance, here's the trajectory of Marc Chagall, an artist born in 1887. And this looks like the normal trajectory of a famous person. He gets more and more and more famous, except if you look in German. If you look in German, you see something completely bizarre, something you pretty much never see, which is he becomes extremely famous and then all of a sudden plummets, going through a nadir between 1933 and 1945, before rebounding afterward. And of course, what we're seeing is the fact Marc Chagall was a Jewish artist in Nazi Germany.

ELA：在我們的資料裡還有其他更發人省思的紀錄例如馬克‧夏卡爾的名字出現的頻率軌跡夏卡爾是位1887年出生的藝術家這看起來是一位名人名字正常出現在書中的軌跡他的名氣日益響亮但如果看德國的數據就不是如此如果看德國的數據，會看到某部份是非常奇怪的這是幾乎不太可能看到的就是他變得非常有名卻突然在1933年至1945年間聲勢跌落谷底又反彈回升當然我們看的出來這是因為馬克‧夏卡爾是一位猶太裔藝術家當時德國是納粹統治

Now these signals are actually so strong that we don't need to know that someone was censored. We can actually figure it out using really basic signal processing. Here's a simple way to do it. Well, a reasonable expectation is that somebody's fame in a given period of time should be roughly the average of their fame before and their fame after. So that's sort of what we expect. And we compare that to the fame that we observe. And we just divide one by the other to produce something we call a suppression index. If the suppression index is very, very, very small, then you very well might be being suppressed. If it's very large, maybe you're benefiting from propaganda.

這些指標事實上相當明確我們不需要知道有人在審查書籍我們能運用基本的信號運算方式實際了解當時狀況我們可以用簡單的方式來做合理的預期是在一段特定的時間裡某人的名氣指數應該會是他們成名前和成名後的指數的平均值這大概是我們預期的結果我們比較了我們觀察到的名人我們將前後的數值相除得到的數值，我們稱作抑制指數如果抑制指數的值非常的小那麼就表示此人也許遭受到打壓但如果數值非常大，也許此人獲得大量的推廣

JM: Now you can actually look at the distribution of suppression indexes over whole populations. So for instance, here -- this suppression index is for 5,000 people picked in English books where there's no known suppression -- it would be like this, basically tightly centered on one. What you expect is basically what you observe. This is distribution as seen in Germany -- very different, it's shifted to the left. People talked about it twice less as it should have been. But much more importantly, the distribution is much wider. There are many people who end up on the far left on this distribution who are talked about 10 times fewer than they should have been. But then also many people on the far right who seem to benefit from propaganda. This picture is the hallmark of censorship in the book record.

JM：各位現在可以看到抑制指數在抽樣整體人數中的分佈情況所以，例如這裡 -- 這個抑制指數的抽樣人數是五千人選自出版時期沒有打壓限制的英文書籍來做調查曲線基本上會在數值1的地方呈現高峰基本上預期的會和觀察到的數值是相同的這份分佈圖則是德國的部分 -- 相當不同，曲線移往左側人們談論事物的次數比預期的少了兩倍更重要的是，整體分佈的情況更寬廣有相當多人是落在圖表較左側的位置因為他們比應該被提及的次數少了十倍但也有相當多人是落在較右側的部分似乎是因為被大量宣傳這張圖是明顯看出書本中具有審查制度

ELA: So culturomics is what we call this method. It's kind of like genomics. Except genomics is a lens on biology through the window of the sequence of bases in the human genome. Culturomics is similar. It's the application of massive-scale data collection analysis to the study of human culture. Here, instead of through the lens of a genome, through the lens of digitized pieces of the historical record. The great thing about culturomics is that everyone can do it. Why can everyone do it? Everyone can do it because three guys, Jon Orwant, Matt Gray and Will Brockman over at Google, saw the prototype of the Ngram Viewer, and they said, "This is so fun. We have to make this available for people." So in two weeks flat -- the two weeks before our paper came out -- they coded up a version of the Ngram Viewer for the general public. And so you too can type in any word or phrase that you're interested in and see its n-gram immediately -- also browse examples of all the various books in which your n-gram appears.

ELA：文化組學是我們用的方法這和基因組學有些類似不過基因組學是透過生物學基本的序列基礎來檢視人類基因組文化組學是類似的這是應用收集分析規模龐大的數據來研究人類文化不透過檢視基因組而是檢視歷史紀錄的數位資料文化組學的好處是每個人都能執行為何每個人都能做呢？因為這三位人士 Google的Jon Orwant，Matt Gray還有Will Brockman 他們看到Ngram瀏覽器的原型他們說「這太有趣了。」我們要讓大家都可以使用這功能所以在兩週的時間 -- 我們的報告出來的兩週前 -- 他們編寫了一個大眾版本的Ngram瀏覽器各位可以打上任何各位有興趣的字或詞組然後立即看到該字詞的頻率變化 -- 同時根據你搜尋的字詞瀏覽不同書籍中的各種例子

JM: Now this was used over a million times on the first day, and this is really the best of all the queries. So people want to be their best, put their best foot forward. But it turns out in the 18th century, people didn't really care about that at all. They didn't want to be their best, they wanted to be their beft. So what happened is, of course, this is just a mistake. It's not that strove for mediocrity, it's just that the S used to be written differently, kind of like an F. Now of course, Google didn't pick this up at the time, so we reported this in the science article that we wrote. But it turns out this is just a reminder that, although this is a lot of fun, when you interpret these graphs, you have to be very careful, and you have to adopt the base standards in the sciences.

JM：這功能在首日就被使用了超過一百萬次這也是各種查詢工具中最好的一個人們希望做到最好的，以最好的狀態像前進但事實證明在18世紀，人們一點也不關心這一切他們不想做到最好，他們想變成"beft" 這是怎麼回事，當然這只是個錯誤這並不是說他們想要平凡這只是因為"S"常被寫的不一樣，寫得像"F" 當然，Google並沒有挑出來所以我們在自己寫科學文章中提到此事不過這只是個提醒雖然這相當有趣當你要解讀這些圖表，你必須非常謹慎而且必須採納科學的基礎標準

ELA: People have been using this for all kinds of fun purposes. (Laughter) Actually, we're not going to have to talk, we're just going to show you all the slides and remain silent. This person was interested in the history of frustration. There's various types of frustration. If you stub your toe, that's a one A "argh." If the planet Earth is annihilated by the Vogons to make room for an interstellar bypass, that's an eight A "aaaaaaaargh." This person studies all the "arghs," from one through eight A's. And it turns out that the less-frequent "arghs" are, of course, the ones that correspond to things that are more frustrating -- except, oddly, in the early 80s. We think that might have something to do with Reagan.

ELA：大家一直在使用這工具來滿足各種樂趣 (笑聲) 事實上，我們不需要說明的我們原本只想播放所有的投影片然後在一旁保持沉默此人對於挫折的歷史感興趣挫折有非常多種方式如果你踢到腳趾，哀叫聲「啊」就是一個"A"的"argh" 如果地球被外星人毀滅變成星際間的通道那麼哀叫聲「啊」就是有八個"A"的"aaaaaaaargh" 此人研究了所有書籍上出現的哀叫聲「啊」有從一個"A"到八個"A" 結果是較不頻繁的「啊」“arghs” 對應了那些相對較令人沮喪的的事情也有例外，奇怪的是在80年代初我們認為這也許是受到雷根的影響

(Laughter)

(笑聲)

JM: There are many usages of this data, but the bottom line is that the historical record is being digitized. Google has started to digitize 15 million books. That's 12 percent of all the books that have ever been published. It's a sizable chunk of human culture. There's much more in culture: there's manuscripts, there newspapers, there's things that are not text, like art and paintings. These all happen to be on our computers, on computers across the world. And when that happens, that will transform the way we have to understand our past, our present and human culture.

JM：這份書據資料有相當多用途不過最終就是歷史紀錄都被數位化了 Google已經開始將一千五百萬本書數位化其中百分之十二的書是已出版的這涵蓋了相當大量的人類文化這當中有非常多的文化資料：裡頭有手稿，報紙也有不是文字的資料，像是藝術品和畫作現在這都存放在我們的電腦裡在世界各處的電腦裡如果這一切成真，就會改變我們了解過去、現在和人類文化的方式

Thank you very much.

非常謝謝各位

(Applause)

(掌聲)

(Applause)

(掌聲)

(Laughter)

(笑聲)

(Laughter)

(笑聲)

(Applause)

(掌聲)

JM：還有其他圖片也具有五千億字的價值例如這張如果談到感冒從這幾個高峰點我們可以知道感冒病毒的大流行在全球造成人類死亡

ELA: If you were not yet convinced, sea levels are rising, so is atmospheric CO2 and global temperature.

ELA：如果各位還不太相信其他像是海平面升高大氣中的二氧化碳和全球暖化

JM: You might also want to have a look at this particular n-gram, and that's to tell Nietzsche that God is not dead, although you might agree that he might need a better publicist.

JM：你也許會想看看這組特別的詞組「告訴尼采，上帝還沒死」也許你可能還會認為，他可能需要一個更好的公關

(Laughter)

(笑聲)

(Laughter)

(笑聲)

(Laughter)

(笑聲)

(Laughter)

(笑聲)

Thank you very much.

非常謝謝各位

(Applause)

(掌聲)

Jean-Baptiste Michel + Erez Lieberman Aiden: What we learned from 5 million books

Jean-Baptiste Michel + Erez Lieberman Aiden: What we learned from 5 million books

Related talks

Brewster Kahle: A free digital library

Aaron Koblin: Visualizing ourselves ... with crowd-sourced data

Amit Sood: Building a museum of museums on the web

Chip Kidd: Designing books is no laughing matter. OK, it is.

Ilan Stavans: Why should you read "Don Quixote"?

Chand John: What's the fastest way to alphabetize your bookshelf?

Related talks

Brewster Kahle: A free digital library

Aaron Koblin: Visualizing ourselves ... with crowd-sourced data

Amit Sood: Building a museum of museums on the web

Chip Kidd: Designing books is no laughing matter. OK, it is.

Ilan Stavans: Why should you read "Don Quixote"?

Chand John: What's the fastest way to alphabetize your bookshelf?