Erez Lieberman Aiden: Everyone knows that a picture is worth a thousand words. But we at Harvard were wondering if this was really true. (Laughter) So we assembled a team of experts, spanning Harvard, MIT, The American Heritage Dictionary, The Encyclopedia Britannica and even our proud sponsors, the Google. And we cogitated about this for about four years. And we came to a startling conclusion. Ladies and gentlemen, a picture is not worth a thousand words. In fact, we found some pictures that are worth 500 billion words.
Erez Lieberman Aiden:大家都知道 一張圖勝過千言萬語 但我們在哈佛時 卻在思考這道理是否真是如此 (笑聲) 所以我們由來自哈佛大學 麻省理工學院 美國傳統英語詞典,大英百科全書 甚至我們偉大的贊助商─Google的專家們 組成一個團隊 我們花了四年的時間 在思考這個問題 然後我們得到了一個驚人的結論 女士先生們,一張圖片其實不只勝過千言萬語 事實上,我們發現某些圖片 更是勝過五千億個字
Jean-Baptiste Michel: So how did we get to this conclusion? So Erez and I were thinking about ways to get a big picture of human culture and human history: change over time. So many books actually have been written over the years. So we were thinking, well the best way to learn from them is to read all of these millions of books. Now of course, if there's a scale for how awesome that is, that has to rank extremely, extremely high. Now the problem is there's an X-axis for that, which is the practical axis. This is very, very low.
Jean-Baptiste Michel:我們是如何得出這項結論的呢? Erez和我思考了不同的方式 想更加了解人類文化 以及人類歷史從古到今的變化的全景 事實上,多年來已經出版了許多書籍。 所以我們認為最好的學習方式 就是將這上百萬的書全讀過一遍 如果能有一個尺規來說明此舉的驚人程度 這將會相當驚人 但問題是這裡的X軸 是表示實用程度 這相當不實用
(Applause)
(掌聲)
Now people tend to use an alternative approach, which is to take a few sources and read them very carefully. This is extremely practical, but not so awesome. What you really want to do is to get to the awesome yet practical part of this space. So it turns out there was a company across the river called Google who had started a digitization project a few years back that might just enable this approach. They have digitized millions of books. So what that means is, one could use computational methods to read all of the books in a click of a button. That's very practical and extremely awesome.
現在人們希望用別的方式 可以讀少一點書,但讀得非常仔細 這會相當實用,但這一點都不吸引人 我們真正想做的是 要用一種吸引人且實用的方法來閱讀這些書 所以在河的對岸有間公司叫做Google 他們幾年之前開始了一項數字化計畫 這項計畫讓我們能實踐剛說的方法 他們已將數百萬本書給數位化 這意味著,我們可以透過電腦 簡單按個按鈕就能閱讀所有的書 這非常實用而且相當棒
ELA: Let me tell you a little bit about where books come from. Since time immemorial, there have been authors. These authors have been striving to write books. And this became considerably easier with the development of the printing press some centuries ago. Since then, the authors have won on 129 million distinct occasions, publishing books. Now if those books are not lost to history, then they are somewhere in a library, and many of those books have been getting retrieved from the libraries and digitized by Google, which has scanned 15 million books to date.
ELA:讓我為各位介紹這些書都來自何方 自古以來,有非常多作家 這些作家一直努力寫作 但現在寫作變得相當容易 這歸功於幾世紀前印刷術的革新 自那時起作家們 能在一億兩千九百萬個不同的地方 出版書籍 如果那些書沒有因為時代交替而遺失 那麼那些書可能在某個圖書館的一處 有相當多書可以從圖書館中被借閱 由Google將其數位化 迄今Google已經掃描了一千五百萬本書
Now when Google digitizes a book, they put it into a really nice format. Now we've got the data, plus we have metadata. We have information about things like where was it published, who was the author, when was it published. And what we do is go through all of those records and exclude everything that's not the highest quality data. What we're left with is a collection of five million books, 500 billion words, a string of characters a thousand times longer than the human genome -- a text which, when written out, would stretch from here to the Moon and back 10 times over -- a veritable shard of our cultural genome. Of course what we did when faced with such outrageous hyperbole ... (Laughter) was what any self-respecting researchers would have done. We took a page out of XKCD, and we said, "Stand back. We're going to try science."
Google將一本書數位化,並以優良的型式呈現 現在我們有了這些數據,加上這些詮釋資料 我們有了相關的資訊,比如出版地區, 作者,出版時間 我們所做的就是透過這些記錄 並剔除不是最精華的資料 我們後來得到的是 五百萬本書 五千億個詞 這是一串比人類基因組 還要長上一千倍的字符 如果寫成文章 將會是從這裡到月球來回距離 的十倍以上 這是我們文化基因名副其實的的一部分 當然當我們面臨 如此誇張的情況時 (笑聲) 我們也跟每一位有自尊心的研究人員一樣 會做相同的事 我們也和四格漫畫一樣 我們決定「等等 我們要用科學的方式來處理。」
(Laughter)
(笑聲)
JM: Now of course, we were thinking, well let's just first put the data out there for people to do science to it. Now we're thinking, what data can we release? Well of course, you want to take the books and release the full text of these five million books. Now Google, and Jon Orwant in particular, told us a little equation that we should learn. So you have five million, that is, five million authors and five million plaintiffs is a massive lawsuit. So, although that would be really, really awesome, again, that's extremely, extremely impractical. (Laughter)
JM:當然,我們在思考 首先我們先把資料提取出來 讓其他人以科學的方式去分析 現在我們在思考,我們能發行何種數據? 當然,我們想拿這些書 將這五百萬本書的內容全部釋出 現在Google,特別是Jon Orwant 告訴我們一個我們該注意的小方程式 我們有五百萬本書,也就是有五百萬名作者 而五百萬名原告是一場龐大的訴訟 雖然這個過程是相當地驚人 但這還是極度的不切實際 (笑聲)
Now again, we kind of caved in, and we did the very practical approach, which was a bit less awesome. We said, well instead of releasing the full text, we're going to release statistics about the books. So take for instance "A gleam of happiness." It's four words; we call that a four-gram. We're going to tell you how many times a particular four-gram appeared in books in 1801, 1802, 1803, all the way up to 2008. That gives us a time series of how frequently this particular sentence was used over time. We do that for all the words and phrases that appear in those books, and that gives us a big table of two billion lines that tell us about the way culture has been changing.
然後,我們似乎有點妥協 我們試了比較實際的方式,這方法不怎麼吸引人 我們認為,與其釋出全部的書籍資料 我們選擇將這些書的數據資料給呈現出來 舉個例子「幸福的光」 這是四個字,我們稱做「四字詞」 我們要告訴各位一個特定的四字詞 從1801,1802,1803年開始出現在書本裡 直到2008年 這給我們一個時間軸來了解 這些特定的字句從過去到現在的使用頻率 我們計算了所有出現在這些書中的字詞 彙整出的資料畫出了二十億條曲線 這告訴了我們文化是如何改變的
ELA: So those two billion lines, we call them two billion n-grams. What do they tell us? Well the individual n-grams measure cultural trends. Let me give you an example. Let's suppose that I am thriving, then tomorrow I want to tell you about how well I did. And so I might say, "Yesterday, I throve." Alternatively, I could say, "Yesterday, I thrived." Well which one should I use? How to know?
ELA:這二十億條曲線 我們稱為二十億組詞 這告訴了我們 每一組詞代表了不同的文化趨勢 讓我舉個例子 假設我做了件不得了的事 明天我要告訴你是多不得了 我可能會說「"Yesterday, I throve."」 或者,我也可以說「"Yesterday, I thrived."」 但我應該說哪一種呢? 要怎麼知道
As of about six months ago, the state of the art in this field is that you would, for instance, go up to the following psychologist with fabulous hair, and you'd say, "Steve, you're an expert on the irregular verbs. What should I do?" And he'd tell you, "Well most people say thrived, but some people say throve." And you also knew, more or less, that if you were to go back in time 200 years and ask the following statesman with equally fabulous hair, (Laughter) "Tom, what should I say?" He'd say, "Well, in my day, most people throve, but some thrived." So now what I'm just going to show you is raw data. Two rows from this table of two billion entries. What you're seeing is year by year frequency of "thrived" and "throve" over time. Now this is just two out of two billion rows. So the entire data set is a billion times more awesome than this slide.
大概在六個月前 要知道這一領域最尖端的方法 你可能得要去詢問 一位有著時髦髮型的心理學家 你可能會問 「史蒂夫,你是不規則動詞的專家。 我該怎麼說呢?」 而他會告訴你「嗯,大部分的人會說"thrive" 但有些人會說"throve"。」 而你也或多或少知道 如果我們回到兩百年前 去問一位同樣也有時髦髮型的政治家 (笑聲) 「湯姆,我應該怎麼說呢?」 他說「嗯,在我的年代,大部份的人說"throve", 但少部分的人說"thrived"」 現在我要向各位展示原始數據 這二十億條目資料中的其中兩條數據 各位將會看到的是"thrived"和"throve"兩個字 在各年時期的出現頻率 這只是二十億筆資料中 其中兩個詞條的資訊 這全部的數據資料 將會比此張投影片還要驚人億萬倍
(Laughter)
(笑聲)
(Applause)
(掌聲)
JM: Now there are many other pictures that are worth 500 billion words. For instance, this one. If you just take influenza, you will see peaks at the time where you knew big flu epidemics were killing people around the globe.
JM:還有其他圖片也具有五千億字的價值 例如這張 如果談到感冒 從這幾個高峰點我們可以知道 感冒病毒的大流行在全球造成人類死亡
ELA: If you were not yet convinced, sea levels are rising, so is atmospheric CO2 and global temperature.
ELA:如果各位還不太相信 其他像是海平面升高 大氣中的二氧化碳和全球暖化
JM: You might also want to have a look at this particular n-gram, and that's to tell Nietzsche that God is not dead, although you might agree that he might need a better publicist.
JM:你也許會想看看這組特別的詞組 「告訴尼采,上帝還沒死」 也許你可能還會認為,他可能需要一個更好的公關
(Laughter)
(笑聲)
ELA: You can get at some pretty abstract concepts with this sort of thing. For instance, let me tell you the history of the year 1950. Pretty much for the vast majority of history, no one gave a damn about 1950. In 1700, in 1800, in 1900, no one cared. Through the 30s and 40s, no one cared. Suddenly, in the mid-40s, there started to be a buzz. People realized that 1950 was going to happen, and it could be big. (Laughter) But nothing got people interested in 1950 like the year 1950. (Laughter) People were walking around obsessed. They couldn't stop talking about all the things they did in 1950, all the things they were planning to do in 1950, all the dreams of what they wanted to accomplish in 1950. In fact, 1950 was so fascinating that for years thereafter, people just kept talking about all the amazing things that happened, in '51, '52, '53. Finally in 1954, someone woke up and realized that 1950 had gotten somewhat passé. (Laughter) And just like that, the bubble burst.
ELA:從這當中,各位也能獲得一些相當抽象的概念 例如,讓我跟各位說說 有關「1950年」的歷史 幾乎在絕大多數的歷史裡 沒有特別談論1950這一年 在1700年,在1800年,1900年 沒有人在乎 甚至到30年代和40年代 也沒有人在談論 突然到了40年代中期 開始出現了風潮 人們意識到1950年就要來臨 這是件大事 (笑聲) 但也沒有因此讓大眾對該年份產生興趣 像是「那1950年」 (笑聲) 人們開始對這一年著迷 大家無法停止談論 有關他們在1950年所做的一切 所有他們計畫要在1950年所做的事 所有他們要在1950年完成的夢想 事實上,1950年跟往後幾年相較 是相當迷人的一年 人們不停談論所有發生在 '51,'52,'53年的驚奇事件 直到1954年 有人驚覺而且意識到 1950年已經變得過時了 (笑聲) 這一切就像泡沫破滅一樣
(Laughter)
(笑聲)
And the story of 1950 is the story of every year that we have on record, with a little twist, because now we've got these nice charts. And because we have these nice charts, we can measure things. We can say, "Well how fast does the bubble burst?" And it turns out that we can measure that very precisely. Equations were derived, graphs were produced, and the net result is that we find that the bubble bursts faster and faster with each passing year. We are losing interest in the past more rapidly.
1950年的情況 其實就是我們數據上每一個年份的情況一樣 稍微編排一下,我們有這些精美的圖表 因為有這些不錯的圖表,我們就能計算 我們可以了解「風潮消逝的速度是多快?」 結果就是我們能很精確測量出一份數據 有了方程式,也有圖表 最終的結果就是 談論年份的風潮一年比一年 消退的更快 我們對於過去的興趣日漸消逝
JM: Now a little piece of career advice. So for those of you who seek to be famous, we can learn from the 25 most famous political figures, authors, actors and so on. So if you want to become famous early on, you should be an actor, because then fame starts rising by the end of your 20s -- you're still young, it's really great. Now if you can wait a little bit, you should be an author, because then you rise to very great heights, like Mark Twain, for instance: extremely famous. But if you want to reach the very top, you should delay gratification and, of course, become a politician. So here you will become famous by the end of your 50s, and become very, very famous afterward. So scientists also tend to get famous when they're much older. Like for instance, biologists and physics tend to be almost as famous as actors. One mistake you should not do is become a mathematician. (Laughter) If you do that, you might think, "Oh great. I'm going to do my best work when I'm in my 20s." But guess what, nobody will really care.
JM:這張圖是有關職業建議 對於那些想成名的人 我們可以知道二十五位最有名的政治人物 作家、演員等等 如果各位想在年輕時就成名,那麼各位應該要當演員 因為你的名氣會從二十歲後開始累積 那時正值青春年華,會相當不錯 如果各位有耐心一點,那麼就應該當個作家 因為各位就能攀上高峰 成為像是馬克吐溫這樣有名望的作家 但如果各位想攀上最頂尖的位置 就得延後滿足自己的慾望 然後當一位政治家 那麼各位會在五十歲過後開始成名 然後你的名氣會在未來持續延續 科學家也往往是在老年時才成名 而生物學家和物理學家一樣 往往也是和演員一樣著名 唯一不要做的職業就是變成數學家 (笑聲) 如果各位真要做這行 各位可能會想「太好了,當我在二十多歲時,我會盡一切努力。」 但事實上,沒人會真正去在乎你所做的事
(Laughter)
(笑聲)
ELA: There are more sobering notes among the n-grams. For instance, here's the trajectory of Marc Chagall, an artist born in 1887. And this looks like the normal trajectory of a famous person. He gets more and more and more famous, except if you look in German. If you look in German, you see something completely bizarre, something you pretty much never see, which is he becomes extremely famous and then all of a sudden plummets, going through a nadir between 1933 and 1945, before rebounding afterward. And of course, what we're seeing is the fact Marc Chagall was a Jewish artist in Nazi Germany.
ELA:在我們的資料裡 還有其他更發人省思的紀錄 例如馬克‧夏卡爾的名字出現的頻率軌跡 夏卡爾是位1887年出生的藝術家 這看起來是一位名人名字正常出現在書中的軌跡 他的名氣日益響亮 但如果看德國的數據就不是如此 如果看德國的數據,會看到某部份是非常奇怪的 這是幾乎不太可能看到的 就是他變得非常有名 卻突然在1933年至1945年間 聲勢跌落谷底 又反彈回升 當然我們看的出來 這是因為馬克‧夏卡爾是一位猶太裔藝術家 當時德國是納粹統治
Now these signals are actually so strong that we don't need to know that someone was censored. We can actually figure it out using really basic signal processing. Here's a simple way to do it. Well, a reasonable expectation is that somebody's fame in a given period of time should be roughly the average of their fame before and their fame after. So that's sort of what we expect. And we compare that to the fame that we observe. And we just divide one by the other to produce something we call a suppression index. If the suppression index is very, very, very small, then you very well might be being suppressed. If it's very large, maybe you're benefiting from propaganda.
這些指標 事實上相當明確 我們不需要知道有人在審查書籍 我們能運用基本的信號運算方式 實際了解當時狀況 我們可以用簡單的方式來做 合理的預期是 在一段特定的時間裡某人的名氣指數 應該會是他們成名前 和成名後的指數的平均值 這大概是我們預期的結果 我們比較了我們觀察到的名人 我們將前後的數值相除 得到的數值,我們稱作抑制指數 如果抑制指數的值非常的小 那麼就表示此人也許遭受到打壓 但如果數值非常大,也許此人獲得大量的推廣
JM: Now you can actually look at the distribution of suppression indexes over whole populations. So for instance, here -- this suppression index is for 5,000 people picked in English books where there's no known suppression -- it would be like this, basically tightly centered on one. What you expect is basically what you observe. This is distribution as seen in Germany -- very different, it's shifted to the left. People talked about it twice less as it should have been. But much more importantly, the distribution is much wider. There are many people who end up on the far left on this distribution who are talked about 10 times fewer than they should have been. But then also many people on the far right who seem to benefit from propaganda. This picture is the hallmark of censorship in the book record.
JM:各位現在可以看到 抑制指數在抽樣整體人數中的分佈情況 所以,例如這裡 -- 這個抑制指數的抽樣人數是五千人 選自出版時期沒有打壓限制的英文書籍來做調查 曲線基本上會在數值1的地方呈現高峰 基本上預期的會和觀察到的數值是相同的 這份分佈圖則是德國的部分 -- 相當不同,曲線移往左側 人們談論事物的次數比預期的少了兩倍 更重要的是,整體分佈的情況更寬廣 有相當多人是落在圖表較左側的位置 因為他們比應該被提及的次數少了十倍 但也有相當多人是落在較右側的部分 似乎是因為被大量宣傳 這張圖是明顯看出書本中具有審查制度
ELA: So culturomics is what we call this method. It's kind of like genomics. Except genomics is a lens on biology through the window of the sequence of bases in the human genome. Culturomics is similar. It's the application of massive-scale data collection analysis to the study of human culture. Here, instead of through the lens of a genome, through the lens of digitized pieces of the historical record. The great thing about culturomics is that everyone can do it. Why can everyone do it? Everyone can do it because three guys, Jon Orwant, Matt Gray and Will Brockman over at Google, saw the prototype of the Ngram Viewer, and they said, "This is so fun. We have to make this available for people." So in two weeks flat -- the two weeks before our paper came out -- they coded up a version of the Ngram Viewer for the general public. And so you too can type in any word or phrase that you're interested in and see its n-gram immediately -- also browse examples of all the various books in which your n-gram appears.
ELA:文化組學 是我們用的方法 這和基因組學有些類似 不過基因組學是透過生物學 基本的序列基礎來檢視人類基因組 文化組學是類似的 這是應用收集分析規模龐大的數據 來研究人類文化 不透過檢視基因組 而是檢視歷史紀錄的數位資料 文化組學的好處是 每個人都能執行 為何每個人都能做呢? 因為這三位人士 Google的Jon Orwant,Matt Gray還有Will Brockman 他們看到Ngram瀏覽器的原型 他們說「這太有趣了。」 我們要讓大家都可以使用這功能 所以在兩週的時間 -- 我們的報告出來的兩週前 -- 他們編寫了一個大眾版本的Ngram瀏覽器 各位可以打上任何各位有興趣的字或詞組 然後立即看到該字詞的頻率變化 -- 同時根據你搜尋的字詞 瀏覽不同書籍中的各種例子
JM: Now this was used over a million times on the first day, and this is really the best of all the queries. So people want to be their best, put their best foot forward. But it turns out in the 18th century, people didn't really care about that at all. They didn't want to be their best, they wanted to be their beft. So what happened is, of course, this is just a mistake. It's not that strove for mediocrity, it's just that the S used to be written differently, kind of like an F. Now of course, Google didn't pick this up at the time, so we reported this in the science article that we wrote. But it turns out this is just a reminder that, although this is a lot of fun, when you interpret these graphs, you have to be very careful, and you have to adopt the base standards in the sciences.
JM:這功能在首日就被使用了超過一百萬次 這也是各種查詢工具中最好的一個 人們希望做到最好的,以最好的狀態像前進 但事實證明在18世紀,人們一點也不關心這一切 他們不想做到最好,他們想變成"beft" 這是怎麼回事,當然這只是個錯誤 這並不是說他們想要平凡 這只是因為"S"常被寫的不一樣,寫得像"F" 當然,Google並沒有挑出來 所以我們在自己寫科學文章中提到此事 不過這只是個提醒 雖然這相當有趣 當你要解讀這些圖表,你必須非常謹慎 而且必須採納科學的基礎標準
ELA: People have been using this for all kinds of fun purposes. (Laughter) Actually, we're not going to have to talk, we're just going to show you all the slides and remain silent. This person was interested in the history of frustration. There's various types of frustration. If you stub your toe, that's a one A "argh." If the planet Earth is annihilated by the Vogons to make room for an interstellar bypass, that's an eight A "aaaaaaaargh." This person studies all the "arghs," from one through eight A's. And it turns out that the less-frequent "arghs" are, of course, the ones that correspond to things that are more frustrating -- except, oddly, in the early 80s. We think that might have something to do with Reagan.
ELA:大家一直在使用這工具來滿足各種樂趣 (笑聲) 事實上,我們不需要說明的 我們原本只想播放所有的投影片然後在一旁保持沉默 此人對於挫折的歷史感興趣 挫折有非常多種方式 如果你踢到腳趾,哀叫聲「啊」就是一個"A"的"argh" 如果地球被外星人毀滅 變成星際間的通道 那麼哀叫聲「啊」就是有八個"A"的"aaaaaaaargh" 此人研究了所有書籍上出現的哀叫聲「啊」 有從一個"A"到八個"A" 結果是 較不頻繁的「啊」“arghs” 對應了那些相對較令人沮喪的的事情 也有例外,奇怪的是在80年代初 我們認為這也許是受到雷根的影響
(Laughter)
(笑聲)
JM: There are many usages of this data, but the bottom line is that the historical record is being digitized. Google has started to digitize 15 million books. That's 12 percent of all the books that have ever been published. It's a sizable chunk of human culture. There's much more in culture: there's manuscripts, there newspapers, there's things that are not text, like art and paintings. These all happen to be on our computers, on computers across the world. And when that happens, that will transform the way we have to understand our past, our present and human culture.
JM:這份書據資料有相當多用途 不過最終就是歷史紀錄都被數位化了 Google已經開始將一千五百萬本書數位化 其中百分之十二的書是已出版的 這涵蓋了相當大量的人類文化 這當中有非常多的文化資料:裡頭有手稿,報紙 也有不是文字的資料,像是藝術品和畫作 現在這都存放在我們的電腦裡 在世界各處的電腦裡 如果這一切成真,就會改變 我們了解過去、現在和人類文化的方式
Thank you very much.
非常謝謝各位
(Applause)
(掌聲)