Jean-Baptiste Michel + Erez Lieberman Aiden: What we learned from 5 million books

Erez Lieberman Aiden: Everyone knows that a picture is worth a thousand words. But we at Harvard were wondering if this was really true. (Laughter) So we assembled a team of experts, spanning Harvard, MIT, The American Heritage Dictionary, The Encyclopedia Britannica and even our proud sponsors, the Google. And we cogitated about this for about four years. And we came to a startling conclusion. Ladies and gentlemen, a picture is not worth a thousand words. In fact, we found some pictures that are worth 500 billion words.

이레즈: 누구나 아는 '백문이 불여일견'이라는 말이 있습니다. 하지만 하버드에서 우리는 저 말이 참인지 거짓인지를 논하곤 했죠. (웃음) 그래서 우리는 하버트와 MIT에 걸쳐 전문가들을 모집하고 아메리칸 헤리티지 사전, 브리태니커 백과사전 그리고 심지어 우리의 자랑스런 후원, 구글까지 포괄하는 팀을 구성했습니다. 그리고 우리는 이것에 대해 약 4년 동안 깊이있게 연구했죠. 우리는 놀라운 결론에 도달했습니다. 신사 숙녀 여러분, 한 그림은 천 단어의 가치가 없습니다. [역: '일견'이 백문의 가치가 되지 않습니다.] 사실, 우리는 몇 가지 사진들의 경우 5천억 단어 정도의 가치가 있음을 발견했죠.

Jean-Baptiste Michel: So how did we get to this conclusion? So Erez and I were thinking about ways to get a big picture of human culture and human history: change over time. So many books actually have been written over the years. So we were thinking, well the best way to learn from them is to read all of these millions of books. Now of course, if there's a scale for how awesome that is, that has to rank extremely, extremely high. Now the problem is there's an X-axis for that, which is the practical axis. This is very, very low.

미셸 : 어떻게 우리가 이 결론에 도달했을까요? 이레즈와 전, 연구 방법에 대해 생각하고 있었습니다. 어떻게 하면 인간 문화와 역사의 큰 그림을 얻을 수 있을까: 시간에 따라 변화되는 것을 포함해서 실제로 수 많은 책들은 지난 수년 동안 기록되었습니다. 그래서 우리가 그들로 부터 배울 수 있는 가장 좋은 방법은 이 수천 수만권의 책들을 다 읽는거라 생각했습니다. 물론, 저 일이 얼마나 멋진 일인지 측정할 수 있다면 저것은 매우, 아주 높은 순위가 매겨질 것입니다. 문제는, 그곳에 x축이 있다는 거죠. 실용성을 나타내는 축이죠. 이 축에서의 점수는 매우 낮습니다.

(Applause)

(박수)

Now people tend to use an alternative approach, which is to take a few sources and read them very carefully. This is extremely practical, but not so awesome. What you really want to do is to get to the awesome yet practical part of this space. So it turns out there was a company across the river called Google who had started a digitization project a few years back that might just enable this approach. They have digitized millions of books. So what that means is, one could use computational methods to read all of the books in a click of a button. That's very practical and extremely awesome.

현재, 사람들은 대안으로 몇 가지 소스들을 선택해서 그것들을 주의깊게 읽어나가죠. 이 방식은 매우 실용적이지만 아주 멋지지는 않습니다. 당신이 정말하고 원하는 것은 아주 멋진 일을 아주 실용적으로 하는 거죠. 그래서 보니 강 건너에 구글이라 불리는 회사가 있더군요. 몇 년 전에 디지털화 프로젝트를 시작했었던 회사죠. 그것이 우리의 접근방식을 가능케 할수도 있겠더군요. 그들은 수백만권의 책을 디지털화 했습니다. 그것이 무슨 뜻인고 하니, 누군가 원하면 단 하나의 클릭으로 책을 한권을 훑어볼 수 있다는 뜻이죠. 아주 실용적이이며 극도로 멋진 일이죠.

ELA: Let me tell you a little bit about where books come from. Since time immemorial, there have been authors. These authors have been striving to write books. And this became considerably easier with the development of the printing press some centuries ago. Since then, the authors have won on 129 million distinct occasions, publishing books. Now if those books are not lost to history, then they are somewhere in a library, and many of those books have been getting retrieved from the libraries and digitized by Google, which has scanned 15 million books to date.

이레즈: 제가 책들이 어디서 왔는지 설명을 좀 하죠. 태고적부터, 작가는 늘 존재해 왔습니다. 이 저자들은 책을 쓰기 위해 분투해왔죠. 그 일은 점점 쉬워졋습니다. 몇 세기전의 인쇄기 발달과 함께말이죠. 그 이후로 부터는 저자들의 승리였죠. 뚜렷이 1억2천9백만번 동안 책을 출판했으니까요 역사 속에 분실되지 않았다면 해당 도서는 지금 어느 도서관 어딘가에 있는 것입니다. 그 도서의 대부분이 도서관에서 회수되어져 구글에 의해 디지털화 되고 있습니다. 현재까지 천오백만권의 도서를 스캔했습니다.

Now when Google digitizes a book, they put it into a really nice format. Now we've got the data, plus we have metadata. We have information about things like where was it published, who was the author, when was it published. And what we do is go through all of those records and exclude everything that's not the highest quality data. What we're left with is a collection of five million books, 500 billion words, a string of characters a thousand times longer than the human genome -- a text which, when written out, would stretch from here to the Moon and back 10 times over -- a veritable shard of our cultural genome. Of course what we did when faced with such outrageous hyperbole ... (Laughter) was what any self-respecting researchers would have done. We took a page out of XKCD, and we said, "Stand back. We're going to try science."

지금 구글이 책을 디지털화하면, 좋은 포맷으로 바꿔두죠. 이제 우리는 데이터가 있고 그에 관한 속성 정보까지 있죠. 우리에겐 그것이 어디서 출판되었고 누가 썼으며 언제 발행되었는지에 관한 정보도 있습니다. 해서, 우리가 가진 모든 자료들을 훑어서 상태가 좋지않은 데이터는 전부 제하여 추려서 남은 것이 오백만권의 책 입니다. 5천억개의 단어들, 일렬로 나열했을 경우 우리 유전자의 총체, 인간 게놈보다 천배 이상 긴 겁니다. 이 텍스트들을 모두 모아서 한 줄로 쓰면 여기서 달까지 10번 왔다갔다 할 만큼 나오죠. 진정 우리 문화 게놈의 한 조각이라 할 수 있죠. 물론 이런 말도 안되는 과장에 직면하게 되면 우리가 할 수 있는 일이라곤 (웃음) 자존감있는 연구원이라면 누구나 했을 법한 일이죠. XKCD의 한 페이지를 꺼내 들고 외치는 거죠. "뒤로 물러나. 우리는 이제 과학을 시도 할 것이야."

(Laughter)

(웃음) [역: XKCD.com 미국의 유명 웹툰. 웹사이트에서 해당 문구의 티셔츠를 판매하고 있음]

JM: Now of course, we were thinking, well let's just first put the data out there for people to do science to it. Now we're thinking, what data can we release? Well of course, you want to take the books and release the full text of these five million books. Now Google, and Jon Orwant in particular, told us a little equation that we should learn. So you have five million, that is, five million authors and five million plaintiffs is a massive lawsuit. So, although that would be really, really awesome, again, that's extremely, extremely impractical. (Laughter)

JM은 : 지금은 물론, 우리는 생각하고 있었죠, 물론 그냥 먼저 밖으로 데이터를 넣어 봅시다 그것을 할 과학을 하는 사람들을 위해서말이죠. 지금 우리가 생각하고, 우리는 어떤 데이터를 공개할 수 있습니까? 그럼요, 당신은 책을 취해서 이러한 오백만 도서의 전체 텍스트를 놓고 싶어합니다. 특히 이제 Google과 존 Orwant, 우리가 배워야할 방정식이 조금있다고 말했습니다. 그래서 5 백만 작가, 즉, 5 백만 달러를 가지고 그리고 5 백만 원고측은 대규모의 소송이다. 그럼, 그건 정말 굉장한 것이긴 하지만 다시말해, 그건 극히, 극히 비실용적입니다. (웃음)

Now again, we kind of caved in, and we did the very practical approach, which was a bit less awesome. We said, well instead of releasing the full text, we're going to release statistics about the books. So take for instance "A gleam of happiness." It's four words; we call that a four-gram. We're going to tell you how many times a particular four-gram appeared in books in 1801, 1802, 1803, all the way up to 2008. That gives us a time series of how frequently this particular sentence was used over time. We do that for all the words and phrases that appear in those books, and that gives us a big table of two billion lines that tell us about the way culture has been changing.

이제 다시, 우리는 굴복한것처럼 되어서, 그리고 약간 덜 굉장하지만, 아주 실용적인 접근을 하게 되었습니다. 우리가 말하길, "글쎄, 전체 텍스트를 발표하는 대신 우리는 도서에 대한 통계를 공개할거야. 예를 들어, '행복의 광채"를 봅시다. 그것은 네 단어입니다; 우리는 4 그램이라고 부릅니다. 우리는 특정 4 그램이 1801, 1802, 1803, 2008년까지 죽 올라가서 책에 몇번이나 나타나는지 여러분께 말할겁니다. 그것은 우리에게 이 특정 문장은 시간이 지남에 따라 얼마나 자주 사용되었는지 시간 시리즈를 제공합니다. 우리가 그 도서에 나타나는 모든 단어와 구문에 대해 그렇게 하면, 그것은 우리에게 이십억 줄의 큰 테이블을 제공하는데 그것은 방식 문화가 변경되는 방법에 관해서 우리에게 알려줍니다.

ELA: So those two billion lines, we call them two billion n-grams. What do they tell us? Well the individual n-grams measure cultural trends. Let me give you an example. Let's suppose that I am thriving, then tomorrow I want to tell you about how well I did. And so I might say, "Yesterday, I throve." Alternatively, I could say, "Yesterday, I thrived." Well which one should I use? How to know?

ELA : 그럼 그 이십억 라인, 우리는 그들 이십억 N -그램. 그들이 우리에게 뭐라고 할까요? 그럼 각각의 N - 그램은 문화동향을 측정합니다. 한가지 예를 들어 드리겠습니다. 내가 번성하고 있다고 가정해 봅시다 그러면 내일은 내가 얼마나 잘했는지 말해주고 싶어요. 그래서 난 "어제 내가 번성했어요(throve)."말할지도 모릅니다. 또 저는 "어제, 내가 번창했어요 (thrived)." 라고 할 수 도 있습니다. 글쎄, 어떤것을 사용해야 할까요? 어떻게 압니까?

As of about six months ago, the state of the art in this field is that you would, for instance, go up to the following psychologist with fabulous hair, and you'd say, "Steve, you're an expert on the irregular verbs. What should I do?" And he'd tell you, "Well most people say thrived, but some people say throve." And you also knew, more or less, that if you were to go back in time 200 years and ask the following statesman with equally fabulous hair, (Laughter) "Tom, what should I say?" He'd say, "Well, in my day, most people throve, but some thrived." So now what I'm just going to show you is raw data. Two rows from this table of two billion entries. What you're seeing is year by year frequency of "thrived" and "throve" over time. Now this is just two out of two billion rows. So the entire data set is a billion times more awesome than this slide.

약 6 개월 전의 시기에, 이 분야에서 예술의 상태는 예를 들어, 당신이, 멋진 머리를 가진 심리학자를 따라 올라가, 당신이 말하길, "스티브, 당신은 불규칙 동사에 관한 전문가입니다. 제가 어떻게 해야 할까요? " 그거면 그는, "글쎄요, 대부분의 사람들이 말하길 번성했다(thrive) 고 했지만, 몇몇 사람은 번창했다(throve) 라고 했어요." 그래서 여러분은 당신은 또한 다소는 만일 이백년전 이전으로 거슬러 올라가서 그리고, 똑같이 멋진 머리를 가진 다음의 정치가에게 묻는다면, (웃음) "톰, 내가 무슨 말을해야합니까?" 그는 "글쎄, 나의 세대는 대부분의 사람들이 번성했다 (throve) 라고 말했지만 몇몇사람은 번창했다 (thrive)라고 말했어요." 할겁니다. 그래서 제가 여러분에게 그냥 보여드리려고 하는것은 원래의 데이터입니다. 이십억 항목의 이 테이블에서 두 줄입니다. 여러분이 지금보고 계시는 것은 번성했다(throve)와 번창했다(thrive)의 오랜시간에 걸친 각 년도의 빈도입니다. 이제 이십억 행에서 이 두 개만 있습니다 따라서 전체 데이터 세트는 이 슬라이드보다 억 배 이상 굉장한 것입니다.

(Laughter)

(웃음)

(Applause)

(박수)

JM: Now there are many other pictures that are worth 500 billion words. For instance, this one. If you just take influenza, you will see peaks at the time where you knew big flu epidemics were killing people around the globe.

JM : 지금 5 백조개단어의 가치가 있는 많은 다른 그림이 있습니다. 예를 들어,이것을 보세요. 여러분이 독감을 취할경우, 여러분은 큰 독감 전염병이 전세계의 사람을 죽이고 있었던것을 알았던 지점의 가장 최고점 시간을 볼 수 있습니다.

ELA: If you were not yet convinced, sea levels are rising, so is atmospheric CO2 and global temperature.

ELA : 여러분이 아직도 납득되지 않으셨다면, 해수면이 상승하고 있으며, 그래서 대기 CO2와 지구의 온도도 상승하고 있습니다.

JM: You might also want to have a look at this particular n-gram, and that's to tell Nietzsche that God is not dead, although you might agree that he might need a better publicist.

JM : 당신은 또한,이 특정 N - 그램을 보고싶어할지도 모르고, 그것은 니체에게 하나님이 죽은것이 아니라고 말하는 것입니다, 여러분은 니체가 더 나은 홍보가가 필요하다는데 동의할 지 모르지만요.

(Laughter)

(웃음)

ELA: You can get at some pretty abstract concepts with this sort of thing. For instance, let me tell you the history of the year 1950. Pretty much for the vast majority of history, no one gave a damn about 1950. In 1700, in 1800, in 1900, no one cared. Through the 30s and 40s, no one cared. Suddenly, in the mid-40s, there started to be a buzz. People realized that 1950 was going to happen, and it could be big. (Laughter) But nothing got people interested in 1950 like the year 1950. (Laughter) People were walking around obsessed. They couldn't stop talking about all the things they did in 1950, all the things they were planning to do in 1950, all the dreams of what they wanted to accomplish in 1950. In fact, 1950 was so fascinating that for years thereafter, people just kept talking about all the amazing things that happened, in '51, '52, '53. Finally in 1954, someone woke up and realized that 1950 had gotten somewhat passé. (Laughter) And just like that, the bubble burst.

ELA : 당신은 이런 비슷한것들로 꽤 추상적인 개념을 얻을 수 있습니다. 예를 들어, 내가 여러분에게 1950년도의 역사를 알려드리겠습니다. 역사의 대부분에 대해서 그 누구도 1950에 대해 주의를 기울이지 않았습니다 1700 년, 1800 년, 1900 년에, 그 누구도 신경 쓰지 않았어요. 30년대와 40년대를 통과하며, 그 누구도 신경 쓰지 않았어요. 갑자기 40 년대 중반에 얘깃거리가 생기기 시작했습니다. 사람들은 1950 년이 일어날 것이라는것과 그게 큰일일 것이라는 것을 깨닫게 되었지요. (웃음) 그러나 아무것도 1950 년과 같이 1950년에 사람들에게 관심이있는것은 없었습니다. (웃음) 사람들은 집착해서 돌아나녔습니다 그들은 그들이 1950 년 한 모든 것에 대해, 말을 멈출수 없었습니다, 그들이 1950년에 할 준비를 하고있던 모든것들, 그들이 1950 년에 달성하고 싶어했던 모든 꿈에 대해. 사실 1950 년 정말 매혹적이어서 그 이후 년 동안 사람들은 51년, 52년, 53년에 일어난 모든 놀라운 일들에 대해 얘기를 계속했습니다. 결국 1954년에, 누군가가 잠에 깨어 일어나서는 1950은 다소 지나갔다는것을 깨달았습니다. (웃음) 그리고 그냥 그렇게, 그 거품이 터졌지요.

(Laughter)

(웃음)

And the story of 1950 is the story of every year that we have on record, with a little twist, because now we've got these nice charts. And because we have these nice charts, we can measure things. We can say, "Well how fast does the bubble burst?" And it turns out that we can measure that very precisely. Equations were derived, graphs were produced, and the net result is that we find that the bubble bursts faster and faster with each passing year. We are losing interest in the past more rapidly.

그리고 1950 년 이야기는 우리가 기록을 보유하고 있는 매년의 이야기가 지금은 이 좋은 차트를 가지고 있기 때문에 약간 꼬여 있어요. 그리고 우리가이 멋진 차트를 가지고 있기 때문에, 우리는 물건을 측정할 수 있습니다. 우리는 "글쎄 얼마나 빨리 거품이 터질까?" 라고 말할 수도 있습니다. 그리고 그것은 우리가 매우 정확하게 측정할 수있다는 게 밝혀졌습니다. 방정식이 도출되었고, 그래프가 만들어졌고, 그리고 그 실제 결과는 우리가 그 거품이 터지는것이 각 지나가는 해와 더불어 점점 더 빨라지는것을 발견했다는 것입니다. 우리는 더 빨리 과거에 흥미를 잃어 가고있습니다.

JM: Now a little piece of career advice. So for those of you who seek to be famous, we can learn from the 25 most famous political figures, authors, actors and so on. So if you want to become famous early on, you should be an actor, because then fame starts rising by the end of your 20s -- you're still young, it's really great. Now if you can wait a little bit, you should be an author, because then you rise to very great heights, like Mark Twain, for instance: extremely famous. But if you want to reach the very top, you should delay gratification and, of course, become a politician. So here you will become famous by the end of your 50s, and become very, very famous afterward. So scientists also tend to get famous when they're much older. Like for instance, biologists and physics tend to be almost as famous as actors. One mistake you should not do is become a mathematician. (Laughter) If you do that, you might think, "Oh great. I'm going to do my best work when I'm in my 20s." But guess what, nobody will really care.

JM : 지금 경력 조언의 작은 조각. 그래서 유명한 사람이 되기를 추구하는 여러분들을 위해, 우리는 25에서 가장 유명한 정치적 인물들에게서, 저자, 배우 등등에게서 배울 수 있습니다. 당신이 빨리 유명해지고 싶다면, 당신은 배우가 되어야합니다 그리고 명성이 20대의 마지막에 상승하기 시작하기 때문에 - 여러분이 아직 어리다면, 정말 좋아요. 당신은 조금 기다릴 수있다면, 이제 당신은 저자되어야합니다 다음 아주 좋은 높이로 상승하기 때문인데, 극히 유명한 사람과 같이 말이죠. 하지만 당신이 맨 상위에 도달하려는 경우, 당신은 만족을 지연해야하고 그리고, 물론, 정치가가 되야 합니다. 그럼 여기서 당신은 당신의 50 대 말까지 유명 될 것입니다 그리고 그 이후에는 아주 유명하게 됩니다. 그래서 과학자들은 또한 훨씬 나이들었을 때 유명해지는 경향이 있습니다. 예를 들어, 생물학 및 물리학에 대한 마찬가지로 배우만큼이나 유명해지는 경향이 있습니다. 당신이 범하지 말아야 할 한가지 실수는 수학자가 되는 것입니다. (웃음) 만약 당신이 그렇게한다면, 당신은 "좋아. 아 내가 내가 20대에 있을 때 내 최고의 작업을 할거야."라고 생각할 수도 있지만 그러나 짐작해보세요, 아무도 상관하지 않습니다.

(Laughter)

(웃음)

ELA: There are more sobering notes among the n-grams. For instance, here's the trajectory of Marc Chagall, an artist born in 1887. And this looks like the normal trajectory of a famous person. He gets more and more and more famous, except if you look in German. If you look in German, you see something completely bizarre, something you pretty much never see, which is he becomes extremely famous and then all of a sudden plummets, going through a nadir between 1933 and 1945, before rebounding afterward. And of course, what we're seeing is the fact Marc Chagall was a Jewish artist in Nazi Germany.

ELA: N-그램사이에 보다 냉정한 노트가 있습니다. 예를 들어, 여기, 1887년에 태어난 마크 샤갈의 탄도가 있습니다. 그리고 이것은 유명한 사람의 정상적인 궤도 같습니다. 그는 점점 더 유명해집니다, 독일어로 여러분이 보는 경우를 제외하고는요. 당신이 독일어로 보면, 당신은 완전히 이상한 무언가를 봅니다, 당신은 거의 못 볼 것을말이죠, 그것은 그가 극도로 유명하게되고 그리고 갑자기 곤두박질을 하는것입니다, 1933과 1945년 사이의 최하점을 겪으면서, 그 이후 복귀하기 전에요. 그리고 물론, 우리가 보는것은 사실 마크 샤갈은 나치 독일에서의 유대인 예술가였다는 사실입니다.

Now these signals are actually so strong that we don't need to know that someone was censored. We can actually figure it out using really basic signal processing. Here's a simple way to do it. Well, a reasonable expectation is that somebody's fame in a given period of time should be roughly the average of their fame before and their fame after. So that's sort of what we expect. And we compare that to the fame that we observe. And we just divide one by the other to produce something we call a suppression index. If the suppression index is very, very, very small, then you very well might be being suppressed. If it's very large, maybe you're benefiting from propaganda.

지금 이러한 신호들은 실제로 대단히 강해서 우리는 누군가가 검열 받았는지 알 필요가 없습니다. 우리는 실제로 기본적인 신호 처리를 사용해서 실제로 그것을 알아낼 수 있습니다. 여기 그것을하는 간단한 방법이 있습니다. 음, 합리적인 기대는 주어진 시간안에 누군가의 명성은 대략 그들의 명성의 이전과 이후의 평균으로 되어야 합니다. 그래서 그것은 우리가 기대하는 어떤것입니다. 그리고 우리는 우리가 관찰하는 명성에 그것을 비교합니다. 그리고 우리는 다른 것을 1로 나누어서 우리가 억제 지수라고 부르는 무언가를 생산합니다. 만일 그 억제 지수가 매우, 매우, 매우 작으면, 그다음에 당신은 잘 억압될 수도 있습니다. 만일 그것이 매우 크면, 아마 당신이 선전에서 혜택을 받는것일겁니다.

JM: Now you can actually look at the distribution of suppression indexes over whole populations. So for instance, here -- this suppression index is for 5,000 people picked in English books where there's no known suppression -- it would be like this, basically tightly centered on one. What you expect is basically what you observe. This is distribution as seen in Germany -- very different, it's shifted to the left. People talked about it twice less as it should have been. But much more importantly, the distribution is much wider. There are many people who end up on the far left on this distribution who are talked about 10 times fewer than they should have been. But then also many people on the far right who seem to benefit from propaganda. This picture is the hallmark of censorship in the book record.

JM이 : 이제 여러분은 전체 인구에 대한 억제 지수의 분포를 실제로 볼 수 있습니다. 따라서 예를 들어, 여기에 - 이 억제 지수는 알려진 억압이 없는 곳에서 영어로 쓰여진 도서를 고른 5,000 명에 대한 것인데- 그것은 기본적으로 긴밀하게 하나를 중심으로 한 이것과 같은 것입니다. 예상할 수 있는것은 기본적으로 여러분이 관찰하는 것입니다. 독일에서 보여진것과 같이 이 배포는 - 매우 다릅니다, 그것은 왼쪽으로 이동되어 있지요. 사람들은 그것이 해 졌어야만 할 것보다 두 번 이하로 얘기했습니다. 그러나 더 중요하게, 그 배포는 훨씬 더 넓다는 것입니다. 이 배포판에서 맨 왼쪽에 결국 많은 사람들은 그들이 있었어야 할 것보다 10 배 이하로 얘기한 사람들입니다. 하지만 그다음에는 선전의 혜택을 받은것처럼 보이는 맨 오른쪽에도 많은 사람들이있습니다. 이 사진은 책에 기록에 검열의 특징이다.

ELA: So culturomics is what we call this method. It's kind of like genomics. Except genomics is a lens on biology through the window of the sequence of bases in the human genome. Culturomics is similar. It's the application of massive-scale data collection analysis to the study of human culture. Here, instead of through the lens of a genome, through the lens of digitized pieces of the historical record. The great thing about culturomics is that everyone can do it. Why can everyone do it? Everyone can do it because three guys, Jon Orwant, Matt Gray and Will Brockman over at Google, saw the prototype of the Ngram Viewer, and they said, "This is so fun. We have to make this available for people." So in two weeks flat -- the two weeks before our paper came out -- they coded up a version of the Ngram Viewer for the general public. And so you too can type in any word or phrase that you're interested in and see its n-gram immediately -- also browse examples of all the various books in which your n-gram appears.

ELA : 그래서 우리는 이 방법을 컬쳐로믹스라고 부릅니다. 그것은 같은 게놈의 일종 이죠. 게노믹스가 인간 게놈에있는 기반의 순서의 창문을 통한 생물학에서는 렌즈라는것을 제외하고는 말입니다. 컬쳐로믹스는 비슷합니다. 그것은 인간 문화의 연구에 거대한 규모의 데이터 수집 분석 응용 프로그램입니다. 여기에서는, 게놈의 렌즈를 통하는것을 대신해서, 역사 기록의 디지털화된 조각의 렌즈를 통합니다. 컬쳐로믹스에 대한 굉장한 점은 모든 사람이 그것을 할 수 있다는 것 입니다. 왜 다들 그것을 할 수 있을까요? 누구나 할 수 있기 때문에 세 남자, 존 오르완트, 매트 그레이와 윌 브록만이 구글에서 N 그램의 뷰어의 프로토 타입을 보고, 그리고 그들이 말하기를, "이건 정말 재미있네. 우리는 사람들이 이걸 사용할 수 있도록해야하겠는걸 "이라고 말했습니다. 그래서 2 주를 쫙 깔아서-- 우리 신문이 나온 두 주 전에 --- 그들은 일반 대중을 위한 N그램 뷰어의 버전을 코드화 했습니다 . 그래서 당신도 당신이 관심이 있는 어떤 단어 또는 구절이든지 타이프칠 수 있고 그 즉시 N 그램을 볼 수 있고 - 또한 여러분의 N그램에 나타나는 다양한 도서의 사례를 탐색할 수 있습니다.

JM: Now this was used over a million times on the first day, and this is really the best of all the queries. So people want to be their best, put their best foot forward. But it turns out in the 18th century, people didn't really care about that at all. They didn't want to be their best, they wanted to be their beft. So what happened is, of course, this is just a mistake. It's not that strove for mediocrity, it's just that the S used to be written differently, kind of like an F. Now of course, Google didn't pick this up at the time, so we reported this in the science article that we wrote. But it turns out this is just a reminder that, although this is a lot of fun, when you interpret these graphs, you have to be very careful, and you have to adopt the base standards in the sciences.

JM : 이제 이것은 첫날에 백만 번 이상 사용되었고, 이것은 정말 모든 질문중 최고입니다. 그래서 사람들은 앞으로 최선의 발차취로 그 자신들의 최고가 되고 싶어합니다. 하지만 18 세기에 밝혀졌듯이, 사람들은 전혀 신경 쓰지 않았습니다. 그들은 그들의 최고가 되고 싶지 않아했습니다, 그들은 그들의 방어인들이 되고 싶어했어요. 그래서 무슨 일이 일어났는가 하면, 이건 실수입니다. 이것은, 평범을위한 투지가 아니에요 그것은 S가 F 비슷하게 다르게 쓰여지곤 했다는 것입니다. 지금은 물론, 구글은 당시에 이것을 알아차리지 못했습니다, 그래서 우리는 우리가 쓴 과학 기사에서 이것을 보도했습니다. 그러나 그것은 이것이 단지 이것이 아주 재미있지만, 여러분이 이 그래프를 해석할 때, 여러분이 매우 신중해야 한다는 것을, 그리고 과학에서 기본 표준을 채택해야만 한다는 것을 상기시켜주는 것입니다.

ELA: People have been using this for all kinds of fun purposes. (Laughter) Actually, we're not going to have to talk, we're just going to show you all the slides and remain silent. This person was interested in the history of frustration. There's various types of frustration. If you stub your toe, that's a one A "argh." If the planet Earth is annihilated by the Vogons to make room for an interstellar bypass, that's an eight A "aaaaaaaargh." This person studies all the "arghs," from one through eight A's. And it turns out that the less-frequent "arghs" are, of course, the ones that correspond to things that are more frustrating -- except, oddly, in the early 80s. We think that might have something to do with Reagan.

ELA : 사람들은 재미 목적인 종류에 이것을 사용하고 있습니다. (웃음) 사실, 우리는 얘기를 할 수 없어야만 하는 않을 것입니다, 우리는 당신에게 모든 슬라이드를 보여하고 조용히 있을겁니다. 이 사람은 좌절의 역사에 관심이 있었습니다. 다양한 종류의 좌절이 있었습니다. 만일 여러분이 여러분의 발가락을 찌른다면, 그것은 하나의 A "argh."입니다. 만일 지구가 성간 우회를 위한 공간을 마련하기 위한, 보곤에 의해 전멸당하게 되면, 그것은 여덟개의 A "argh" 입니다. 이 사람은 모든 "argh" 를 하나에서부터 8 A를 통해서 공부합니다. 그리고 그것은 그 "arghs" 가 덜 빈번하게 나올때, 물론, 이것들에 해당하는 것들은 더 어렵게됩니다-- 이상하게도 초기 80 년대에서를 제외하고는요. 우리는 레이건과 뭔가 관련이 있을지 모른다고 생각합니다.£

(Laughter)

(웃음)

JM: There are many usages of this data, but the bottom line is that the historical record is being digitized. Google has started to digitize 15 million books. That's 12 percent of all the books that have ever been published. It's a sizable chunk of human culture. There's much more in culture: there's manuscripts, there newspapers, there's things that are not text, like art and paintings. These all happen to be on our computers, on computers across the world. And when that happens, that will transform the way we have to understand our past, our present and human culture.

JM :이 데이터의 여러 용도가 있습니다, 하지만 요점은 역사적 기록이 디지털화 되고 있다는 점입니다. Google은 천오백만권의 책을 디지털화하기 시작했습니다. 그것은 사상 출판된 모든 책들의 12 % 입니다. 그것은 인간 문화의 상당한 부분입니다. 문화에는 훨씬 더 있습니다: 거기에는 원고, 신문이 있고, 예술과 그림과 같은, 텍스트가 아닌 것들이 있습니다. 이것들은 모두 우리의 컴퓨터위에서 일어났습니다, 전세계에 걸쳐 컴퓨터위에서. 그리고 그것이 일어나는 때면, 우리가 우리의 과거, 현재, 그리고 미래를 이해하는 우리의 과거, 현재 우리의 인간 문화를 이해합니다.

Thank you very much.

정말 감사합니다.

(Applause)

(박수)

(Applause)

(박수)

(Laughter)

(웃음) [역: XKCD.com 미국의 유명 웹툰. 웹사이트에서 해당 문구의 티셔츠를 판매하고 있음]

(Laughter)

(웃음)

(Applause)

(박수)

ELA: If you were not yet convinced, sea levels are rising, so is atmospheric CO2 and global temperature.

ELA : 여러분이 아직도 납득되지 않으셨다면, 해수면이 상승하고 있으며, 그래서 대기 CO2와 지구의 온도도 상승하고 있습니다.

JM: You might also want to have a look at this particular n-gram, and that's to tell Nietzsche that God is not dead, although you might agree that he might need a better publicist.

(Laughter)

(웃음)

(Laughter)

(웃음)

(Laughter)

(웃음)

(Laughter)

(웃음)

Thank you very much.

정말 감사합니다.

(Applause)

(박수)

Jean-Baptiste Michel + Erez Lieberman Aiden: What we learned from 5 million books

Jean-Baptiste Michel + Erez Lieberman Aiden: What we learned from 5 million books

Related talks

Brewster Kahle: A free digital library

Aaron Koblin: Visualizing ourselves ... with crowd-sourced data

Amit Sood: Building a museum of museums on the web

Chip Kidd: Designing books is no laughing matter. OK, it is.

Ilan Stavans: Why should you read "Don Quixote"?

Chand John: What's the fastest way to alphabetize your bookshelf?

Related talks

Brewster Kahle: A free digital library

Aaron Koblin: Visualizing ourselves ... with crowd-sourced data

Amit Sood: Building a museum of museums on the web

Chip Kidd: Designing books is no laughing matter. OK, it is.

Ilan Stavans: Why should you read "Don Quixote"?

Chand John: What's the fastest way to alphabetize your bookshelf?