Jean-Baptiste Michel + Erez Lieberman Aiden: What we learned from 5 million books

إيريز ليبرمان ايدن: الجميع يعرف أن الصورة تعادل الف كلمة. لكننا في هارفارد تساءلنا ما إذا كان ذلك بالفعل صحيحا. (ضحك) وبالتالي جمعنا فريقا من الخبراء، يمتدون في هارفارد وMIT قاموس التراث الأمريكي، موسوعة بريتانيكا وحتى رعاتنا الذين نفتخر بهم، غوغل. ودبرنا هذا لحوالي أربع سنوات. ووصلنا إلى استنتاج مبدئي، سيداتي سادتي، الصورة لا تعادل الف كلمة. في الحقيقة، وجدنا بعض الصور التي تقدر بأكثر من 500 مليار كلمة.

Erez Lieberman Aiden: Everyone knows that a picture is worth a thousand words. But we at Harvard were wondering if this was really true. (Laughter) So we assembled a team of experts, spanning Harvard, MIT, The American Heritage Dictionary, The Encyclopedia Britannica and even our proud sponsors, the Google. And we cogitated about this for about four years. And we came to a startling conclusion. Ladies and gentlemen, a picture is not worth a thousand words. In fact, we found some pictures that are worth 500 billion words.

جان باتيست ميشال: إذن كيف وصلنا إلى هذا الاستنتاج؟ إذن أنا وإريز كنا نفكر في طرق للحصول على صورة كبيرة للثقافة الإنسانية والتاريخ البشري: تغيرها عبر الزمن. وهكذا الكثير من الكتب تم تأليفها على مدى السنوات. وبالتالي كنا نفكر، حسنا أفضل طريقة للاستفادة منها هو قراءة كل هذه الملايين من الكتب. الآن بالطبع، إن كان هناك مقياس لمدى روعة ذلك، كان هذا ليصنف عاليا وعاليا للغاية. الآن المشكل هو أن هناك محور أفاصيل لذلك، والذي هو محور العملية. هذا متدن متدن للغاية.

Jean-Baptiste Michel: So how did we get to this conclusion? So Erez and I were thinking about ways to get a big picture of human culture and human history: change over time. So many books actually have been written over the years. So we were thinking, well the best way to learn from them is to read all of these millions of books. Now of course, if there's a scale for how awesome that is, that has to rank extremely, extremely high. Now the problem is there's an X-axis for that, which is the practical axis. This is very, very low.

(تصفيق)

(Applause)

الآن الناس يميلون إلى استخدام مقاربة بديلة، والتي هي أخذ بضعة مصادر وقرائتها بعناية. هذا عملي للغاية، لكن ليس بتلك الروعة. ما تريد فعله حقا هو أخذ الجزء الرائع والعملي كذلك من هذا الفضاء. وقد اتضح أن هناك شركة على الجانب الآخر من النهر تدعى غوغل والتي بدأت مشروع رقمنة قبل بضع سنوات من شأنه أن يسمح بهذه المقاربة. قاموا برقمنة ملايين الكتب. وبالتالي ما يعنيه ذلك هو، قد يستخدم المرء طرقا حوسبية لقراءة كل الكتب بضغطة زر. هذا في غاية العملية والروعة.

Now people tend to use an alternative approach, which is to take a few sources and read them very carefully. This is extremely practical, but not so awesome. What you really want to do is to get to the awesome yet practical part of this space. So it turns out there was a company across the river called Google who had started a digitization project a few years back that might just enable this approach. They have digitized millions of books. So what that means is, one could use computational methods to read all of the books in a click of a button. That's very practical and extremely awesome.

إ. ل. أ: دعوني أخبركم قليلا من أين تأتي الكتب. منذ قديم الزمن، تواجد كتّاب. هؤلاء الكتّاب كانوا يسعون لتأليف كتب. وقد صار ذلك سهلا جدا مع تطور الصحافة المطبوعة قبل بضعة قرون. منذ ذلك الوقت، استطاع الكتاب على مدى 129 مليون مناسبة متميزة، من نشر الكتب. الآن إن لم تكن تلك الكتب مفقودة في التاريخ، فإنها في مكان ما في مكتبة، والكثير من تلك الكتب يتم استرجاعها من المكتبات ورقمنتها من قبل غوغل، والذين قاموا بمسح 15 مليون كتاب لحد الساعة.

ELA: Let me tell you a little bit about where books come from. Since time immemorial, there have been authors. These authors have been striving to write books. And this became considerably easier with the development of the printing press some centuries ago. Since then, the authors have won on 129 million distinct occasions, publishing books. Now if those books are not lost to history, then they are somewhere in a library, and many of those books have been getting retrieved from the libraries and digitized by Google, which has scanned 15 million books to date.

الآن حين يقوم غوغل برقمنة كتاب، يضعونه في شكل أنيق للغاية. الآن لدينا البيانات بالإضافة إلى البيانات الوصفية. لدينا معلومات حول أمور مثل أين تم نشره، من كان المؤلف، متى تم نشره. وما نقوم به هو القراء من خلال كل هذه السجلات وإلغاء كل البيانات التي ليست ذات جودة عالية. ما يتبقى لنا هو مجموعة من خمسة ملايين كتاب، 500 مليار كلمة، سلسلة من الأحرف أطول بألف مرة من الجينوم البشري -- نص إن تمت كتابته، سيمتد من هنا إلى القمر وعودة 10 مرات متوالية -- قشرة حقيقية لجينومنا الثقافي. ما قمنا به بالطبع حين واجهنا تلك المبالغة الفظيعة .. (ضحك) هو ماكان أي باحث يحترم نفسه ليقوم به. أخذنا صفحة من إكس كي سي دي، وقلنا، "ارجع للوراء. سنجرب العلم."

Now when Google digitizes a book, they put it into a really nice format. Now we've got the data, plus we have metadata. We have information about things like where was it published, who was the author, when was it published. And what we do is go through all of those records and exclude everything that's not the highest quality data. What we're left with is a collection of five million books, 500 billion words, a string of characters a thousand times longer than the human genome -- a text which, when written out, would stretch from here to the Moon and back 10 times over -- a veritable shard of our cultural genome. Of course what we did when faced with such outrageous hyperbole ... (Laughter) was what any self-respecting researchers would have done. We took a page out of XKCD, and we said, "Stand back. We're going to try science."

(ضحك)

(Laughter)

ج. م: الآن بالطبع، كنا نفكر، حسنا، دعونا أولا نضع البيانات فقط هناك حتى يمارس عليها الناس العلم. الآن كنا نفكر، ما نوع البيانات التي قد نصدرها؟ حسنا بالطبع، تريد أخذ كل الكتب ونشر النص الكامل لتلك الخمس ملايين كتاب. الآن غوغل، وجون أوروانت تحديدا، أخبرونا عن معادلة صغيرة علينا تعلمها. حسنا لديكم خمس ملايين، هذا يعني، خمس ملايين كاتب وخمس ملايين مدعي هي دعوى قضائية هائلة. إذن، على الرغم من أن هذا كان ليكون رائعا للغاية، مجددا، إنه غير عملي للغاية. (ضحك)

JM: Now of course, we were thinking, well let's just first put the data out there for people to do science to it. Now we're thinking, what data can we release? Well of course, you want to take the books and release the full text of these five million books. Now Google, and Jon Orwant in particular, told us a little equation that we should learn. So you have five million, that is, five million authors and five million plaintiffs is a massive lawsuit. So, although that would be really, really awesome, again, that's extremely, extremely impractical. (Laughter)

الآن مجددا، أذعنا نوعا ما، وأنجزنا المقاربة العملية جدا، والتي لم تكن بتلك الروعة. قلنا، حسنا بدل نشر النصوص الكاملة، سنقوم بنشر إحصائيات حول الكتب. إذن على سبيل المثال "بصيص من السعادة." إنها أربع كلمات؛ نسمي ذلك أربعة-غرام سنقوم باخباركم بعدد المرات التي ظهرت فيها أربعة-غرام معينة في الكتب في 1801، 1802، 1803، على طول الطريق إلى 2008. ذلك يعطينا تسلسلا زمنيا حول مدى تردد استخدام هذه الجملة المعينة مع مرور الزمن. نقوم بذلك لكل الكلمات والجمل التي تظهر في تلك الكتب، وذلك يعطينا جدولا ضخما من ملياري سطر يخبرنا حول الطريقة التي تتغير بها الثقافة.

Now again, we kind of caved in, and we did the very practical approach, which was a bit less awesome. We said, well instead of releasing the full text, we're going to release statistics about the books. So take for instance "A gleam of happiness." It's four words; we call that a four-gram. We're going to tell you how many times a particular four-gram appeared in books in 1801, 1802, 1803, all the way up to 2008. That gives us a time series of how frequently this particular sentence was used over time. We do that for all the words and phrases that appear in those books, and that gives us a big table of two billion lines that tell us about the way culture has been changing.

إ. ل. أ: وبالتالي هذان الملياران، نسميهما ملياري ن-غرام. ما الذي تخبرنا؟ حسنا ال: ن-غرام المفردة تقيس الاتجاهات الثقافية. دعوني أعطيكم مثالا. دعونا نفترض أنني في ازدهار، ثم أردت إخباركم في الغد كيف أبليت حسنا. وبالتالي قد أقول، "البارحة قد نجحت." كبديل، يمكنني القول، "البارحة، ازدهرت." حسنا أي واحدة يمكنني استخدامها؟ كيف أعرف؟

ELA: So those two billion lines, we call them two billion n-grams. What do they tell us? Well the individual n-grams measure cultural trends. Let me give you an example. Let's suppose that I am thriving, then tomorrow I want to tell you about how well I did. And so I might say, "Yesterday, I throve." Alternatively, I could say, "Yesterday, I thrived." Well which one should I use? How to know?

قبل ستة أشهر من الآن، أعلى تقدم تقني في المجال كان أن تقوم، على سبيل المثال، بالذهاب إلى عالم نفساني بشعر رائع، وتقول، "ستيف، أنت خبير في الأفعال غير النظامية. ما الذي يجدر بي فعله؟" وسيقول لك، "حسنا معظم الناس يقولون نجحت، لكن بعضهم يقول ازدهرت." وتعلم كذلك، أكثر أو أقل، أنه إن عدت 200 سنة إلى الوراء وسألت رجل دولة بنفس الشعر الرائع، (ضحك) "توم، ما الذي يجدر بي قوله؟" سيقول، "حسنا، في وقتنا، معظم الناس يزدهرون، لكن بعضعهم ينجحون." وبالتالي الآن ما سأعرضهم عليهم هو بينات خام. صفان من هذا جدول ملياري مدخلة. ما ترونه هو تردد سنة بسنة لـ "نجح" و"ازدهر" على مرور الزمن. الآن هذه فقط اثنتان من ملياري صف. وبالتالي مجموع البيانات الكلية هي مليار مرة أكثر روعة من هذه الشريحة.

As of about six months ago, the state of the art in this field is that you would, for instance, go up to the following psychologist with fabulous hair, and you'd say, "Steve, you're an expert on the irregular verbs. What should I do?" And he'd tell you, "Well most people say thrived, but some people say throve." And you also knew, more or less, that if you were to go back in time 200 years and ask the following statesman with equally fabulous hair, (Laughter) "Tom, what should I say?" He'd say, "Well, in my day, most people throve, but some thrived." So now what I'm just going to show you is raw data. Two rows from this table of two billion entries. What you're seeing is year by year frequency of "thrived" and "throve" over time. Now this is just two out of two billion rows. So the entire data set is a billion times more awesome than this slide.

(ضحك)

(Laughter)

(تصفيق)

(Applause)

ج. م: الآن هناك الكثير من الصور الأخرى التي تعادل 500 مليار كلمة. على سبيل المثال، هذه. إن أخذتم الإنفلونزا سترون ذرى في الأوقات التي تعرفون كانت تقتل فيه أوبئة الأنفلونزا الكبرى الناس في جميع أنحاء العالم.

JM: Now there are many other pictures that are worth 500 billion words. For instance, this one. If you just take influenza, you will see peaks at the time where you knew big flu epidemics were killing people around the globe.

إ. ل. أ: إن لم تكونوا بعد مقتنعين، مستويات البحر ترتفع، وكذلك ثنائي أكسيد الكربون في الجو والحرارة العالمية.

ELA: If you were not yet convinced, sea levels are rising, so is atmospheric CO2 and global temperature.

ج. م: سترغبون كذلك في إلقاء نظرة على هذا الـ ن-غرام بذاته، وذلك لإخبار نيتشه أن الإله ليس ميتا، على الرغم من أنه قد يحتاج وكيلا إعلاميا أفضل.

JM: You might also want to have a look at this particular n-gram, and that's to tell Nietzsche that God is not dead, although you might agree that he might need a better publicist.

(ضحك)

(Laughter)

إ. ل. أ: يمكنكم الحصول على بعض المفاهيم المجردة بهذا الشيء. على سبيل المثال، دعوني أخبركم عن تاريخ السنة 1950. للغالبية العظمى من التاريخ، لم يهتم أحد في حدود 1950. في 1700 في 1800 في 1900، لم يهتم أحد. خلال الثلاثينيات والأربعينيات، لم يهتم أحد. وفجأة، في منتصف الأربعينيات، بدأت تحدث ضجة. بدأ الناس يدركون أن 1950 قادمة، وقد تكون عظيمة. (ضحك) لكن لا شيء جعل الناس أكثر اهتماما بـ 1950 مثل السنة 1950. (ضحك) كان الناس يمشون مهووسين. لم يستطيعوا التوقف عن الكلام حول ما قاموا به في 1950، كل ما كانوا يخططون له في 1950، كل الأحلام حول ما أرادوا إنجازه في 1950. في الواقع، 1950 كان آسرا لدرجة أنه في السنوات اللاحقة، بقي الناس يتحدثون حول كل الأمور الرائعة التي حدثت، في 51 و 52 و53. وأخير في 1954، أحدهم استيقظ وأدرك أن 1950 قد انقضت. (ضحك) وبكل بساطة، انفجرت الفقاعة.

ELA: You can get at some pretty abstract concepts with this sort of thing. For instance, let me tell you the history of the year 1950. Pretty much for the vast majority of history, no one gave a damn about 1950. In 1700, in 1800, in 1900, no one cared. Through the 30s and 40s, no one cared. Suddenly, in the mid-40s, there started to be a buzz. People realized that 1950 was going to happen, and it could be big. (Laughter) But nothing got people interested in 1950 like the year 1950. (Laughter) People were walking around obsessed. They couldn't stop talking about all the things they did in 1950, all the things they were planning to do in 1950, all the dreams of what they wanted to accomplish in 1950. In fact, 1950 was so fascinating that for years thereafter, people just kept talking about all the amazing things that happened, in '51, '52, '53. Finally in 1954, someone woke up and realized that 1950 had gotten somewhat passé. (Laughter) And just like that, the bubble burst.

(ضحك)

(Laughter)

وقصة 1950 هي قصة كل سنة لدينا في السجلات، بالتفاف بسيط، لأنه لدينا الآن هذه المبيانات الجميلة. ولأنه لدينا مبيانات جميلة، يمكننا قياس الأشياء. يمكننا القول، "حسنا ما مدى سرعة انفجار الفقاعة؟" وقد اتضح أنه يمكننا قياس ذلك بدقة بالغة. تم اشتقاق معادلات، وإنتاج رسوم بيانية، والنتيجة الخام هي أنه وجدنا أن الفقاعة تنفجر بسرعة أكثر فأكثر مع مرور كل سنة. نحن نفقد اهتمامنا بالماضي بشكل أكثر سرعة.

And the story of 1950 is the story of every year that we have on record, with a little twist, because now we've got these nice charts. And because we have these nice charts, we can measure things. We can say, "Well how fast does the bubble burst?" And it turns out that we can measure that very precisely. Equations were derived, graphs were produced, and the net result is that we find that the bubble bursts faster and faster with each passing year. We are losing interest in the past more rapidly.

ج. م: الآن نصيحة مهنية بسيطة. وبالتالي لمن يريد منكم أن يشتهر، يمكننا الاستفادة من الشخصيات السياسية الـ 25 الأكثر شهرة، والكتاب والممثلين وما إلى ذلك. وبالتالي إذا ما أردت أن تصبح مشهورا لاحقا، يجدر بك أن تصير ممثلا، لأن الشهرة تبدأ بالزيادة في نهاية العشرينيات -- لا تزال يافعا، ذلك عظيم للغاية. الآن إن أمكنك الانتظار قليلا، يجدر بك أن تصير كاتبا، لأنه آنذاك ترتفع إلى مستويات عظيمة، مثل مارك توين، على سبيل المثال: في غاية الشهرة. لكن إن أردت أن تصل إلى أعلى القمة، يجدر بك أن تؤجل المتعة، وتصير سياسيا بالطبع. وبالتالي هنا ستصير مشهورا في نهاية خمسينياتك، وستصير مشهورا للغاية بعد ذلك. والعلماء كذلك يميلون إلى أن يصيروا مشهورين حين يكونون أكبر بكثير. على سبيل المثال، الأحيائيون والفيزيائيون يصيرون بقدر شهرة الممثلين. خطأ واحد لا يجدر بكم الوقوع فيه هو أن تصيروا رياضياتيين. (ضحك) إن قمتم بذلك، قد تعتقدون، "أوه عظيم. سأقوم بأعظم أعمالي في العشرينيات من عمري" لكن خمنوا ماذا، لا أحد في الواقع يهتم.

JM: Now a little piece of career advice. So for those of you who seek to be famous, we can learn from the 25 most famous political figures, authors, actors and so on. So if you want to become famous early on, you should be an actor, because then fame starts rising by the end of your 20s -- you're still young, it's really great. Now if you can wait a little bit, you should be an author, because then you rise to very great heights, like Mark Twain, for instance: extremely famous. But if you want to reach the very top, you should delay gratification and, of course, become a politician. So here you will become famous by the end of your 50s, and become very, very famous afterward. So scientists also tend to get famous when they're much older. Like for instance, biologists and physics tend to be almost as famous as actors. One mistake you should not do is become a mathematician. (Laughter) If you do that, you might think, "Oh great. I'm going to do my best work when I'm in my 20s." But guess what, nobody will really care.

(ضحك)

(Laughter)

إ. ل. أ: هناك ملاحظات حكيمة أخرى فيما بين الـ ن-غرام. على سبيل المثال، هنا تجدون مسار مارك شاغال، فنان ولد سنة 1887. ويبدو هذا مثل مسار شخص مشهور عادي. يصير أكثر وأكثر شهرة، باستثناء إن بحثتم بالألمانية. إن بحثتم بالألمانية، ترون شيئا غريبا للغاية، شيئا لا ترونه أبدا، وهو أن تصير مشهورا للغاية ثم وفجأة، يصل الحضيض ما بين 1933 و1945، قبل أن يرتد مجددا لاحقا. وبالطبع، ما ترونه هو حقيقة كون مارك شاغال فنانا يهوديا في ألمانيا النازية.

ELA: There are more sobering notes among the n-grams. For instance, here's the trajectory of Marc Chagall, an artist born in 1887. And this looks like the normal trajectory of a famous person. He gets more and more and more famous, except if you look in German. If you look in German, you see something completely bizarre, something you pretty much never see, which is he becomes extremely famous and then all of a sudden plummets, going through a nadir between 1933 and 1945, before rebounding afterward. And of course, what we're seeing is the fact Marc Chagall was a Jewish artist in Nazi Germany.

الآن هذه الإشارات هي في الواقع قوية للغاية لدرجة أنه لا تحتاج إلى معرفة أن أحدهم كان تحت الرقابة. يمكننا في الواقع إستنتاج ذلك باستخدام معالجة إشارات أساسية. هنا طريقة بسيطة للقيام بذلك. حسنا، توقع منطقي هو أن شهرة أحدهم في فترة زمنية معينة يجب أن تكون تقريبا معدل شهرتهم قبل وبعد الاشتهار. وبالتالي ذلك نوع مما نتوقعه. ونقارن ذلك بالشهرة التي نلاحظها. ونقسم واحدا بالآخر لإنتاج شيء نسميه مؤشر القمع. إن كان مؤشر القمع صغيرا جدا للغاية، بعد ذلك قد تكون أنت ذاتك تحت القمع. إن كان كبيرا للغاية، ربما تكون مستفيدا من بروباغاندا.

Now these signals are actually so strong that we don't need to know that someone was censored. We can actually figure it out using really basic signal processing. Here's a simple way to do it. Well, a reasonable expectation is that somebody's fame in a given period of time should be roughly the average of their fame before and their fame after. So that's sort of what we expect. And we compare that to the fame that we observe. And we just divide one by the other to produce something we call a suppression index. If the suppression index is very, very, very small, then you very well might be being suppressed. If it's very large, maybe you're benefiting from propaganda.

ج. م: الآن يمكنكم فعلا أن تنظروا توزيع مؤشرات القمع على لمجموع السكان. لذا على سبيل المثال، هنا -- مؤشر القمع هذا هو لـ 5000 شخص تم اختيارهم من كتب إنجليزية حيث لا يوجد هناك قمع -- سيكون شيئا مثل هذا، سيكون أساسا متركزا بإحكام. ما تتوقعونه هو ما يمكنكم ملاحظته أساسا. هذا التوزيع كما يرى في ألمانيا -- مختلف كثيرا، إنه محول قليلا لليسار. تحدث عنه الناس مرتين أقل مما يجب أن يكون. لكن المهم جدا، التوزيع أقل وسعا. هناك الكثير من الناس ينتهون في أقصى يسار التوزيع تم الحديث عنهم حوالي 10 مرات أقل مما يجب أن يكون. لكن كذلك الكثير من الناس على أقصى اليسار يبدو أنهم استفادوا من البروباغندا. هذه الصورة هي السمة المميزة للرقابة في سجل الكتب.

JM: Now you can actually look at the distribution of suppression indexes over whole populations. So for instance, here -- this suppression index is for 5,000 people picked in English books where there's no known suppression -- it would be like this, basically tightly centered on one. What you expect is basically what you observe. This is distribution as seen in Germany -- very different, it's shifted to the left. People talked about it twice less as it should have been. But much more importantly, the distribution is much wider. There are many people who end up on the far left on this distribution who are talked about 10 times fewer than they should have been. But then also many people on the far right who seem to benefit from propaganda. This picture is the hallmark of censorship in the book record.

إ. ل. أ: الكلتروميكس هي ما نسمي هذه الطريقة. إنها نوع ما مثل الجينوميات. باستثناء أن الجينوما هي عدسات على البيولوجيا من خلال نافذة تسلسل القواعد في الجينوم البشري. الكلتروميكس أمر مشابه. إنه تطبيق تحليل لمجموعة البيانات الهائلة الحجم لدراسة الثقافة البشرية. هنا، على سبيل المثال من خلال عدسات الجينوم، من خلال عدسات السجل التاريخي المرقمن. الأمر العظيم بخصوص الكلتروميكس هو أن الجميع يمكنه القيام بذلك. لماذا يمكن للجميع القيام بذلك؟ الجميع يمكنه القيام بذلك لأن ثلاثة أشخاص، جون أوروانت ومات غراي وويل بروكمان في غوغل، رأوا النموذج الأولي لعارض ن-غرام، فقالوا، "هذا في غاية المتعة. علينا جعل هذا متوفرا للناس." وبالتالي في أسبوعين بالتمام -- الأسبوعان قبل صدور ورقتنا البحثية -- قاموا ببرمجة نسخة من عارض ن-غرام من أجل الاستخدام العام. وبالتالي يمكنكم كذلك أن تكتبوا أي كلمة أو جملة تهتمون بها ورؤية الـ ن-غرام مباشرة -- كذلك تصفح أمثلة من مختلف الكتب حيث تظهر ن-غرام.

ELA: So culturomics is what we call this method. It's kind of like genomics. Except genomics is a lens on biology through the window of the sequence of bases in the human genome. Culturomics is similar. It's the application of massive-scale data collection analysis to the study of human culture. Here, instead of through the lens of a genome, through the lens of digitized pieces of the historical record. The great thing about culturomics is that everyone can do it. Why can everyone do it? Everyone can do it because three guys, Jon Orwant, Matt Gray and Will Brockman over at Google, saw the prototype of the Ngram Viewer, and they said, "This is so fun. We have to make this available for people." So in two weeks flat -- the two weeks before our paper came out -- they coded up a version of the Ngram Viewer for the general public. And so you too can type in any word or phrase that you're interested in and see its n-gram immediately -- also browse examples of all the various books in which your n-gram appears.

ج. م: الآن تم استخدام هذا أكثر من مليون مرة في اليوم الأول، وهذه بالفعل أفضل كل تلك الاستعلامات. وبالتالي أراد الناس وضع قدمهم الأفضل إلى الأمام. لكن اتضح أنه في القرن الـ 18، لم يهتم الناس بذلك على الإطلاق. لم يريدوا أن يكونوا الأفضل، أرادوا أن يكونوا الأفدل. وبالتالي ما حصل هو، بالطبع، هذا كان مجرد خطأ. ليس ذلك السعي نحو التوسط، إنه فقط حقيقة كون الحرف ض كان يكتب بشكل مختلف، نوعا ما مثل د. الآن بالطبع، غوغل لم يلاحظ ذلك آنذاك، وبالتالي أبلغنا عن ذلك في المقال العلمي الذي كتبناه. لكن اتضح أن ذلك هو فقط تذكير أنه، على الرغم من أنه ممتع جدا، حين تفسر تلك الرسوم البيانية، عليك أن تكون حذرا للغاية، وعليك أن تعتمد المعايير الأساسية للعلوم.

JM: Now this was used over a million times on the first day, and this is really the best of all the queries. So people want to be their best, put their best foot forward. But it turns out in the 18th century, people didn't really care about that at all. They didn't want to be their best, they wanted to be their beft. So what happened is, of course, this is just a mistake. It's not that strove for mediocrity, it's just that the S used to be written differently, kind of like an F. Now of course, Google didn't pick this up at the time, so we reported this in the science article that we wrote. But it turns out this is just a reminder that, although this is a lot of fun, when you interpret these graphs, you have to be very careful, and you have to adopt the base standards in the sciences.

إ. ل. أ: الناس كانوا يستخدمون جميع أنواع أغراض المتعة. (ضحك) في الواقع، ليس علينا الكلام، سنقوم فقط بعرض بقية الشرائح والبقاء صامتين. هذا الشخص كان مهتما بتاريخ الإحباط. هنا هذه الأنواع المختلفة من الإحباط. إن صدمت اصبع قدمك، تلك آ "أرغ." إن كان كوكب الأرض أبيد من طرف الفوغونات لإفساح المكان لممر بين-نجمي، تلك 8 آهات "آآآآآآآآرغ." هذا الشخص درس كل هذه ال"آرغ". من واحد إلى ثمانية آهات. وقد اتضح أن أقل الآهات استخداما هي بالطبع تلك التي ترتبط بالأشياء الأكثر احباطا باستثناء، وبشكل غريب، في بداية الثمانينيات. نعتقد أن لذلك علاقة بريغان.

ELA: People have been using this for all kinds of fun purposes. (Laughter) Actually, we're not going to have to talk, we're just going to show you all the slides and remain silent. This person was interested in the history of frustration. There's various types of frustration. If you stub your toe, that's a one A "argh." If the planet Earth is annihilated by the Vogons to make room for an interstellar bypass, that's an eight A "aaaaaaaargh." This person studies all the "arghs," from one through eight A's. And it turns out that the less-frequent "arghs" are, of course, the ones that correspond to things that are more frustrating -- except, oddly, in the early 80s. We think that might have something to do with Reagan.

(ضحك)

(Laughter)

ج. م: هناك الكثير من الاستخدامات لهذه البيانات، لكن الخلاصة أن السجلات التاريخية تتم رقمنتها. غوغل قد بدأ يرقمن 15 مليون كتاب. ذلك 12 في المئة من كل الكتب التي نشرت من قبل. إنه قسم لا بأس به من الثقافة البشرية. هناك ما هو أكثر بكثير في الثقافة: هناك المخطوطات، الجرائد، هناك الأشياء التي ليست كتابة، مثل الفن والرسومات. كل هذا يصادف تواجده في حواسيبنا، في حواسيب في جميع أنحاء العالم. وحين يحدث ذلك، سيغير ذلك الطريقة التي نفهم بها ماضينا، وحاضرنا والثقافة البشرية.

JM: There are many usages of this data, but the bottom line is that the historical record is being digitized. Google has started to digitize 15 million books. That's 12 percent of all the books that have ever been published. It's a sizable chunk of human culture. There's much more in culture: there's manuscripts, there newspapers, there's things that are not text, like art and paintings. These all happen to be on our computers, on computers across the world. And when that happens, that will transform the way we have to understand our past, our present and human culture.

شكرا جزيلا لكم.

Thank you very much.

(تصفيق)

(Applause)

(تصفيق)

(Applause)

(ضحك)

(Laughter)

(ضحك)

(Laughter)

(تصفيق)

(Applause)

إ. ل. أ: إن لم تكونوا بعد مقتنعين، مستويات البحر ترتفع، وكذلك ثنائي أكسيد الكربون في الجو والحرارة العالمية.

ELA: If you were not yet convinced, sea levels are rising, so is atmospheric CO2 and global temperature.

JM: You might also want to have a look at this particular n-gram, and that's to tell Nietzsche that God is not dead, although you might agree that he might need a better publicist.

(ضحك)

(Laughter)

(ضحك)

(Laughter)

(ضحك)

(Laughter)

(ضحك)

(Laughter)

شكرا جزيلا لكم.

Thank you very much.

(تصفيق)

(Applause)

Jean-Baptiste Michel + Erez Lieberman Aiden: What we learned from 5 million books

Jean-Baptiste Michel + Erez Lieberman Aiden: What we learned from 5 million books

Related talks

Brewster Kahle: A free digital library

Aaron Koblin: Visualizing ourselves ... with crowd-sourced data

Amit Sood: Building a museum of museums on the web

Chip Kidd: Designing books is no laughing matter. OK, it is.

Ilan Stavans: Why should you read "Don Quixote"?

Chand John: What's the fastest way to alphabetize your bookshelf?

Related talks

Brewster Kahle: A free digital library

Aaron Koblin: Visualizing ourselves ... with crowd-sourced data

Amit Sood: Building a museum of museums on the web

Chip Kidd: Designing books is no laughing matter. OK, it is.

Ilan Stavans: Why should you read "Don Quixote"?

Chand John: What's the fastest way to alphabetize your bookshelf?