Jean-Baptiste Michel + Erez Lieberman Aiden: What we learned from 5 million books

Erez Lieberman Aiden: Everyone knows that a picture is worth a thousand words. But we at Harvard were wondering if this was really true. (Laughter) So we assembled a team of experts, spanning Harvard, MIT, The American Heritage Dictionary, The Encyclopedia Britannica and even our proud sponsors, the Google. And we cogitated about this for about four years. And we came to a startling conclusion. Ladies and gentlemen, a picture is not worth a thousand words. In fact, we found some pictures that are worth 500 billion words.

Էրեզ Լիբերման Էյդն.«Բոլորը գիտեն, որ մի նկարը հազարավոր բառեր արժե: Բայց մենք Հարվարդում կասկածում էինք, արդյոք դա ճիշտ է: (Ծիծաղ) Այդ իսկ պատճառով մենք հավաքեցինք մի խումբ փորձագետների` Հարվարդի համալսարանից և Մասաչուսեթսի տեխնոլոգիական ինստիտուտից, Ամերիկյան Ժառանգություն Բառարանի և Բրիտանիկա հանրագիտարանի անձնակազմից և նույնիսկ մեր հպարտ հովանավորներից` Google-ին: Մենք այս մասին մտածել ենք ավելի քան 4 տարի: Եվ հանգեցինք ապշեցուցիչ մի եզրակացության: Տիկնայք և պարոնայք, մի նկարը հազարավոր բառ չարժե: Իրականում, մենք գտանք որոշ նկարներ, որոնք 500 միլիարդ բառ արժեն:

Jean-Baptiste Michel: So how did we get to this conclusion? So Erez and I were thinking about ways to get a big picture of human culture and human history: change over time. So many books actually have been written over the years. So we were thinking, well the best way to learn from them is to read all of these millions of books. Now of course, if there's a scale for how awesome that is, that has to rank extremely, extremely high. Now the problem is there's an X-axis for that, which is the practical axis. This is very, very low.

Ժան-Բապտիստա Միշել: Ինչպե՞ս մենք եկանք այս եզրակացության: Էրեզը և ես մտածում էինք այն ուղիների մասին, թե ինչպես կարող ենք գտնել մի ընդհանուր պատկեր մարդկության մշակույթի և պատմության մասին` փոփխված ժամանակի ընթացքում: Տարիներ շարունակ շատ գրքեր են գրվել: Եվ մենք կարծում էինք, որ դրանց ուսումնասիրելու ամենալավ եղանակը այդ միլիոնավոր գրքերը կարդալն է: Իհարկե, եթե լիներ այդ հրաշքը գնահատելու սանդղակ, այն չափազանց արագ, չափազանց բարձր աճ կունենար: Այժմ խնդիրն այն է, որ դրա համար ունենք X-երի առանցքը, որը պրակտիկայի առանցք է: Սա շատ, շատ ցածր է:

(Applause)

(Ծափահարություններ)

Now people tend to use an alternative approach, which is to take a few sources and read them very carefully. This is extremely practical, but not so awesome. What you really want to do is to get to the awesome yet practical part of this space. So it turns out there was a company across the river called Google who had started a digitization project a few years back that might just enable this approach. They have digitized millions of books. So what that means is, one could use computational methods to read all of the books in a click of a button. That's very practical and extremely awesome.

Այսօր մարդիկ հակված են օգտագործել այլընտրանքային մոտեցում, այն է, վերցնել մի քանի աղբյուրներ և շատ ուշադիր կարդալ դրանք: Սա չափազանց գործնական է, բայց ոչ այդքան ապշեցուցիչ: Այն, ինչ իրականում ցանկանում եք` հասնել այս գործընթացի ոչ միայն գործնական, այլ նաև ապշեցուցիչ մասին: Փաստորեն, պարզ է դառնում, որ գետի մյուս ափին Google անունով մի ընկերություն կա, որ դեռ մի քանի տարի առաջ էր սկսել թվայնացման ծրագիրը, որը պարզապես հնարավորություն է տալիս անել դա: Նրանք թվայնացրեցին միլիոնավոր գրքեր: Սա նշանակում է, որ կարելի էր օգտագործել հաշվարկման մեթոդներ` բոլոր գրքերը կարդալու համար` կոճակի մի սեղմումով: Սա իրոք շատ պրակտիկ է և չափազանց ապշեցուցիչ:

ELA: Let me tell you a little bit about where books come from. Since time immemorial, there have been authors. These authors have been striving to write books. And this became considerably easier with the development of the printing press some centuries ago. Since then, the authors have won on 129 million distinct occasions, publishing books. Now if those books are not lost to history, then they are somewhere in a library, and many of those books have been getting retrieved from the libraries and digitized by Google, which has scanned 15 million books to date.

Էրեզ: Թույլ տվեք ձեզ մի փոքր պատմեմ գրքերի ստեղծման մասին: Անհիշելի ժամանակներից կային գրողներ: Այս գրողները ձգտում էին գրքեր գրել: Սա Էապես ավելի հեշտացավ մի քանի դար առաջ տպագրահաստոցի առաջացումից հետո: Դրանից հետո գրողները հաղթեցին. 129 միլիոն գրքերի հրատարակման դեպք գրանցվեց: Եվ եթե այդ գրքերը պատմության մեջ չեն կորել, ուրեմն դրանք գրադարաններում ինչ-որ տեղ են պահվում, այս գրքերից շատերը գրադարաններից ետ վերցվեցին և թվայնացվեցին Google-ի կողմից, որն այսօրվա դրությամբ 15 միլիոն գիրք է սկանավորել:

Now when Google digitizes a book, they put it into a really nice format. Now we've got the data, plus we have metadata. We have information about things like where was it published, who was the author, when was it published. And what we do is go through all of those records and exclude everything that's not the highest quality data. What we're left with is a collection of five million books, 500 billion words, a string of characters a thousand times longer than the human genome -- a text which, when written out, would stretch from here to the Moon and back 10 times over -- a veritable shard of our cultural genome. Of course what we did when faced with such outrageous hyperbole ... (Laughter) was what any self-respecting researchers would have done. We took a page out of XKCD, and we said, "Stand back. We're going to try science."

Երբ Google-ը թվայնացնում է գրքերը, դրանք իսկապես լավ ձևաչափով են դասակարգվում: Հիմա մենք ունենք տվյալների բազա, դրան գումարած նաև մեթատվյալների բազա: Մենք գիտենք, թե որտեղ են գրքերը հրատարակվել, ով է հեղինակը, երբ է այն հրատարակվել: Եվ մենք ուսումնասիրեցինք այդ բոլոր գրառումները` բացառելով բոլոր ոչ բարձրորակ տվյալները: Այն ինչ մեզ մնաց հինգ միլիոն գրքերի հավաքածուն է, 500 մլրդ բառ, հազար անգամ ավելի շատ տարրերով, քան մարդու գեներում են, եթե գրի առնենք այս տեքստը, ապա այն կունենա դեպի լուսին և ետ ճանապարհի երկարությունը բազմապատկած 10 անգամ, մշակութային գենի իրական մասնիկ: Իհարկե այն, ինչ մենք արեցինք, երբ դեմ առ դեմ կանգնեցին հիպերբոլայի առջև ... (Ծիծաղ) այն էր, ինչ կաներ յուրաքանչյուր իրեն հարգող գիտնական: Մենք վերցրեցինք XKCD-ից մի էջ և ասացինք. «Ետ քաշվեք: Մենք պատրաստվում ենք գիտությամբ զբաղվել»:

(Laughter)

(Ծիծաղ)

JM: Now of course, we were thinking, well let's just first put the data out there for people to do science to it. Now we're thinking, what data can we release? Well of course, you want to take the books and release the full text of these five million books. Now Google, and Jon Orwant in particular, told us a little equation that we should learn. So you have five million, that is, five million authors and five million plaintiffs is a massive lawsuit. So, although that would be really, really awesome, again, that's extremely, extremely impractical. (Laughter)

ԺՄ: Իհարկե մենք մտածում էինք, որ առաջնային նպատակը տվյալները հասանելի դարձնել այն մարդկանց համար, ով գիտությամբ է զբաղվում: Հիմա մենք մտածում ենք, թե ո՞ր տվյալները կարող ենք թողարկել: Իհարկե, ցանկություն է առաջանում վերցնել և միանգամից թողարկել այդ 5 միլիոն գրքերի ամբողջական տեքստերը: Google-ը, մասնավորապես Ջոն Օրվանթը, մեզ մի փոքրիկ հավասարում սովորեցրեց: Այսպիով, դուք ունեք 5 միլիոն, այսինքն` 5 միլիոն գրող, իսկ հինգ միլիոն հայցվորները հավասար են զանգվածային դատական գործի: Այսպես, թեև դա իրոք չափազանց ապշեցուցիչ է, մեկ է, այն չափազանց, ծայրահեղ ոչ պրակտիկ է: (Ծիծաղ)

Now again, we kind of caved in, and we did the very practical approach, which was a bit less awesome. We said, well instead of releasing the full text, we're going to release statistics about the books. So take for instance "A gleam of happiness." It's four words; we call that a four-gram. We're going to tell you how many times a particular four-gram appeared in books in 1801, 1802, 1803, all the way up to 2008. That gives us a time series of how frequently this particular sentence was used over time. We do that for all the words and phrases that appear in those books, and that gives us a big table of two billion lines that tell us about the way culture has been changing.

Կարծես մենք զիջում ենք, և գործին շատ գործնական մոտեցում ցուցաբերեցինք, չնայած մի փոքր պակաս ապշեցուցիչ կերպով: Մենք ասացինք, որ ամբողջական տեքստը հրապարակելու փոխարեն, մենք կհրապարակենք գրքերի մասին վիճակագրությունը: Վերցնենք օրինակ, «A gleam of happiness»-ը: այս բառերը մենք անվանում ենք 4-գրամ Մենք պատրաստվում ենք ձեզ ցույց տալ, թե քանի անգամ է այս 4-գրամը հայտնվել 1801, 1802, 1803 թթ. գրքերում, և այսպես մինչև 2008 թ.: Սա մեզ կտա ժամանակային շարքերի հաճախականությունը, թե տվյալ նախադասությունը քանի անգամ է կրկնվել ժամանակի ընթացքում: Մենք դա արեցին այն բոլոր բառերի և բառակապակցությունների հետ, որ կային այդ գրքերում և դա մեզ տվեց 2 միլիարդ տողանի մի մեծ աղյուսակ, որոնք մեզ հուշում են, թե ինչպիսի փոփոխությունների է ենթարկվել մշակույթը:

ELA: So those two billion lines, we call them two billion n-grams. What do they tell us? Well the individual n-grams measure cultural trends. Let me give you an example. Let's suppose that I am thriving, then tomorrow I want to tell you about how well I did. And so I might say, "Yesterday, I throve." Alternatively, I could say, "Yesterday, I thrived." Well which one should I use? How to know?

ԷԼԷ: Այսպես, այդ 2 միլիարդ տողերին մենք անվանում ենք n-գրամ: Ի՞նչ են դրանք մեզ ասում: Առանձին n-գրամերը չափում են մշակութային տենդենցները: Թույլ տվեք բերեմ հետևյալ օրինակը: Ենթադրենք, ես հարստացել եմ, իսկ վաղը ուզում եմ ձեզ ասել իմ կարգավիճակի մասին: Այսպիսով, ես պետք է ասեմ. «Երեկ ես բարգավաճեցի (throve)»: Այլ կերպ ես կարող եմ ասել. «Երեկ ես բարգավաճեցի (thrived)»: Դե, ո՞ր մեկը պետք է օգտագործեմ: Ինչպե՞ս պարզել դա:

As of about six months ago, the state of the art in this field is that you would, for instance, go up to the following psychologist with fabulous hair, and you'd say, "Steve, you're an expert on the irregular verbs. What should I do?" And he'd tell you, "Well most people say thrived, but some people say throve." And you also knew, more or less, that if you were to go back in time 200 years and ask the following statesman with equally fabulous hair, (Laughter) "Tom, what should I say?" He'd say, "Well, in my day, most people throve, but some thrived." So now what I'm just going to show you is raw data. Two rows from this table of two billion entries. What you're seeing is year by year frequency of "thrived" and "throve" over time. Now this is just two out of two billion rows. So the entire data set is a billion times more awesome than this slide.

Ավելի քան վեց ամիս առաջ այս ոլորտում արվեստի կարգավիճակը այնպիսին էր, որ կարելի էր, օրինակ, մոտենալ այս գեղեցիկ վարսահարդարմամբ հոգեբանին, և հարցնել. «Սթիվ, դուք անկանոն բայերի մասնագետ եք: Ի՞նչ անեմ»: Եվ նա ձեզ կասի. «Դե, շատերը ասում են բարգավաճեցի (thrived), բայց ոմանք էլ ասում են բարգավաճեցի (throve)»: Բայց դուք նաև քիչ թե շատ գիտեք, որ եթե 200 տարով հետ գնայիք ժամանակի մեջ և հարցնեիք այս քաղաքական գործիչին` նույնպես գեղեցիկ վարսահարդարմամբ. (Ծիծաղ) «Թոմ, ի՞նչ պետք է ասեմ»: Նա կպատասխաներ. «Այժմ, մարդկանց մեծ մասը օգտագործում է բարգավաճեցի (throve), իսկ ոմանք էլ բարգավաճեցի (thrived)»: Այնպես որ, այն ինչ հիմա պատրաստվում եմ ձեզ ցույց տալ պարզապես չմշակված տվյալներ են: Այս երկու միլիարդ գրառումներով աղյուսակից 2 տող: Այն, ինչ դուք տեսնում եք «thrived» և «throve»-ի ժամանակի րնթացքում կատարված տարեկան պարբերականն է: Իսկ սա ընդամենը երկու բառ է երկու միլիարդ բառերի շարքից: Այնպես որ, ամբողջ տվյալների համախումբը միլիարդ անգամ ավելի ապշեցուցիչ է, քան այս սլայդը:

(Laughter)

(Ծիծաղ)

(Applause)

(Ծափահարություններ)

JM: Now there are many other pictures that are worth 500 billion words. For instance, this one. If you just take influenza, you will see peaks at the time where you knew big flu epidemics were killing people around the globe.

ԺՄ: Կան նաև շատ այլ նկարներ, որ 500 միլիարդ բառ արժեն: Օրինակ, այս մեկը: Եթե դուք հարբուխով հիվանդանաք, կարող եք տեսնել գագաթնակետային վիճակները այն ժամանակ, երբ դուք գիտեիք, որ մեծ գրիպի համաճարակի ընթացքում ամբողջ աշխարհում մարդիկ մահանում էին:

ELA: If you were not yet convinced, sea levels are rising, so is atmospheric CO2 and global temperature.

ԷԼԷ. Դուք դեռ չեք համոզվել, որ ծովի մակարդակը բարձրանում է, այսպես, ինչպես մթնոլորտայն ածխաթթու գազն ու գլոբալ ջերմաստիճանը

JM: You might also want to have a look at this particular n-gram, and that's to tell Nietzsche that God is not dead, although you might agree that he might need a better publicist.

ԺՄ: Դուք նաև կարող եք տեսնել այս որոշակի n-գրամը, իսկ Նիցշեն ասել է, որ Աստված մահացած չէ, չնայած կարելի է համաձայնվել, որ նա լավագույն հրապարակախոսի կարիքն ունի:

(Laughter)

(Ծիծաղ)

ELA: You can get at some pretty abstract concepts with this sort of thing. For instance, let me tell you the history of the year 1950. Pretty much for the vast majority of history, no one gave a damn about 1950. In 1700, in 1800, in 1900, no one cared. Through the 30s and 40s, no one cared. Suddenly, in the mid-40s, there started to be a buzz. People realized that 1950 was going to happen, and it could be big. (Laughter) But nothing got people interested in 1950 like the year 1950. (Laughter) People were walking around obsessed. They couldn't stop talking about all the things they did in 1950, all the things they were planning to do in 1950, all the dreams of what they wanted to accomplish in 1950. In fact, 1950 was so fascinating that for years thereafter, people just kept talking about all the amazing things that happened, in '51, '52, '53. Finally in 1954, someone woke up and realized that 1950 had gotten somewhat passé. (Laughter) And just like that, the bubble burst.

ԷԼԷ: Այս գործիքի օգնությամբ դուք կարող եք ձեռք բերել բավականին աբստրակտ հասկացություններ: Օրինակ, թույլ տվեք ձեզ պատմեմ մի պատմություն, որ տեղի է ունեցել 1950 թ-ին: Պատմության գերակշիռ մասում, 1950 թ-ը ոչ ոքի չի հետաքրքրել: 1700-ին 1800-ին 1800-ին ոչ ոքի դա պետք չէր: Երեսունականներին և քառասունականներին նույնպես ոչ ոք չէր մտածում դրա մասին: Հանկարծակի քառասունականների կեսերին ինչ-որ հետաքրքրություն առաջ եկավ: Մարդիկ հասկացան, որ 1950 թ. մոտենում է, և դա կարող է ահռելի իրադարձություն լինել: (Ծիծաղ) Սակայն ոչինչ չստիպեց մարդկանց հետաքրքրվել 1950 թ-ով այնքան, որքան հենց ինքը` 1950-ը: (Ծիծաղ) Մարդիկ խենթացել էին: Նրանք անկարող էին լռել այն ամենի մասին, ինչ արել էին 1950 թ-ին, այն բոլոր բաների մասին, ինչ նրանք պլանավորում էին անել 1950 թ-ին, այն բոլոր երազանքների մասին, ինչ նրանք ցանկանում էին իրականացնել 1950 թ-ին: Արդյունքում, 1950-ը այնքան հրաշալի էր, որ տարիներ անց, մարդիկ շարունակում էին խոսել բոլոր զարմանալի բաների մասին, որ տեղի էր ունեցել, '51-ին, '52-ին, '53-ին: Վերջապես 1954 թ-ին ինչ-որ մեկը մի օր արթնացավ և հասկացավ, որ 1950թ. արդեն հնացել է: (Ծիծաղ) Հենց այնպես, ինչպես փուչիկն է պայթում:

(Laughter)

(Ծիծաղ)

And the story of 1950 is the story of every year that we have on record, with a little twist, because now we've got these nice charts. And because we have these nice charts, we can measure things. We can say, "Well how fast does the bubble burst?" And it turns out that we can measure that very precisely. Equations were derived, graphs were produced, and the net result is that we find that the bubble bursts faster and faster with each passing year. We are losing interest in the past more rapidly.

Իսկ 1950-ի պատմությունը կրկնվում է յուրաքանչյուր տարվա համար, որի մասին մենք ունենք տեղեկություններ, որոշակի շեղումով, քանի որ հիմա մենք ունենք այս գեղեցիկ գծապատկերները: Շնորհիվ այս հրաշալի գծապատկերների, մենք կարող ենք չափել շատ բաներ: Կարող ենք ասել. «Դե, ին՞չ արագությամբ կարող է փուչիկը պայթել»: Պարզվում է, որ մենք կարող ենք դա ճշտորեն չափել: Հավասարումները դուրս էին գրվել, գրաֆիկները գծագրվել էին, և արդյունքն այն է, որ փուչիկները պայթում են ավելի ու ավելի արագ յուրաքանչյուր հաջորդ տարում: Մենք շատ արագ ենք կորցնում մեր հատաքրքրությունն անցյալի նկատմամբ:

JM: Now a little piece of career advice. So for those of you who seek to be famous, we can learn from the 25 most famous political figures, authors, actors and so on. So if you want to become famous early on, you should be an actor, because then fame starts rising by the end of your 20s -- you're still young, it's really great. Now if you can wait a little bit, you should be an author, because then you rise to very great heights, like Mark Twain, for instance: extremely famous. But if you want to reach the very top, you should delay gratification and, of course, become a politician. So here you will become famous by the end of your 50s, and become very, very famous afterward. So scientists also tend to get famous when they're much older. Like for instance, biologists and physics tend to be almost as famous as actors. One mistake you should not do is become a mathematician. (Laughter) If you do that, you might think, "Oh great. I'm going to do my best work when I'm in my 20s." But guess what, nobody will really care.

ԺՄ: Իսկ հիմա մի փոքր խորհուրդ կարիերայի վերաբերյալ: Ձեզանից յուրաքանչյուրի համար, ով ձգտում է հայտնի դառնալ, կարող է սովորել 25ից ավելի հայտնի քաղաքական գործիչներից, գրողներից, դերասաններից և այլն: Եթե դուք ցանկանում եք վաղ տարիքում հայտնի դառնալ, դուք կարող եք դերասան դառնալ, քանի որ այդ համբավը սկսում է մեծանալ, երբ դուք դեռ 20 տարեկան եք. դուք դեռ երիտասարդ եք և դա հրաշալի է: Եթե կարող եք մի փոքր սպասել, դուք կարող եք գրող դառնալ, քանի որ այդ ժամանակ դուք կհասնեք մեծ բարձունքների, ինչպես, օրինակ Մարկ Տվենը, չափազանց հայտնի է: Բայց եթե դուք ուզում եք հասնել փառքի գագաթնակետին, դուք պետք է հրաժարվեք հաճույքերից և, իհարկե, դառնաք քաղաքագետ: Այս դեպքում դուք հայտնի կլինեք, երբ 50 տարեկան դառնաք, և շատ, չափազանց հայտնի կլինեք: Գիտնականները նույնպես հայտնի են դառնում, երբ արդեն շատ ծեր են: Օրինակ, կենսաբաններն ու ֆիզիկոսները այնքան հայտնի են, որքան դերասանները: Սխալը, որ պետք չէ թույլ տալ` մաթեմատիկոս դառնալն է: (Ծիծաղ) Այս դեպքում կարելի է ենթադրել. «Հինալի է, ես իմ ամենալավ աշխատանքը հայտնագործել եմ, երբ ընդամենը 20 տարեկան էի»: Բայց գիտեք ինչ, ոչ ոքի դա պետք չէ:

(Laughter)

(Ծիծաղ)

ELA: There are more sobering notes among the n-grams. For instance, here's the trajectory of Marc Chagall, an artist born in 1887. And this looks like the normal trajectory of a famous person. He gets more and more and more famous, except if you look in German. If you look in German, you see something completely bizarre, something you pretty much never see, which is he becomes extremely famous and then all of a sudden plummets, going through a nadir between 1933 and 1945, before rebounding afterward. And of course, what we're seeing is the fact Marc Chagall was a Jewish artist in Nazi Germany.

ԷԼԷ: n-գրամերը շատ ավելի սթափեցնող հատկություններ ունեն: Օրինակ` ահա Մարկ Շագալի հետագիծը, նկարիչ, որ ծնվել է 1887-ին: Նա ունի հայտնի մարդու սովորական ուղի: Նա ավելի և ավելի հայտնի է դառնում, բացառությամբ գերմանախոսների շրջանում, Եթե գերմաներեն լեզվին նայենք, ապա կտեսնենք մի անհնարին բան, մի բան, որ հազվադեպ եք տեսնում, նա դառնում է չափազանց հայտնի, այնուհետև, միանգամից նվազում է անցնելով 1933-ի և 1945-ի մրջև գտնվող ծայրահեղ անկման շրջանով, շատ ավելի հետ ընկրկելու համար: Իհարկե, այստեղ մենք տեսնում ենք այն փաստը, որ Մարկ Շագալը հրեա նկարիչ էր Նացիստական Գերմանիայում:

Now these signals are actually so strong that we don't need to know that someone was censored. We can actually figure it out using really basic signal processing. Here's a simple way to do it. Well, a reasonable expectation is that somebody's fame in a given period of time should be roughly the average of their fame before and their fame after. So that's sort of what we expect. And we compare that to the fame that we observe. And we just divide one by the other to produce something we call a suppression index. If the suppression index is very, very, very small, then you very well might be being suppressed. If it's very large, maybe you're benefiting from propaganda.

Այս ազդանշանները իրականում այնքան ուժեղ են, որ մեզ պետք չէ իմանալ, որ ինչ-որ մեկը գրաքննվել է: Մենք կարող ենք դա հասկանալ` օգտագործելով ազդանշանների ամենապարզ վերլուծությունը: Ահա դա անելու ամենապարզ եղանակը: Խելամիտ է ենթադրել այն, որ ինչ-ոչ մեկի փառքը տվյալ ժամանակահատվածում պետք է հավասար լինի մինչև նրան և նրանից հետո եղած փառքերի միջինին: Այսինքն, սա հենց այն էր, ինչ մենք սպասում էինք: Եվ դա մենք կհամեմատենք այն բանի հետ, ինչ հետազոտում ենք: Այնուհետև դրանք հարաբերում ենք իրար, որպեսզի ստանանք այն, ինչ կոչում ենք ընկճման ինդեքս: Եթե ընկճման ինդեքսը շատ, շատ, շատ փոքր է, ապա հավանականություն կա, որ ձեզ ընկճում են: Եթե դա մեծ է, ապա ձեզ, հավանաբար, պրոպագանդում են:

JM: Now you can actually look at the distribution of suppression indexes over whole populations. So for instance, here -- this suppression index is for 5,000 people picked in English books where there's no known suppression -- it would be like this, basically tightly centered on one. What you expect is basically what you observe. This is distribution as seen in Germany -- very different, it's shifted to the left. People talked about it twice less as it should have been. But much more importantly, the distribution is much wider. There are many people who end up on the far left on this distribution who are talked about 10 times fewer than they should have been. But then also many people on the far right who seem to benefit from propaganda. This picture is the hallmark of censorship in the book record.

ԺՄ: Իսկ այժմ կարող ենք նայել ամբողջ բնակչության նկատմամբ ընկճման ինդեքսների բաշխմանը: Օրինակ այստեղ, այս ընկճման ինդեքսը 5000 մարդու համար է` ընտրված անգլալեզու գրքերից, որտեղ ցենզուրան բացակայում է. դա մոտավորապես այսքան է, կետրոնացված մեկի վրա: Այն ինչ դուք սպասում եք, համընկնում է դիտարկումի հետ Այս բաշխումը կատարվել է Գերմանիայում` սա լրիվ տարբեր է, փոխանցված դեպի ձախ: Մարդիկ դրա մասին խոսել են 2 անգամ ավելի քիչ, քան պետք էր: Սակայն այն, ինչ անհրաժեշտ է, ավելի լայն բաշխումն է: Շատ մարդկանց մասին, ովքեր հայտնվում են այս բաշխման ձախ կողմում, խոսում են 10 անգամ ավելի քիչ, քան պետք է: Իսկ աջ կողմում գտնվող շատ մարդիկ քաղում են պրոպագանդայի պտուղները: Այս նկարը գրքի պատմության ցենզուրայի կնիքն է:

ELA: So culturomics is what we call this method. It's kind of like genomics. Except genomics is a lens on biology through the window of the sequence of bases in the human genome. Culturomics is similar. It's the application of massive-scale data collection analysis to the study of human culture. Here, instead of through the lens of a genome, through the lens of digitized pieces of the historical record. The great thing about culturomics is that everyone can do it. Why can everyone do it? Everyone can do it because three guys, Jon Orwant, Matt Gray and Will Brockman over at Google, saw the prototype of the Ngram Viewer, and they said, "This is so fun. We have to make this available for people." So in two weeks flat -- the two weeks before our paper came out -- they coded up a version of the Ngram Viewer for the general public. And so you too can type in any word or phrase that you're interested in and see its n-gram immediately -- also browse examples of all the various books in which your n-gram appears.

ԷԼԷ: մենք սա անվանում ենք կուլտուրոմիքսի մեթոդ: Սա գենոմիքսի պես բան է: Միայն թե գենոմիքսը կենսաբանության ոսպնյակն է հանդիսանում, մարդու գենի հիմքի հաջորդականության պատուհանից դուրս: Կուլտուրոմիքսը նման է սրան: Սա չափազանց մեծ մասշտաբի տվյալների հավաքականի վերլուծության օգտագործումն է` մարդկության մշակույթը ուսումնասիրելու համար: Սակայն, ի հակադրություն սրան, գենի ոսպնյակը մենք տենում ենք պատմության թվայնացված մասերի ոսպնյակի միջոցով: Կուլտուրոմիքսի դրական կողմն այն է, որ բոլորը կարող են օգտագործել դա: Իսկ ինչո՞ւ բոլորը կարող են դա անել: Բոլորը կարող են անել դա, քանի որ 3 հոգի` Ջոն Օրվանթը, Մետտ Գրեյը և Ուիլ Բրոքմանը Google-ից` տեսնելով Ngram Viewer-ը, ասացին. «Սա շատ զվարճալի բան է: Մենք պետք է սա բոլորի համար հասնելի դարձնենք»: Ուղիղ երկու շաբաթում, մեր հոդվածի հրատարակումից ընդամենը 2 շաբաթ առաջ, նրանք ծրագրավորեցին Ngram Viewer ամբողջ հասարակության համար: Հիմա դուք էլ կարող եք հավաքել ցանկացած բառ կամ նախադասություն, որ ձեզ հետաքրքրում է, և անմիջապես տեսնել դրա n-գրամը, ներառյալ դրանց օրինակները բազմաթիվ այլ գրքերից, որտեղ հանդիպում ենք n-գրամ:

JM: Now this was used over a million times on the first day, and this is really the best of all the queries. So people want to be their best, put their best foot forward. But it turns out in the 18th century, people didn't really care about that at all. They didn't want to be their best, they wanted to be their beft. So what happened is, of course, this is just a mistake. It's not that strove for mediocrity, it's just that the S used to be written differently, kind of like an F. Now of course, Google didn't pick this up at the time, so we reported this in the science article that we wrote. But it turns out this is just a reminder that, although this is a lot of fun, when you interpret these graphs, you have to be very careful, and you have to adopt the base standards in the sciences.

ԺՄ: Հենց առաջին իսկ օրը դա միլիոն անգամ օգտագործվեց, և սա հարցումներից ամենալավն է: Մարդիկ ցանկանում են իրենց ամենալավ կողմը ցույց տալ: Սակայն պարզ է դառնում, որ 18-րդ դարում դա ընդհանրապես մարդկանց չի հետաքրքրել: Նրանք չեն ցանկանում իրենց ամենալավ (best) կողմը ցույց տալ, նրանք ցանկանում էին իրենց ամենալաֆ (beft) կողմը ցույց տալ: Իհարկե,այն ինչ պատահեց, ուղղակի սխալ էր: Դա միջակության ձգտում չէ, ուղղակի 'Վ' տառը գրվել է այլ կերպ, մի փոքր նման 'Ֆ' տառին: Իհարկե այն ժամանակ Google-ը ուշադրություն չդարձրեց դրան, այդ պատճառով մենք դա մեր հոդվածում օգտագործեցինք: Սակայն պարզ դարձավ, որ սա միայն հիշեցում է, որ չնայաց դա զվարճալի է, այս գրաֆիկները մեկնաբանելիս, պետք է շատ զգույշ լինել, և անհրաժեշտ է օգտագործել գիտության լավագույն չափանիշները:

ELA: People have been using this for all kinds of fun purposes. (Laughter) Actually, we're not going to have to talk, we're just going to show you all the slides and remain silent. This person was interested in the history of frustration. There's various types of frustration. If you stub your toe, that's a one A "argh." If the planet Earth is annihilated by the Vogons to make room for an interstellar bypass, that's an eight A "aaaaaaaargh." This person studies all the "arghs," from one through eight A's. And it turns out that the less-frequent "arghs" are, of course, the ones that correspond to things that are more frustrating -- except, oddly, in the early 80s. We think that might have something to do with Reagan.

ԷԼԷ: Մարդիկ ինչ ձևով ասես, որ չեն օգտագործել դա: (Ծիծաղ) Իրականում, ոչինչ պետք չէ ասել, մենք ցույց կտանք սլայդերը անձայն: Այս մարդուն հետաքրքրել է բացականչությունների պատմությունը: Բացականչությունների տարբեր ձևեր կան: Եթե հարվածել եք ձեր ոտքի բութ մաինը, դա «Ախ» է մեկ Ա-ով: Եթե Երկիր մոլորակը ոչնչացվում է Վոգոնների կողմից, որպեսզի տեղ ազատվի միջտիեզերական շրջանցումների համար, ապա դա «Աաաաաաաախ» է ութ Ա-ով Այս մարդը ուսումնասիրել է բոլոր «Ախերը»` մեկից մինչև ութ Ա պարունակող: Եվ պարզվում է, որ ավելի հազվադեպ «Ախերը», իհարկե, առավել վախեցնող բաների հետ են կապված. բացառությամբ, ինչը շատ տարօրինակ է, 80-ականների սկզբի: Միգուցե, Ռեյգանը ինչ-որ կապ ունի սրա հետ:

(Laughter)

(Ծիծաղ)

JM: There are many usages of this data, but the bottom line is that the historical record is being digitized. Google has started to digitize 15 million books. That's 12 percent of all the books that have ever been published. It's a sizable chunk of human culture. There's much more in culture: there's manuscripts, there newspapers, there's things that are not text, like art and paintings. These all happen to be on our computers, on computers across the world. And when that happens, that will transform the way we have to understand our past, our present and human culture.

ԺՄ: Այս տվյալները կարելի է տարբեր կերպ օգտագործել, սակայն խնդրիը պատմական թվայնացման մեջ չէ: Google-ը սկսել է թվայնացնել 15 միլիոն գիրք: Դա երբևէ հրատարակված գրքերի 12 տոկոսն է կազմում: Դա մարդկույթան մշակույթի զգալի մասն է կազմում: Սակայն մշակույթը իր մեջ շատ ավելին է պարունակում. ձեռագրեր, թերթեր, ոչ տեքստային բաներ, ինչպիսին է, օրինակ, արվեստը և նկարչությունը: Այս ամենը կարող է հայտնվել մեր համակարգիչներում, աշխարհի բոլոր համակարգիչներում: Եվ երբ սա պատահի, այն կվերափոխի մեր անցյալը, ներկան և մարդկության ապագան ընկալելու մեր պատկերացումները:

Thank you very much.

Շատ շնորհակալություն:

(Applause)

(Ծափահարություններ)

(Applause)

(Ծափահարություններ)

(Laughter)

(Ծիծաղ)

(Laughter)

(Ծիծաղ)

(Applause)

(Ծափահարություններ)

ELA: If you were not yet convinced, sea levels are rising, so is atmospheric CO2 and global temperature.

JM: You might also want to have a look at this particular n-gram, and that's to tell Nietzsche that God is not dead, although you might agree that he might need a better publicist.

(Laughter)

(Ծիծաղ)

(Laughter)

(Ծիծաղ)

(Laughter)

(Ծիծաղ)

(Laughter)

(Ծիծաղ)

Thank you very much.

Շատ շնորհակալություն:

(Applause)

(Ծափահարություններ)

Jean-Baptiste Michel + Erez Lieberman Aiden: What we learned from 5 million books

Jean-Baptiste Michel + Erez Lieberman Aiden: What we learned from 5 million books

Related talks

Brewster Kahle: A free digital library

Aaron Koblin: Visualizing ourselves ... with crowd-sourced data

Amit Sood: Building a museum of museums on the web

Chip Kidd: Designing books is no laughing matter. OK, it is.

Ilan Stavans: Why should you read "Don Quixote"?

Chand John: What's the fastest way to alphabetize your bookshelf?

Related talks

Brewster Kahle: A free digital library

Aaron Koblin: Visualizing ourselves ... with crowd-sourced data

Amit Sood: Building a museum of museums on the web

Chip Kidd: Designing books is no laughing matter. OK, it is.

Ilan Stavans: Why should you read "Don Quixote"?

Chand John: What's the fastest way to alphabetize your bookshelf?