Ioannis Papachimonas: How computers translate human language

How is it that so many intergalactic species in movies and TV just happen to speak perfect English? The short answer is that no one wants to watch a starship crew spend years compiling an alien dictionary. But to keep things consistent, the creators of Star Trek and other science-fiction worlds have introduced the concept of a universal translator, a portable device that can instantly translate between any languages. So is a universal translator possible in real life? We already have many programs that claim to do just that, taking a word, sentence, or entire book in one language and translating it into almost any other, whether it's modern English or Ancient Sanskrit. And if translation were just a matter of looking up words in a dictionary, these programs would run circles around humans. The reality, however, is a bit more complicated. A rule-based translation program uses a lexical database, which includes all the words you'd find in a dictionary and all grammatical forms they can take, and set of rules to recognize the basic linguistic elements in the input language. For a seemingly simple sentence like, "The children eat the muffins," the program first parses its syntax, or grammatical structure, by identifying the children as the subject, and the rest of the sentence as the predicate consisting of a verb "eat," and a direct object "the muffins." It then needs to recognize English morphology, or how the language can be broken down into its smallest meaningful units, such as the word muffin and the suffix "s," used to indicate plural. Finally, it needs to understand the semantics, what the different parts of the sentence actually mean. To translate this sentence properly, the program would refer to a different set of vocabulary and rules for each element of the target language. But this is where it gets tricky. The syntax of some languages allows words to be arranged in any order, while in others, doing so could make the muffin eat the child. Morphology can also pose a problem. Slovene distinguishes between two children and three or more using a dual suffix absent in many other languages, while Russian's lack of definite articles might leave you wondering whether the children are eating some particular muffins, or just eat muffins in general. Finally, even when the semantics are technically correct, the program might miss their finer points, such as whether the children "mangiano" the muffins, or "divorano" them. Another method is statistical machine translation, which analyzes a database of books, articles, and documents that have already been translated by humans. By finding matches between source and translated text that are unlikely to occur by chance, the program can identify corresponding phrases and patterns, and use them for future translations. However, the quality of this type of translation depends on the size of the initial database and the availability of samples for certain languages or styles of writing. The difficulty that computers have with the exceptions, irregularities and shades of meaning that seem to come instinctively to humans has led some researchers to believe that our understanding of language is a unique product of our biological brain structure. In fact, one of the most famous fictional universal translators, the Babel fish from "The Hitchhiker's Guide to the Galaxy", is not a machine at all but a small creature that translates the brain waves and nerve signals of sentient species through a form of telepathy. For now, learning a language the old fashioned way will still give you better results than any currently available computer program. But this is no easy task, and the sheer number of languages in the world, as well as the increasing interaction between the people who speak them, will only continue to spur greater advances in automatic translation. Perhaps by the time we encounter intergalactic life forms, we'll be able to communicate with them through a tiny gizmo, or we might have to start compiling that dictionary, after all.

Како то да толико интергалактичких врста у филмовима и на телевизији савршено говори енглески језик? Одговор укратко је да нико не жели да гледа посаду свемирског брода како проводи године у састављању ванземаљског речника. Ипак, да би ствари остале складне, творци „Звезданих стаза“ и других научнофантастичних светова представили су концепт универзалног преводиоца, преносног уређаја који истог тренутка може да преводи било који језик. Па, да ли је универзални преводилац могућ у стварном животу? Већ имамо многе програме који тврде да раде управо то - да узимају реч, реченицу или целу књигу на једном језику и преводе је на скоро сваки други језик, било да је у питању модерни енглески или древни санскрит. Када би се преводи сводили на пуко тражење речи у речнику, ови програми би далеко надмашили људе. Стварност је, међутим, малчице компликованија. Преводилачки програм заснован на правилима користи лексичку базу података која укључује све речи које се могу наћи у речнику и све граматичке облике које могу попримити, као и скуп правила за препознавање основних лингвистичких елемената из улазног језика. За наизглед једноставну реченицу као што је „Деца једу мафине“, програм прво рашчлањује њену синтаксу или граматичку структуру тако што препознаје реч „деца“ као субјекат, а остатак реченице као предикат, који се састоји од глагола „јести“ и директног објекта, „мафине“. Затим треба да препозна енглеску морфологију или како се језик може разложити на мање значењске јединице, као што је реч „мафин“ и енглески суфикс „с“, који се користи за означавање множине. На крају, потребно је да разуме семантику, то јест, шта различити делови реченице заиста значе. Да би превео ову реченицу ваљано, програм ће се ослонити на различите скупове речника и правила за сваки елемент циљног језика. Међутим, овде ствари постају компликоване. Синтакса неких језика дозвољава да речи буду распоређене на било који начин, док би код других, на тај начин, мафин јео дете. Морфологија, такође, може представљати проблем. Словеначки језик прави разлику између двоје и троје деце, користећи двојни суфикс који не постоји у многим другим језицима, док вас непостојање одређених чланова у руском може оставити са питањем да ли деца једу неке посебне мафине или једу мафине уопште. Коначно, чак и када је семантика технички тачна, програм може испустити финесе, као што су да ли деца једу мафине или их прождиру (на италијанском). Још један метод је статистички машински превод, који анализира базу података књига, чланака и докумената које су људи већ превели. Проналажењем подударности између извора и преведеног текста које се ретко случајно јављају, програм може да препозна одговарајуће фразе и обрасце и да их користи за будуће преводе. Међутим, квалитет оваквог превођења зависи од опширности почетне базе података, као и доступности узорака за одређене језике или стилова писања. Потешкоће које компјутери имају везане за изузетке, неправилности и нијансе у значењу за које се чини да су људима урођене навеле су неке истраживаче да поверују да је наше разумевање језика јединствени производ наше биолошке мождане структуре. Заправо, један од најпознатијих фиктивних универзалних преводилаца, вавилонска рибица из „Аутостоперског водича кроз галаксију“ није уопште машина, већ мало створење које преводи мождане таласе и нервне сигнале врста које осећају надржаје кроз облик телепатије. За сада, учење језика на старомодан начин даће вам боље резултате него било који тренутно доступан програм. Међутим, ово није лак задатак, а бројност светских језика, као и повећана интеракција између људи који их користе само ће наставити да подстичу развој аутоматског превођења. Можда ћемо, када се будемо срели са интергалактичким животним облицима, моћи да комуницирамо са њима кроз малену направу, или ћемо можда ипак морати да започнемо састављање оног речника.

Ioannis Papachimonas: How computers translate human language

Ioannis Papachimonas: How computers translate human language

Related talks

Krystian Aparta: One of the most difficult words to translate...

Sajan Saini: The hidden network that makes the internet possible

John McWhorter: Are Elvish, Klingon, Dothraki and Na'vi real languages?

Related talks

Krystian Aparta: One of the most difficult words to translate...

Sajan Saini: The hidden network that makes the internet possible

John McWhorter: Are Elvish, Klingon, Dothraki and Na'vi real languages?