Ioannis Papachimonas: How computers translate human language

How is it that so many intergalactic species in movies and TV just happen to speak perfect English? The short answer is that no one wants to watch a starship crew spend years compiling an alien dictionary. But to keep things consistent, the creators of Star Trek and other science-fiction worlds have introduced the concept of a universal translator, a portable device that can instantly translate between any languages. So is a universal translator possible in real life? We already have many programs that claim to do just that, taking a word, sentence, or entire book in one language and translating it into almost any other, whether it's modern English or Ancient Sanskrit. And if translation were just a matter of looking up words in a dictionary, these programs would run circles around humans. The reality, however, is a bit more complicated. A rule-based translation program uses a lexical database, which includes all the words you'd find in a dictionary and all grammatical forms they can take, and set of rules to recognize the basic linguistic elements in the input language. For a seemingly simple sentence like, "The children eat the muffins," the program first parses its syntax, or grammatical structure, by identifying the children as the subject, and the rest of the sentence as the predicate consisting of a verb "eat," and a direct object "the muffins." It then needs to recognize English morphology, or how the language can be broken down into its smallest meaningful units, such as the word muffin and the suffix "s," used to indicate plural. Finally, it needs to understand the semantics, what the different parts of the sentence actually mean. To translate this sentence properly, the program would refer to a different set of vocabulary and rules for each element of the target language. But this is where it gets tricky. The syntax of some languages allows words to be arranged in any order, while in others, doing so could make the muffin eat the child. Morphology can also pose a problem. Slovene distinguishes between two children and three or more using a dual suffix absent in many other languages, while Russian's lack of definite articles might leave you wondering whether the children are eating some particular muffins, or just eat muffins in general. Finally, even when the semantics are technically correct, the program might miss their finer points, such as whether the children "mangiano" the muffins, or "divorano" them. Another method is statistical machine translation, which analyzes a database of books, articles, and documents that have already been translated by humans. By finding matches between source and translated text that are unlikely to occur by chance, the program can identify corresponding phrases and patterns, and use them for future translations. However, the quality of this type of translation depends on the size of the initial database and the availability of samples for certain languages or styles of writing. The difficulty that computers have with the exceptions, irregularities and shades of meaning that seem to come instinctively to humans has led some researchers to believe that our understanding of language is a unique product of our biological brain structure. In fact, one of the most famous fictional universal translators, the Babel fish from "The Hitchhiker's Guide to the Galaxy", is not a machine at all but a small creature that translates the brain waves and nerve signals of sentient species through a form of telepathy. For now, learning a language the old fashioned way will still give you better results than any currently available computer program. But this is no easy task, and the sheer number of languages in the world, as well as the increasing interaction between the people who speak them, will only continue to spur greater advances in automatic translation. Perhaps by the time we encounter intergalactic life forms, we'll be able to communicate with them through a tiny gizmo, or we might have to start compiling that dictionary, after all.

Cum de atâtea specii intergalactice din filme sau de la TV vorbesc perfect limba engleză? Pe scurt, nimeni nu vrea să vadă cum echipajul unei nave spațiale petrece ani să creeze un dicționar pentru o limbă extraterestră. Însă din motive de coerență, creatorii Star Trek și ai altor lumi SF au introdus conceptul de traducător universal, un dispozitiv portabil ce traduce imediat din orice limbă. E posibil în realitate astfel de traducător universal? Deja sunt multe programe care susțin că fac asta, iau un cuvânt, o propoziție, o carte întreagă dintr-o limbă și le traduc în aproape oricare alta, fie că e engleză modernă sau sanscrită veche. Dacă traducerea ar presupune doar căutarea cuvintelor în dicționar, aceste programe s-ar descurca mai bine decât oamenii. Totuși, realitatea e puțin mai complicată. Un program de traducere bazat pe reguli folosește o bază de date lexicală care include toate cuvintele dintr-un dicționar și toate formele gramaticale pe care le pot avea aceste cuvinte, și un set de reguli pentru a recunoaște elementele lingvistice de bază din limbă. Pentru o propoziție simplă cum ar fi „Copiii mănâncă brioșele”, programul analizează întâi sintaxa sau structura gramaticală, identificând copiii ca fiind subiectul, iar restul propoziției fiind predicatul format din verbul „mănâncă” și complementul direct „brioșele”. Trebuie să recunoască morfologia limbii sau cum ea poate fi împărțită în cele mai mici unități de sens, cum ar fi cuvântul „brioșă” și sufixul „le”, care indică pluralul. Mai mult, trebuie să înțeleagă semantica, ce anume înseamnă fiecare parte a propoziției. Pentru a traduce propoziția corect, programul va accesa un vocabular și un set de reguli diferite pentru fiecare element din limba țintă. Aici devine complicat. Sintaxa unor limbi permite aranjarea cuvintelor în orice ordine, pe când în alte limbi, asta ar însemna că brioșa mănâncă copilul. Și morfologia poate pune probleme. Limba slovenă distinge între doi copii și trei sau mai mulți, folosind un sufix dual, absent în multe alte limbi, în timp ce lipsa articolului hotărât din rusă te face să te întrebi dacă copiii mănâncă anumite brioșe sau doar brioșe în general. Chiar dacă semantica e corectă din punct de vedere tehnic, programul poate omite subtilitățile, cum ar fi dacă copiii mănâncă brioșele sau le devorează. O altă metodă e traducerea automată statistică, care analizează o bază de date cu cărți, articole și documente care au fost deja traduse de oameni. Prin găsirea de echivalențe între limba sursă și textul tradus, care sunt puțin probabil să apară aleatoriu, programul poate identifica propoziții și structuri corespondente și utiliza pentru traduceri viitoare. Totuși, calitatea acestui tip de traducere depinde de mărimea bazei de date inițiale și de disponibilitatea eșantioanelor pentru anumite limbi sau stiluri de scriere. Dificultatea pe care calculatorul o are cu excepțiile, abaterile sau sensurile ascunse care par să fie instinctive oamenilor i-a făcut pe unii cercetători să creadă că modul în care înțelegem limba e un produs unic al structurii biologice a creierului. De fapt, unul dintre cei mai faimoși traducători universali ficționali, peștele Babel din „Ghidul autostopistului galactic” nu e o mașină, ci o creatură mică ce traduce undele creierului și semnalele nervoase ale speciilor inteligente, printr-o formă de telepatie. Momentan, învățarea unei limbi prin metoda veche va avea rezultate mai bune decât orice program disponibil. Dar asta nu e deloc ușor, iar numărul mare de limbi din lume și interacțiunea crescândă dintre oamenii care le vorbesc va continua să stimuleze dezvoltarea traducerii automate. Poate atunci când vom întâlni forme de viață intergalactice, vom putea comunica cu ele cu ajutorul unui aparat mic sau va trebui totuși să începem să creăm acel dicționar.

Ioannis Papachimonas: How computers translate human language

Ioannis Papachimonas: How computers translate human language

Related talks

Krystian Aparta: One of the most difficult words to translate...

Sajan Saini: The hidden network that makes the internet possible

John McWhorter: Are Elvish, Klingon, Dothraki and Na'vi real languages?

Related talks

Krystian Aparta: One of the most difficult words to translate...

Sajan Saini: The hidden network that makes the internet possible

John McWhorter: Are Elvish, Klingon, Dothraki and Na'vi real languages?