Ioannis Papachimonas: How computers translate human language

How is it that so many intergalactic species in movies and TV just happen to speak perfect English? The short answer is that no one wants to watch a starship crew spend years compiling an alien dictionary. But to keep things consistent, the creators of Star Trek and other science-fiction worlds have introduced the concept of a universal translator, a portable device that can instantly translate between any languages. So is a universal translator possible in real life? We already have many programs that claim to do just that, taking a word, sentence, or entire book in one language and translating it into almost any other, whether it's modern English or Ancient Sanskrit. And if translation were just a matter of looking up words in a dictionary, these programs would run circles around humans. The reality, however, is a bit more complicated. A rule-based translation program uses a lexical database, which includes all the words you'd find in a dictionary and all grammatical forms they can take, and set of rules to recognize the basic linguistic elements in the input language. For a seemingly simple sentence like, "The children eat the muffins," the program first parses its syntax, or grammatical structure, by identifying the children as the subject, and the rest of the sentence as the predicate consisting of a verb "eat," and a direct object "the muffins." It then needs to recognize English morphology, or how the language can be broken down into its smallest meaningful units, such as the word muffin and the suffix "s," used to indicate plural. Finally, it needs to understand the semantics, what the different parts of the sentence actually mean. To translate this sentence properly, the program would refer to a different set of vocabulary and rules for each element of the target language. But this is where it gets tricky. The syntax of some languages allows words to be arranged in any order, while in others, doing so could make the muffin eat the child. Morphology can also pose a problem. Slovene distinguishes between two children and three or more using a dual suffix absent in many other languages, while Russian's lack of definite articles might leave you wondering whether the children are eating some particular muffins, or just eat muffins in general. Finally, even when the semantics are technically correct, the program might miss their finer points, such as whether the children "mangiano" the muffins, or "divorano" them. Another method is statistical machine translation, which analyzes a database of books, articles, and documents that have already been translated by humans. By finding matches between source and translated text that are unlikely to occur by chance, the program can identify corresponding phrases and patterns, and use them for future translations. However, the quality of this type of translation depends on the size of the initial database and the availability of samples for certain languages or styles of writing. The difficulty that computers have with the exceptions, irregularities and shades of meaning that seem to come instinctively to humans has led some researchers to believe that our understanding of language is a unique product of our biological brain structure. In fact, one of the most famous fictional universal translators, the Babel fish from "The Hitchhiker's Guide to the Galaxy", is not a machine at all but a small creature that translates the brain waves and nerve signals of sentient species through a form of telepathy. For now, learning a language the old fashioned way will still give you better results than any currently available computer program. But this is no easy task, and the sheer number of languages in the world, as well as the increasing interaction between the people who speak them, will only continue to spur greater advances in automatic translation. Perhaps by the time we encounter intergalactic life forms, we'll be able to communicate with them through a tiny gizmo, or we might have to start compiling that dictionary, after all.

Ako je možné, že toľko medzigalaktických rás vo filmoch a TV náhodou ovláda perfektnú angličtinu? Krátka odpoveď: nikto nechce pozerať na posádku vesmírnej lode, ako roky kompiluje slovník k jazyku mimozemšťanov. Ale z dôvodu konzistentnosti tvorcovia Star Treku a iných sci-fi svetov zaviedli používanie univerzálneho prekladača, prenosného zariadenia, ktoré dokáže okamžite prekladať medzi všetkými jazykmi. Je univerzálny prekladač možný v skutočnosti? Už teraz máme mnoho programov, ktoré tvrdia, že to dokážu, zoberú si slovo, vetu, či celú knihu v jednom jazyku a preložia ju do takmer akéhokoľvek iného, či už ide o modernú angličtinu, alebo staroveký Sanskrit. A ak by pri preklade išlo len o vyhľadávanie slovíčok v slovníku, tieto programy by ľudia ani nepustili z ruky. Skutočnosť je však trochu zložitejšia. Program na základe pravidiel používa databázu slov, ktorá obsahuje všetky slová zo slovníka a všetky gramatické formy, ktoré môže mať, aj sadu pravidiel, podľa ktorých rozoznať základné lingvistické prvky východiskového jazyka. Pri zdanlivo jednoduchých vetách ako: „Deti jedia muffiny“ program najprv analyzuje vetnú syntax, teda gramatickú štruktúru, tým, že deti identifikuje ako podmet a zvyšok vety ako prísudok, ktorý pozostáva zo slova "jesť" (eat) a priamy predmet - "muffiny" (muffins). Potom potrebuje rozoznať anglickú morfológiu, čiže to, ako sa jazyk môže rozdeliť na najmenšie zmysluplné jednotky, ako je slovo muffin a prípona -s, ktorá sa používa na označenie množného čísla. Nakoniec potrebuje porozumieť sémantike, čiže tomu, čo vlastne znamenajú rôzne časti vety. Aby program túto vetu preložil správne, odkáže na rôzne sady slovnej zásoby a pravidiel, a to pre každý prvok cieľového jazyka. Ale tu sa to zamotáva. Syntax niektorých jazykov umožňuje, aby sa slová zoradili do rôzneho poradia, kým v iných jazykoch by z toho vzniklo, že muffiny jedia deti. Problémom môže byť aj morfológia. Slovinčina rozlišuje medzi dvoma deťmi a troma a viacerými pomocou prípony, ktorá v ostatných jazykoch nie je, kým neprítomnosť určitých členov v ruštine môže spôsobiť, že neviete, či deti jedia nejaké určité muffiny, alebo len vo všeobecnosti jedia muffiny. Nakoniec, dokonca aj keď je sémantika technicky správna, program by mohol opomenúť niektoré jemnejšie narážky, napríklad, či deti „margiano“ tie muffiny, alebo či ich „divorano“. Ďalšou metódou je strojový preklad pomocou štatistiky, ktorý analyzuje databázu kníh, článkov a dokumentov, ktoré už boli preložené ľuďmi. Hľadaním zhody medzi zdrojom a preloženými textami, pri ktorých nie je pravdepodobné, že sa vyskytnú náhodou, program vie identifikovať zodpovedajúce frázy a vzory a využiť ich pri budúcich prekladoch. Avšak kvalita takéhoto typu prekladu závisí od veľkosti počiatočnej databázy a dostupnosti vzoriek niektorých jazykov alebo štýlov písania. Problém, ktorý majú počítače s výnimkami, odchýlkami a odtienkami zmyslu, ktoré sú pre ľudí očividne inštinktívnejšie, niektorých vedcov viedol k názoru, že naše chápanie jazyka je jedinečným produktom našej biologickej štruktúry mozgu. Vlastne jeden z najslávnejších fiktívnych univerzálnych prekladačov, babylonská rybka z knihy „Stopárov sprievodca galaxiou“, nie je stroj, ale malý tvor, ktorý prekladá mozgové vlny a nervové signály vnímavých tvorov formou telepatie. Učenie jazyka starým spôsobom vám momentálne poskytne lepšie výsledky ako akýkoľvek dostupný počítačový program. Ale nie je to ľahká úloha, a už len počet jazykov na svete a zväčšujúca sa interakcia medzi ľuďmi, ktorí nimi hovoria, bude len podporovať väčšie pokroky v automatickom preklade. Možno kým sa stretneme s intergalaktickými formami života, s nimi budeme vedieť komunikovať cez malé niečo, alebo možno začneme aj tak vytvárať ten slovník.

Ioannis Papachimonas: How computers translate human language

Ioannis Papachimonas: How computers translate human language

Related talks

Krystian Aparta: One of the most difficult words to translate...

Sajan Saini: The hidden network that makes the internet possible

John McWhorter: Are Elvish, Klingon, Dothraki and Na'vi real languages?

Related talks

Krystian Aparta: One of the most difficult words to translate...

Sajan Saini: The hidden network that makes the internet possible

John McWhorter: Are Elvish, Klingon, Dothraki and Na'vi real languages?