Ioannis Papachimonas: How computers translate human language

How is it that so many intergalactic species in movies and TV just happen to speak perfect English? The short answer is that no one wants to watch a starship crew spend years compiling an alien dictionary. But to keep things consistent, the creators of Star Trek and other science-fiction worlds have introduced the concept of a universal translator, a portable device that can instantly translate between any languages. So is a universal translator possible in real life? We already have many programs that claim to do just that, taking a word, sentence, or entire book in one language and translating it into almost any other, whether it's modern English or Ancient Sanskrit. And if translation were just a matter of looking up words in a dictionary, these programs would run circles around humans. The reality, however, is a bit more complicated. A rule-based translation program uses a lexical database, which includes all the words you'd find in a dictionary and all grammatical forms they can take, and set of rules to recognize the basic linguistic elements in the input language. For a seemingly simple sentence like, "The children eat the muffins," the program first parses its syntax, or grammatical structure, by identifying the children as the subject, and the rest of the sentence as the predicate consisting of a verb "eat," and a direct object "the muffins." It then needs to recognize English morphology, or how the language can be broken down into its smallest meaningful units, such as the word muffin and the suffix "s," used to indicate plural. Finally, it needs to understand the semantics, what the different parts of the sentence actually mean. To translate this sentence properly, the program would refer to a different set of vocabulary and rules for each element of the target language. But this is where it gets tricky. The syntax of some languages allows words to be arranged in any order, while in others, doing so could make the muffin eat the child. Morphology can also pose a problem. Slovene distinguishes between two children and three or more using a dual suffix absent in many other languages, while Russian's lack of definite articles might leave you wondering whether the children are eating some particular muffins, or just eat muffins in general. Finally, even when the semantics are technically correct, the program might miss their finer points, such as whether the children "mangiano" the muffins, or "divorano" them. Another method is statistical machine translation, which analyzes a database of books, articles, and documents that have already been translated by humans. By finding matches between source and translated text that are unlikely to occur by chance, the program can identify corresponding phrases and patterns, and use them for future translations. However, the quality of this type of translation depends on the size of the initial database and the availability of samples for certain languages or styles of writing. The difficulty that computers have with the exceptions, irregularities and shades of meaning that seem to come instinctively to humans has led some researchers to believe that our understanding of language is a unique product of our biological brain structure. In fact, one of the most famous fictional universal translators, the Babel fish from "The Hitchhiker's Guide to the Galaxy", is not a machine at all but a small creature that translates the brain waves and nerve signals of sentient species through a form of telepathy. For now, learning a language the old fashioned way will still give you better results than any currently available computer program. But this is no easy task, and the sheer number of languages in the world, as well as the increasing interaction between the people who speak them, will only continue to spur greater advances in automatic translation. Perhaps by the time we encounter intergalactic life forms, we'll be able to communicate with them through a tiny gizmo, or we might have to start compiling that dictionary, after all.

為何電影、電視中星際間的不同物種恰巧能講一口流利的英語？答案是：沒人想看太空船員在影片中花費數年來編撰外星人字典。但為保持一致性，「星際迷航」和其他科幻小說創作者引進「萬能翻譯器」的概念：一種攜帶式裝置，可即時翻譯任何語言。那麼，「萬能翻譯器」可能存在於現實嗎？已有很多個程式宣稱做得到：從一種語言中選取單字、句子，或整本書，幾乎可以將它們翻譯成任何語言，不論是現代英語，或是古梵語。如果翻譯只是在詞典中查找單字，那麼，這些程式早就普及了。然而，現實複雜許多。基於「規則」的翻譯程式使用字彙資料庫，包含字典找到的單字、套用的文法型式、以及「辨認基本語言元素」的規則。這個看似簡單的句子為例：「孩子們吃松餅」，程式首先分析「語法」或「文法結構」，辨識出「孩子們」為主詞，剩下的句子為「述語」－由動詞「吃」構成。和直接受詞「松餅」。程式需要辨識出「英語構詞學」，也就是將該語言拆分成有意義的最小單元，例如單字「松餅」及字尾加「s」表示複數型。最後，程式還需要理解「語意」－各別部份所表達的意思。為了正確翻譯句子，程式會參考不同語言的字彙與規則來處理目標語言的每個元素。這卻是棘手的地方。某些語言允許單字以任何順序排列，但在其它語言，這樣做會出現「松餅吃孩子們」的句子。「構詞學」也有同樣問題。「斯拉維尼亞語」可區分是兩個、三個、或更多孩子－「雙字尾」的用法未見於其它語言中。而俄語則缺少「定冠詞」，你可能會困惑孩子們是在吃某種特定的松餅，還是泛指一般松餅。最後，即使「語意」技術上正確，程式也可能遺失細微部分，例如，孩子們是在「吃」松餅，還是在「吞」松餅？另一種方法是基於「統計」的機器翻譯，該方法分析「已翻譯的書籍、文章、文件」所建立的資料庫。從「原文」與「譯文」之間，尋找非偶然的匹配模式，程式就可以辨識出對應的片語和句型，以便使用在未來的翻譯上。然而，這種翻譯的品質決定於資料庫的大小以及能否應用於特定語言或寫作風格的翻譯上。電腦的困難：會遇到異常、非常規情況、以及無法呈現人類「直覺本能」可以了解的意函－這些令研究者相信「語言的理解能力」是我們大腦生理結構的獨特產物。實際上，小說中最著名的萬能翻譯器之一，出自《星際大奇航》的「寶貝魚」，根本就不是機器，而是小生物－是一隻能透過心靈感應，翻譯腦波和神經信號的「有感知」的生物。目前傳統的語言學習仍然會優於利用電腦程式的翻譯。但這不是簡單的任務，世界上語言的數量，和人與人之間逐漸增加的語言互動，都會繼續激發「自動翻譯」的進步。也許，遇到星際間的其他生物時，我們已經能夠透過小裝置來溝通，也或許最終，我們還是得著手編寫那部字典。

Ioannis Papachimonas: How computers translate human language

Ioannis Papachimonas: How computers translate human language

Related talks

Krystian Aparta: One of the most difficult words to translate...

Sajan Saini: The hidden network that makes the internet possible

John McWhorter: Are Elvish, Klingon, Dothraki and Na'vi real languages?

Related talks

Krystian Aparta: One of the most difficult words to translate...

Sajan Saini: The hidden network that makes the internet possible

John McWhorter: Are Elvish, Klingon, Dothraki and Na'vi real languages?