Ioannis Papachimonas: How computers translate human language

How is it that so many intergalactic species in movies and TV just happen to speak perfect English? The short answer is that no one wants to watch a starship crew spend years compiling an alien dictionary. But to keep things consistent, the creators of Star Trek and other science-fiction worlds have introduced the concept of a universal translator, a portable device that can instantly translate between any languages. So is a universal translator possible in real life? We already have many programs that claim to do just that, taking a word, sentence, or entire book in one language and translating it into almost any other, whether it's modern English or Ancient Sanskrit. And if translation were just a matter of looking up words in a dictionary, these programs would run circles around humans. The reality, however, is a bit more complicated. A rule-based translation program uses a lexical database, which includes all the words you'd find in a dictionary and all grammatical forms they can take, and set of rules to recognize the basic linguistic elements in the input language. For a seemingly simple sentence like, "The children eat the muffins," the program first parses its syntax, or grammatical structure, by identifying the children as the subject, and the rest of the sentence as the predicate consisting of a verb "eat," and a direct object "the muffins." It then needs to recognize English morphology, or how the language can be broken down into its smallest meaningful units, such as the word muffin and the suffix "s," used to indicate plural. Finally, it needs to understand the semantics, what the different parts of the sentence actually mean. To translate this sentence properly, the program would refer to a different set of vocabulary and rules for each element of the target language. But this is where it gets tricky. The syntax of some languages allows words to be arranged in any order, while in others, doing so could make the muffin eat the child. Morphology can also pose a problem. Slovene distinguishes between two children and three or more using a dual suffix absent in many other languages, while Russian's lack of definite articles might leave you wondering whether the children are eating some particular muffins, or just eat muffins in general. Finally, even when the semantics are technically correct, the program might miss their finer points, such as whether the children "mangiano" the muffins, or "divorano" them. Another method is statistical machine translation, which analyzes a database of books, articles, and documents that have already been translated by humans. By finding matches between source and translated text that are unlikely to occur by chance, the program can identify corresponding phrases and patterns, and use them for future translations. However, the quality of this type of translation depends on the size of the initial database and the availability of samples for certain languages or styles of writing. The difficulty that computers have with the exceptions, irregularities and shades of meaning that seem to come instinctively to humans has led some researchers to believe that our understanding of language is a unique product of our biological brain structure. In fact, one of the most famous fictional universal translators, the Babel fish from "The Hitchhiker's Guide to the Galaxy", is not a machine at all but a small creature that translates the brain waves and nerve signals of sentient species through a form of telepathy. For now, learning a language the old fashioned way will still give you better results than any currently available computer program. But this is no easy task, and the sheer number of languages in the world, as well as the increasing interaction between the people who speak them, will only continue to spur greater advances in automatic translation. Perhaps by the time we encounter intergalactic life forms, we'll be able to communicate with them through a tiny gizmo, or we might have to start compiling that dictionary, after all.

为什么影视剧里会有那么多的星际物种恰好都会说一口流利的英语呢？原因很简单，因为没人希望为了看一部星际舰队还得花上好几年时间去编译一本外星字典但为了保证一致性，星际迷航和其它科幻小说的编导们就想出了万能翻译机这个点子一个能够立马能在各种语言间进行翻译的手持设备你们觉得万能翻译机在现实生活中是可行的吗？现在已经有很多程序声称他们能在各种语言间进行翻译不管是一个字，一句话，一本书也不管是现代英语还是古梵语如果翻译仅仅只是在字典上查找字意的话，这些程序完全能比人类做得更好但实际上没那么简单一个基于规则的翻译系统所用的词义数据包括你能在字典上找到的所有单词和所有能够使用的语法形态并且得有一套规则能够区分输入语言的基本语言成分举个看起来比较简单的例子：孩子们在吃松饼。翻译程序会先解析这句话的句法或语法结构通过将“孩子”定为主语剩下的部分作为谓语并且包含动词“吃” 和直接宾语“松饼” 或者这段话怎么才能够拆分成几个小词组就比如说“松饼”这个词后缀“s” 通常是表示复数最后一步还需要理解其中的语义学需要理解这段话中的每个部分都各自表示什么意思为了恰当地翻译这句话翻译程序会为将翻译的文本参照其语言的各个要素词汇和使用规则但这才是麻烦的地方在一些语言的句法结构中，文字并没有特定的顺序而且在有些语言中这句话看起来就像：松饼在吃小孩儿词态学也是个问题斯洛文尼亚语中区别通过使用双重后缀缺失来区分这句话中孩子的数量，两个、三个或者更多然后俄罗斯人不使用定冠词会让你觉得这些孩子到底是在吃一些特定的松饼呢还是一般含义上的松饼结果是，就算程序翻译出来的语义是正确的它可能还是会忽略一些细节就比如说这些孩子到底是在吃松饼还是在吞松饼？另一个研究方法是：统计翻译法这个方法是取分析那些已经被前人翻译过的书籍、文章和文件的数据库翻译系统可以通过找到那些不是偶然和译文恰好匹配的资源辨识相关的短语和句型并存以备用然而这种方式的翻译质量得根据某些语言或写作风格的初始数据库和语库可用性而定有一些困难，就像一些特例、非常规的事物和人类本能上的细微区别这样的困难导致了一些研究人员觉得我们对于语言的理解是我们大脑生物结构的单一产物但事实上，最著名的科幻小说通用翻译器－“巴别塔” 是从“银河系漫游指南”中逐渐分离出来的这翻译器不完全只是一个机器，而是一个能以心电感应形式从有意识生物那儿翻译他们的脑电波和神经信号的小生物目前为止，用老办法去学一门新的语言仍然比用目前可用的计算机程序的效果更好但这也绝非易事，世界上语言的绝对数量和其使用者间的相互作用会刺激自动翻译系统不断进步也许等到我们遇到星际生命形态的物种时我们就能够通过一个小发明与他们交流又或许我们终究得编译那样一套字典。

Ioannis Papachimonas: How computers translate human language

Ioannis Papachimonas: How computers translate human language

Related talks

Krystian Aparta: One of the most difficult words to translate...

Sajan Saini: The hidden network that makes the internet possible

John McWhorter: Are Elvish, Klingon, Dothraki and Na'vi real languages?

Related talks

Krystian Aparta: One of the most difficult words to translate...

Sajan Saini: The hidden network that makes the internet possible

John McWhorter: Are Elvish, Klingon, Dothraki and Na'vi real languages?