Ioannis Papachimonas: How computers translate human language

How is it that so many intergalactic species in movies and TV just happen to speak perfect English? The short answer is that no one wants to watch a starship crew spend years compiling an alien dictionary. But to keep things consistent, the creators of Star Trek and other science-fiction worlds have introduced the concept of a universal translator, a portable device that can instantly translate between any languages. So is a universal translator possible in real life? We already have many programs that claim to do just that, taking a word, sentence, or entire book in one language and translating it into almost any other, whether it's modern English or Ancient Sanskrit. And if translation were just a matter of looking up words in a dictionary, these programs would run circles around humans. The reality, however, is a bit more complicated. A rule-based translation program uses a lexical database, which includes all the words you'd find in a dictionary and all grammatical forms they can take, and set of rules to recognize the basic linguistic elements in the input language. For a seemingly simple sentence like, "The children eat the muffins," the program first parses its syntax, or grammatical structure, by identifying the children as the subject, and the rest of the sentence as the predicate consisting of a verb "eat," and a direct object "the muffins." It then needs to recognize English morphology, or how the language can be broken down into its smallest meaningful units, such as the word muffin and the suffix "s," used to indicate plural. Finally, it needs to understand the semantics, what the different parts of the sentence actually mean. To translate this sentence properly, the program would refer to a different set of vocabulary and rules for each element of the target language. But this is where it gets tricky. The syntax of some languages allows words to be arranged in any order, while in others, doing so could make the muffin eat the child. Morphology can also pose a problem. Slovene distinguishes between two children and three or more using a dual suffix absent in many other languages, while Russian's lack of definite articles might leave you wondering whether the children are eating some particular muffins, or just eat muffins in general. Finally, even when the semantics are technically correct, the program might miss their finer points, such as whether the children "mangiano" the muffins, or "divorano" them. Another method is statistical machine translation, which analyzes a database of books, articles, and documents that have already been translated by humans. By finding matches between source and translated text that are unlikely to occur by chance, the program can identify corresponding phrases and patterns, and use them for future translations. However, the quality of this type of translation depends on the size of the initial database and the availability of samples for certain languages or styles of writing. The difficulty that computers have with the exceptions, irregularities and shades of meaning that seem to come instinctively to humans has led some researchers to believe that our understanding of language is a unique product of our biological brain structure. In fact, one of the most famous fictional universal translators, the Babel fish from "The Hitchhiker's Guide to the Galaxy", is not a machine at all but a small creature that translates the brain waves and nerve signals of sentient species through a form of telepathy. For now, learning a language the old fashioned way will still give you better results than any currently available computer program. But this is no easy task, and the sheer number of languages in the world, as well as the increasing interaction between the people who speak them, will only continue to spur greater advances in automatic translation. Perhaps by the time we encounter intergalactic life forms, we'll be able to communicate with them through a tiny gizmo, or we might have to start compiling that dictionary, after all.

映画やテレビに出てくる宇宙人の多くが苦もなく完璧な英語を話すのは一体どういうわけなんでしょうか端的に言えば宇宙船の乗組員が何年もかけて宇宙人語の辞書を編纂している姿なんて誰も見たくないからですしかし辻褄を合わせる意味でスタートレックなどSF業界の製作者たちは万能翻訳機というコンセプトを世に送り出しましたどんな言語間の翻訳も即座にできる携帯機器ですでは現実の世界に万能翻訳機は存在し得るのでしょうかそれが可能だと言い張るプログラムはたくさんあります単語でも文でも本を丸ごと１冊でもある言語から別のたいていの言語に翻訳でき現代英語も古代サンスクリット語もお手のものですもし翻訳というものがただ単語を辞書で調べるだけのことならこうしたプログラムは人間の力を越えていると言えるでしょうしかし実際のところはもっと複雑です規則に基づくタイプの翻訳プログラムは辞書に載っているすべての単語やその単語が取りうる― 文法的に正しいすべての形を含んだ語彙のデータベースと入力言語の基本的な言語的要素を認識するための一連の規則を用います「その子たちはマフィンを食べる」は一見単純そうな文ですがプログラムはまず構文つまり文法的な構造を解析します「その子たち」が主語で残りの部分は述語でありその中には動詞の「食べる」と直接目的語「マフィン」があることを割り出します続いて英語の形態論に沿って識別する必要があります意味を持つ最小単位にまで言語を分解するのですたとえば「マフィン」という語と複数を示す接尾辞「s」とは別々になります最後に意味論の理解が必要です文の各部分が実際にどういう意味か判断するのですこの文を正しく翻訳するためにプログラムは翻訳後の要素１つ１つに対してその言語の語彙や規則を参照しますがここが微妙なところです構文的に語順を自由に入れ替えてよい言語もあれば入れ替えると「マフィンが子どもを食べる」になってしまう言語もあります形態論が問題になることもありますスロベニア語では「２人の子ども」と「３人以上の子ども」は区別され他の多くの言語にはない双数形を示す接尾辞を使いますロシア語には定冠詞がないので子どもたちが食べているのはある特定のマフィンなのか一般的なマフィンなのか戸惑ってしまうかもしれません最後にプログラムは意味論に関して間違いではなくても細かいところを取りこぼす場合がありますたとえば子どもたちがマフィンを「普通に食べる」か「ガツガツむさぼる」か拾いきれませんまた別の方法として統計的機械翻訳というのがありますすでに人間の手で翻訳されている本や記事や文書のデータベースを分析する方法ですこのプログラムでは偶然には起こりにくいと考えられる― 原文と翻訳が一致する箇所を探すことで対応する表現やパターンを特定しておきそれを後々翻訳に利用しますしかしこのタイプの翻訳の質は初期のデータベースの規模や目的の言語のサンプルや文体のサンプルがどれだけ使えるかにかかっていますコンピュータは例外や変則的な表現ニュアンスの違いに苦労しますがこれは人間なら本能的にわかるものですこのため研究者の中には私たちの言語理解を人間の脳の生物学的構造が生む特徴だと考える人もいます事実もっとも有名な架空の万能翻訳機は『銀河ヒッチハイク・ガイド』に登場するバベル魚ですがあれは機械などではなく小さな生き物で知覚を持つ生物の脳波と神経信号をテレパシーという形で翻訳するというものでした今のところはまだ昔ながらの言語学習の方が世の中のどのコンピュータプログラムより良い結果を出しますが言語学習は簡単ではありませんし世界中には数多くの言語がありそれを話す人々の間で交流が増えていることにも後押しされて自動翻訳は今後進歩していく一方でしょうきっと私たちが宇宙の生命体に出会う頃には小型の装置を通じてコミュニケーションが可能になっているでしょうあるいは結局例の辞書の編纂に着手しているかもしれませんけどね

Ioannis Papachimonas: How computers translate human language

Ioannis Papachimonas: How computers translate human language

Related talks

Krystian Aparta: One of the most difficult words to translate...

Sajan Saini: The hidden network that makes the internet possible

John McWhorter: Are Elvish, Klingon, Dothraki and Na'vi real languages?

Related talks

Krystian Aparta: One of the most difficult words to translate...

Sajan Saini: The hidden network that makes the internet possible

John McWhorter: Are Elvish, Klingon, Dothraki and Na'vi real languages?