Ioannis Papachimonas: How computers translate human language

How is it that so many intergalactic species in movies and TV just happen to speak perfect English? The short answer is that no one wants to watch a starship crew spend years compiling an alien dictionary. But to keep things consistent, the creators of Star Trek and other science-fiction worlds have introduced the concept of a universal translator, a portable device that can instantly translate between any languages. So is a universal translator possible in real life? We already have many programs that claim to do just that, taking a word, sentence, or entire book in one language and translating it into almost any other, whether it's modern English or Ancient Sanskrit. And if translation were just a matter of looking up words in a dictionary, these programs would run circles around humans. The reality, however, is a bit more complicated. A rule-based translation program uses a lexical database, which includes all the words you'd find in a dictionary and all grammatical forms they can take, and set of rules to recognize the basic linguistic elements in the input language. For a seemingly simple sentence like, "The children eat the muffins," the program first parses its syntax, or grammatical structure, by identifying the children as the subject, and the rest of the sentence as the predicate consisting of a verb "eat," and a direct object "the muffins." It then needs to recognize English morphology, or how the language can be broken down into its smallest meaningful units, such as the word muffin and the suffix "s," used to indicate plural. Finally, it needs to understand the semantics, what the different parts of the sentence actually mean. To translate this sentence properly, the program would refer to a different set of vocabulary and rules for each element of the target language. But this is where it gets tricky. The syntax of some languages allows words to be arranged in any order, while in others, doing so could make the muffin eat the child. Morphology can also pose a problem. Slovene distinguishes between two children and three or more using a dual suffix absent in many other languages, while Russian's lack of definite articles might leave you wondering whether the children are eating some particular muffins, or just eat muffins in general. Finally, even when the semantics are technically correct, the program might miss their finer points, such as whether the children "mangiano" the muffins, or "divorano" them. Another method is statistical machine translation, which analyzes a database of books, articles, and documents that have already been translated by humans. By finding matches between source and translated text that are unlikely to occur by chance, the program can identify corresponding phrases and patterns, and use them for future translations. However, the quality of this type of translation depends on the size of the initial database and the availability of samples for certain languages or styles of writing. The difficulty that computers have with the exceptions, irregularities and shades of meaning that seem to come instinctively to humans has led some researchers to believe that our understanding of language is a unique product of our biological brain structure. In fact, one of the most famous fictional universal translators, the Babel fish from "The Hitchhiker's Guide to the Galaxy", is not a machine at all but a small creature that translates the brain waves and nerve signals of sentient species through a form of telepathy. For now, learning a language the old fashioned way will still give you better results than any currently available computer program. But this is no easy task, and the sheer number of languages in the world, as well as the increasing interaction between the people who speak them, will only continue to spur greater advances in automatic translation. Perhaps by the time we encounter intergalactic life forms, we'll be able to communicate with them through a tiny gizmo, or we might have to start compiling that dictionary, after all.

איך זה שכל כך הרבה מינים אינטרגלקטיים בסרטים ובטלויזיה מדברים במקרה אנגלית מושלמת? התשובה הקצרה היא שאף אחד לא רוצה לצפות בצוות ספינת חלל מבלה שנים בהרכבת מילון חייזרים. אבל כדי לשמור על עקביות, היוצרים של סטאר טרק ועולמות בדיוניים אחרים הציגו את הרעיון של מתרגם אוניברסלי, מכשיר נייד שיכול לתרגם מיידית כל שפה. אז האם מתרגם אוניברסלי אפשרי בחיים האמיתיים? יש לנו כבר הרבה תוכנות שטוענות בדיוק לזה, לקחת מילה, משפט, או ספר שלם בשפה אחת ולתרגם אותו לכמעט כל אחת אחרת, בין אם זה אנגלית מודרנית או סנסקריט עתיקה. ואם תרגום היה פשוט עניין של לחפש מילים במילון, התוכנות האלו היה יכולות בקלות לנצח אנשים. המציאות, עם זאת, היא מעט יותר מורכבת. תרגום מבוסס חוקים משתמש במאגר מידע לשוני, שכולל את כל המילים שתמצאו במילון וכל המבנים הדקדוקיים שהן יכולות ללבוש, וסט חוקים כדי להכיר את האלמנטים הלשוניים הבסיסיים בשפת הקלט. למשפט שנראה פשוט כמו, "הילדים אוכלים את המאפינס," התוכנה ראשית תנסח את התחביר, או מבנה דקדוקי, על ידי זיהוי הילדים כנושא, ושאר המשפט כנשוא שמכיל את הפועל "לאכול," ועצם ישיר "המאפינס." אז היא צריכה לזהות מורפולוגיה אנגלית, או איך השפה יכולה להתחלק ליחידות הכי קטנות בעלות המשמעות, כמו המילה מאפין והתוספת "ס" שמשמשת להראות רבים. לבסוף, היא צריכה להבין את הסמנטיקה, מה למעשה המשמעות של החלקים השונים של המשפט. כדי לתרגם את המשפט הזה נכון, התוכנה צריכה להתייחס לסטים שונים של אוצר מילים וחוקים לכל אלמנט של שפת המטרה. אבל שם זה נעשה מסובך. התחביר של כמה שפות מאפשר למילים להיות מאורגנות בכל סדר, בעוד באחרות, זה יכול לגרום למאפין לאכול את הילד. מורפולוגיה יכולה גם להוות בעיה. סלובנית מבדילה בין שני ילדים ושלושה ילדים או יותר בשימוש בתוספת כפולה שלא קיימת בהרבה שפות אחרות, בעוד היעדר תוויות יידוע ברוסית ישאיר אתכם תוהים אם הילדים אוכלים מאפין מסויים, או פשוט אוכלים מאפינס באופן כללי. לבסוף, אפילו כשהסמנטיקה נכונה טכנית, התוכנית יכולה לפספס נקודות עדינות יותר, כמו אם הילדים "מנג'יאנו" את המאפינס, או "דיבורני" אותם. שיטה נוספת היא תרגום מכונה סטטיסטי, שמנתח מאגר מידע של ספרים, מאמרים, ומסמכים שכבר תורגמו על ידי אנשים. על ידי מציאת התאמות בין מקורות וטקסט מתורגם שלא הגיוני שיתרחשו במקרה, התוכנית יכולה לזהות מושגים ותבניות מותאמים, ולהשתמש בהם לתרגומים עתידיים. עם זאת, האיכות של סוג זה של תרגום תלוי בגודל מאגר המידע הראשוני והזמינות של דוגמאות לשפות מסויימות או סגנונות שונים של כתיבה. הקושי שיש למחשבים עם יוצאי הדופן, חוסר הסדר והגוונים של משמעויות שנתפסים אינסטינקטיבית עבור אנשים הובילו כמה חוקרים להאמין שההבנה שלנו של שפה היא תוצא יחודי של מבנה המוח הביולוגי. למעשה, אחד המתרגמים האוניברסליים המומצאים הכי מפורסמים, דג בבל מ"מדריך הטרמפיסט לגלקסיה", הוא לא מכונה בכלל אלא יצור קטן שמתרגם את גלי המוח וסימנים עצביים של יצורים תבוניים דרך צורה של טלפטיה. בינתיים, למידת שפה בדרך המסורתית עדיין תיתן לנו תוצאה טובה יותר מכל תוכנת תרגום שזמינה כעת. אבל זו לא מטלה פשוטה, והמספר העצום של שפות בעולם, כמו גם האינטראקציה הגדלה בין האנשים שמדברים אותן, רק תמשיך לעורר התקדמות גדולה יותר בתרגום אוטומטי. אולי בזמן שניתקל בצורות חיים אינטרגלקטיות, נהיה מסוגלים לתקשר איתם דרך מכשיר זעיר, או שאולי נצטרך להתחיל להרכיב את המילון ההוא אחרי הכל.

Ioannis Papachimonas: How computers translate human language

Ioannis Papachimonas: How computers translate human language

Related talks

Krystian Aparta: One of the most difficult words to translate...

Sajan Saini: The hidden network that makes the internet possible

John McWhorter: Are Elvish, Klingon, Dothraki and Na'vi real languages?

Related talks

Krystian Aparta: One of the most difficult words to translate...

Sajan Saini: The hidden network that makes the internet possible

John McWhorter: Are Elvish, Klingon, Dothraki and Na'vi real languages?