Rupal Patel: Synthetic voices, as unique as fingerprints

I'd like to talk today about a powerful and fundamental aspect of who we are: our voice. Each one of us has a unique voiceprint that reflects our age, our size, even our lifestyle and personality. In the words of the poet Longfellow, "the human voice is the organ of the soul." As a speech scientist, I'm fascinated by how the voice is produced, and I have an idea for how it can be engineered. That's what I'd like to share with you.

Сегодня мне бы хотелось поговорить о значительном и существенном аспекте того, кем мы являемся: о голосе. Каждый из нас обладает уникальным отпечатком голоса, отражающим возраст, пропорции, даже стиль жизни и личность. Словами поэта Лонгфелло: «Голос человека есть орган его души». Как учёный-лингвист, я восхищена тем, как возникает голос, и у меня есть идея, как голос может быть спроектирован. Вот, чем бы мне хотелось с вами поделиться.

I'm going to start by playing you a sample of a voice that you may recognize.

Начну с проигрывания образца голоса, который вы можете узнать.

(Recording) Stephen Hawking: "I would have thought it was fairly obvious what I meant."

(Запись) Стивен Хокинг: «Я бы подумал, что было довольно очевидно то, что я имел в виду».

Rupal Patel: That was the voice of Professor Stephen Hawking. What you may not know is that same voice may also be used by this little girl who is unable to speak because of a neurological condition. In fact, all of these individuals may be using the same voice, and that's because there's only a few options available. In the U.S. alone, there are 2.5 million Americans who are unable to speak, and many of whom use computerized devices to communicate. Now that's millions of people worldwide who are using generic voices, including Professor Hawking, who uses an American-accented voice. This lack of individuation of the synthetic voice really hit home when I was at an assistive technology conference a few years ago, and I recall walking into an exhibit hall and seeing a little girl and a grown man having a conversation using their devices, different devices, but the same voice. And I looked around and I saw this happening all around me, literally hundreds of individuals using a handful of voices, voices that didn't fit their bodies or their personalities. We wouldn't dream of fitting a little girl with the prosthetic limb of a grown man. So why then the same prosthetic voice? It really struck me, and I wanted to do something about this.

Рупал Патель: Это был голос профессора Стивена Хокинга. Вы можете не знать, что тот же голос может быть использован вот этой девочкой, которая не может говорить из-за неврологического нарушения. На деле, все эти люди могут использовать один и тот же голос, потому что есть всего лишь несколько доступных вариантов. Только в США живут 2,5 миллиона американцев, которые не могут говорить, и множество из которых пользуются компьютеризированными устройствами для общения. Это миллионы людей по всему миру, применяющие универсальные голоса, включая профессора Хокинга, который использует голос с американским акцентом. Вот эта нехватка индивидуальности синтезированного голоса по-настоящему задела меня за живое, когда я была на конференции по вспомогательным технологиям пару лет назад. Я вспоминаю, как зашла в выставочный зал и увидела беседующих маленькую девочку и взрослого мужчину с помощью своих устройств — разных устройств, но с тем же голосом. Я огляделась вокруг и увидела, что все так делают. Буквально сотни людей, использующих горстку голосов — голосов, не подходящих их телам или их личностям. Мы бы и не подумали на маленькую девочку подгонять протез взрослого мужчины. Почему же тогда один и тот же искусственный голос? Меня это поразило, и мне захотелось что-то с этим сделать.

I'm going to play you now a sample of someone who has, two people actually, who have severe speech disorders. I want you to take a listen to how they sound. They're saying the same utterance.

Сейчас я проиграю вам образец аудио человека, у которого — вообще-то два человека — у которых есть тяжёлые нарушения речи. Хочу, чтобы вы послушали, как звучат их голоса. Они говорят одно и то же.

(First voice)

(Первый голос)

(Second voice) You probably didn't understand what they said, but I hope that you heard their unique vocal identities.

(Второй голос) Вероятно, вы не поняли, что они сказали, но я надеюсь, вы услышали их уникальные вокальные индивидуальности.

So what I wanted to do next is, I wanted to find out how we could harness these residual vocal abilities and build a technology that could be customized for them, voices that could be customized for them. So I reached out to my collaborator, Tim Bunnell. Dr. Bunnell is an expert in speech synthesis, and what he'd been doing is building personalized voices for people by putting together pre-recorded samples of their voice and reconstructing a voice for them. These are people who had lost their voice later in life. We didn't have the luxury of pre-recorded samples of speech for those born with speech disorder. But I thought, there had to be a way to reverse engineer a voice from whatever little is left over.

Следующим шагом мне захотелось найти, как можно приспособить эти остаточные вокальные способности и создать технологию, которую можно будет настроить под них, голоса, которые могут быть настроены под них. Я связалась с моим коллегой — Тимом Баннеллом. Доктор Баннелл — эксперт в синтезе речи. Он занимался созданием персонифицированных голосов для людей, соединяя записанные заранее образцы их голосов и воспроизводя заново их голос. Это для тех людей, кто потерял возможность говорить в какой-то момент своей жизни. У нас не было такой роскоши, как заранее записанные образцы речи для тех, кто родился с нарушениями речи. Но я думала, что должен быть способ обратной разработки голоса из того малого, что осталось.

So we decided to do exactly that. We set out with a little bit of funding from the National Science Foundation, to create custom-crafted voices that captured their unique vocal identities. We call this project VocaliD, or vocal I.D., for vocal identity.

Мы решили именно это и делать. Мы начали с небольшого гранта от Национального Научного Фонда, чтобы создать специально разработанные под конкретных людей голоса, которые улавливали уникальность их вокальной индивидуальности. Мы называем этот проект VocaliD, или vocal I.D., что означает «вокальная индивидуальность».

Now before I get into the details of how the voice is made and let you listen to it, I need to give you a real quick speech science lesson. Okay? So first, we know that the voice is changing dramatically over the course of development. Children sound different from teens who sound different from adults. We've all experienced this. Fact number two is that speech is a combination of the source, which is the vibrations generated by your voice box, which are then pushed through the rest of the vocal tract. These are the chambers of your head and neck that vibrate, and they actually filter that source sound to produce consonants and vowels. So the combination of source and filter is how we produce speech. And that happens in one individual.

Прежде чем я окунусь в детали того, как создаётся голос и дам вам его прослушать, позвольте сначала краткую лекцию о речи. Хорошо? Во-первых, мы знаем, что голос радикально меняется в процессе развития. Дети звучат иначе, нежели подростки, голоса которых отличаются от взрослых. У всех у нас есть этот опыт. Факт номер два: речь — это сочетание источника, коим являются вибрации, производимые гортанью, которые далее проходят сквозь остальной речевой тракт. Это те участки в голове и шее, которые вибрируют, и они фильтруют звук от источника, чтобы образовывать гласные и согласные. Сочетание источника и фильтра — это то, как мы генерируем речь. И всё это происходит в каждом отдельном человеке.

Now I told you earlier that I'd spent a good part of my career understanding and studying the source characteristics of people with severe speech disorder, and what I've found is that even though their filters were impaired, they were able to modulate their source: the pitch, the loudness, the tempo of their voice. These are called prosody, and I've been documenting for years that the prosodic abilities of these individuals are preserved. So when I realized that those same cues are also important for speaker identity, I had this idea. Why don't we take the source from the person we want the voice to sound like, because it's preserved, and borrow the filter from someone about the same age and size, because they can articulate speech, and then mix them? Because when we mix them, we can get a voice that's as clear as our surrogate talker -- that's the person we borrowed the filter from— and is similar in identity to our target talker. It's that simple. That's the science behind what we're doing.

Ранее я говорила вам, что провела добрую часть моей карьеры, пытаясь понять и изучить исходные характеристики людей с тяжёлыми нарушениями речи, и я обнаружила, что даже несмотря на то, что их фильтры были повреждены, они могли модулировать свой источник: высоту, громкость, темп своего голоса. Их называют «просодиями». Я документально подтверждала в течение нескольких лет, что просодические способности этих людей сохранены. Когда же я поняла, что те же самые стимулы важны и для индивидуальности говорящего, у меня появилась идея. Почему бы не взять источник от человека, голос которого мы хотим воссоздать, ибо он сохранен, и позаимствовать фильтр у кого-то, кто схож по пропорциям и возрасту, владеющего членораздельной речью, а затем смешать их? Ведь, когда мы их смешиваем, мы можем получить голос, настолько же чистый, как и голос нашего суррогатного носителя голоса — того человека, у которого мы позаимствовали фильтр — и совпадает с индивидуальностью нашего целевого владельца голоса. Вот так просто. Таков научный фон нашей работы.

So once you have that in mind, how do you go about building this voice? Well, you have to find someone who is willing to be a surrogate. It's not such an ominous thing. Being a surrogate donor only requires you to say a few hundred to a few thousand utterances. The process goes something like this.

Как только эта идея появилась, что делать дальше для создания такого голоса? Придётся найти кого-то, кто захочет стать суррогатом. Не такая уж и зловещая вещь. Чтобы быть суррогатным донором, вам только нужно произнести от нескольких сотен до нескольких тысяч шаблонных фраз. Процесс происходит примерно вот так.

(Video) Voice: Things happen in pairs.

(Видео) Голос: Всё имеет пару.

I love to sleep.

Я люблю спать.

The sky is blue without clouds.

Небо голубое и без облаков.

RP: Now she's going to go on like this for about three to four hours, and the idea is not for her to say everything that the target is going to want to say, but the idea is to cover all the different combinations of the sounds that occur in the language. The more speech you have, the better sounding voice you're going to have. Once you have those recordings, what we need to do is we have to parse these recordings into little snippets of speech, one- or two-sound combinations, sometimes even whole words that start populating a dataset or a database. We're going to call this database a voice bank. Now the power of the voice bank is that from this voice bank, we can now say any new utterance, like, "I love chocolate" -- everyone needs to be able to say that— fish through that database and find all the segments necessary to say that utterance.

Рупал Патель: Вот так она и будет продолжать в течение порядка 3—4 часов, и идея не в том, чтобы сказать всё, что целевой владелец захочет сказать, а просто охватить все различные сочетания звуков, которые существуют в языке. Чем больше речевой информации у вас есть, тем лучший по звучанию голос получится. Как только такие записи созданы, следующее, что нужно сделать, это разобрать эти записи на маленькие фрагменты речи — сочетания одного—двух звуков, иногда даже целые слова, — которые начнут заполнять набор данных, или базу данных. Мы будем называть эту базу данных «банк голоса». Сила этого банка голоса в том, что из него теперь мы можем создать любые новые фразы, например, «Я люблю шоколад» — каждому нужно уметь это сказать. Покопайтесь в этой базе данных и найдите все необходимые сегменты, чтобы сказать эту фразу.

(Video) Voice: I love chocolate.

(Видео) Голос: Я люблю шоколад.

RP: So that's speech synthesis. It's called concatenative synthesis, and that's what we're using. That's not the novel part. What's novel is how we make it sound like this young woman.

Рупал Патель: Это синтез речи. Называется «конкатенативный синтез», и это то, с чем мы работаем. Это не новая часть. А новое в том, как мы делаем так, чтобы голос звучал, как эта молодая женщина.

This is Samantha. I met her when she was nine, and since then, my team and I have been trying to build her a personalized voice. We first had to find a surrogate donor, and then we had to have Samantha produce some utterances. What she can produce are mostly vowel-like sounds, but that's enough for us to extract her source characteristics. What happens next is best described by my daughter's analogy. She's six. She calls it mixing colors to paint voices. It's beautiful. It's exactly that. Samantha's voice is like a concentrated sample of red food dye which we can infuse into the recordings of her surrogate to get a pink voice just like this.

Это Саманта. Я встретила её, когда ей было 9 лет, и с тех пор я и моя команда пытались создать для неё персонифицированный голос. Сначала нам нужно было найти суррогатного донора, а затем — дать Саманте воспроизвести несколько шаблонных фраз. То, что она может произнести, в большинстве своём звучит, как звуки гласных, но нам этого достаточно, чтобы выявить её исходные характеристики. То, что происходит потом, лучше всего описано аналогией моей дочери. Ей 6 лет. Она называет это «смешением цветов, чтобы рисовать голоса». Это красиво. Это именно так. Голос Саманты — как концентрированный образец красного пищевого красителя, который мы можем «влить» в записи её суррогата, чтобы получился «розовый голос». Вот такой.

(Video) Samantha: Aaaaaah.

(Видео) Саманта: Ааааааааа.

RP: So now, Samantha can say this.

Рупал Патель: Сейчас Саманта может сказать вот это.

(Video) Samantha: This voice is only for me. I can't wait to use my new voice with my friends.

(Видео) Саманта: Этот голос — только для меня. Не дождусь попробовать мой новый голос с друзьями.

RP: Thank you. (Applause)

Рупал Патель: Спасибо. (Аплодисменты)

I'll never forget the gentle smile that spread across her face when she heard that voice for the first time. Now there's millions of people around the world like Samantha, millions, and we've only begun to scratch the surface. What we've done so far is we have a few surrogate talkers from around the U.S. who have donated their voices, and we have been using those to build our first few personalized voices. But there's so much more work to be done. For Samantha, her surrogate came from somewhere in the Midwest, a stranger who gave her the gift of voice. And as a scientist, I'm so excited to take this work out of the laboratory and finally into the real world so it can have real-world impact. What I want to share with you next is how I envision taking this work to that next level. I imagine a whole world of surrogate donors from all walks of life, different sizes, different ages, coming together in this voice drive to give people voices that are as colorful as their personalities. To do that as a first step, we've put together this website, VocaliD.org, as a way to bring together those who want to join us as voice donors, as expertise donors, in whatever way to make this vision a reality.

Никогда не забуду мягкую улыбку, появившуюся на её лице, когда она впервые услышала этот голос. Миллионы таких людей по всему миру, как и Саманта. Миллионы. И мы только делаем первые шаги. К настоящему моменту у нас есть несколько суррогатных носителей голоса из США, которые дали свои голоса, и мы их использовали, чтобы создать несколько наших первых персонифицированных голосов. Но так ещё много работы предстоит сделать. Суррогат Саманты — со Среднего Запада, незнакомец, который наделил её даром голоса. Как учёный, я так взволнована тем, чтобы вынести эту работу за пределы лаборатории, наконец-то, в реальный мир, чтобы она по-настоящему внесла свой вклад в него. Далее хочу поделиться с вами тем, как я представляю себе продвижение этой работы на следующий уровень. Я представляю целый мир суррогатных доноров из разных профессий, разных пропорций и возраста, объединяющихся в таком «драйве голоса» с целью дать людям голоса, которые настолько же красочны, как и их личности. Чтобы сделать это, для начала мы создали веб-сайт, VocaliD.org, как способ собрать тех, кто хочет присоединиться к нам в качестве доноров голоса, доноров компетенции, чтобы каким-либо образом сделать это видение реальностью.

They say that giving blood can save lives. Well, giving your voice can change lives. All we need is a few hours of speech from our surrogate talker, and as little as a vowel from our target talker, to create a unique vocal identity.

Говорят, что донорство крови может спасти жизни. Что же, донорство вашего голоса может изменить жизни. Всё, что нужно, — это пара часов речи от суррогатного носителя голоса, и всего-навсего гласная от целевого владельца голоса, чтобы создать уникальную вокальную индивидуальность.

So that's the science behind what we're doing. I want to end by circling back to the human side that is really the inspiration for this work. About five years ago, we built our very first voice for a little boy named William. When his mom first heard this voice, she said, "This is what William would have sounded like had he been able to speak." And then I saw William typing a message on his device. I wondered, what was he thinking? Imagine carrying around someone else's voice for nine years and finally finding your own voice. Imagine that.

Таков научный фон наших действий. Хочу закончить, обратившись вновь к человеческому, что по-настоящему является вдохновением для этой работы. Около 5 лет назад мы создали наш первый голос для маленького мальчика по имени Вильям. Когда его мама впервые услышала его голос, она сказала: «Так бы звучал голос Вильяма, если бы он мог говорить». А потом я увидела Вильяма, набирающего сообщение на своём устройстве. Я заинтересовалась, о чём он думал? Представьте, таскаться с чьим-то чужим голосом в течение 9 лет и, наконец-то, найти свой собственный. Представьте это.

This is what William said: "Never heard me before."

Вот, что сказал Вильям: «Никогда раньше себя не слышал».

Thank you.

Спасибо.

(Applause)

(Аплодисменты)

I'm going to start by playing you a sample of a voice that you may recognize.

Начну с проигрывания образца голоса, который вы можете узнать.

(Recording) Stephen Hawking: "I would have thought it was fairly obvious what I meant."

(Запись) Стивен Хокинг: «Я бы подумал, что было довольно очевидно то, что я имел в виду».

I'm going to play you now a sample of someone who has, two people actually, who have severe speech disorders. I want you to take a listen to how they sound. They're saying the same utterance.

(First voice)

(Первый голос)

(Second voice) You probably didn't understand what they said, but I hope that you heard their unique vocal identities.

(Video) Voice: Things happen in pairs.

(Видео) Голос: Всё имеет пару.

I love to sleep.

Я люблю спать.

The sky is blue without clouds.

Небо голубое и без облаков.

(Video) Voice: I love chocolate.

(Видео) Голос: Я люблю шоколад.

RP: So that's speech synthesis. It's called concatenative synthesis, and that's what we're using. That's not the novel part. What's novel is how we make it sound like this young woman.

(Video) Samantha: Aaaaaah.

(Видео) Саманта: Ааааааааа.

RP: So now, Samantha can say this.

Рупал Патель: Сейчас Саманта может сказать вот это.

(Video) Samantha: This voice is only for me. I can't wait to use my new voice with my friends.

(Видео) Саманта: Этот голос — только для меня. Не дождусь попробовать мой новый голос с друзьями.

RP: Thank you. (Applause)

Рупал Патель: Спасибо. (Аплодисменты)

This is what William said: "Never heard me before."

Вот, что сказал Вильям: «Никогда раньше себя не слышал».

Thank you.

Спасибо.

(Applause)

(Аплодисменты)

Rupal Patel: Synthetic voices, as unique as fingerprints

Rupal Patel: Synthetic voices, as unique as fingerprints

Related talks

Ellen Jorgensen: Biohacking -- you can do it, too

Julian Treasure: Shh! Sound health in 8 steps

Craig Venter: On the verge of creating synthetic life

Rébecca Kleinberger: Why you don't like the sound of your own voice

Shaylin Schundler: Why does your voice change as you get older?

Roger Ebert: Remaking my voice

Related talks

Ellen Jorgensen: Biohacking -- you can do it, too

Julian Treasure: Shh! Sound health in 8 steps

Craig Venter: On the verge of creating synthetic life

Rébecca Kleinberger: Why you don't like the sound of your own voice

Shaylin Schundler: Why does your voice change as you get older?

Roger Ebert: Remaking my voice