Rupal Patel: Synthetic voices, as unique as fingerprints

Сёння я б хацела паразмаўляць пра базісную і ўплывовую частку нашае існасці: наш голас. У кожнага з нас свой галасавы адбітак, які адлюстроўвае наш узрост, памер, і нават стыль жыцця ці асабовасць. Як напісаў паэт Лангфэлаў, "чалавечы голас -- орган ягонай душы". Я -- навуковы спецыяліст па размове, і мяне вельмі цікавіць, як голас вырабляецца. Я трохі ведаю, як яго вырабіць штучна. Менавіта гэтым я і хачу падзяліцца. Я распачну з запісу голасу, які вы, мажліва, распазнаеце. (Запіс) Сцівен Хокін: Мне здаецца, было даволі ясна, што я меў на ўвазе. Рупал Патэль: Гэта быў голас прафесара Сцівена Хокіна. Чаго вы хутчэй за ўсё не ведаеце, дык гэта тое, што гэткі самы голас можна даць і маленькай дзяўчынцы, якая не можа размаўляць з-за нейралагічнага захворвання. Факт у тым, што ўсе гэтыя асобы могуць карыстацца тым самым голасам, таму што існуючы выбар невялікі. Толькі ў ЗША жывуць 2,5 мільёны людзей, якія не могуць размаўляць, і шмат хто з іх карыстаецца кампутарнымі прыстасаваннямі для паразумення з іншымі. Значыцца, у свеце мільёны людзей, якія карыстаюцца абагульненымі галасамі, у тым ліку і прафесар Хокін, у каго гэты голас з амерыканскім вымаўленнем. Я насамрэч уцяміла гэткую адсутнасць індывідуальнасці штучнага голасу калі я была на канферэнцыі па тэхналагічных дапаможніках некалькі год таму. Я ясна памятаю, як я ўвайшла ў выставачную залу і ўбачыла, як маленькая дзяўчына размаўляе з дарослым мужчынам праз свае прыстасаванні, якія былі розныя, але галасы ім давалі аднолькавыя. Я глянула вакол і ўбачыла тую самую сітуацыю паўсюль вакол мяне,калі сотні чалавек карысталіся мо якой пяцёркай галасоў, і гэтыя галасы не пасавалі ні да іхняе постаці, ні да іхняга характару. Мы ніколі б не надзелі на дзяўчынку пратэз канцавіны таго самага памеру, што і пратэз дарослага мужчыны. Чаму ж мы ім даямо аднолькавыя пратэзы галасоў? Гэта мяне моцна ўразіла, і я захацела неяк на сітуацыю паўплываць. Я прайграю вам прыклад гучання двух людзей, у каго моцныя размоўныя праблемы. Я хачу, каб вы паслухалі, як яны гучаць. Яны вымаўляюць той самы сказ. (Першы голас) (Другі голас) Верагодней за ўсё, вы іх не зразумелі, але, спадзяюся, вы пачулі адзінкавыя якасці іхніх галасоў. Маёй наступнай ідэяй была даведацца, ці мажліва прыладзіць гэтыя другасныя галасавыя якасці і пабудаваць тэхналагічную мажлівасць, якую можна дапасаваць да кожнага паасобку, гэта значыцца кожнаму асобны голас. Я запыталася майго суаўтара, Ціма Баннела. Кандыдат навук, Баннел -- спецыяліст па сукладу мовы. Ён займаецца будаваннем нанова індывідуальных галасоў для людзей праз працэс складання запісаных прыкладаў іх голаса і пасля яго рэканструкцыі. Гэта для людзей, што згубілі голас пазней у жыцці. У нас не было пажаданага запісу прыкладаў размовы для людзей, што нарадзіліся з дэфектам голасу. Але я падумала, што можна выкарастацца рэверсіўным праектаваннем і зрабіць голас з таго, што ёсць. Гэтак мы і вырашылі зрабіць. Мы распачал на невялікай стыпендыі з Нацыянальнага навуковага фонда, і мэтай было стварыць паасобку падабраныя галасы, якія б выяўлялі адзінкавыя галасавыя асаблівасці. Мы назвалі праект VocaliD, што значыцца галасавая асабовасць. Перад тым, як паглыбіцца ў дэталі працы голаса, і да таго, як прайграць вам прыклад, я настаўлю вас трохі ў навуцы пра голас. Згода? Мы ведаем, што голас радыкальна змяняецца ў працэсе развіцця чалавека. Дзіцячыя галасы адрозніваюцца ад падлеткавых, якія ў сваю чаргу іншыя, чым дарослыя. Мы ўсе гэта чулі са свайго досведу. Другі факт: мова складаецца з аб'яднанага першаснага гуку -- вібрыцыяў галасавых складкаў -- якія праціскаюцца праз рэшту галасавога апарату. Гэта -- рэзанатары ў галаве і шыі, што хістаюцца, яны працуюць гэткім цэдалем для першаснага гуку па выпрацаванні галосных і зычных гукаў. Нам трэба першасны гук і цэдаль разам каб мець магчымасць нешта вымавіць. Усё гэтае здараецца з адным чалавекам. Як я вам ужо казала, вялікую частку майго працоўнага жыцця я правяла ў пашырэнні ведаў пра характэрыстыкі першаснага гуку людзей з цяжкімі парушэннямі мовы, і з маіх выханоджанняў бачна, што нягледзячы на парушэнні цэдаляў, людзі могуць мяняць першасны гук, ягоная вышыня, голаснасць, тэмп. Агульна яны завуцца прасодыя, і я ўжо не раз апісвала, як прасадычныя якасці людзей з парушэннямі размаўлення застаюцца. Таму, калі я зразумела, што гэтыя ж самыя якасці важны для асабовасці голасу, у мяне з'явмлася ідэя. Чаму б не ўзяць першасны гук ад чалавека, чый голас мы хочам чуць, таму што ён захаваны, і пазычым цэдаль ад кагось з аднолькавым узростам і памерам, таму што яны могуць вымаўляць, і змяшаем іх? Калі мы іх змяшаем, у нас атрымаецца настолькі чысты голас, як у сурагатнага размоўцы -- гэта таго, ад каго мы пазычылі цэдаль -- але з асабовасцю носьбіта гэтага голасу. Вельмі проста. Гэта ўся навука нашых даследванняў. Ведаючы гэтае, як далей будаваць голас? Трэба кагосьці знайсці, хто хоча стаць сурагатам. Гэта зусім не злавесная праца. Быць сурагатным донарам займае час на вымаўленне ад некалькіх сотняў да некалькіх тысячаў выказванняў. Працэс выглядае гэтак. (Відэа) Голас: Рэчы робяцца запар. Я люблю спаць. Неба блакітнае без хмар. РП: Яна гэтак будзе працягваць недзе тры ці чатыры гадзіны. Мэта не ў тым, каб сказаць усё, што носьбіт пасля кольвесь захоча сказаць, а ў тым, каб агучыць усе розныя камбінацыі гукаў, што сустракаюцца ў мове. Чым больш у цябе вымаўленняў, тым лепш гучаць будзе твой голас. Атрымаўшы гэтыя запісы трэба іх пачленаваць у маленькія кавалкі вымаўлення ў адзін ці два гукі, ці калі-нікалі цэлыя словы, якія пазней склалаюцца ў базу дадзеных. Назавем гэтыў базу банкам голасу. Прыгажосць банка голасу ў тым, што з такога банку мы можам зрабіць любы новы сказ, як, напрыклад, "Я люблю шакалад" -- усім трэба магчы гэта сказаць -- і мы пашукаем у банку, і знойдам там скадальнікі, патрэбныя для вымаўлення фразы. (Відэа) Голас: Я люблю шакалад. РП: Гэта сінтэз мовы. Гэта завецца кантатэнцыйны сінтэз, мы карыстаемся гэтай методай. Гэта не новае. Новае -- гэта як прымусіць словы гучаць як гэтая дзяўчына. Яе зваць Саманта. Мы пазнаёміліся, калі ёй было 9 год, і з тае пары наша група навукоўцаў спрабуюць пабудаваць ёй асабісты голас. Нам трэ было спачатку знайсці сурагатнага донара, пасля Саманце трэ было вымавіць некалькі выказванняў. Яна можа вымавіць толькі збольшага спалучэнні галосных, але гэтага дастаткова, каб атрымаць характэрыстыкі першаснага гуку. Тое, што здарылася далей, лепш перадаць словамі маёй дачкі. Ёй шэсць гадоў. Яна заве гэта перамешваннем колераў для фарбавання галасоў. Прыгожае параўнанне. Менавіта гэтае мы й робім. Голас Саманты бы сканцэнтраваная сумесь чырвонае харчовае фарбы, якую мы можам дадаць да запісаў яе сурагата, каб атрымаць гэткі ружовы голас. (Відэа) Саманта: Ааааааааааааа. РП: Таму зараз Саманта можа сказаць наступнае. (Відэа) Саманта: Гэты голас толькі для мяне. Не дачакаюся яго выкарастаць з сябрамі. РП: Дзякуй. (Воплескі) Мне ніколі не запамятаваць ейнае мяккае ўсмешкі, якая расплылася па твары, калі яна ўпершыню пачула гэты голас. Зараз мільёны людзей па свеце, такіх, як Саманта, мільёны, а мы нават яшчэ і не паглыбіліся ў прадмет. Што мы дасель зрабілі, дык назбіралі некалькі сурагатных размоўцаў з розных частак ЗША, якія ахвяравалі свае галасы, і імі мы карыстаемся, каб стварыць нашыя першыя індывідуальныя галасы. Але ж працы яшчэ пройма. Сурагат Саманты жыве недзе на сярэднім захадзе краіны, ён -- чужэнец, але падараваў ёй голас. Я -- навуковец, і вельмі задаволена, што нарэшце гэтую працу можна вынесці з лабараторыі ў свет, каб убачыць сапраўдны ўплыў на нечае жыццё. Далей я хачу з вамі падзяліцца тым, як гэтую працу можна пасунуць на наступны ўзровень. Я ўяўляю свет сурагатных донараў розных заняткаў, розных памераў, розных узростаў, якія аб'ядноўваюцца ў жаданні даць людзям галасы, якія настолькі ж каляровыя, як і людскія асабовасці. Каб гэта дасягнуць мы спачатку зрабілі старонку на інтэрнэт, VocalID.org, каб злучыць тых, хто хоча далучыцца ў якасці донару голасу, ці донару экспертных ведаў, ці якой-кольвесь ролі, каб ажыцявіць гэтае мае бачанне. Кажуць, што донарства крыві бароніць жыццё. А донарства голасу мяняе жыццё. Усё, што для гэтага трэба -- некалькі гадзін размовы ад сурагатнага размоўцы, і хоць бы адзін галосны гук ад носьбіта новага голасу, і мы можам стварыць адзінкавую галасавую асабовасць. Гэта з навуковага пункту гледжання. Я хочу скончыць прамову звярнуўшысы да чалавечага боку, які мяне ў працы сапраўды натхняе. Недзе 5 год таму мы сканструктавалі наш першы голас, маленькаму хлопчыку Вільяму. Калі яго маці ўпершыню пачула гэты голас, яна сказала: "Гэтак бы Вільям гучаў, "калі б "мог размаўляць." Тут я ўбачыла, як Вільям піша нешта на сваім прыстасаванні. Мне было цікава, пра што ён думае. Уявіце сабе, што вы размаўляеце чужым голасам 9 год, і раптам знайшлі свой. Уявіце. Вось што ён сказаў: "Ніколі мяне не чуў." Дзякуй. (Воплескі)

I'd like to talk today about a powerful and fundamental aspect of who we are: our voice. Each one of us has a unique voiceprint that reflects our age, our size, even our lifestyle and personality. In the words of the poet Longfellow, "the human voice is the organ of the soul." As a speech scientist, I'm fascinated by how the voice is produced, and I have an idea for how it can be engineered. That's what I'd like to share with you. I'm going to start by playing you a sample of a voice that you may recognize. (Recording) Stephen Hawking: "I would have thought it was fairly obvious what I meant." Rupal Patel: That was the voice of Professor Stephen Hawking. What you may not know is that same voice may also be used by this little girl who is unable to speak because of a neurological condition. In fact, all of these individuals may be using the same voice, and that's because there's only a few options available. In the U.S. alone, there are 2.5 million Americans who are unable to speak, and many of whom use computerized devices to communicate. Now that's millions of people worldwide who are using generic voices, including Professor Hawking, who uses an American-accented voice. This lack of individuation of the synthetic voice really hit home when I was at an assistive technology conference a few years ago, and I recall walking into an exhibit hall and seeing a little girl and a grown man having a conversation using their devices, different devices, but the same voice. And I looked around and I saw this happening all around me, literally hundreds of individuals using a handful of voices, voices that didn't fit their bodies or their personalities. We wouldn't dream of fitting a little girl with the prosthetic limb of a grown man. So why then the same prosthetic voice? It really struck me, and I wanted to do something about this. I'm going to play you now a sample of someone who has, two people actually, who have severe speech disorders. I want you to take a listen to how they sound. They're saying the same utterance. (First voice) (Second voice) You probably didn't understand what they said, but I hope that you heard their unique vocal identities. So what I wanted to do next is, I wanted to find out how we could harness these residual vocal abilities and build a technology that could be customized for them, voices that could be customized for them. So I reached out to my collaborator, Tim Bunnell. Dr. Bunnell is an expert in speech synthesis, and what he'd been doing is building personalized voices for people by putting together pre-recorded samples of their voice and reconstructing a voice for them. These are people who had lost their voice later in life. We didn't have the luxury of pre-recorded samples of speech for those born with speech disorder. But I thought, there had to be a way to reverse engineer a voice from whatever little is left over. So we decided to do exactly that. We set out with a little bit of funding from the National Science Foundation, to create custom-crafted voices that captured their unique vocal identities. We call this project VocaliD, or vocal I.D., for vocal identity. Now before I get into the details of how the voice is made and let you listen to it, I need to give you a real quick speech science lesson. Okay? So first, we know that the voice is changing dramatically over the course of development. Children sound different from teens who sound different from adults. We've all experienced this. Fact number two is that speech is a combination of the source, which is the vibrations generated by your voice box, which are then pushed through the rest of the vocal tract. These are the chambers of your head and neck that vibrate, and they actually filter that source sound to produce consonants and vowels. So the combination of source and filter is how we produce speech. And that happens in one individual. Now I told you earlier that I'd spent a good part of my career understanding and studying the source characteristics of people with severe speech disorder, and what I've found is that even though their filters were impaired, they were able to modulate their source: the pitch, the loudness, the tempo of their voice. These are called prosody, and I've been documenting for years that the prosodic abilities of these individuals are preserved. So when I realized that those same cues are also important for speaker identity, I had this idea. Why don't we take the source from the person we want the voice to sound like, because it's preserved, and borrow the filter from someone about the same age and size, because they can articulate speech, and then mix them? Because when we mix them, we can get a voice that's as clear as our surrogate talker -- that's the person we borrowed the filter from— and is similar in identity to our target talker. It's that simple. That's the science behind what we're doing. So once you have that in mind, how do you go about building this voice? Well, you have to find someone who is willing to be a surrogate. It's not such an ominous thing. Being a surrogate donor only requires you to say a few hundred to a few thousand utterances. The process goes something like this. (Video) Voice: Things happen in pairs. I love to sleep. The sky is blue without clouds. RP: Now she's going to go on like this for about three to four hours, and the idea is not for her to say everything that the target is going to want to say, but the idea is to cover all the different combinations of the sounds that occur in the language. The more speech you have, the better sounding voice you're going to have. Once you have those recordings, what we need to do is we have to parse these recordings into little snippets of speech, one- or two-sound combinations, sometimes even whole words that start populating a dataset or a database. We're going to call this database a voice bank. Now the power of the voice bank is that from this voice bank, we can now say any new utterance, like, "I love chocolate" -- everyone needs to be able to say that— fish through that database and find all the segments necessary to say that utterance. (Video) Voice: I love chocolate. RP: So that's speech synthesis. It's called concatenative synthesis, and that's what we're using. That's not the novel part. What's novel is how we make it sound like this young woman. This is Samantha. I met her when she was nine, and since then, my team and I have been trying to build her a personalized voice. We first had to find a surrogate donor, and then we had to have Samantha produce some utterances. What she can produce are mostly vowel-like sounds, but that's enough for us to extract her source characteristics. What happens next is best described by my daughter's analogy. She's six. She calls it mixing colors to paint voices. It's beautiful. It's exactly that. Samantha's voice is like a concentrated sample of red food dye which we can infuse into the recordings of her surrogate to get a pink voice just like this. (Video) Samantha: Aaaaaah. RP: So now, Samantha can say this. (Video) Samantha: This voice is only for me. I can't wait to use my new voice with my friends. RP: Thank you. (Applause) I'll never forget the gentle smile that spread across her face when she heard that voice for the first time. Now there's millions of people around the world like Samantha, millions, and we've only begun to scratch the surface. What we've done so far is we have a few surrogate talkers from around the U.S. who have donated their voices, and we have been using those to build our first few personalized voices. But there's so much more work to be done. For Samantha, her surrogate came from somewhere in the Midwest, a stranger who gave her the gift of voice. And as a scientist, I'm so excited to take this work out of the laboratory and finally into the real world so it can have real-world impact. What I want to share with you next is how I envision taking this work to that next level. I imagine a whole world of surrogate donors from all walks of life, different sizes, different ages, coming together in this voice drive to give people voices that are as colorful as their personalities. To do that as a first step, we've put together this website, VocaliD.org, as a way to bring together those who want to join us as voice donors, as expertise donors, in whatever way to make this vision a reality. They say that giving blood can save lives. Well, giving your voice can change lives. All we need is a few hours of speech from our surrogate talker, and as little as a vowel from our target talker, to create a unique vocal identity. So that's the science behind what we're doing. I want to end by circling back to the human side that is really the inspiration for this work. About five years ago, we built our very first voice for a little boy named William. When his mom first heard this voice, she said, "This is what William would have sounded like had he been able to speak." And then I saw William typing a message on his device. I wondered, what was he thinking? Imagine carrying around someone else's voice for nine years and finally finding your own voice. Imagine that. This is what William said: "Never heard me before." Thank you. (Applause)

Rupal Patel: Synthetic voices, as unique as fingerprints

Rupal Patel: Synthetic voices, as unique as fingerprints

Related talks

Ellen Jorgensen: Biohacking -- you can do it, too

Julian Treasure: Shh! Sound health in 8 steps

Craig Venter: On the verge of creating synthetic life

Rébecca Kleinberger: Why you don't like the sound of your own voice

Shaylin Schundler: Why does your voice change as you get older?

Roger Ebert: Remaking my voice

Related talks

Ellen Jorgensen: Biohacking -- you can do it, too

Julian Treasure: Shh! Sound health in 8 steps

Craig Venter: On the verge of creating synthetic life

Rébecca Kleinberger: Why you don't like the sound of your own voice

Shaylin Schundler: Why does your voice change as you get older?

Roger Ebert: Remaking my voice