Rupal Patel: Synthetic voices, as unique as fingerprints

I'd like to talk today about a powerful and fundamental aspect of who we are: our voice. Each one of us has a unique voiceprint that reflects our age, our size, even our lifestyle and personality. In the words of the poet Longfellow, "the human voice is the organ of the soul." As a speech scientist, I'm fascinated by how the voice is produced, and I have an idea for how it can be engineered. That's what I'd like to share with you.

我今天想給大家介紹一個對我們身份有重要影響的因素那就是:聲音我們每一個人都有獨特的音印它反映了我們的年紀, 體型, 甚至我們的性格與生活習慣以詩人亨利·沃茲沃思·朗費羅的話說: "人類的聲音就是靈魂的器官." 做為一個語言科學家, 我對聲音產生的過程有著濃厚的興趣, 我對如何來設計與建造聲音有一個新的看法我想和大家分享的這個看法

I'm going to start by playing you a sample of a voice that you may recognize.

先給大家放一個實例你們也許認得這個聲音

(Recording) Stephen Hawking: "I would have thought it was fairly obvious what I meant."

(錄音) 史蒂芬‧霍金:"我以為我說的話還是比較清楚的"

Rupal Patel: That was the voice of Professor Stephen Hawking. What you may not know is that same voice may also be used by this little girl who is unable to speak because of a neurological condition. In fact, all of these individuals may be using the same voice, and that's because there's only a few options available. In the U.S. alone, there are 2.5 million Americans who are unable to speak, and many of whom use computerized devices to communicate. Now that's millions of people worldwide who are using generic voices, including Professor Hawking, who uses an American-accented voice. This lack of individuation of the synthetic voice really hit home when I was at an assistive technology conference a few years ago, and I recall walking into an exhibit hall and seeing a little girl and a grown man having a conversation using their devices, different devices, but the same voice. And I looked around and I saw this happening all around me, literally hundreds of individuals using a handful of voices, voices that didn't fit their bodies or their personalities. We wouldn't dream of fitting a little girl with the prosthetic limb of a grown man. So why then the same prosthetic voice? It really struck me, and I wanted to do something about this.

這個錄音裡的聲音是來自史蒂芬‧霍金教授但是你也許不知道同一個聲音也可能被這個小女孩使用她因為神經的問題而無法說話事實上, 所有這些人都可能用著同一個聲音, 因為目前可用的聲音只有幾個僅在美國就有250萬人無法通過語言溝通, 他們大多數使用電子設備來溝通這意味著全世界有數百萬的人都用著同樣的聲音, 其中包括了霍金教授, 他用的是帶有美式口音的聲音這種人工聲音缺少的個體性讓我非常的驚訝, 當我幾年前在一個輔具科技會議上, 我記得走進一個展覽廳看見一個小女孩和一個成年男子通過他們的設備談話, 雖然設備不同, 但聲音卻是一樣的我望了望四周,發現周圍有幾百個人使用的聲音却只有幾種都不符合他們的身體或是性格. 我們不會考慮給一個小女孩裝上一個成年男子的假肢那為甚麼要給她一個不屬於自己的聲音呢? 我因為感觸很深, 所以決定對此做些甚麼

I'm going to play you now a sample of someone who has, two people actually, who have severe speech disorders. I want you to take a listen to how they sound. They're saying the same utterance.

接下來我要播放的例子是兩個人, 他們都有嚴重的語言障礙我希望大家聽聽看他們的聲音二人說的是一樣的話

(First voice)

（聲音一）

(Second voice) You probably didn't understand what they said, but I hope that you heard their unique vocal identities.

（聲音二）你們也許沒聽懂他們的話, 但我希望你們注意到了他們聲音中的獨特性

So what I wanted to do next is, I wanted to find out how we could harness these residual vocal abilities and build a technology that could be customized for them, voices that could be customized for them. So I reached out to my collaborator, Tim Bunnell. Dr. Bunnell is an expert in speech synthesis, and what he'd been doing is building personalized voices for people by putting together pre-recorded samples of their voice and reconstructing a voice for them. These are people who had lost their voice later in life. We didn't have the luxury of pre-recorded samples of speech for those born with speech disorder. But I thought, there had to be a way to reverse engineer a voice from whatever little is left over.

我接下來要做的是, 找到一個方法來利用這些剩餘的聲音特性來發明一套科技專為他們設計將他們的聲音個性化, 我找到了我的合作人, 蒂姆·布涅爾布涅爾博士是智能語音方面的專家, 他一直都在為他人設計個性化的語音方法是通過收集這些人之前的聲音錄音然後再為他們重建一種聲音但是布涅爾博士的這些研究對象遇到的問題是後天性語言障礙我們這次的研究沒有這個福利對這些先天帶有語言障礙的人我們沒有事先錄製好的聲音樣品但是我想了想, 一定有一個方法可以從僅有的所剩中將聲音逆向製作出來

So we decided to do exactly that. We set out with a little bit of funding from the National Science Foundation, to create custom-crafted voices that captured their unique vocal identities. We call this project VocaliD, or vocal I.D., for vocal identity.

所以我們決定就這樣做我們從國家科學基金會獲得了一些資金, 用以建造一套可以抓住他們聲音特性的個體化語音我們將該專案稱作VocaliD, 或是vocal I.D., 作為語音身份(Vocal Identity)的簡寫

Now before I get into the details of how the voice is made and let you listen to it, I need to give you a real quick speech science lesson. Okay? So first, we know that the voice is changing dramatically over the course of development. Children sound different from teens who sound different from adults. We've all experienced this. Fact number two is that speech is a combination of the source, which is the vibrations generated by your voice box, which are then pushed through the rest of the vocal tract. These are the chambers of your head and neck that vibrate, and they actually filter that source sound to produce consonants and vowels. So the combination of source and filter is how we produce speech. And that happens in one individual.

在我向大家播放和介紹如何製作這個聲音之前, 我需要先給大家上一堂語言科學課, 好嗎? 首先,我們需要了解聲音在成長的過程中會發生巨大的變化兒童和青少年聽起來會不同而青少年和成年人之間也是我們都曾經歷過這些語言變化階段事實二，是語言的產生是由多個來源組成, 其中包括了你喉頭產生的顫動, 這種顫動接著會貫穿整個聲腔圖像顯示的是頭和脖子的內部它們會顫動, 其實它們是將來源聲音過濾掉來產生子音和母音所以聲音的來源和過濾過程加在一起就是我們產生聲音的方法這是一個人身上發生的過程

Now I told you earlier that I'd spent a good part of my career understanding and studying the source characteristics of people with severe speech disorder, and what I've found is that even though their filters were impaired, they were able to modulate their source: the pitch, the loudness, the tempo of their voice. These are called prosody, and I've been documenting for years that the prosodic abilities of these individuals are preserved. So when I realized that those same cues are also important for speaker identity, I had this idea. Why don't we take the source from the person we want the voice to sound like, because it's preserved, and borrow the filter from someone about the same age and size, because they can articulate speech, and then mix them? Because when we mix them, we can get a voice that's as clear as our surrogate talker -- that's the person we borrowed the filter from— and is similar in identity to our target talker. It's that simple. That's the science behind what we're doing.

我之前告訴過大家我職業生涯的大部分時間都用來研究和學習有嚴重語音障礙人士的聲音源的特徵, 我發現雖然他們的過濾器官已遭到損壞, 他們可以調製自己的聲音來源: 包括高低度, 大小, 以及速度這些被稱之為音律, 我用了多年的時間來紀錄這些人是如何維持自己音律的能力當我認識到同樣的線索對說話人的身份同樣重要的時候, 我有了一個想法為什麼我們不找一個聲音是我們所需要的人, 從他那採集聲音源因為它已被保留, 然後再找一個有著相似年紀和體型的人從他那借用過濾器, 因為他們能清晰地說話, 然後將二者混合? 因為當我們將它們混合的時候, 我們得到的聲音將會和那個代替說話者一樣清楚代替說話者就是我們借用過濾器的人而產生的語音和我們目標說話者有相似的辨認度就這麼簡單這就我們該項研究的科學性

So once you have that in mind, how do you go about building this voice? Well, you have to find someone who is willing to be a surrogate. It's not such an ominous thing. Being a surrogate donor only requires you to say a few hundred to a few thousand utterances. The process goes something like this.

有了這個想法以後, 應該怎麼來製造這個聲音呢? 首先,你必須找一個願意當這個代替者的人這個任務也不是太糟糕當一個聲音捐贈者只要求你閱讀幾百到幾千句話. 以下是過程

(Video) Voice: Things happen in pairs.

(錄影)聲音: 事情成雙成對地發生

I love to sleep.

我愛睡覺

The sky is blue without clouds.

天空藍色無雲

RP: Now she's going to go on like this for about three to four hours, and the idea is not for her to say everything that the target is going to want to say, but the idea is to cover all the different combinations of the sounds that occur in the language. The more speech you have, the better sounding voice you're going to have. Once you have those recordings, what we need to do is we have to parse these recordings into little snippets of speech, one- or two-sound combinations, sometimes even whole words that start populating a dataset or a database. We're going to call this database a voice bank. Now the power of the voice bank is that from this voice bank, we can now say any new utterance, like, "I love chocolate" -- everyone needs to be able to say that— fish through that database and find all the segments necessary to say that utterance.

演講者: 她接下來的3-4個小時都會繼續閱讀, 目的是不要讓她說所有目標說話者要說的話真正的目的是要概擴所有在語言中可能發生的組合你說的話越多, 你的聲音就會聽起來更好當錄音完成後, 我們接下來要對這些錄音做語法分析將它們分段, 大概1-2個音的組合, 有時候也會是那些填入數據集或是數據庫的完整單字我們將這個數據庫稱之為聲音銀行聲音銀行的力量使我們通過它可以說出任何新的語句, 比如說, "我喜歡巧克力" 所有人都需要說這類的話的能力搜尋數據庫找到必須的部分來完成這個語句

(Video) Voice: I love chocolate.

(錄影)聲音: 我喜歡巧克力

RP: So that's speech synthesis. It's called concatenative synthesis, and that's what we're using. That's not the novel part. What's novel is how we make it sound like this young woman.

演講人: 這是一個人工聲音我們將其稱之為連環整合我們使用的就是這個方法這不是新奇的部分它新奇之處是我們使它聽起來就像是這個年輕女士的聲音

This is Samantha. I met her when she was nine, and since then, my team and I have been trying to build her a personalized voice. We first had to find a surrogate donor, and then we had to have Samantha produce some utterances. What she can produce are mostly vowel-like sounds, but that's enough for us to extract her source characteristics. What happens next is best described by my daughter's analogy. She's six. She calls it mixing colors to paint voices. It's beautiful. It's exactly that. Samantha's voice is like a concentrated sample of red food dye which we can infuse into the recordings of her surrogate to get a pink voice just like this.

她是珊曼莎在她9歲時, 我第一次見到她在那之後, 我和我的團隊一直設法為她製造一款個性化的聲音我們首先需要一個捐贈者, 然後我們會讓珊曼莎發一些音雖然她所發出的音大部分都類似母音, 但我們用這些已足夠來取得她聲音根源的特性接下來所發生的事用我女兒的比喻來描述再合適不過, 她6歲她說這是混合顏色來畫聲音很漂亮, 就是這樣珊曼莎的聲音就像是紅色食用色素的濃縮樣品我們可以將它注入到她代替者的錄音裡然後取得一個像這樣的粉色聲音

(Video) Samantha: Aaaaaah.

(錄影)珊曼莎:啊.....

RP: So now, Samantha can say this.

現在, 珊曼莎可以說這個

(Video) Samantha: This voice is only for me. I can't wait to use my new voice with my friends.

(錄影)珊曼莎: 這個聲音是我的專屬我等不及與我朋友們分享我的聲音

RP: Thank you. (Applause)

謝謝

I'll never forget the gentle smile that spread across her face when she heard that voice for the first time. Now there's millions of people around the world like Samantha, millions, and we've only begun to scratch the surface. What we've done so far is we have a few surrogate talkers from around the U.S. who have donated their voices, and we have been using those to build our first few personalized voices. But there's so much more work to be done. For Samantha, her surrogate came from somewhere in the Midwest, a stranger who gave her the gift of voice. And as a scientist, I'm so excited to take this work out of the laboratory and finally into the real world so it can have real-world impact. What I want to share with you next is how I envision taking this work to that next level. I imagine a whole world of surrogate donors from all walks of life, different sizes, different ages, coming together in this voice drive to give people voices that are as colorful as their personalities. To do that as a first step, we've put together this website, VocaliD.org, as a way to bring together those who want to join us as voice donors, as expertise donors, in whatever way to make this vision a reality.

我永遠都不會忘記當她第一次聽到自己的聲音時佈滿在她臉上那輕柔的微笑目前世界上有好幾百萬像珊曼莎的人, 幾百萬, 而我們的工作才剛剛開始我們目前只有幾個來自美國的語言代替者捐贈了他們的聲音, 我們使用了他們的捐贈來建造我們第一批個性化的聲音但還有更多的工作要完成對珊曼莎而言, 她的代替者是來自美國中西部, 一個陌生人送給了她一個聲音禮物作為一個科學家, 我很開心能將這個研究從實驗室帶到現實的世界讓它產生一個實際的影響我接下來想跟大家分享我如何想像讓這項研究進入下一個階段我想像著一個充滿了聲音捐贈者的世界他們來自各行各業, 有著不同的體型和年齡, 一起聚集到這個聲音活動給其他人提供的聲音就像他們個性一樣多姿多采我們的第一個步驟, 是建立這個網站, VocaliD.org, 通過這個網站將那些願意捐贈聲音的, 願意提供意見的, 還有想提供其它幫助的人聚集到一起

They say that giving blood can save lives. Well, giving your voice can change lives. All we need is a few hours of speech from our surrogate talker, and as little as a vowel from our target talker, to create a unique vocal identity.

有人說捐血可以救人那麼捐聲音就可以改變他人的生活從我們的代替說話者那裡我們只需要幾個小時的語音, 然後再從我們的目標說話者那裡取得幾個母音, 就可以建立出一個獨特的聲音身份

So that's the science behind what we're doing. I want to end by circling back to the human side that is really the inspiration for this work. About five years ago, we built our very first voice for a little boy named William. When his mom first heard this voice, she said, "This is what William would have sounded like had he been able to speak." And then I saw William typing a message on his device. I wondered, what was he thinking? Imagine carrying around someone else's voice for nine years and finally finding your own voice. Imagine that.

這就是我們研究背後的科學結尾我想再次強調人為因素因為它才是這項研究的啟發大約在5年前, 我們為一個名為威廉的小男孩製造了第一個聲音當他的媽媽第一次聽到兒子的聲音時, 她說, "如果威廉可以說話, 那他的聲音一定和這個一模一樣." 我們然後看到威廉在他的設備上打一條訊息我猜想他在想什麼? 試想一下借用了他人的聲音 9年之後終於有了自己聲音的感覺試想一下

This is what William said: "Never heard me before."

這就是威廉說的話: "在這之前從來沒聽過我說話"

Thank you.

謝謝大家

(Applause)

掌聲

I'm going to start by playing you a sample of a voice that you may recognize.

先給大家放一個實例你們也許認得這個聲音

(Recording) Stephen Hawking: "I would have thought it was fairly obvious what I meant."

(錄音) 史蒂芬‧霍金:"我以為我說的話還是比較清楚的"

I'm going to play you now a sample of someone who has, two people actually, who have severe speech disorders. I want you to take a listen to how they sound. They're saying the same utterance.

接下來我要播放的例子是兩個人, 他們都有嚴重的語言障礙我希望大家聽聽看他們的聲音二人說的是一樣的話

(First voice)

（聲音一）

(Second voice) You probably didn't understand what they said, but I hope that you heard their unique vocal identities.

（聲音二）你們也許沒聽懂他們的話, 但我希望你們注意到了他們聲音中的獨特性

(Video) Voice: Things happen in pairs.

(錄影)聲音: 事情成雙成對地發生

I love to sleep.

我愛睡覺

The sky is blue without clouds.

天空藍色無雲

(Video) Voice: I love chocolate.

(錄影)聲音: 我喜歡巧克力

RP: So that's speech synthesis. It's called concatenative synthesis, and that's what we're using. That's not the novel part. What's novel is how we make it sound like this young woman.

演講人: 這是一個人工聲音我們將其稱之為連環整合我們使用的就是這個方法這不是新奇的部分它新奇之處是我們使它聽起來就像是這個年輕女士的聲音

(Video) Samantha: Aaaaaah.

(錄影)珊曼莎:啊.....

RP: So now, Samantha can say this.

現在, 珊曼莎可以說這個

(Video) Samantha: This voice is only for me. I can't wait to use my new voice with my friends.

(錄影)珊曼莎: 這個聲音是我的專屬我等不及與我朋友們分享我的聲音

RP: Thank you. (Applause)

謝謝

This is what William said: "Never heard me before."

這就是威廉說的話: "在這之前從來沒聽過我說話"

Thank you.

謝謝大家

(Applause)

掌聲

Rupal Patel: Synthetic voices, as unique as fingerprints

Rupal Patel: Synthetic voices, as unique as fingerprints

Related talks

Ellen Jorgensen: Biohacking -- you can do it, too

Julian Treasure: Shh! Sound health in 8 steps

Craig Venter: On the verge of creating synthetic life

Rébecca Kleinberger: Why you don't like the sound of your own voice

Shaylin Schundler: Why does your voice change as you get older?

Roger Ebert: Remaking my voice

Related talks

Ellen Jorgensen: Biohacking -- you can do it, too

Julian Treasure: Shh! Sound health in 8 steps

Craig Venter: On the verge of creating synthetic life

Rébecca Kleinberger: Why you don't like the sound of your own voice

Shaylin Schundler: Why does your voice change as you get older?

Roger Ebert: Remaking my voice