I'd like to talk today about a powerful and fundamental aspect of who we are: our voice. Each one of us has a unique voiceprint that reflects our age, our size, even our lifestyle and personality. In the words of the poet Longfellow, "the human voice is the organ of the soul." As a speech scientist, I'm fascinated by how the voice is produced, and I have an idea for how it can be engineered. That's what I'd like to share with you.
我今天想給大家介紹 一個對我們身份有重要影響的因素 那就是:聲音 我們每一個人都有獨特的音印 它反映了我們的年紀, 體型, 甚至我們的性格與生活習慣 以詩人亨利·沃茲沃思·朗費羅的話說: "人類的聲音就是靈魂的器官." 做為一個語言科學家, 我對聲音產生的過程 有著濃厚的興趣, 我對如何來設計與建造聲音 有一個新的看法 我想和大家分享的這個看法
I'm going to start by playing you a sample of a voice that you may recognize.
先給大家放一個實例 你們也許認得這個聲音
(Recording) Stephen Hawking: "I would have thought it was fairly obvious what I meant."
(錄音) 史蒂芬‧霍金:"我以為我說的話 還是比較清楚的"
Rupal Patel: That was the voice of Professor Stephen Hawking. What you may not know is that same voice may also be used by this little girl who is unable to speak because of a neurological condition. In fact, all of these individuals may be using the same voice, and that's because there's only a few options available. In the U.S. alone, there are 2.5 million Americans who are unable to speak, and many of whom use computerized devices to communicate. Now that's millions of people worldwide who are using generic voices, including Professor Hawking, who uses an American-accented voice. This lack of individuation of the synthetic voice really hit home when I was at an assistive technology conference a few years ago, and I recall walking into an exhibit hall and seeing a little girl and a grown man having a conversation using their devices, different devices, but the same voice. And I looked around and I saw this happening all around me, literally hundreds of individuals using a handful of voices, voices that didn't fit their bodies or their personalities. We wouldn't dream of fitting a little girl with the prosthetic limb of a grown man. So why then the same prosthetic voice? It really struck me, and I wanted to do something about this.
這個錄音裡的聲音 是來自史蒂芬‧霍金教授 但是你也許不知道同一個聲音 也可能被這個小女孩使用 她因為神經的問題 而無法說話 事實上, 所有這些人 都可能用著同一個聲音, 因為目前可用的聲音只有幾個 僅在美國就有250萬人 無法通過語言溝通, 他們大多數 使用電子設備來溝通 這意味著全世界有數百萬的人 都用著同樣的聲音, 其中包括了霍金教授, 他用的是帶有美式口音的聲音 這種人工聲音缺少的個體性 讓我非常的驚訝, 當我幾年前 在一個輔具科技會議上, 我記得走進一個展覽廳 看見一個小女孩和一個成年男子 通過他們的設備談話, 雖然設備不同, 但聲音卻是一樣的 我望了望四周,發現 周圍有幾百個人 使用的聲音却只有幾種 都不符合他們的身體 或是性格. 我們不會考慮給一個小女孩裝上 一個成年男子的假肢 那為甚麼要給她一個 不屬於自己的聲音呢? 我因為感觸很深, 所以決定對此做些甚麼
I'm going to play you now a sample of someone who has, two people actually, who have severe speech disorders. I want you to take a listen to how they sound. They're saying the same utterance.
接下來我要播放的例子 是兩個人, 他們都有嚴重的語言障礙 我希望大家聽聽看他們的聲音 二人說的是一樣的話
(First voice)
(聲音一)
(Second voice) You probably didn't understand what they said, but I hope that you heard their unique vocal identities.
(聲音二) 你們也許沒聽懂他們的話, 但我希望你們注意到了 他們聲音中的獨特性
So what I wanted to do next is, I wanted to find out how we could harness these residual vocal abilities and build a technology that could be customized for them, voices that could be customized for them. So I reached out to my collaborator, Tim Bunnell. Dr. Bunnell is an expert in speech synthesis, and what he'd been doing is building personalized voices for people by putting together pre-recorded samples of their voice and reconstructing a voice for them. These are people who had lost their voice later in life. We didn't have the luxury of pre-recorded samples of speech for those born with speech disorder. But I thought, there had to be a way to reverse engineer a voice from whatever little is left over.
我接下來要做的是, 找到一個方法來 利用這些剩餘的聲音特性 來發明一套科技 專為他們設計 將他們的聲音個性化, 我找到了我的合作人, 蒂姆·布涅爾 布涅爾博士是智能語音方面的專家, 他一直都在為 他人設計個性化的語音 方法是通過收集 這些人之前的聲音錄音 然後再為他們重建一種聲音 但是布涅爾博士的這些研究對象 遇到的問題是後天性語言障礙 我們這次的研究沒有這個福利 對這些先天帶有語言障礙的人 我們沒有事先錄製好的聲音樣品 但是我想了想, 一定有一個方法 可以從僅有的所剩中 將聲音逆向製作出來
So we decided to do exactly that. We set out with a little bit of funding from the National Science Foundation, to create custom-crafted voices that captured their unique vocal identities. We call this project VocaliD, or vocal I.D., for vocal identity.
所以我們決定就這樣做 我們從國家科學基金會獲得了一些資金, 用以建造一套可以抓住他們 聲音特性的個體化語音 我們將該專案稱作VocaliD, 或是vocal I.D., 作為語音身份(Vocal Identity)的簡寫
Now before I get into the details of how the voice is made and let you listen to it, I need to give you a real quick speech science lesson. Okay? So first, we know that the voice is changing dramatically over the course of development. Children sound different from teens who sound different from adults. We've all experienced this. Fact number two is that speech is a combination of the source, which is the vibrations generated by your voice box, which are then pushed through the rest of the vocal tract. These are the chambers of your head and neck that vibrate, and they actually filter that source sound to produce consonants and vowels. So the combination of source and filter is how we produce speech. And that happens in one individual.
在我向大家播放 和介紹如何製作這個聲音之前, 我需要先給大家上一堂 語言科學課, 好嗎? 首先,我們需要了解聲音 在成長的過程中會發生巨大的變化 兒童和青少年聽起來會不同 而青少年和成年人之間也是 我們都曾經歷過這些語言變化階段 事實二,是語言的產生 是由多個來源組成, 其中包括了你喉頭產生的顫動, 這種顫動接著 會貫穿整個聲腔 圖像顯示的是頭和脖子的內部 它們會顫動, 其實它們是將來源聲音過濾掉 來產生子音和母音 所以聲音的來源和過濾過程加在一起 就是我們產生聲音的方法 這是一個人身上發生的過程
Now I told you earlier that I'd spent a good part of my career understanding and studying the source characteristics of people with severe speech disorder, and what I've found is that even though their filters were impaired, they were able to modulate their source: the pitch, the loudness, the tempo of their voice. These are called prosody, and I've been documenting for years that the prosodic abilities of these individuals are preserved. So when I realized that those same cues are also important for speaker identity, I had this idea. Why don't we take the source from the person we want the voice to sound like, because it's preserved, and borrow the filter from someone about the same age and size, because they can articulate speech, and then mix them? Because when we mix them, we can get a voice that's as clear as our surrogate talker -- that's the person we borrowed the filter from— and is similar in identity to our target talker. It's that simple. That's the science behind what we're doing.
我之前告訴過大家 我職業生涯的大部分時間 都用來研究和學習 有嚴重語音障礙人士的 聲音源的特徵, 我發現 雖然他們的過濾器官已遭到損壞, 他們可以調製自己的聲音來源: 包括高低度, 大小, 以及速度 這些被稱之為音律, 我用了多年的時間 來紀錄這些人是如何 維持自己音律的能力 當我認識到同樣的線索 對說話人的身份同樣重要的時候, 我有了一個想法 為什麼我們不找一個 聲音是我們所需要的人, 從他那採集聲音源 因為它已被保留, 然後再找一個有著相似年紀和體型的人 從他那借用過濾器, 因為他們能清晰地說話, 然後將二者混合? 因為當我們將它們混合的時候, 我們得到的聲音將會和 那個代替說話者一樣清楚 代替說話者就是我們借用過濾器的人 而產生的語音和我們 目標說話者有相似的辨認度 就這麼簡單 這就我們該項研究的科學性
So once you have that in mind, how do you go about building this voice? Well, you have to find someone who is willing to be a surrogate. It's not such an ominous thing. Being a surrogate donor only requires you to say a few hundred to a few thousand utterances. The process goes something like this.
有了這個想法以後, 應該怎麼來製造這個聲音呢? 首先,你必須找一個 願意當這個代替者的人 這個任務也不是太糟糕 當一個聲音捐贈者 只要求你閱讀幾百 到幾千句話. 以下是過程
(Video) Voice: Things happen in pairs.
(錄影)聲音: 事情成雙成對地發生
I love to sleep.
我愛睡覺
The sky is blue without clouds.
天空藍色無雲
RP: Now she's going to go on like this for about three to four hours, and the idea is not for her to say everything that the target is going to want to say, but the idea is to cover all the different combinations of the sounds that occur in the language. The more speech you have, the better sounding voice you're going to have. Once you have those recordings, what we need to do is we have to parse these recordings into little snippets of speech, one- or two-sound combinations, sometimes even whole words that start populating a dataset or a database. We're going to call this database a voice bank. Now the power of the voice bank is that from this voice bank, we can now say any new utterance, like, "I love chocolate" -- everyone needs to be able to say that— fish through that database and find all the segments necessary to say that utterance.
演講者: 她接下來的3-4個小時 都會繼續閱讀, 目的是不要讓她說 所有目標說話者要說的話 真正的目的是要概擴所有 在語言中可能發生的組合 你說的話越多, 你的聲音就會聽起來更好 當錄音完成後, 我們接下來 要對這些錄音做語法分析 將它們分段, 大概1-2個音的組合, 有時候也會是那些 填入數據集或是數據庫的完整單字 我們將這個數據庫稱之為聲音銀行 聲音銀行的力量 使我們通過它 可以說出任何新的語句, 比如說, "我喜歡巧克力" 所有人都需要說這類的話的能力 搜尋數據庫 找到必須的部分 來完成這個語句
(Video) Voice: I love chocolate.
(錄影)聲音: 我喜歡巧克力
RP: So that's speech synthesis. It's called concatenative synthesis, and that's what we're using. That's not the novel part. What's novel is how we make it sound like this young woman.
演講人: 這是一個人工聲音 我們將其稱之為連環整合 我們使用的就是這個方法 這不是新奇的部分 它新奇之處是我們使它 聽起來就像是這個年輕女士的聲音
This is Samantha. I met her when she was nine, and since then, my team and I have been trying to build her a personalized voice. We first had to find a surrogate donor, and then we had to have Samantha produce some utterances. What she can produce are mostly vowel-like sounds, but that's enough for us to extract her source characteristics. What happens next is best described by my daughter's analogy. She's six. She calls it mixing colors to paint voices. It's beautiful. It's exactly that. Samantha's voice is like a concentrated sample of red food dye which we can infuse into the recordings of her surrogate to get a pink voice just like this.
她是珊曼莎 在她9歲時, 我第一次見到她 在那之後, 我和我的團隊 一直設法為她製造一款個性化的聲音 我們首先需要一個捐贈者, 然後我們會讓珊曼莎 發一些音 雖然她所發出的音大部分都類似母音, 但我們用這些已足夠 來取得她聲音根源的特性 接下來所發生的事 用我女兒的比喻來描述再合適不過, 她6歲 她說這是混合顏色來畫聲音 很漂亮, 就是這樣 珊曼莎的聲音就像是紅色食用色素 的濃縮樣品 我們可以將它注入到她代替者的錄音裡 然後取得一個像這樣的粉色聲音
(Video) Samantha: Aaaaaah.
(錄影)珊曼莎:啊.....
RP: So now, Samantha can say this.
現在, 珊曼莎可以說這個
(Video) Samantha: This voice is only for me. I can't wait to use my new voice with my friends.
(錄影)珊曼莎: 這個聲音是我的專屬 我等不及與我朋友們分享我的聲音
RP: Thank you. (Applause)
謝謝
I'll never forget the gentle smile that spread across her face when she heard that voice for the first time. Now there's millions of people around the world like Samantha, millions, and we've only begun to scratch the surface. What we've done so far is we have a few surrogate talkers from around the U.S. who have donated their voices, and we have been using those to build our first few personalized voices. But there's so much more work to be done. For Samantha, her surrogate came from somewhere in the Midwest, a stranger who gave her the gift of voice. And as a scientist, I'm so excited to take this work out of the laboratory and finally into the real world so it can have real-world impact. What I want to share with you next is how I envision taking this work to that next level. I imagine a whole world of surrogate donors from all walks of life, different sizes, different ages, coming together in this voice drive to give people voices that are as colorful as their personalities. To do that as a first step, we've put together this website, VocaliD.org, as a way to bring together those who want to join us as voice donors, as expertise donors, in whatever way to make this vision a reality.
我永遠都不會忘記 當她第一次聽到自己的聲音時 佈滿在她臉上那輕柔的微笑 目前世界上 有好幾百萬像珊曼莎的人, 幾百萬, 而我們的工作才剛剛開始 我們目前只有 幾個來自美國的語言代替者 捐贈了他們的聲音, 我們使用了他們的捐贈 來建造我們第一批個性化的聲音 但還有更多的工作要完成 對珊曼莎而言, 她的代替者 是來自美國中西部, 一個陌生人 送給了她一個聲音禮物 作為一個科學家, 我很開心 能將這個研究從實驗室 帶到現實的世界 讓它產生一個實際的影響 我接下來想跟大家分享 我如何想像讓這項研究 進入下一個階段 我想像著一個充滿了聲音捐贈者的世界 他們來自各行各業, 有著不同的體型和年齡, 一起聚集到這個聲音活動 給其他人提供的聲音 就像他們個性一樣多姿多采 我們的第一個步驟, 是建立這個網站, VocaliD.org, 通過這個網站將 那些願意捐贈聲音的, 願意提供意見的, 還有想提供其它幫助的人聚集到一起
They say that giving blood can save lives. Well, giving your voice can change lives. All we need is a few hours of speech from our surrogate talker, and as little as a vowel from our target talker, to create a unique vocal identity.
有人說捐血可以救人 那麼捐聲音就可以改變他人的生活 從我們的代替說話者那裡 我們只需要幾個小時的語音, 然後再從我們的目標說話者那裡取得幾個母音, 就可以建立出一個獨特的聲音身份
So that's the science behind what we're doing. I want to end by circling back to the human side that is really the inspiration for this work. About five years ago, we built our very first voice for a little boy named William. When his mom first heard this voice, she said, "This is what William would have sounded like had he been able to speak." And then I saw William typing a message on his device. I wondered, what was he thinking? Imagine carrying around someone else's voice for nine years and finally finding your own voice. Imagine that.
這就是我們研究背後的科學 結尾我想再次強調人為因素 因為它才是這項研究的啟發 大約在5年前, 我們為一個名為威廉的小男孩 製造了第一個聲音 當他的媽媽第一次聽到兒子的聲音時, 她說, "如果威廉可以說話, 那他的聲音 一定和這個一模一樣." 我們然後看到威廉在他的設備上 打一條訊息 我猜想他在想什麼? 試想一下借用了他人的聲音 9年之後 終於有了自己聲音的感覺 試想一下
This is what William said: "Never heard me before."
這就是威廉說的話: "在這之前從來沒聽過我說話"
Thank you.
謝謝大家
(Applause)
掌聲