Rupal Patel: Synthetic voices, as unique as fingerprints

I'd like to talk today about a powerful and fundamental aspect of who we are: our voice. Each one of us has a unique voiceprint that reflects our age, our size, even our lifestyle and personality. In the words of the poet Longfellow, "the human voice is the organ of the soul." As a speech scientist, I'm fascinated by how the voice is produced, and I have an idea for how it can be engineered. That's what I'd like to share with you.

오늘 제가 말씀드리고자 하는 것은 우리 자신의 강력하고도 기본적인 한 측면에 대한 것입니다. 바로 우리의 목소리죠. 우리 모두는 독특한 성문을 가지고 있는데요. 이 성문은 우리의 나이, 크기, 심지어 생활 양식이나 성격까지도 드러내줍니다. 시인 롱펠로우는 "인간의 목소리는 영혼의 오르간이다." 라고 까지 했습니다. 음성학자로서 저는 음성이 어떻게 만들어지는지에 매료되어 있고 공학적으로 어떻게 만들어질 수 있는지에 대해 알고 있습니다. 이게 오늘 제가 여러분과 드릴 이야기의 주제입니다.

I'm going to start by playing you a sample of a voice that you may recognize.

여러분이 알아들으실만한 목소리로 예시를 들며 시작해 보겠습니다.

(Recording) Stephen Hawking: "I would have thought it was fairly obvious what I meant."

(녹음) 스티븐 호킹: "제가 생각하는 바로는 제 의도가 아주 명확합니다."

Rupal Patel: That was the voice of Professor Stephen Hawking. What you may not know is that same voice may also be used by this little girl who is unable to speak because of a neurological condition. In fact, all of these individuals may be using the same voice, and that's because there's only a few options available. In the U.S. alone, there are 2.5 million Americans who are unable to speak, and many of whom use computerized devices to communicate. Now that's millions of people worldwide who are using generic voices, including Professor Hawking, who uses an American-accented voice. This lack of individuation of the synthetic voice really hit home when I was at an assistive technology conference a few years ago, and I recall walking into an exhibit hall and seeing a little girl and a grown man having a conversation using their devices, different devices, but the same voice. And I looked around and I saw this happening all around me, literally hundreds of individuals using a handful of voices, voices that didn't fit their bodies or their personalities. We wouldn't dream of fitting a little girl with the prosthetic limb of a grown man. So why then the same prosthetic voice? It really struck me, and I wanted to do something about this.

러팰 패텔: 저건 바로 스티븐 호킹 교수의 목소리였습니다. 여러분이 흥미롭게 느끼실만한 점은, 이 어린 소녀가 똑같은 목소리를 사용할 수도 있다는 것입니다. 이 아이는 신경학적 문제로 인해 말을 할 수가 없습니다. 사실 이런 모든 사람들은 똑같은 목소리를 쓰고 있을지도 모릅니다. 그 이유는 선택 가능한 목소리 종류가 그리 많지 않기 때문이에요. 미국에만 말하지 못하는 사람이 250만이나 됩니다. 그들 중 많은 사람들은 컴퓨터화된 기기를 이용하여 의사 소통을 합니다. 그렇다면 세계적으로는 수백만의 사람들이 공통적인 목소리를 사용하는거겠죠. 호킹 교수를 포함해서 말입니다. 그 분은 미국 억양을 가진 목소리를 사용합니다. 이렇듯 기계 음성의 개별화가 부족하다는 것, 이게 바로 문제의 핵심입니다. 제가 보조 기구 회의에 참석했던 적이 있습니다. 몇 해 전이었죠. 그때 저는 전시관을 걸어가다가 어린 소녀와 성인이 각자의 기구를 사용하여 대화하는 것을 보았습니다. 서로 다른 기구였지만 목소리는 똑같았죠. 둘러보니 이런 일이 제 주위에서도 빈번하게 일어나고 있었죠. 말 그대로 수백명의 사람들이 단 몇 가지 목소리를 갖고 있었습니다. 그들의 신체나 성격과는 맞지 않는 목소리였어요. 우리는 어린 소녀에게 성인의 보조 기구를 사용시킬 상상도 하지 않습니다. 그런데 왜 목소리는 똑같은 것을 사용할까요? 이건 정말 제게 큰 충격이었고 저는 뭔가 해야겠다는 생각이 들었습니다.

I'm going to play you now a sample of someone who has, two people actually, who have severe speech disorders. I want you to take a listen to how they sound. They're saying the same utterance.

여러분들께 어떤 사람의 목소리를 들려 드릴텐데요. 두 사람의 목소리입니다. 이 두 사람에게는 심각한 육성 장애가 있습니다. 어떻게 들리는지 한번 들어보세요. 두 사람은 똑같은 말을 하고 있습니다.

(First voice)

(첫번째 음성)

(Second voice) You probably didn't understand what they said, but I hope that you heard their unique vocal identities.

(두번째 음성) 여러분들은 아마 두 사람이 어떤 말을 나눴는지 모르실 겁니다. 하지만 저는 여러분들이 각각의 독특한 음성을 들으셨기를 바랍니다.

So what I wanted to do next is, I wanted to find out how we could harness these residual vocal abilities and build a technology that could be customized for them, voices that could be customized for them. So I reached out to my collaborator, Tim Bunnell. Dr. Bunnell is an expert in speech synthesis, and what he'd been doing is building personalized voices for people by putting together pre-recorded samples of their voice and reconstructing a voice for them. These are people who had lost their voice later in life. We didn't have the luxury of pre-recorded samples of speech for those born with speech disorder. But I thought, there had to be a way to reverse engineer a voice from whatever little is left over.

그래서 제가 그 다음으로 하고자 해던 게 있습니다. 저는 이렇게 내재된 발성 능력을 어떻게 이용할 수 있는지 알아내고 싶었습니다. 그리고는 기술을 이용해 각자의 목소리로 개별화할 수 있기를 희망했습니다. 각자의 특성을 지닌 목소리로 말이에요. 저는 제 동료인 팀 버넬을 찾아갔습니다. 버넬 박사님은 음성 분석 전문가인데 그분이 하고 계시던 일은 사람들의 개별적인 목소리를 만드는 일이었어요. 미리 녹음된 사람들의 목소리를 조합하여 개개인을 위한 목소리를 다시 만들어 내는 일이었죠. 이 분들은 후천적으로 자신의 목소리를 잃은 사람들이에요. 우리에겐 선천적으로 음성 장애를 가지고 태어난 사람들이 이미 녹음한 목소리를 가지는 사치 같은 것은 없었습니다. 하지만 아무리 적은 것이라도 남아있으면 그것을 통해 공학적으로 목소리를 되살려 낼 수 있는 방법이 있을거라고 생각했어요.

So we decided to do exactly that. We set out with a little bit of funding from the National Science Foundation, to create custom-crafted voices that captured their unique vocal identities. We call this project VocaliD, or vocal I.D., for vocal identity.

바로 그 방법을 써보기로 한겁니다. 우리는 국립과학재단(NSF)로 부터 받은 약간의 재정 지원을 가지고 각자의 유일한 음성적 특성을 갖도록 맞춤 제작형 목소리를 만들기 시작했습니다. 저희는 이 프로젝트를 보컬아이디( vocal I.D.)라고 불렀습니다. vocal identitiy에서 따왔죠.

Now before I get into the details of how the voice is made and let you listen to it, I need to give you a real quick speech science lesson. Okay? So first, we know that the voice is changing dramatically over the course of development. Children sound different from teens who sound different from adults. We've all experienced this. Fact number two is that speech is a combination of the source, which is the vibrations generated by your voice box, which are then pushed through the rest of the vocal tract. These are the chambers of your head and neck that vibrate, and they actually filter that source sound to produce consonants and vowels. So the combination of source and filter is how we produce speech. And that happens in one individual.

저희가 목소리를 어떻게 만들었는지에 대한 자세한 이야기나 목소리를 여러분께 들려드리기 전에 여러분께 아주 짧은 음성 과학 교육을 해야할 필요가 있습니다, 괜찮겠죠? 우선, 우리는 사람의 목소리가 그 주인이 성장함에 따라 급격하게 변한다는 사실을 알고 있습니다. 어린 아이일 때의 음성은 청소년기의 음성과 다르고 또 그건 성인이 되었을 때와도 다릅니다. 우리 모두 이런 경험을 해본 적이 있지요. 두번째 사실은 음성이란 것이 각자의 후두에서 발생하는 진동, 즉 음원과 그 음원이 나머지 성도 부분을 통해 밀려나가며 일어나는 현상의 조화로 만들어진다는 것입니다. 이런 것들은 사람의 머리와 목에 있는 공간인데요. 이들이 진동하고 음성원을 투과시켜 자음과 모음을 만들어 냅니다. 즉, 음성원과 투과 과정의 조합으로 우리가 말하게 되는 겁니다. 그것이 모든 개개인에게서 일어나죠.

Now I told you earlier that I'd spent a good part of my career understanding and studying the source characteristics of people with severe speech disorder, and what I've found is that even though their filters were impaired, they were able to modulate their source: the pitch, the loudness, the tempo of their voice. These are called prosody, and I've been documenting for years that the prosodic abilities of these individuals are preserved. So when I realized that those same cues are also important for speaker identity, I had this idea. Why don't we take the source from the person we want the voice to sound like, because it's preserved, and borrow the filter from someone about the same age and size, because they can articulate speech, and then mix them? Because when we mix them, we can get a voice that's as clear as our surrogate talker -- that's the person we borrowed the filter from— and is similar in identity to our target talker. It's that simple. That's the science behind what we're doing.

앞에서 말씀드린대로 저는 제 일생의 상당 부분을 심각한 음성 장애를 가진 사람들의 음성원이 가진 특성을 이해하고 연구해 왔습니다. 제가 알아낸 것은 그들의 투과 장치가 손상되어도 음성원을 조절할 수 있다는 사실입니다: 목소리의 높낮이, 크기, 빠르게 같은 것들이요. 이런 것들을 운율 체계라고 하는데, 저는 이런 분들의 운율적 능력이 보존된다는 점을 여러 해에 걸쳐 조사해 왔습니다. 그래서 이런 것들이 말하는 사람의 특성에 중요하다는 것에 생각이 미치자 바로 이런 아이디어가 떠올랐습니다. 원하는 목소리를 가진 사람들의 음성원을 가져오면 어떨까하는 것이었죠. 음성원은 보존되니까요. 그리고 이 투과 기관이 음성을 구분지을 수 있으니까 비슷한 나이와 신체 크기를 가진 사람의 투과 기간을 차용하여 음성원과 혼합하면 어떨까하는 생각이었습니다. 왜냐하면 그것들을 합성하면 대리자의 목소리 만큼이나 명확한 목소리를 얻을 수 있으니까요. 투과 기관을 차용해 준 사람이 대리자가 되는 겁니다. 그리고 그 목소리의 특성은 육성을 내고자 하는 사람의 목소리와 비슷하게 되는거죠. 아주 간단하죠. 이게 바로 저희가 하고 있는 일의 과학적 원리라고 할 수 있습니다.

So once you have that in mind, how do you go about building this voice? Well, you have to find someone who is willing to be a surrogate. It's not such an ominous thing. Being a surrogate donor only requires you to say a few hundred to a few thousand utterances. The process goes something like this.

그럼 이런 것을 염두에 두었을 때, 이런 목소리는 어떻게 만드는 걸까요? 우선 목소리를 빌려줄 사람을 찾아야 합니다. 그건 그리 어려운 일이 아니죠. 목소리를 빌려주는 데에는 몇 백에서 몇 천개쯤의 소리만 내주면 됩니다. 그 과정은 이렇습니다.

(Video) Voice: Things happen in pairs.

(영상) 목소리: 현상은 짝으로 일어난다.

I love to sleep.

나는 잠자는 걸 좋아해.

The sky is blue without clouds.

하늘이 구름 한점없이 맑군.

RP: Now she's going to go on like this for about three to four hours, and the idea is not for her to say everything that the target is going to want to say, but the idea is to cover all the different combinations of the sounds that occur in the language. The more speech you have, the better sounding voice you're going to have. Once you have those recordings, what we need to do is we have to parse these recordings into little snippets of speech, one- or two-sound combinations, sometimes even whole words that start populating a dataset or a database. We're going to call this database a voice bank. Now the power of the voice bank is that from this voice bank, we can now say any new utterance, like, "I love chocolate" -- everyone needs to be able to say that— fish through that database and find all the segments necessary to say that utterance.

페텔: 이제 그녀는 이렇게 서너 시간을 계속합니다. 요점은 그녀가 이렇게 모든 단어를 말하는 것이 아니라 언어에서 일어날 수 있는 모든 음성의 다른 조합들을 만들어 내는 것이에요. 발음샘플을 더 많이 확보할 수록 더 좋은 목소리를 갖게 됩니다. 이걸 모두 녹음한 다음에 해야 할 일은 이 녹음을 분석하여 말에 쓰이는 작은 조각으로 그리고 이걸 다시 한두가지 소리의 조합으로 만드는데, 때론 전체 단어로 자료 집합 혹은 데이터베이스를 많이 만들어냅니다. 이런 자료 조합을 음성 은행이라고 합니다. 음성 은행의 힘은 그로부터 어떤 새로운 음성이라도 만들어낼 수 있다는 점에 있습니다. --"나는 쵸코렛을 사랑해." 같은 말도요. 누구나 이 말은 꼭 할 수 있어야 하잖아요. -- 그 자료 집합에서 걸러내 그런 발음을 하는데 필요한 모든 단위를 찾아냅니다.

(Video) Voice: I love chocolate.

(영상) 음성: 나는 초코렛을 사랑해.

RP: So that's speech synthesis. It's called concatenative synthesis, and that's what we're using. That's not the novel part. What's novel is how we make it sound like this young woman.

패텔: 이것이 통합된 음성입니다. 흔히 '복합적 통합'이라고 하는데요. 우리가 사용하는 음성이 바로 이거에요. 이건 특별히 새로운 것은 아닙니다. 독창적인 부분은 우리가 그걸 이용해 육성을 만드는 부분이에요. 이 젊은 여성처럼이요.

This is Samantha. I met her when she was nine, and since then, my team and I have been trying to build her a personalized voice. We first had to find a surrogate donor, and then we had to have Samantha produce some utterances. What she can produce are mostly vowel-like sounds, but that's enough for us to extract her source characteristics. What happens next is best described by my daughter's analogy. She's six. She calls it mixing colors to paint voices. It's beautiful. It's exactly that. Samantha's voice is like a concentrated sample of red food dye which we can infuse into the recordings of her surrogate to get a pink voice just like this.

이 여성분의 이름은 사만다라고 해요. 저는 이 아이가 9살때 처음 만났는데 그 이후로, 저희 팀과 저는 이 아이만의 음성을 만들려고 노력했습니다. 처음에 우리는 목소리를 빌려줄 사람을 찾아야만 했습니다. 그리고는 사만다가 몇가지 소리를 내도록 했죠. 그 아이가 낼 수 있는 소리는 대부분 모음처럼 들리는 것이었지만 저희가 그 아이의 특색을 살린 육성을 추출하기에는 충분했습니다. 그 다음으로 했던 일은 제 딸아이가 했던 것과 똑같아요. 이제 6살인데요. 그 애는 소리를 합성하는 것을 목소리에 색을 입힌다고 합니다. 아주 아름다운 비유죠. 딱 그말이 맞습니다. 사만다의 목소리는 빨강색 식용색소를 농축해 놓은 것과 같아서 그걸 기증된 목소리를 녹음한 것에 입혀 이런 핑크빛 목소리를 만들어 냅니다.

(Video) Samantha: Aaaaaah.

(영상) 사마다: 아~~

RP: So now, Samantha can say this.

패텔: 이제 사만다는 이런 말을 할 수 있어요.

(Video) Samantha: This voice is only for me. I can't wait to use my new voice with my friends.

(영상) 사만다: 이 목소리는 오직 나를 위한 목소리다. 새로 받은 목소리로 어서 친구들과 얘기하고 싶어요.

RP: Thank you. (Applause)

패텔: 감사합니다. (박수)

I'll never forget the gentle smile that spread across her face when she heard that voice for the first time. Now there's millions of people around the world like Samantha, millions, and we've only begun to scratch the surface. What we've done so far is we have a few surrogate talkers from around the U.S. who have donated their voices, and we have been using those to build our first few personalized voices. But there's so much more work to be done. For Samantha, her surrogate came from somewhere in the Midwest, a stranger who gave her the gift of voice. And as a scientist, I'm so excited to take this work out of the laboratory and finally into the real world so it can have real-world impact. What I want to share with you next is how I envision taking this work to that next level. I imagine a whole world of surrogate donors from all walks of life, different sizes, different ages, coming together in this voice drive to give people voices that are as colorful as their personalities. To do that as a first step, we've put together this website, VocaliD.org, as a way to bring together those who want to join us as voice donors, as expertise donors, in whatever way to make this vision a reality.

저는 사만다가 이 목소리를 처음 들었을 때, 아이의 얼굴에 퍼지던 부드러운 미소를 절대로 잊을 수가 없습니다. 전세계에는 사만다와 같은 사람들이 수백만 명이나 있어요. 수백만이요. 저희는 이제 겨우 한걸음 내딛기 시작한 정도구요. 이제까지 우리가 해낸 일은 미국에서 자신의 목소리를 기부하겠다고 해준 몇몇의 봉사자를 찾아냈고 우리는 그 목소리를 이용하여 몇몇 개별화된 목소리를 만들어냈어요. 하지만 아직도 해야할 일이 많습니다. 사만다의 경우에, 그 아이가 빌린 목소리는 중서부 어느 지역에서 받은건데 얼굴도 모르는 사람이 선물을 준 것이지요. 과학자로서 저는 실험실에서 이런 일을 해내고 실제 세상에도 확산할 수 있다는 것에 상당히 고무되어 있습니다. 실생활에 영향을 미칠 수 있는 것이죠. 그 다음으로 여러분들께 이런 작업을 어떻게 다음 단계로 이끌어 갈지에 대해 말씀드리고 싶어요. 저는 목소리 기증자가 전세계적으로 충분히 있어서 어디를 가든, 다른 크기와 나이대에서 있기를 희망합니다. 모두가 이런 목소리 기증에 참여하여 사람들에게 각자의 개성만큼이나 다양한 음색을 줄 수 있기를 바랍니다. 그러기 위한 첫걸음으로 저희가 이런 웹사이트를 개설했어요. VocaliD.org 입니다. 이를 통하여 목소리 기증자로 우리와 함께 하고자 하는 사람들을 한데 모으고 기증도 전문화하는 겁니다. 그리고 어떤 방식으로든 이런 희망을 현실로 만들고자 합니다.

They say that giving blood can save lives. Well, giving your voice can change lives. All we need is a few hours of speech from our surrogate talker, and as little as a vowel from our target talker, to create a unique vocal identity.

사람들은 헌혈이 생명을 구할 수 있다고 말합니다. 자, 목소리를 기증하면 누군가의 삶을 바꿀 수 잇어요. 해야할 것이라곤 목소리 기증자로 몇 시간 발성하는 것이 전부에요. 목소리가 필요한 사람이 내는 몇 가지 모음으로 각자의 개성에 맞는 음성을 창조해내는 겁니다.

So that's the science behind what we're doing. I want to end by circling back to the human side that is really the inspiration for this work. About five years ago, we built our very first voice for a little boy named William. When his mom first heard this voice, she said, "This is what William would have sounded like had he been able to speak." And then I saw William typing a message on his device. I wondered, what was he thinking? Imagine carrying around someone else's voice for nine years and finally finding your own voice. Imagine that.

그게 바로 저희가 하고 있는 일의 밑바탕에 깔린 과학입니다. 저는 인간적인 측면으로 되돌아 가면서 강연을 마치고자 합니다. 이런 작업을 하게 된 이유이기도 하죠. 5년전 즈음, 저희는 윌리엄이란 아이를 위해 처음으로 목소리를 만들었었어요. 아이의 엄마가 그 목소리를 처음 들었을 때 그녀가 이렇게 말했습니다. "이 음성이 윌리엄이 말을 할 수 있다면 내리라고 생각하는 음성이에요." 그리고는 윌리엄에 제게 자신의 기구를 통해 문자 메시지를 쓰는 것을 보았습니다. 저도 궁음했어요. 아이가 어떻게 생각할까? 남의 목소리를 가지고 9년이나 살아오다가 이제서야 자기 자신만의 목소리를 찾게 된 것을 생각해보세요. 그게 어떤 기분일지 한번 상상해 보세요.

This is what William said: "Never heard me before."

윌리엄에 제게 한 말입니다: "이제야 제 목소리를 들어보네요."

Thank you.

감사합니다.

(Applause)

(박수)

I'm going to start by playing you a sample of a voice that you may recognize.

여러분이 알아들으실만한 목소리로 예시를 들며 시작해 보겠습니다.

(Recording) Stephen Hawking: "I would have thought it was fairly obvious what I meant."

(녹음) 스티븐 호킹: "제가 생각하는 바로는 제 의도가 아주 명확합니다."

I'm going to play you now a sample of someone who has, two people actually, who have severe speech disorders. I want you to take a listen to how they sound. They're saying the same utterance.

(First voice)

(첫번째 음성)

(Second voice) You probably didn't understand what they said, but I hope that you heard their unique vocal identities.

(두번째 음성) 여러분들은 아마 두 사람이 어떤 말을 나눴는지 모르실 겁니다. 하지만 저는 여러분들이 각각의 독특한 음성을 들으셨기를 바랍니다.

(Video) Voice: Things happen in pairs.

(영상) 목소리: 현상은 짝으로 일어난다.

I love to sleep.

나는 잠자는 걸 좋아해.

The sky is blue without clouds.

하늘이 구름 한점없이 맑군.

(Video) Voice: I love chocolate.

(영상) 음성: 나는 초코렛을 사랑해.

RP: So that's speech synthesis. It's called concatenative synthesis, and that's what we're using. That's not the novel part. What's novel is how we make it sound like this young woman.

(Video) Samantha: Aaaaaah.

(영상) 사마다: 아~~

RP: So now, Samantha can say this.

패텔: 이제 사만다는 이런 말을 할 수 있어요.

(Video) Samantha: This voice is only for me. I can't wait to use my new voice with my friends.

(영상) 사만다: 이 목소리는 오직 나를 위한 목소리다. 새로 받은 목소리로 어서 친구들과 얘기하고 싶어요.

RP: Thank you. (Applause)

패텔: 감사합니다. (박수)

This is what William said: "Never heard me before."

윌리엄에 제게 한 말입니다: "이제야 제 목소리를 들어보네요."

Thank you.

감사합니다.

(Applause)

(박수)

Rupal Patel: Synthetic voices, as unique as fingerprints

Rupal Patel: Synthetic voices, as unique as fingerprints

Related talks

Ellen Jorgensen: Biohacking -- you can do it, too

Julian Treasure: Shh! Sound health in 8 steps

Craig Venter: On the verge of creating synthetic life

Rébecca Kleinberger: Why you don't like the sound of your own voice

Shaylin Schundler: Why does your voice change as you get older?

Roger Ebert: Remaking my voice

Related talks

Ellen Jorgensen: Biohacking -- you can do it, too

Julian Treasure: Shh! Sound health in 8 steps

Craig Venter: On the verge of creating synthetic life

Rébecca Kleinberger: Why you don't like the sound of your own voice

Shaylin Schundler: Why does your voice change as you get older?

Roger Ebert: Remaking my voice