Rupal Patel: Synthetic voices, as unique as fingerprints

I'd like to talk today about a powerful and fundamental aspect of who we are: our voice. Each one of us has a unique voiceprint that reflects our age, our size, even our lifestyle and personality. In the words of the poet Longfellow, "the human voice is the organ of the soul." As a speech scientist, I'm fascinated by how the voice is produced, and I have an idea for how it can be engineered. That's what I'd like to share with you.

Hoje gostaria de falar de um aspecto poderoso e fundamental daquilo que somos: a nossa voz. Cada um de nós tem uma impressão vocal única que reflecte a nossa idade, o nosso tamanho e até o nosso estilo de vida e personalidade. Nas palavras do poeta Longfellow, "A voz humana é o órgão da alma." Como cientista da fala, sinto-me fascinada pela forma como a voz é produzida e tenho uma ideia de como pode ser sintetizada. É isso que quero partilhar convosco.

I'm going to start by playing you a sample of a voice that you may recognize.

Vou começar por reproduzir uma amostra de uma voz que possivelmente reconhecem.

(Recording) Stephen Hawking: "I would have thought it was fairly obvious what I meant."

(Gravação) Stephen Hawking: "Pensei que seria razoavelmente óbvio o que eu quis dizer."

Rupal Patel: That was the voice of Professor Stephen Hawking. What you may not know is that same voice may also be used by this little girl who is unable to speak because of a neurological condition. In fact, all of these individuals may be using the same voice, and that's because there's only a few options available. In the U.S. alone, there are 2.5 million Americans who are unable to speak, and many of whom use computerized devices to communicate. Now that's millions of people worldwide who are using generic voices, including Professor Hawking, who uses an American-accented voice. This lack of individuation of the synthetic voice really hit home when I was at an assistive technology conference a few years ago, and I recall walking into an exhibit hall and seeing a little girl and a grown man having a conversation using their devices, different devices, but the same voice. And I looked around and I saw this happening all around me, literally hundreds of individuals using a handful of voices, voices that didn't fit their bodies or their personalities. We wouldn't dream of fitting a little girl with the prosthetic limb of a grown man. So why then the same prosthetic voice? It really struck me, and I wanted to do something about this.

Rupal Patel: Esta era a voz do Professor Stephen Hawking. O que possivelmente não sabem é que esta mesma voz pode ser utilizada por esta menina que não pode falar por causa de um problema neurológico. Na realidade, todas estas pessoas podem usar esta mesma voz, pois existem apenas umas poucas opções disponíveis. Só nos Estados Unidos existem 2,5 milhões de americanos que não conseguem falar. Muitos deles utilizam aparelhos computorizados para comunicar. Actualmente são milhões de pessoas em todo o mundo que utilizam vozes genéricas — incluindo a do Professor Hawking — que usam uma voz com sotaque americano. Esta falta de individualização da voz sintética tocou-me fundo quando estava numa conferência de tecnologia de apoio alguns anos atrás. Lembro-me de ir a um salão de exposições e ver uma menina e um homem adulto a conversarem utilizando estes aparelhos, diferentes aparelhos, mas a mesma voz. Olhei em volta e vi isto acontecer à minha volta: literalmente, centenas de pessoas a usar meia dúzia de vozes, vozes que não condiziam com os corpos nem com as personalidades. Não sonharíamos adaptar a uma pequena menina um membro protésico de um homem adulto. Então porquê a mesma voz protésica? Isso realmente chocou-me e quis fazer algo acerca disso.

I'm going to play you now a sample of someone who has, two people actually, who have severe speech disorders. I want you to take a listen to how they sound. They're saying the same utterance.

Vou reproduzir agora uma amostra de alguém, na realidade duas pessoas que têm graves distúrbios da fala. Gostaria que escutassem como elas soam. Elas estão a dizer a mesma expressão.

(First voice)

(Primeira voz)

(Second voice) You probably didn't understand what they said, but I hope that you heard their unique vocal identities.

(Segunda voz) Provavelmente não perceberam o que elas disseram, mas espero que tenham ouvido as suas identidades vocais únicas.

So what I wanted to do next is, I wanted to find out how we could harness these residual vocal abilities and build a technology that could be customized for them, voices that could be customized for them. So I reached out to my collaborator, Tim Bunnell. Dr. Bunnell is an expert in speech synthesis, and what he'd been doing is building personalized voices for people by putting together pre-recorded samples of their voice and reconstructing a voice for them. These are people who had lost their voice later in life. We didn't have the luxury of pre-recorded samples of speech for those born with speech disorder. But I thought, there had to be a way to reverse engineer a voice from whatever little is left over.

Então o que quis fazer a seguir foi perceber como podemos reter estas capacidades vocais residuais e construir a tecnologia que lhes possa ser adaptada, vozes que possam ser personalizadas. Fui ao encontro do meu colaborador, Tim Bunnell. O Dr. Bunnell é um especialista em síntese de fala e o que tem feito é produzir vozes personalizadas para pessoas, juntando amostras pré-gravadas das suas vozes e reconstruindo a voz para elas. Estas são pessoas que perderam a voz mais tarde na vida. Não temos o luxo de amostras pré-gravadas da fala para os que já nasceram com distúrbios da fala. Mas pensei que teria que haver uma forma de recuperar a voz por pouco que dela restasse.

So we decided to do exactly that. We set out with a little bit of funding from the National Science Foundation, to create custom-crafted voices that captured their unique vocal identities. We call this project VocaliD, or vocal I.D., for vocal identity.

Então decidimos fazer isso mesmo. Dispusemo-nos, com alguns fundos da Fundação Nacional de Ciência, a criar vozes personalizadas que captavam as suas identidades vocais únicas. Chamámos a este projecto VocaliD, ou vocal I.D., ou seja, identidade vocal.

Now before I get into the details of how the voice is made and let you listen to it, I need to give you a real quick speech science lesson. Okay? So first, we know that the voice is changing dramatically over the course of development. Children sound different from teens who sound different from adults. We've all experienced this. Fact number two is that speech is a combination of the source, which is the vibrations generated by your voice box, which are then pushed through the rest of the vocal tract. These are the chambers of your head and neck that vibrate, and they actually filter that source sound to produce consonants and vowels. So the combination of source and filter is how we produce speech. And that happens in one individual.

Antes de entrar em detalhes sobre como a voz é feita e deixar-vos ouvi-la, preciso dar-vos uma muito breve lição de ciência da fala. OK? Em primeiro lugar, sabemos que a voz muda drasticamente enquanto nos desenvolvemos. As crianças soam de forma diferente dos adolescentes que, por sua vez, soam diferente dos adultos. Todos nós já passámos por isso. O facto número dois é que a fala é uma combinação da fonte, que são as vibrações geradas pela nossa caixa vocal e que depois são enviadas através do resto do trato vocal. Estas são as câmaras do pescoço e da cabeça que vibram e que, na realidade, filtram essa fonte sonora para produzirem consoantes e vogais. É através da combinação da fonte e do filtro que nós produzimos a fala. Isso acontece numa pessoa.

Now I told you earlier that I'd spent a good part of my career understanding and studying the source characteristics of people with severe speech disorder, and what I've found is that even though their filters were impaired, they were able to modulate their source: the pitch, the loudness, the tempo of their voice. These are called prosody, and I've been documenting for years that the prosodic abilities of these individuals are preserved. So when I realized that those same cues are also important for speaker identity, I had this idea. Why don't we take the source from the person we want the voice to sound like, because it's preserved, and borrow the filter from someone about the same age and size, because they can articulate speech, and then mix them? Because when we mix them, we can get a voice that's as clear as our surrogate talker -- that's the person we borrowed the filter from— and is similar in identity to our target talker. It's that simple. That's the science behind what we're doing.

Inicialmente disse-vos que passei boa parte da minha carreira a estudar e a compreender as características originais das pessoas com graves distúrbios da fala. O que descobri foi que, apesar de esses filtros estarem enfraquecidos, eles conseguiam modular a sua origem: o tom, o volume, o ritmo da sua voz. A isto chama-se prosódia. Documentei-me durante anos sobre as capacidades de prosódia que estas pessoas preservaram. Quando percebi que esses mesmos sinais também eram importantes para a identidade da pessoa que fala, tive esta ideia: Porque não pegamos na fonte da pessoa para quem queremos uma voz que se lhe assemelhe, — porque está preservada — e pedimos emprestado o filtro de alguém que tenha mais ou menos a mesma idade e tamanho e que pode articular a fala e misturar as duas coisas? Porque, quando os misturarmos, poderemos obter a voz tão clara como a do falante substituto, a pessoa a quem pedimos emprestado o filtro e que é semelhante em identidade ao nosso falante alvo. É muito simples. É a ciência por detrás do que fazemos.

So once you have that in mind, how do you go about building this voice? Well, you have to find someone who is willing to be a surrogate. It's not such an ominous thing. Being a surrogate donor only requires you to say a few hundred to a few thousand utterances. The process goes something like this.

Assim, tendo isto em mente, como se vai construir esta voz? Bem, tem que se encontrar alguém que esteja disposto a servir como substituto. Não é algo assim tão assustador. Ser um doador substituto só requer que se diga umas centenas ou milhares de expressões. O processo é algo semelhante a isto.

(Video) Voice: Things happen in pairs.

Voz: "As coisas acontecem aos pares".

I love to sleep.

"Eu adoro dormir".

The sky is blue without clouds.

"O céu é azul sem nuvens".

RP: Now she's going to go on like this for about three to four hours, and the idea is not for her to say everything that the target is going to want to say, but the idea is to cover all the different combinations of the sounds that occur in the language. The more speech you have, the better sounding voice you're going to have. Once you have those recordings, what we need to do is we have to parse these recordings into little snippets of speech, one- or two-sound combinations, sometimes even whole words that start populating a dataset or a database. We're going to call this database a voice bank. Now the power of the voice bank is that from this voice bank, we can now say any new utterance, like, "I love chocolate" -- everyone needs to be able to say that— fish through that database and find all the segments necessary to say that utterance.

RP: Agora ela vai continuar assim por aproximadamente três ou quatro horas. A ideia não é ela dizer tudo o que o alvo vai querer dizer. A ideia é cobrir todas as diferentes combinações dos sons que ocorrem na linguagem. Quanto mais falas tivermos, melhor qualidade de voz obteremos. Uma vez na posse destas gravações, o que necessitamos de fazer é analisar estas gravações em pequenos fragmentos de fala, uma ou duas combinações de som, às vezes até mesmo palavras inteiras para começar a preencher um conjunto de dados ou uma base de dados. Vamos chamar a esta base de dados um banco de voz. O poder deste banco de voz é que, a partir deste banco de voz, podemos dizer uma nova expressão qualquer, como: "Eu adoro chocolate". — toda a gente necessita de dizer isto — procuramos na base de dados e encontramos todos os segmentos necessários

(Video) Voice: I love chocolate.

para dizer aquela expressão.

RP: So that's speech synthesis. It's called concatenative synthesis, and that's what we're using. That's not the novel part. What's novel is how we make it sound like this young woman.

Voz: "Eu adoro chocolate". RP: Isto é voz sintetizada. Chama-se síntese concatenativa. É o que estamos a usar. Mas isto não é novo. O que é novo é como a fazemos soar

This is Samantha. I met her when she was nine, and since then, my team and I have been trying to build her a personalized voice. We first had to find a surrogate donor, and then we had to have Samantha produce some utterances. What she can produce are mostly vowel-like sounds, but that's enough for us to extract her source characteristics. What happens next is best described by my daughter's analogy. She's six. She calls it mixing colors to paint voices. It's beautiful. It's exactly that. Samantha's voice is like a concentrated sample of red food dye which we can infuse into the recordings of her surrogate to get a pink voice just like this.

como esta rapariga. Esta é a Samantha. Eu conheci-a quando ela tinha nove anos. Desde então, a minha equipa e eu temos tentado produzir uma voz personalizada para ela. Em primeiro lugar tivemos que encontrar uma doadora substituta. Depois tivemos que fazer com que Samantha produzisse algumas expressões. Ela consegue produzir sobretudo sons semelhantes a vogais, mas é o suficiente para podermos extrair as suas características originais. O que acontece a seguir descreve-se melhor pela analogia que a minha filha faz. Ela tem seis anos. Chama-lhe "misturar cores para pintar vozes". É lindo. É exactamente isso. A voz de Samantha é como uma amostra concentrada de corante alimentar vermelho que podemos inserir nas gravações da sua substituta para termos uma voz cor-de-rosa como esta.

(Video) Samantha: Aaaaaah.

(Video) Samantha: "Aaaaaah".

RP: So now, Samantha can say this.

RP: Agora a Samantha pode dizer isto.

(Video) Samantha: This voice is only for me. I can't wait to use my new voice with my friends.

(Video) S: "Esta voz é apenas para mim. "Não posso esperar para usar a minha nova voz com os meus amigos".

RP: Thank you. (Applause)

(Aplausos)

Obrigada. (Aplausos)

I'll never forget the gentle smile that spread across her face when she heard that voice for the first time. Now there's millions of people around the world like Samantha, millions, and we've only begun to scratch the surface. What we've done so far is we have a few surrogate talkers from around the U.S. who have donated their voices, and we have been using those to build our first few personalized voices. But there's so much more work to be done. For Samantha, her surrogate came from somewhere in the Midwest, a stranger who gave her the gift of voice. And as a scientist, I'm so excited to take this work out of the laboratory and finally into the real world so it can have real-world impact. What I want to share with you next is how I envision taking this work to that next level. I imagine a whole world of surrogate donors from all walks of life, different sizes, different ages, coming together in this voice drive to give people voices that are as colorful as their personalities. To do that as a first step, we've put together this website, VocaliD.org, as a way to bring together those who want to join us as voice donors, as expertise donors, in whatever way to make this vision a reality.

Nunca esquecerei o sorriso gentil espelhado no seu rosto quando ouviu aquela voz pela primeira vez. Existem milhões de pessoas em todo o mundo como a Samantha, milhões, e ainda só agora começámos. O que fizemos até agora foi termos alguns substitutos espalhados pelos Estados Unidos que nos doaram as suas vozes. Começámos a usá-las para produzir algumas vozes personalizadas. Mas há muito trabalho para ser feito. A substituta para Samantha veio do Midwest, uma desconhecida que lhe deu a a dádiva da voz. Como cientista, estou entusiasmada por levar este trabalho para fora do laboratório e trazê-lo finalmente para o mundo real para ter assim um verdadeiro impacto mundial. O que quero partilhar convosco a seguir é como imagino fazer com que este trabalho passe para o nível seguinte. Imagino um vasto universo de doadores substitutos de todos os sectores da vida, diferentes tamanhos, diferentes idades, a juntarem-se neste movimento da voz para darem às pessoas vozes tão coloridas quanto as suas personalidades. Para tal, como primeiro passo, criámos este website, VocaliD.org, como forma de juntar aqueles que se queiram juntar a nós como doadores de voz, como doadores especialistas, de qualquer maneira que possa tornar realidade esta visão.

They say that giving blood can save lives. Well, giving your voice can change lives. All we need is a few hours of speech from our surrogate talker, and as little as a vowel from our target talker, to create a unique vocal identity.

Dizem que dar sangue pode salvar vidas. Ora bem, dar voz pode mudar vidas. Só precisamos de algumas horas de fala do nosso falante substituto e apenas uma vogal do nosso falante alvo, para criar uma identidade vocal única.

So that's the science behind what we're doing. I want to end by circling back to the human side that is really the inspiration for this work. About five years ago, we built our very first voice for a little boy named William. When his mom first heard this voice, she said, "This is what William would have sounded like had he been able to speak." And then I saw William typing a message on his device. I wondered, what was he thinking? Imagine carrying around someone else's voice for nine years and finally finding your own voice. Imagine that.

É esta a ciência por detrás do que fazemos. Gostaria de terminar voltando ao aspecto humano que é a inspiração deste trabalho. Há cerca de cinco anos, produzimos a nossa primeira voz para um menino chamado William. Quando a mãe dele ouviu esta voz, disse: "Isto é como o William deveria soar, se pudesse falar." Então vi o William digitar uma mensagem no seu aparelho. Imaginei, o que estaria ele a pensar? Imaginem, usar a voz de outra pessoa durante nove anos e finalmente encontrar a sua própria voz. Imaginem só.

This is what William said: "Never heard me before."

Isto foi o que William disse: "Nunca me tinha ouvido."

Thank you.

Obrigada.

(Applause)

(Aplausos)

I'm going to start by playing you a sample of a voice that you may recognize.

Vou começar por reproduzir uma amostra de uma voz que possivelmente reconhecem.

(Recording) Stephen Hawking: "I would have thought it was fairly obvious what I meant."

(Gravação) Stephen Hawking: "Pensei que seria razoavelmente óbvio o que eu quis dizer."

I'm going to play you now a sample of someone who has, two people actually, who have severe speech disorders. I want you to take a listen to how they sound. They're saying the same utterance.

Vou reproduzir agora uma amostra de alguém, na realidade duas pessoas que têm graves distúrbios da fala. Gostaria que escutassem como elas soam. Elas estão a dizer a mesma expressão.

(First voice)

(Primeira voz)

(Second voice) You probably didn't understand what they said, but I hope that you heard their unique vocal identities.

(Segunda voz) Provavelmente não perceberam o que elas disseram, mas espero que tenham ouvido as suas identidades vocais únicas.

(Video) Voice: Things happen in pairs.

Voz: "As coisas acontecem aos pares".

I love to sleep.

"Eu adoro dormir".

The sky is blue without clouds.

"O céu é azul sem nuvens".

(Video) Voice: I love chocolate.

para dizer aquela expressão.

RP: So that's speech synthesis. It's called concatenative synthesis, and that's what we're using. That's not the novel part. What's novel is how we make it sound like this young woman.

Voz: "Eu adoro chocolate". RP: Isto é voz sintetizada. Chama-se síntese concatenativa. É o que estamos a usar. Mas isto não é novo. O que é novo é como a fazemos soar

(Video) Samantha: Aaaaaah.

(Video) Samantha: "Aaaaaah".

RP: So now, Samantha can say this.

RP: Agora a Samantha pode dizer isto.

(Video) Samantha: This voice is only for me. I can't wait to use my new voice with my friends.

(Video) S: "Esta voz é apenas para mim. "Não posso esperar para usar a minha nova voz com os meus amigos".

RP: Thank you. (Applause)

(Aplausos)

Obrigada. (Aplausos)

This is what William said: "Never heard me before."

Isto foi o que William disse: "Nunca me tinha ouvido."

Thank you.

Obrigada.

(Applause)

(Aplausos)

Rupal Patel: Synthetic voices, as unique as fingerprints

Rupal Patel: Synthetic voices, as unique as fingerprints

Related talks

Ellen Jorgensen: Biohacking -- you can do it, too

Julian Treasure: Shh! Sound health in 8 steps

Craig Venter: On the verge of creating synthetic life

Rébecca Kleinberger: Why you don't like the sound of your own voice

Shaylin Schundler: Why does your voice change as you get older?

Roger Ebert: Remaking my voice

Related talks

Ellen Jorgensen: Biohacking -- you can do it, too

Julian Treasure: Shh! Sound health in 8 steps

Craig Venter: On the verge of creating synthetic life

Rébecca Kleinberger: Why you don't like the sound of your own voice

Shaylin Schundler: Why does your voice change as you get older?

Roger Ebert: Remaking my voice