Rupal Patel: Synthetic voices, as unique as fingerprints

I'd like to talk today about a powerful and fundamental aspect of who we are: our voice. Each one of us has a unique voiceprint that reflects our age, our size, even our lifestyle and personality. In the words of the poet Longfellow, "the human voice is the organ of the soul." As a speech scientist, I'm fascinated by how the voice is produced, and I have an idea for how it can be engineered. That's what I'd like to share with you.

Hoje eu quero falar sobre um aspecto poderoso e fundamental de quem nós somos: nossa voz. Cada um de nós possui uma impressão de voz única que reflete nossa idade, nosso tamanho até mesmo nosso estilo de vida e personalidade. Como dizia o poeta Longfellow, "a voz humana é o órgão da alma." Como sou uma cientista no campo da fala, acho fascinante como a voz é emitida, e eu tenho uma ideia de como isto pode ser construído. Isto é o que gostaria de compartilhar com vocês.

I'm going to start by playing you a sample of a voice that you may recognize.

Começo tocando um exemplo de uma voz que poderão reconhecer.

(Recording) Stephen Hawking: "I would have thought it was fairly obvious what I meant."

(Gravação) Stephen Hawking: "Penso que foi bastante óbvio o que eu quiz dizer."

Rupal Patel: That was the voice of Professor Stephen Hawking. What you may not know is that same voice may also be used by this little girl who is unable to speak because of a neurological condition. In fact, all of these individuals may be using the same voice, and that's because there's only a few options available. In the U.S. alone, there are 2.5 million Americans who are unable to speak, and many of whom use computerized devices to communicate. Now that's millions of people worldwide who are using generic voices, including Professor Hawking, who uses an American-accented voice. This lack of individuation of the synthetic voice really hit home when I was at an assistive technology conference a few years ago, and I recall walking into an exhibit hall and seeing a little girl and a grown man having a conversation using their devices, different devices, but the same voice. And I looked around and I saw this happening all around me, literally hundreds of individuals using a handful of voices, voices that didn't fit their bodies or their personalities. We wouldn't dream of fitting a little girl with the prosthetic limb of a grown man. So why then the same prosthetic voice? It really struck me, and I wanted to do something about this.

Rupal Patel: Esta era a voz do Professor Stephen Hawking. O que não devem saber é que esta mesma voz pode ser usada por esta menininha com distúrbio da fala devido a uma condição neurológica. Na verdade, todas essas pessoas podem estar usando a mesma voz, porque as opções disponíveis são poucas. Só nos EUA há 2.5 milhões de americanos impossibilitados de falar, e muitos deles usam dispositivos computadorizados para se comunicarem. São milhões de pessoas no mundo todo usando vozes genéricas, inclusive Professor Hawking, que usa uma voz com sotaque americano. Esta falta de uma voz sintética individualizada realmente me tocou muito quando participava de uma conferência de tecnologia assistiva alguns anos atrás, e eu lembro entrando na exibição e vendo uma menininha e um homem adulto conversando, cada um com seu dispositivo mas com uma mesma voz. E olhei em volta e vi isto acontecendo a minha volta, centenas de pessoas usando um punhado de vozes, vozes que não condiziam com seus corpos ou suas personalidades. Nunca sonharíamos de colocar numa menininha uma prótese de um homem adulto. Então por quê a mesma protética voz? Isto mexeu comigo e eu quis fazer algo sobre isto.

I'm going to play you now a sample of someone who has, two people actually, who have severe speech disorders. I want you to take a listen to how they sound. They're saying the same utterance.

Vou mostrar um exemplo de duas pessoas com distúrbios da fala graves. Quero que ouçam como elas soam. Elas estão expressando a mesma coisa

(First voice)

(Primeira voz)

(Second voice) You probably didn't understand what they said, but I hope that you heard their unique vocal identities.

(Segunda voz) Provavelmente vocês não entenderam o que disseram mas espero que tenham ouvido suas identidades vocais únicas.

So what I wanted to do next is, I wanted to find out how we could harness these residual vocal abilities and build a technology that could be customized for them, voices that could be customized for them. So I reached out to my collaborator, Tim Bunnell. Dr. Bunnell is an expert in speech synthesis, and what he'd been doing is building personalized voices for people by putting together pre-recorded samples of their voice and reconstructing a voice for them. These are people who had lost their voice later in life. We didn't have the luxury of pre-recorded samples of speech for those born with speech disorder. But I thought, there had to be a way to reverse engineer a voice from whatever little is left over.

A seguir, eu quis saber como nós poderíamos preparar essas habilidades vocais residuais e construir uma tecnologia que pudesse ser personalizada, vozes personalizadas para elas. Entrei em contato com meu colaborador, Tim Bunnell. Dr. Bunnell é um especialista em síntese de fala, e ele vem desenvolvendo vozes personalizadas para pessoas colocando junto exemplos pré-gravados das suas vozes e reconstruindo uma voz para eles. Estas pessoas perderam a voz ao longo da vida. Não tínhamos o luxo de exemplos pré-gravados da fala para aqueles nascidos com o distúrbio da fala. Mas pensei, deve haver uma forma de reverter uma designada voz a partir de alguma pequena sobra.

So we decided to do exactly that. We set out with a little bit of funding from the National Science Foundation, to create custom-crafted voices that captured their unique vocal identities. We call this project VocaliD, or vocal I.D., for vocal identity.

Então decidimos fazer exatamente isto. Com uma pequena ajuda financeira da Fundação Nacional de Ciência decidimos criar vozes personalizadas que capturavam suas identidades vocais únicas. Chamamos este projeto de VocalID, ou Vocal I.D., como em identidade vocal.

Now before I get into the details of how the voice is made and let you listen to it, I need to give you a real quick speech science lesson. Okay? So first, we know that the voice is changing dramatically over the course of development. Children sound different from teens who sound different from adults. We've all experienced this. Fact number two is that speech is a combination of the source, which is the vibrations generated by your voice box, which are then pushed through the rest of the vocal tract. These are the chambers of your head and neck that vibrate, and they actually filter that source sound to produce consonants and vowels. So the combination of source and filter is how we produce speech. And that happens in one individual.

Antes de entrar em detalhes de como a voz é feita e os deixar ouvir preciso lhes dar uma lição da fala bem rápida. Certo? Primeiro, sabemos que a voz muda de forma dramática ao longo do desenvolvimento. O som das crianças é diferente do som dos adolescentes que soa diferente dos adultos. Nós todos tivemos esta experiência. Segundo, a fala é uma combinação da fonte, que são vibrações geradas pela sua caixa de voz que são então puxadas pelo restante do seu trato vocal. Estas são as cavidades da sua cabeça e pescoço que vibram, e elas filtram a fonte do som para produzir as consoantes e as vogais. Esta combinação de fonte e filtro é como produzimos a fala. Isto acontece em uma pessoa.

Now I told you earlier that I'd spent a good part of my career understanding and studying the source characteristics of people with severe speech disorder, and what I've found is that even though their filters were impaired, they were able to modulate their source: the pitch, the loudness, the tempo of their voice. These are called prosody, and I've been documenting for years that the prosodic abilities of these individuals are preserved. So when I realized that those same cues are also important for speaker identity, I had this idea. Why don't we take the source from the person we want the voice to sound like, because it's preserved, and borrow the filter from someone about the same age and size, because they can articulate speech, and then mix them? Because when we mix them, we can get a voice that's as clear as our surrogate talker -- that's the person we borrowed the filter from— and is similar in identity to our target talker. It's that simple. That's the science behind what we're doing.

Eu já tinha dito que passei grande parte da minha carreira entendendo e estudando as características da fonte das pessoas com grave distúrbios da fala, e o que constatei foi que embora seus filtros estejam danificados, eram capazes de articular a fonte: o tom da voz, o volume, o ritmo. Estes são chamados de versificação e venho registrando há anos que as habilidades de versificação dessas pessoas estão conservadas. Quando percebi que estes mesmos indicadores são também importantes para a identidade do falante, eu tive esta ideia. Por quê não aproveitar a fonte da pessoa que queremos com que a voz se pareça pois está conservada, e pegar o filtro emprestado de alguém da mesma idade e tamanho, porque elas podem articular a fala, e as misturar? Porque quando as misturamos conseguimos uma voz tão clara quanto o nosso 'falante substituto', a pessoa de quem pedimos o filtro emprestado, e é semelhante à identidade do nosso 'falante alvo'. É muito simples. Esta é a ciência por detrás do que estamos desenvolvendo.

So once you have that in mind, how do you go about building this voice? Well, you have to find someone who is willing to be a surrogate. It's not such an ominous thing. Being a surrogate donor only requires you to say a few hundred to a few thousand utterances. The process goes something like this.

Uma vez que se tem isto em mente, como podemos construir esta voz? Bem, precisamos achar alguém disposto a ser um substituto. Isto não é tão intimidante. Para se ser um substituto doador você só precisa dizer entre centenas e milhares de palavras. O processo é mais ou menos assim.

(Video) Voice: Things happen in pairs.

(Vídeo) Voz: As coisas acontecem em pares.

I love to sleep.

Eu adoro dormir.

The sky is blue without clouds.

O céu está azul sem nuvens.

RP: Now she's going to go on like this for about three to four hours, and the idea is not for her to say everything that the target is going to want to say, but the idea is to cover all the different combinations of the sounds that occur in the language. The more speech you have, the better sounding voice you're going to have. Once you have those recordings, what we need to do is we have to parse these recordings into little snippets of speech, one- or two-sound combinations, sometimes even whole words that start populating a dataset or a database. We're going to call this database a voice bank. Now the power of the voice bank is that from this voice bank, we can now say any new utterance, like, "I love chocolate" -- everyone needs to be able to say that— fish through that database and find all the segments necessary to say that utterance.

RP: Ela continuará assim durante umas 3 ou 4 horas, e a ideia não é ela dizer tudo que o alvo vai dizer, mas incluir todos os sons combinados que ocorrem na linguagem. Quanto mais fala se tem, melhor soa a sua futura voz. Após estas gravações precisamos analizá-las em pequenos fragmentos da fala, combinando um ou dois sons, às vezes até as palavras inteiras um banco de dados. que começam a expandir Chamaremos este banco de dados de banco de vozes. O poder deste banco de vozes é que a partir dele, agora nós podemos criar qualquer expressão nova como, "Adoro chocolate"– todo mundo precisa ser capaz de dizer isto – buscamos no banco de dados e encontramos todos os segmentos necessários para articular a expressão

(Video) Voice: I love chocolate.

(Vídeo) Voz: Adoro chocolate.

RP: So that's speech synthesis. It's called concatenative synthesis, and that's what we're using. That's not the novel part. What's novel is how we make it sound like this young woman.

RP: Então, isto é síntese da fala Chama-se síntese de concatenação e isso é o que estamos usando. Esta não é a novidade. A novidade é como fazemos com que soe como esta jovem.

This is Samantha. I met her when she was nine, and since then, my team and I have been trying to build her a personalized voice. We first had to find a surrogate donor, and then we had to have Samantha produce some utterances. What she can produce are mostly vowel-like sounds, but that's enough for us to extract her source characteristics. What happens next is best described by my daughter's analogy. She's six. She calls it mixing colors to paint voices. It's beautiful. It's exactly that. Samantha's voice is like a concentrated sample of red food dye which we can infuse into the recordings of her surrogate to get a pink voice just like this.

Esta é Samantha. Eu a conheci quando ela tinha nove anos e desde então, minha equipe e eu temos tentado construir uma voz personalizada para ela. Primeiro nós tivemos que encontrar um doador substituto, depois pedimos a Samantha para produzir algumas expressões. O que ela pode expressar na maioria das vezes é o som das vogais, mas isto não é suficiente para se extrair as características da sua fonte. O que acontece a seguir é melhor descrito com a analogia da minha filha. Ela tem 6 anos. Ela chama isso de misturando cores para pintar vozes. Isto é lindo. É exatamente isto. A voz de Samantha é como um exemplo concentrado de corante vermelho que podemos infundir nas gravações de sua substituta para conseguir uma voz cor de rosa assim.

(Video) Samantha: Aaaaaah.

(Vídeo) Samantha: Aaaaaah.

RP: So now, Samantha can say this.

RP: Agora, Samantha pode dizer isso.

(Video) Samantha: This voice is only for me. I can't wait to use my new voice with my friends.

(Vídeo) Samantha: Esta voz é só para mim. Mal posso esperar para usar minha nova voz com meus amigos.

RP: Thank you. (Applause)

RP: Obrigada. (Aplausos)

I'll never forget the gentle smile that spread across her face when she heard that voice for the first time. Now there's millions of people around the world like Samantha, millions, and we've only begun to scratch the surface. What we've done so far is we have a few surrogate talkers from around the U.S. who have donated their voices, and we have been using those to build our first few personalized voices. But there's so much more work to be done. For Samantha, her surrogate came from somewhere in the Midwest, a stranger who gave her the gift of voice. And as a scientist, I'm so excited to take this work out of the laboratory and finally into the real world so it can have real-world impact. What I want to share with you next is how I envision taking this work to that next level. I imagine a whole world of surrogate donors from all walks of life, different sizes, different ages, coming together in this voice drive to give people voices that are as colorful as their personalities. To do that as a first step, we've put together this website, VocaliD.org, as a way to bring together those who want to join us as voice donors, as expertise donors, in whatever way to make this vision a reality.

Eu nunca vou esquecer este sorriso gentil estampado em seu rosto quando ela ouviu aquela voz pela primeira vez Há milhões de pessoas em todo o mundo como Samantha, milhões, e isto é apenas o começo. Até agora nós conseguimos alguns substitutos falantes nos Estados Unidos. que doaram suas vozes, e temos usado estas vozes para construir nossas primeiras vozes personalizadas. Mas ainda precisamos fazer muito mais. Para Samantha, sua substituta é de algum lugar no Centro-Oeste, um estranho que lhe deu o dom da voz. Como uma cientista, eu estou muito entusiasmada de tirar este trabalho do laboratório e finalmente implementar isto no mundo real para que possa ter um impacto mundial real O que ainda quero compartilhar com vocês é como eu visualizo levar este trabalho para o próximo nível. Eu imagino um mundo de doadores substitutos de todo tipo de formação, de tamanhos e idades diferentes, unindo-se juntos nesta unidade de voz para dar às pessoas vozes tão coloridas quanto suas personalidades. Para isso, um primeiro passo, nós criamos este site na Internet: VocaliD.org, como uma forma de reunir estes que desejam se juntar a nós. como doadores de vozes, como doadores especialistas, de qualquer forma, para fazer esta visão uma realidade

They say that giving blood can save lives. Well, giving your voice can change lives. All we need is a few hours of speech from our surrogate talker, and as little as a vowel from our target talker, to create a unique vocal identity.

Diz-se que a doação de sangue pode salvar vidas. Bem, a doação de suas vozes pode mudar vidas Tudo que precisamos são algumas horas de fala do nosso 'falante substituto', e apenas uma vogal do 'falante alvo', para criar uma identidade vocal única.

So that's the science behind what we're doing. I want to end by circling back to the human side that is really the inspiration for this work. About five years ago, we built our very first voice for a little boy named William. When his mom first heard this voice, she said, "This is what William would have sounded like had he been able to speak." And then I saw William typing a message on his device. I wondered, what was he thinking? Imagine carrying around someone else's voice for nine years and finally finding your own voice. Imagine that.

Essa é a ciência por trás do que fazemos. Quero concluir colocando um círculo de volta no lado humano que é realmente a verdadeira inspiração para este trabalho. Cerca de 5 anos atrás nós construímos nossa primeira voz por um garoto chamado William. Quando sua mãe ouviu primeira voz, disse: "É assim que a voz de William teria soado se ele pudesse falar." Eu vi William escrever uma mensagem em seu dispositivo. Eu pensei, que será que ele está pensando? Imaginem usando voz de outra pessoa durante 9 anos e finalmente encontrando sua própria voz. Imagine só.

This is what William said: "Never heard me before."

Wiiliam disse: "Nunca me ouvi antes."

Thank you.

Obrigada.

(Applause)

(Aplausos)

I'm going to start by playing you a sample of a voice that you may recognize.

Começo tocando um exemplo de uma voz que poderão reconhecer.

(Recording) Stephen Hawking: "I would have thought it was fairly obvious what I meant."

(Gravação) Stephen Hawking: "Penso que foi bastante óbvio o que eu quiz dizer."

I'm going to play you now a sample of someone who has, two people actually, who have severe speech disorders. I want you to take a listen to how they sound. They're saying the same utterance.

Vou mostrar um exemplo de duas pessoas com distúrbios da fala graves. Quero que ouçam como elas soam. Elas estão expressando a mesma coisa

(First voice)

(Primeira voz)

(Second voice) You probably didn't understand what they said, but I hope that you heard their unique vocal identities.

(Segunda voz) Provavelmente vocês não entenderam o que disseram mas espero que tenham ouvido suas identidades vocais únicas.

(Video) Voice: Things happen in pairs.

(Vídeo) Voz: As coisas acontecem em pares.

I love to sleep.

Eu adoro dormir.

The sky is blue without clouds.

O céu está azul sem nuvens.

(Video) Voice: I love chocolate.

(Vídeo) Voz: Adoro chocolate.

RP: So that's speech synthesis. It's called concatenative synthesis, and that's what we're using. That's not the novel part. What's novel is how we make it sound like this young woman.

RP: Então, isto é síntese da fala Chama-se síntese de concatenação e isso é o que estamos usando. Esta não é a novidade. A novidade é como fazemos com que soe como esta jovem.

(Video) Samantha: Aaaaaah.

(Vídeo) Samantha: Aaaaaah.

RP: So now, Samantha can say this.

RP: Agora, Samantha pode dizer isso.

(Video) Samantha: This voice is only for me. I can't wait to use my new voice with my friends.

(Vídeo) Samantha: Esta voz é só para mim. Mal posso esperar para usar minha nova voz com meus amigos.

RP: Thank you. (Applause)

RP: Obrigada. (Aplausos)

This is what William said: "Never heard me before."

Wiiliam disse: "Nunca me ouvi antes."

Thank you.

Obrigada.

(Applause)

(Aplausos)

Rupal Patel: Synthetic voices, as unique as fingerprints

Rupal Patel: Synthetic voices, as unique as fingerprints

Related talks

Ellen Jorgensen: Biohacking -- you can do it, too

Julian Treasure: Shh! Sound health in 8 steps

Craig Venter: On the verge of creating synthetic life

Rébecca Kleinberger: Why you don't like the sound of your own voice

Shaylin Schundler: Why does your voice change as you get older?

Roger Ebert: Remaking my voice

Related talks

Ellen Jorgensen: Biohacking -- you can do it, too

Julian Treasure: Shh! Sound health in 8 steps

Craig Venter: On the verge of creating synthetic life

Rébecca Kleinberger: Why you don't like the sound of your own voice

Shaylin Schundler: Why does your voice change as you get older?

Roger Ebert: Remaking my voice