Rupal Patel: Synthetic voices, as unique as fingerprints

I'd like to talk today about a powerful and fundamental aspect of who we are: our voice. Each one of us has a unique voiceprint that reflects our age, our size, even our lifestyle and personality. In the words of the poet Longfellow, "the human voice is the organ of the soul." As a speech scientist, I'm fascinated by how the voice is produced, and I have an idea for how it can be engineered. That's what I'd like to share with you.

Quisiera hablarles hoy acerca de un aspecto poderoso y fundamental sobre quienes somos: nuestra voz. Cada uno de nosotros tenemos una impresión de voz única que refleja nuestra edad, tamaño, incluso nuestro estilo de vida y personalidad. En palabras del poeta Longfellow, "La voz humana es el órgano del alma". Como científica de la voz, me fascina cómo se produce la voz y tengo una idea de cómo puede ser diseñada. Eso es lo que me gustaría compartir con ustedes.

I'm going to start by playing you a sample of a voice that you may recognize.

Comenzaré reproduciéndoles un ejemplo de voz que tal vez reconozcan.

(Recording) Stephen Hawking: "I would have thought it was fairly obvious what I meant."

(Audio) Stephen Hawking: "Yo hubiera pensado que era bastante obvio lo que quise decir ".

Rupal Patel: That was the voice of Professor Stephen Hawking. What you may not know is that same voice may also be used by this little girl who is unable to speak because of a neurological condition. In fact, all of these individuals may be using the same voice, and that's because there's only a few options available. In the U.S. alone, there are 2.5 million Americans who are unable to speak, and many of whom use computerized devices to communicate. Now that's millions of people worldwide who are using generic voices, including Professor Hawking, who uses an American-accented voice. This lack of individuation of the synthetic voice really hit home when I was at an assistive technology conference a few years ago, and I recall walking into an exhibit hall and seeing a little girl and a grown man having a conversation using their devices, different devices, but the same voice. And I looked around and I saw this happening all around me, literally hundreds of individuals using a handful of voices, voices that didn't fit their bodies or their personalities. We wouldn't dream of fitting a little girl with the prosthetic limb of a grown man. So why then the same prosthetic voice? It really struck me, and I wanted to do something about this.

Rupal Patel: Esa es la voz del Profesor Stephen Hawking. Lo que pueden no saber es que la misma voz también puede ser utilizada por esta niña que es incapaz de hablar debido a una condición neurológica. De hecho, todos estos individuos podrían utilizar la misma voz, y eso es porque sólo hay unas pocas opciones disponibles. Sólo en los EE.UU., hay 2.5 millones de estadounidenses que no pueden hablar, y muchos de ellos utilizan dispositivos computarizados para comunicarse. Ahora esos millones de personas alrededor del mundo están usando voces genéricas, incluyendo al Profesor Hawking, quien usa una voz con acento americano. Esta falta de individualización de la voz sintética realmente me impactó cuando estaba en una conferencia de tecnología de asistencia hace algunos años, y recuerdo entrar en una sala de exposiciones y ver a una niña y un hombre adulto teniendo una conversación usando sus dispositivos, diferentes dispositivos, pero la misma voz. Y miré alrededor y vi que esto sucedía a todo mi alrededor, literalmente cientos de individuos utilizando un puñado de voces, voces que no encajaban con sus cuerpos o sus personalidades. No se nos ocurriría encajar una niña con la prótesis de un hombre adulto. Entonces ¿por qué la misma voz protésica? Realmente me llamó la atención, y yo quería hacer algo sobre esto.

I'm going to play you now a sample of someone who has, two people actually, who have severe speech disorders. I want you to take a listen to how they sound. They're saying the same utterance.

Les reproduciré un ejemplo de alguien que tiene, más bien dos personas, que tienen trastornos graves del habla. Quiero que escuchen cómo suenan. Están diciendo el mismo enunciado.

(First voice)

(Primera voz)

(Second voice) You probably didn't understand what they said, but I hope that you heard their unique vocal identities.

(Segunda voz) Probablemente no entendieron que decían, pero espero que ustedes hayan oido sus identidades vocales únicas.

So what I wanted to do next is, I wanted to find out how we could harness these residual vocal abilities and build a technology that could be customized for them, voices that could be customized for them. So I reached out to my collaborator, Tim Bunnell. Dr. Bunnell is an expert in speech synthesis, and what he'd been doing is building personalized voices for people by putting together pre-recorded samples of their voice and reconstructing a voice for them. These are people who had lost their voice later in life. We didn't have the luxury of pre-recorded samples of speech for those born with speech disorder. But I thought, there had to be a way to reverse engineer a voice from whatever little is left over.

Así que lo que quisé hacer a continuación, fue averiguar cómo podíamos aprovechar estas habilidades vocales residuales y construir una tecnología que pudiera personalizarse, voces que pudieran ser personalizadas para ellos. Así que me acerqué a mi colaborador, Tim Bunnell. El Dr. Bunnell es un experto en la síntesis de voz, y lo que él ha estado haciendo es la construcción de voces personalizadas para las personas juntando muestras pre-grabadas de su voz y reconstruyendo una voz para ellos. Se trata de personas que habían perdido su voz más tarde en la vida. No teníamos el lujo de muestras pregrabadas del habla para los nacidos con un trastorno del habla. Pero pensé que tenía que haber una manera de hacer ingeniería inversa de una voz de lo poco que quedaba.

So we decided to do exactly that. We set out with a little bit of funding from the National Science Foundation, to create custom-crafted voices that captured their unique vocal identities. We call this project VocaliD, or vocal I.D., for vocal identity.

Así que decidimos hacer exactamente eso. Nos pusimos en marcha con financiación de la Fundación Nacional de Ciencias, para crear voces diseñadas a medida que capturaran sus identidades vocales únicas. Llamamos a este proyecto VocaliD o I.D. vocal, por identidad vocal.

Now before I get into the details of how the voice is made and let you listen to it, I need to give you a real quick speech science lesson. Okay? So first, we know that the voice is changing dramatically over the course of development. Children sound different from teens who sound different from adults. We've all experienced this. Fact number two is that speech is a combination of the source, which is the vibrations generated by your voice box, which are then pushed through the rest of the vocal tract. These are the chambers of your head and neck that vibrate, and they actually filter that source sound to produce consonants and vowels. So the combination of source and filter is how we produce speech. And that happens in one individual.

Ahora, antes de entrar en los detalles de cómo la voz se crea y les permite que la escuchen, necesito darles una muy rápida lección de ciencias del habla ¿de acuerdo? Así que en primer lugar, sabemos que la voz está cambiando drásticamente durante el curso del desarrollo. Los niños suenan diferente a los adolescentes quienes suenan diferente de los adultos. Todos hemos experimentado esto. El hecho número dos es que el habla es una combinación de la fuente, que son las vibraciones generadas por la caja de la voz, que luego son empujadas a través del resto del tracto vocal. Estas son las cámaras de su cabeza y cuello que vibran, y que filtran el sonido de la fuente para producir consonantes y vocales. Entonces, de la combinación de la fuente y el filtro es cómo se produce el habla. Y eso sucede en cada individuo.

Now I told you earlier that I'd spent a good part of my career understanding and studying the source characteristics of people with severe speech disorder, and what I've found is that even though their filters were impaired, they were able to modulate their source: the pitch, the loudness, the tempo of their voice. These are called prosody, and I've been documenting for years that the prosodic abilities of these individuals are preserved. So when I realized that those same cues are also important for speaker identity, I had this idea. Why don't we take the source from the person we want the voice to sound like, because it's preserved, and borrow the filter from someone about the same age and size, because they can articulate speech, and then mix them? Because when we mix them, we can get a voice that's as clear as our surrogate talker -- that's the person we borrowed the filter from— and is similar in identity to our target talker. It's that simple. That's the science behind what we're doing.

Ahora, les dije antes que yo he pasado una buena parte de mi carrera tratando de entender y estudiando las características de la fuente de las personas con severos trastornos del habla, y lo que encontré es que a pesar de que sus filtros fueron perjudicados, eran capaces de modular su fuente: el tono, la intensidad, el ritmo de su voz. Estos se llaman prosodia, y he estado documentando desde hace años que las habilidades prosódicas de estos individuos se conservan. Así que cuando me di cuenta de que esas mismas señales son tan importantes para la identidad del interlocutor, tuve esta idea. ¿Por qué no tomamos la fuente de la persona de la que queremos que la voz suene igual, porque se ha conservado, y pedimos prestado el filtro a alguien de la misma edad y tamaño, porque ellos pueden articular el habla, y luego, las mezclamos? Porque cuando las mezclamos, podemos obtener una voz que es tan clara como nuestro hablante sustituto --la persona a la que pedimos prestado el filtro-- y es similar en identidad a nuestro hablante objetivo. Es así de simple. Esa es la ciencia detrás de lo que estamos haciendo.

So once you have that in mind, how do you go about building this voice? Well, you have to find someone who is willing to be a surrogate. It's not such an ominous thing. Being a surrogate donor only requires you to say a few hundred to a few thousand utterances. The process goes something like this.

Así que una vez que tienes eso en mente, ¿cómo construyes esta voz? Bueno, tienes que encontrar alguien que esté dispuesto a ser un sustituto. No es una cosa tan ominosa. Ser un donante sustituto sólo requiere que digas de algunos cientos a unos pocos miles de expresiones. El proceso es algo como esto.

(Video) Voice: Things happen in pairs.

(Video) Voz: Las cosas suceden de a pares.

I love to sleep.

Amo dormir.

The sky is blue without clouds.

El cielo es azul sin nubes.

RP: Now she's going to go on like this for about three to four hours, and the idea is not for her to say everything that the target is going to want to say, but the idea is to cover all the different combinations of the sounds that occur in the language. The more speech you have, the better sounding voice you're going to have. Once you have those recordings, what we need to do is we have to parse these recordings into little snippets of speech, one- or two-sound combinations, sometimes even whole words that start populating a dataset or a database. We're going to call this database a voice bank. Now the power of the voice bank is that from this voice bank, we can now say any new utterance, like, "I love chocolate" -- everyone needs to be able to say that— fish through that database and find all the segments necessary to say that utterance.

RP: Ahora ella va a seguir así entre cerca de tres a cuatro horas, y la idea no es que ella dijera todo lo que el objetivo va a querer decir, pero la idea es cubrir todas las diferentes combinaciones de los sonidos que se producen en el lenguaje. Cuanto más habla tienes, tendrás una voz que suena mejor. Una vez que tienes esas grabaciones, lo que necesitamos hacer es analizar gramaticalmente estas grabaciones en pequeños fragmentos del habla, una --o dos-- combinaciones de sonido, a veces incluso palabras enteras comienzan a llenar un conjunto de datos o una base de datos. Llamaremos a esta base de datos un banco de voz. El poder de este banco de voz es que a partir de este banco de voz, ahora podemos decir cualquier nuevo enunciado, del tipo, "Me encanta el chocolate", todos necesitamos poder decir eso, pescar a través de esa base de datos y encontrar todos los segmentos necesarios para decir ese enunciado.

(Video) Voice: I love chocolate.

(Video) Voz: Me encanta el chocolate.

RP: So that's speech synthesis. It's called concatenative synthesis, and that's what we're using. That's not the novel part. What's novel is how we make it sound like this young woman.

RP: Eso es la síntesis de voz. Se llama síntesis por concatenación, y eso es lo que estamos usando. Esa no es la mejor parte. Lo novedoso es cómo hacemos que suene como esta joven mujer.

This is Samantha. I met her when she was nine, and since then, my team and I have been trying to build her a personalized voice. We first had to find a surrogate donor, and then we had to have Samantha produce some utterances. What she can produce are mostly vowel-like sounds, but that's enough for us to extract her source characteristics. What happens next is best described by my daughter's analogy. She's six. She calls it mixing colors to paint voices. It's beautiful. It's exactly that. Samantha's voice is like a concentrated sample of red food dye which we can infuse into the recordings of her surrogate to get a pink voice just like this.

Ella es Samantha. La conocí cuando ella tenía 9 años, y desde entonces, mi equipo y yo hemos estado tratando de construir una voz personalizada para ella. Primero tuvimos que encontrar un donante sustituto, y luego tuvimos que pedirle a Samantha que produjera algunos enunciados. Lo sonidos que más puede producir son de vocales, pero eso nos es suficiente para extraer las características de su fuente. Lo que sucede luego es mejor explicado con la analogía de mi hija que tiene 6 años. Ella lo llama mezclar colores para pintar voces. Es hermoso. Es exactamente eso. La voz de Samantha es un ejemplo concentrado de colorante alimentario rojo que podemos infundir en las grabaciones de su sustituto para conseguir una voz rosa al igual que esta.

(Video) Samantha: Aaaaaah.

RP: So now, Samantha can say this.

RP: Así que ahora, Samantha puede decir esto.

(Video) Samantha: This voice is only for me. I can't wait to use my new voice with my friends.

(Video) Samantha: Esta voz es sólo para mí. No puedo esperar para usar mi nueva voz con mis amigos.

RP: Thank you. (Applause)

RP: Gracias. (Aplausos)

I'll never forget the gentle smile that spread across her face when she heard that voice for the first time. Now there's millions of people around the world like Samantha, millions, and we've only begun to scratch the surface. What we've done so far is we have a few surrogate talkers from around the U.S. who have donated their voices, and we have been using those to build our first few personalized voices. But there's so much more work to be done. For Samantha, her surrogate came from somewhere in the Midwest, a stranger who gave her the gift of voice. And as a scientist, I'm so excited to take this work out of the laboratory and finally into the real world so it can have real-world impact. What I want to share with you next is how I envision taking this work to that next level. I imagine a whole world of surrogate donors from all walks of life, different sizes, different ages, coming together in this voice drive to give people voices that are as colorful as their personalities. To do that as a first step, we've put together this website, VocaliD.org, as a way to bring together those who want to join us as voice donors, as expertise donors, in whatever way to make this vision a reality.

Nunca olvidaré la amable sonrisa que se extendió por su cara cuando escuchó esa voz por primera vez. Hay millones de personas alrededor del mundo como Samantha, millones, y nosotros sólo hemos empezado a rascar la superficie. Lo que hemos hecho hasta ahora es que tenemos unos hablantes sustitutos a lo ancho de los EE.UU. quienes han donado sus voces, y hemos estado usándolas para construir nuestras primeras voces personalizadas. Pero hay mucho más trabajo por hacer. Para Samantha, su sustituta vino de algún lugar del Medio Oeste, una extraña que le regaló su voz. Como científica, estoy tan emocionada de llevar este trabajo fuera del laboratorio y finalmente al mundo real para que pueda tener un impacto real en el mundo. Lo que a continuación quiero compartirles es cómo me imagino llevando este trabajo a ese siguiente nivel. Imagino un mundo de donantes sustitutos de todos los ámbitos de la vida, de diferentes tamaños, diferentes edades, que se unen en este disco de voz para darle a las personas voces tan coloridas como sus personalidades. Para hacer eso, como primer paso, hemos colocado este sitio, VocaliD.org, como una forma de reunir a aquellos quienes quieran unírsenos como donantes de voz, como donantes de experiencia, en lo que sea posible para hacer de esta visión una realidad.

They say that giving blood can save lives. Well, giving your voice can change lives. All we need is a few hours of speech from our surrogate talker, and as little as a vowel from our target talker, to create a unique vocal identity.

Dicen que la donación de sangre puede salvar vidas. Bueno, dar su voz puede cambiar vidas. Todo lo que necesitamos es un par de horas de habla de nuestro hablante sustituto, y por lo menos una vocal de nuestro habladante objetivo, para crear una identidad de voz única.

So that's the science behind what we're doing. I want to end by circling back to the human side that is really the inspiration for this work. About five years ago, we built our very first voice for a little boy named William. When his mom first heard this voice, she said, "This is what William would have sounded like had he been able to speak." And then I saw William typing a message on his device. I wondered, what was he thinking? Imagine carrying around someone else's voice for nine years and finally finding your own voice. Imagine that.

Así que esa es la ciencia detrás de lo que estamos haciendo. Quiero terminar regresando a la parte humana que es realmente la inspiración para este trabajo. Hace unos cinco años, construímos nuestra primera voz para un pequeño niño llamado William. Cuando su madre escuchó por primera vez su voz, ella dijo, "Así es como William habría sonado si hubiera sido capaz de hablar ". Y entonces vi a William escribir un mensaje en su dispositivo. Y pensé, ¿que estará pensando? Imagínense llevando con ustedes la voz de otro por nueve años y finalmente encontrar tu propia voz. Imaginen eso.

This is what William said: "Never heard me before."

Esto es lo que William dijo: "Nunca antes me había escuchado".

Thank you.

Gracias.

(Applause)

(Aplausos)

I'm going to start by playing you a sample of a voice that you may recognize.

Comenzaré reproduciéndoles un ejemplo de voz que tal vez reconozcan.

(Recording) Stephen Hawking: "I would have thought it was fairly obvious what I meant."

(Audio) Stephen Hawking: "Yo hubiera pensado que era bastante obvio lo que quise decir ".

I'm going to play you now a sample of someone who has, two people actually, who have severe speech disorders. I want you to take a listen to how they sound. They're saying the same utterance.

Les reproduciré un ejemplo de alguien que tiene, más bien dos personas, que tienen trastornos graves del habla. Quiero que escuchen cómo suenan. Están diciendo el mismo enunciado.

(First voice)

(Primera voz)

(Second voice) You probably didn't understand what they said, but I hope that you heard their unique vocal identities.

(Segunda voz) Probablemente no entendieron que decían, pero espero que ustedes hayan oido sus identidades vocales únicas.

(Video) Voice: Things happen in pairs.

(Video) Voz: Las cosas suceden de a pares.

I love to sleep.

Amo dormir.

The sky is blue without clouds.

El cielo es azul sin nubes.

(Video) Voice: I love chocolate.

(Video) Voz: Me encanta el chocolate.

RP: So that's speech synthesis. It's called concatenative synthesis, and that's what we're using. That's not the novel part. What's novel is how we make it sound like this young woman.

RP: Eso es la síntesis de voz. Se llama síntesis por concatenación, y eso es lo que estamos usando. Esa no es la mejor parte. Lo novedoso es cómo hacemos que suene como esta joven mujer.

(Video) Samantha: Aaaaaah.

RP: So now, Samantha can say this.

RP: Así que ahora, Samantha puede decir esto.

(Video) Samantha: This voice is only for me. I can't wait to use my new voice with my friends.

(Video) Samantha: Esta voz es sólo para mí. No puedo esperar para usar mi nueva voz con mis amigos.

RP: Thank you. (Applause)

RP: Gracias. (Aplausos)

This is what William said: "Never heard me before."

Esto es lo que William dijo: "Nunca antes me había escuchado".

Thank you.

Gracias.

(Applause)

(Aplausos)

Rupal Patel: Synthetic voices, as unique as fingerprints

Rupal Patel: Synthetic voices, as unique as fingerprints

Related talks

Ellen Jorgensen: Biohacking -- you can do it, too

Julian Treasure: Shh! Sound health in 8 steps

Craig Venter: On the verge of creating synthetic life

Rébecca Kleinberger: Why you don't like the sound of your own voice

Shaylin Schundler: Why does your voice change as you get older?

Roger Ebert: Remaking my voice

Related talks

Ellen Jorgensen: Biohacking -- you can do it, too

Julian Treasure: Shh! Sound health in 8 steps

Craig Venter: On the verge of creating synthetic life

Rébecca Kleinberger: Why you don't like the sound of your own voice

Shaylin Schundler: Why does your voice change as you get older?

Roger Ebert: Remaking my voice